key: cord-1047859-a5myh752 authors: Cottam, Eleanor M.; Wadsworth, Jemma; Knowles, Nick J.; King, Donald P. title: Full Sequencing of Viral Genomes: Practical Strategies Used for the Amplification and Characterization of Foot-and-Mouth Disease Virus date: 2009-03-16 journal: Molecular Epidemiology of Microorganisms DOI: 10.1007/978-1-60327-999-4_17 sha: ea918b7ca0a0c7307a2a972d615b263ab6d8b58d doc_id: 1047859 cord_uid: a5myh752 Nucleic acid sequencing is now commonplace in most research and diagnostic virology laboratories. The data generated can be used to compare novel strains with other viruses and allow the genetic basis of important phenotypic characteristics, such as antigenic determinants, to be elucidated. Furthermore, virus sequence data can also be used to address more fundamental questions relating to the evolution of viruses. Recent advances in laboratory methodologies allow rapid sequencing of virus genomes. For the first time, this opens up the potential for using genome sequencing to reconstruct virus transmission trees with extremely high resolution and to quickly reveal and identify the origin of unresolved transmission events within discrete infection clusters. Using foot-and-mouth disease virus as an example, this chapter describes strategies that can be successfully used to amplify and sequence the full genomes of RNA viruses. Practical considerations for protocol design and optimization are discussed, with particular emphasis on the software programs used to assemble large contigs and analyze the sequence data for high-resolution epidemiology. 217 chain-termination method initially devised by Fred Sanger in the 1970s ( 1 ) . The throughput and robustness of these methods have been improved by the use of fluorescent dyes and capillary separation technologies, such that the routine assembly of large fragments of genomic DNA (>10 kb) is now achievable by many modestly equipped laboratories. For the large part, protocols developed to sequence large fragments of nucleic acid can also be adapted to characterize the genomes of RNA viruses, which typically are 15 kb or less. Full-genome sequences of viruses can be used to address fundamental questions relating to evolution, identification of critical antigenic determinants, and viral molecular epidemiology. Although sequencing small numbers of some viral genomes can be straightforward, specific protocols and work flows are required to effectively manage projects that aim to characterize the molecular epidemiology of viral transmission. Using foot-and-mouth disease virus (FMDV) as an example, this chapter describes strategies that can be successfully used to amplify and sequence the complete genomes of RNA viruses. Foot-and-mouth disease (FMD) is a highly contagious disease affecting cloven-hoofed livestock (cattle, sheep, pigs, goats, and water buffalo). The causative agent is a virus belonging to the genus Aphthovirus (family: Picornaviridae) that exists as seven antigenically distinct serotypes, each comprising numerous and constantly evolving variants ( 2 ) . The genome of FMDV is approximately 8,300 nucleotides in length. It comprises a polyadenylated positive-sense RNA that encodes a single polyprotein, which is posttranslationally cleaved into constituent capsid proteins and nonstructural proteins involved in viral replication. In common with most other RNA viruses, the enzyme (RNAdependent RNA polymerase) responsible for replication of the FMDV genome has poor fidelity, such that changes to the nucleotide sequence frequently occur and are inherited to progeny viruses. This rapid evolution rate of FMDV allows virus transmission trees to be reconstructed with extremely high resolution, opening up the possibility of using these data to retrospectively reveal and identify the origin of unresolved transmission events ( 3, 4 ) . In addition to forensic molecular epidemiology, full-genome sequence data have also recently contributed to our understanding of a number of aspects of FMDV evolution, including ( i ) evolutionary rates (5 ) ; ( ii ) sites and importance of recombination ( 6, 7 ) ; ( iii ) identification of ordered RNA structures ( 8 ) ; and ( iv ) contribution and significance of the quasi-species phenomenon to evolution ( 9 ) . Sequence data from a wide variety of FMDV isolates also play an important role in the reiterative design of oligonucleotide primers used for molecular assays for routine diagnostic use in reference laboratories (for pan-reactive and serotype-specific detection and strain characterization). The extent of the run length obtained by capillary sequencers places a limit on the maximum distance between oligonucleotide primers (either in the polymerase chain reaction [PCR] amplification or cycle sequencing setup stages). In contrast to DNA targets, which are relatively stable, researchers who study RNA viruses, such as picornaviruses, are familiar with the plasticity of viral genomes. This high variability poses particular challenges for the design of pan-reactive oligonucleotide primers to reliably amplify complete viral complementary DNAs (cDNAs). For viruses such as FMDV, the existence of multiple serotypes (whose nucleotide sequences may vary by as much as 50% in some genome regions) can further complicate the identification of suitable target sequences. As a consequence, the two extremes of the sequencing strategies used for FMDV are illustrated in Fig. 1 ( see Fig. 1a , b ) and shown by representative agarose gels in Fig. 2 . In both of these approaches (adopted for the characterization of FMD outbreaks in the United Kingdom in 2001 and 2007), a large number of specific primers were required. Furthermore, these oligonucleotides are specific for defined lineages of FMDV, limiting their use for study of other genotypes of FMDV. Other FMD laboratories have used similar approaches also requiring large numbers of primers (11) (12) (13) (14) (15) . Additional protocols, such as rapid amplification of cDNA ends (RACE), can be used to generate sequence data for the terminal ends of the genome and regions close to the poly (C) tract of FMDV. To reduce the complexity of the cycle-sequencing reactions, recognition sequences for universal sequencing primers (such as M13) can be incorporated into the 5 ¢ ends of the primers used for PCR ( see Fig. 1a ). Alternative approaches, such as shotgun cloning (for example, Fig. 1c ) are also being considered for full-genome sequencing. Initially, these use long-range PCR to amplify large fragments of the virus genome (possibly even encompassing entire genomic sequences). These PCR products are subsequently fragmented and cloned into plasmid vectors prior to sequencing and reconstruction of the viral sequence. Since this approach uses only two viral-specific primers (which can be targeted to highly conserved regions) and is not reliant on internal virus-specific primers, this method may provide a more suitable approach that has a broader sensitivity to different viral variants. However, these methods need to balance the advantages in diagnostic sensitivity that are gained from using a smaller number of primers with the drawback of lower analytical sensitivity that may arise from amplifying large PCR products (in comparison to shorter fragments). In this chapter, a guide protocol that has been successfully used to sequence FMDV is presented. Although some of the finer details are specific to FMDV, the general approaches described are broadly applicable to other RNA viruses. Indeed, similar methods have been described recently to characterize the genomes of other viruses that infect humans, livestock, and plants ( 16-23 ) . 1. Using sterile sand and a pestle and mortar, prepare a 10% (w/v) suspension of the tissue sample in phosphate -buffered saline. Liquid samples (such as serum) can be processed straight to step 3 . Depending on application and nature of the sample to be tested ( see Note 3 ), alternative RNA extraction protocols can also be used (such as commercially available silica-based spin columns). 2. Centrifuge at 300g for 10 min. 3 5. Add 2 µ L Superscript III Reverse Transcriptase. 6. Incubate at 42°C for 1-4 h followed by 85°C for 5 min. A specific PCR amplifying the 5 ¢ end of the genome can be used to test that complete first -strand cDNAs have been generated. 7. Cleanup cDNA using GFX PCR DNA and Gel Band Purification kit according to manufacture r s ' instructions and elute in 50 µ L. This step removes unincorporated primers and dNTPs from the RT reaction. 8. Set up a PCR master mix in a clean room using the primer sets required for amplification of the genomic fragments. 9. Add 2.5 m L cDNA to each reaction in a separate area away from the PCR clean room ( see Note 5 ). 10. Run thermocycling program (as described in refs. 2, 3, 10; see Note 6 ). 1. Run 2 m L of PCR product on 1.2% (w/v) agarose gel at 105 V for 30 min to check reaction has worked. 2. Clean up cDNA using GFX PCR DNA and Gel Band Purification kit according to manufacture r s ' instructions. 3. Quantify DNA concentration in purified PCR product. This can be done using a spectrophotometer (e.g., Nanodrop, Thermo Fisher Scientific) or by agarose gel electrophoresis using DNA standards ( see Subheading 2.3 .). 4. Dilute products to give appropriate concentrations for sequencing. 5. Prepare sequencing reaction using diluted PCR product. Sequencing viral genomes can quickly accumulate a large amount of data ( see Note 7 ). Software programs (such as Lasergene, http://www.dnastar.com/ ) can be used to simplify the alignment of individual sequences and to rapidly assemble large contigs. The minimum criterion for acceptance of a final sequence is that each nucleotide position should be determined by sequencing reactions in either direction (forward and reverse). Currently, the genetic evolution and relationships of viruses are studied by analyzing their genetic sequence data by phylogenetic methods. Phylogenetic trees are constructed and used to deduce the genetic relatedness of the viruses. There are different methods for constructing phylogenetic trees; the first approach developed was the maximum parsimony methodology, but more recently maximum likelihood ( 24 ) and Bayesian methods ( 25 ) are the preferred techniques for tree construction. Other methods based on distance matrixes, such as neighbor-joining ( 26 ) or unweighted pair-group method with arithmetic mean (UPGMA) ( 27 ) , which calculate genetic distance from multiple sequence alignments, are simpler to implement but do not invoke an evolutionary model. Maximum parsimony determines the most parsimonious tree requiring the least evolutionary steps. This method is simple and as such makes very few assumptions about the evolutionary process. However, certain features of genetic evolution of organisms present problems when using this method of tree construction. First, inaccuracies can occur as a result of the existence of homoplasy. Homoplasy describes processes, such as convergent evolution, by which a single mutation can occur twice on independent branches of a tree. Hence, it implies that two sequences sharing a mutation were not necessarily derived from a common ancestor that also contained this mutation. Another hurdle to overcome is back-mutation, by which a mutation reverts to its original genotype. This can cause the specific sequence to appear more ancestral than is necessarily the case. A further drawback to the method of maximum parsimony is that it takes no account of the rate at which mutations arise and the varying probabilities of different mutations occurring (i.e., transversions vs. transitions). For these reasons, the parametric method of maximum likelihood is usually preferred as it provides the most probable tree that suits a specific determined evolutionary model. Providing that the model employed is a reasonable approximation of the evolutionary processes that gave rise to the observed genetic data, this analysis is potentially more powerful than other methods. The evolutionary model may include a large number of parameters accounting for differences in the probabilities of various character states, differences in the occurrence of particular substitutions/ mutations, and differences in the probabilities of change among characters. With the sophisticated models such as the Hasegawa-Kishino-Yano (HKY) model ( 28 ) and the general time reversible (GTR) model ( 29 ) , an improved idea of phylogeny is achieved, although fitting an incorrect model can give incorrect results. The suitability of models can be tested using a program such as model test ( 30 ) . Maximum likelihood estimation of tree phylogeny is generally preferable to maximum parsimony because it is statistically consistent with a better statistical foundation, and it allows complex modeling of evolutionary processes. However, the maximum likelihood method has a computing limitation for large numbers of sequences. To infer statistical confidence in either maximum parsimony or maximum likelihood, constructed phylogenies bootstrap analyses ( 31 ) are performed. A further method to infer phylogenies is that of Bayesian inference, which generates a posterior distribution for a parameter based on the prior for that parameter and the likelihood of the data (represented by the sequence alignment). In other words, whereas maximum likelihood analysis investigates the probability of the observed data given a specific evolutionary model, Bayesian inference looks at the probability that a model is correct given the observed data set. With the availability of Markov chain Monte Carlo methods ( 32 ) , Bayesian inference can be a preferred choice for tree estimation because it can be faster than maximum likelihood, and no bootstrapping is required as the posterior probabilities determine the statistical confidence in the tree. Although in the majority of incidences maximum likelihood or Bayesian inference is preferable for tree construction, in certain situations maximum parsimony can be a viable alternative. When studying closely related sequences over a short time period the likelihood of back-mutation is relatively low, and hence maximum parsimony tree construction is likely to give an accurate estimation of tree phylogeny. Phylogenetic analysis of virus sequences is often performed with the aim of tracing specific virus history, and in these cases the method of statistical parsimony can be used. The distances depicted by parsimony trees represent the actual number of differences between sequences, whereas for a maximum likelihood tree the probability of change is shown ( Fig. 3 ) . Often when studying viruses, closely related sequences are being investigated, with a focus on the accumulation of changes, and in this case a simpler representation of the raw data as depicted by parsimony is desirable. The TCS statistical parsimony program ( 34 ) can position sequences internally on a branch, which assists in depicting directly ancestral sequences ( see Fig. 3b ). Although the statistical parsimony trees drawn by TCS are not bootstrapped, if the data comprise the complete genome sequences of the sampled viruses, then the tree is as accurate and as representative as it can be: It is not sensitive to the choice of a single arbitrary locus because there are no further genetic data retrievable. A useful Web site that lists available phylogenetic programs for analyzing sequence data is http://evolution.genetics.washington.edu/ phylip/software.html . Newer technologies are currently being developed that offer the potential to eliminate the use of capillary electrophoresis and even greater throughput. Resequencing microarrays have been developed and used to determine the sequence of the severe acute respiratory syndrome (SARS) coronavirus ( 35, 36 ) . However, development of specific arrays is heavily resource dependent and currently likely to be deployed only in niche markets. Of the newer technologies, sequential ligation systems (SOLiD), solidphase primer amplification (Solexa), and bead-and-well-based pyrosequencing methods (such as the 454 platform) have the capacity to generate reads of 4-20 Mb in a single run. Although this might be considered excessive for characterization of individual viral genomes, these approaches may allow infrequent mutations within a viral population to be detected. Thus, these methods may be ideal for dissecting the genetic variability within viral populations. 1. In addition to ensuring that all solutions used for RNA extraction are RNase free, pipets and work surfaces should be cleaned using 10% bleach followed by DNAzap (Ambion) prior to and between each sample processed. A logical work flow for processing the samples for sequencing projects is highly recommended ( see Fig. 4 for an example). This is particularly important for high-resolution molecular epidemiological studies since the discrimination of samples may be dependent on the accurate determination of only a Original few nucleotide differences in the complete genome length ( 3 ) . Therefore, it is important that care is taken to minimize cross-contamination between samples (particularly post-PCR products). If possible, samples should be processed independently (including suitable negative control material), and the study should be organized to attempt to maximize the differences between successive samples tested. 3 . A variety of sample types (including blood, tissues, esophagealpharyngeal fluid, and cell culture supernatant) can be tested; however, it is usually preferable to test primary material (such as clinical samples) since it is possible that cell culture passage or molecular cloning of viruses can introduce nucleotide changes that can influence the interpretation of results. 4. Once placed in TRIzol reagent, samples can be stored for extended periods (at a wide range of temperatures, −70 to +4°C). 5. The requirement to perform a high number of downstream sequencing reactions may necessitate that a relatively large volume of PCR product is generated requiring pooling of RNA, cDNA, or post-PCR products. An additional practical consideration is the fidelity of the DNA polymerase used for the PCR amplification step; if possible proofreading enzymes that are widely available should be used. 6 . In common with other long PCR methods, the parameters of the protocol used for amplification of viral genomes should be optimized prior to routine use. Steps to be considered include the components of the RT or PCR mixes and the cycling times used for amplification. In initial experiments, a PCR targeting a fragment of the 5 ¢ end of the genome can be used to confirm that full-length cDNA has been produced in the RT reaction. 7. In general, these methods provide an accurate estimation of the viral consensus sequence. However, it is important to recognize that this sequence will be a composite of the component variability that, to a greater or lesser extent, may be present. In spite of concerns that it is theoretically possible that the sequence generated will not represent an actual virus species present in the sample, studies with FMDV indicated that the majority of molecular clones have identical sequences to the consensus ( 37 ) . Testing of duplicate samples can generate identical results ( 4 ) , demonstrating that these methods are accurate, and as long as the viral concentrations are relatively high, consensus sequences obtained will mask any individual proofreading errors that might arise due to low fidelity of reverse transcriptase and polymerase enzymes. These aspects relating to accurate determination of the sequences of specific viral genomes (rather than consensus sequences) will be of particular concern in studies that aim to characterize the genetic population structure within samples (i.e., the quasi-species nature of a virus). New technologies and approaches ( see Subheading 3.5 .) may be utilized to address these important questions that underpin our understanding of viral evolution. This work was funded by Defra research project SE2936. We acknowledge the assistance of colleagues Guido König, Sasmita Upadhyaya, Nigel Ferris, and Geoff Hutchings and Michael Quail from the Wellcome Trust Sanger Institute, Cambridge, for collaboration with the shotgun sequencing approach. DNA sequencing with chain-terminating inhibitors Comparative genomics of foot-and-mouth disease virus Molecular epidemiology of the foot-and-mouth disease virus outbreak in the United Kingdom in 2001 Integrating genetic and epidemiological data to determine transmission pathways of foot-and-mouth disease virus Genetic and phenotypic variation of foot-andmouth disease virus during serial passages in a natural host Recombination patterns in aphthoviruses mirror those found in other picornaviruses Mosaic structure of foot-and-mouth disease virus genomes Detection of genome-scale ordered RNA structure (GORS). in genomes of positive-stranded RNA viruses: implications for virus evolution and host persistence Quasispecies dynamics and RNA extinction Transmission pathways of foot-andmouth disease virus in the United Kingdom in 2007 Comparisons of the complete genomes of Asian, African and European isolates of a recent foot-and-mouth disease virus type O pandemic strain (PanAsia) High throughput sequencing and comparative genomics of foot-and-mouth disease virus Comparison and analysis of the complete nucleotide sequence of foot-and-mouth disease viruses from animals in Korea and other PanAsia strains The nucleotide sequence of foot-and-mouth disease virus O/FRA/1/2001 and comparison with its British parental strain O/ UKG/35/2001 Complete nucleotide sequence of a Chinese serotype Asia1 vaccine strain of foot-and-mouth disease virus Genome comparison of a novel classical swine fever virus isolated in China in 2004 with other CSFV strains Phylogenetic analysis of WNV in North American blood donors during the 2003-2004 epidemic seasons Comparative analysis of the full genome sequence of European bat lyssavirus type 1 and type 2 with other lyssaviruses and evidence for a conserved transcription termination and polyadenylation motif in the G-L 3 ¢ non-translated region Complete genomes of hepatitis C virus (HCV). subtypes 6c, 6l, 6o, 6p and 6q: completion of a full panel of genomes for HCV genotype 6 A newly reported human polyomavirus, KI virus, is present in the respiratory tract of Australian children Complete genome sequencing of a non-syncytium-inducing HIV type 1 subtype D strain from Cape Town, South Africa Study of the genetic stability of measles virus CAM-70 vaccine strain after serial passages in chicken embryo fibroblasts primary cultures Complete nucleotide sequence of a new strain of Tobacco necrosis virus A infecting soybean in China and infectivity of its full-length cDNA clone Bayesian inference of phylogeny and its impact on evolutionary biology The neighbor-joining method: a new method for reconstructing phylogenetic trees A quantitative approach to a problem in classification Dating of the human-ape splitting by a molecular clock of mitochondrial DNA A new method for calculating evolutionary substitution rates MODELTEST: testing the model of DNA substitution Confidence limits on phylogenies: an approach using the bootstrap Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo Method SeaView and Phylo_win, two graphic tools for sequence alignment and molecular phylogeny TCS: a computer program to estimate gene genealogies Evaluation of affymetrix severe acute respiratory syndrome resequencing Gene-Chips in characterization of the genomes of two strains of coronavirus infecting humans Tracking the evolution of the SARS coronavirus using high-throughput, high-density resequencing arrays Analysis of Foot-and-mouth disease virus nucleotide sequence variation within naturally infected epithelium