key: cord-0585368-85usg3uh authors: Morelli, Marco J.; Wright, Caroline F.; Th'ebaud, Gael; Knowles, Nick J.; Herzyk, Pawel; Paton, David J.; Haydon, Daniel T.; King, Donald P. title: Beyond the consensus: dissecting within-host viral population diversity of foot-and-mouth disease virus using next-generation genome sequencing date: 2011-01-27 journal: nan DOI: nan sha: 4a3354298d6d6fec6974b13a8f5d0051e278abeb doc_id: 585368 cord_uid: 85usg3uh The sequence diversity of viral populations within individual hosts is the starting material for selection and subsequent evolution of RNA viruses such as foot-and-mouth disease virus (FMDV). Using next-generation sequencing (NGS) performed on a Genome Analyzer platform (Illumina), this study compared the viral populations within two bovine epithelial samples (foot lesions) from a single animal with the Inoculum used to initiate experimental infection. Genomic sequences were determined in duplicate sequencing runs, and the consensus sequence determined by NGS, for the Inoculum, was identical to that previously determined using the Sanger method. However, NGS reveals the fine polymorphic sub-structure of the viral population, from nucleotide variants present at just below 50% frequency to those present at fractions of 1%. Some of the higher frequency polymorphisms identified encoded changes within codons associated with heparan sulphate binding and were present in both feet lesions revealing intermediate stages in the evolution of a tissue-culture adapted virus replicating within a mammalian host. We identified 2,622, 1,434 and 1,703 polymorphisms in the Inoculum, and in the two foot lesions respectively: most of the substitutions occurred only in a small fraction of the population and represent the progeny from recent cellular replication prior to onset of any selective pressures. We estimated an upper limit for the genome-wide mutation rate of the virus within a cell to be 7.8 x 10-4 per nt. The greater depth of detection, achieved by NGS, demonstrates that this method is a powerful and valuable tool for the dissection of FMDV populations within-hosts. The sequence diversity of viral populations within individual hosts is the starting 24 material for selection and subsequent evolution of RNA viruses such as foot--and--mouth 25 disease virus (FMDV). Using next--generation sequencing (NGS) performed on a Genome 26 Analyzer platform (Illumina), this study compared the viral populations within two 27 bovine epithelial samples (foot lesions) from a single animal with the Inoculum used to 28 initiate experimental infection. Genomic sequences were determined in duplicate 29 sequencing runs, and the consensus sequence determined by NGS, for the Inoculum, 30 was identical to that previously determined using the Sanger method. However, NGS 31 reveals the fine polymorphic sub--structure of the viral population, from nucleotide 32 variants present at just below 50% frequency to those present at fractions of 1%. Some 33 of the higher frequency polymorphisms identified encoded changes within codons 34 associated with heparan sulphate binding and were present in both feet lesions 35 revealing intermediate stages in the evolution of a tissue--culture adapted virus 36 replicating within a mammalian host. We identified 2,622, 1,434 and 1,703 37 polymorphisms in the Inoculum, and in the two foot lesions respectively: most of the 38 substitutions occurred only in a small fraction of the population and represent the 39 progeny from recent cellular replication prior to onset of any selective pressures. We 40 estimated an upper limit for the genome--wide mutation rate of the virus within a cell to 41 be 7.8 x 10 --4 per nt. The greater depth of detection, achieved by NGS, demonstrates that 42 this method is a powerful and valuable tool for the dissection of FMDV populations 43 within--hosts. 44 45 Introduction 46 RNA viruses evolve rapidly due to their large population size, high replication rate and 47 poor proof--reading ability of their RNA--dependent RNA polymerase. These viruses exist 48 as heterogeneous and complex populations comprising similar but non--identical 49 genomes, but the evolutionary importance of this phenomenon remains unclear (15, 16, 50 25). Consensus sequencing identifies the predominant or major viral sequence present 51 in a sample, but is uninformative about minority variants that are present. Evidence for 52 population heterogeneity, where individual sequences differ from the consensus 53 sequence, has been routinely obtained using cloning approaches (1, 8), providing 54 insights into the evolutionary processes that shape viral populations. Unfortunately, 55 these cloning processes are laborious and usually provide only a limited resolution of 56 the mutant spectrum within a sample. 57 Next--Generation Sequencing (NGS) techniques offer an unprecedented 'step--change' 58 increase in the amount of sequence data that can be generated from a sample. Albeit 59 mostly used for de--novo sequencing of large genomes, NGS can be applied to re--60 sequence short viral genomes to obtain an ultra--deep coverage. Therefore, NGS has the 61 potential to provide information beyond the consensus for a viral sample by revealing 62 nucleotide substitutions present in only a small fraction of the population. 90 The samples analysed were collected during an infection experiment, in which a single 91 bovine host was inoculated intradermolingually with a dose of 10 5.7 50% tissue culture 92 infective doses (TCID50) of FMDV (O1BFS 1860). The full--length FMDV genome sequence 93 of this sample had been previously determined using Sanger sequencing (EU448369) 94 and was used as a reference genome in this study. The Inoculum was derived from a 95 bovine tongue vesicle specimen that had been passaged extensively in cell culture (9). 96 Total RNA (TRIzol, Invitrogen, Paisley, UK) was extracted from a sample of the Inoculum 97 as well as two 10% tissue suspensions prepared from epithelial lesions (front left foot 98 [FLF] and back right foot [BRF]) collected from the animal at 2 days post inoculation. 99 Reverse transcription was performed using an enzyme with high specificity 100 (Superscript III reverse transcriptase, Invitrogen), and an oligo--dT primer (see Table 1 ). 101 For each sample, two PCR reactions generating long overlapping fragments (4557 bp 102 and 4317 bp respectively) were carried out using a proof--reading enzyme mixture 103 (Platinum Taq In order to make direct comparisons between the two runs, we trimmed reads from the 138 second run to 50nt. Typically, quality scores decreased along a read, as the reliability of 139 the sequencing process decreased with the number of cycles of the Sequencing 140 Platform. The second run yielded much better qualities following an upgrade of the 141 Illumina platform. For both runs, reads with average error per nt below a fixed 142 threshold (θ = 0.2%) were discarded to generate a flatter error profile along the read 143 (see Appendix and Figure A1 ). The first and last 5 nts of each aligned read were 144 removed from the analysis as they showed a higher number of mismatches to the 145 reference sequence due to insertions or deletions close to the edges of the reads ( Figure 146 A2). More details can be found in Appendix. 147 Validation and analysis of sequence diversity in the samples 148 The frequency of site--specific polymorphisms was estimated from the frequency of 149 mismatches of the aligned reads to the reference genome. A proportion of these 150 mismatches was expected to be artifactual, arising from a base mis--calling in the 151 sequencing process, or from a PCR error in the amplification of the sample. In order to 152 identify polymorphisms arising from possible base mis--calls in the sequencing reaction, 153 we used the quality score of each nucleotide read to compute the average probability of hypothesize that the probability that PCR errors in both runs independently generated 160 identical base changes at the same site is very low. Based on values quoted for the 161 enzymes used, we estimated that the error rate for the combined RT--PCR amplification 162 process to be 7.7x10 --6 per base pair copied (2, 31, 38). We therefore defined 163 polymorphic sites that could not be attributed to sequencing errors and at which both 164 the most common and second most common nucleotides were the same between the 165 two runs to be 'qualitatively validated sites'. For each site in the set of qualitatively 166 validated polymorphisms, we computed the 95% confidence intervals for the 167 polymorphism frequency using the binomial distribution above. If the 95% confidence 168 intervals from each run overlapped, we defined the polymorphism frequency estimates 169 from the two runs to be in quantitative agreement. 170 We assessed the quantitative repeatability of site--specific polymorphism frequency 171 estimates by calculating Spearman Rank correlation coefficients between 172 polymorphism frequencies in the samples within each run and between polymorphism 173 frequencies from runs 1 and 2. 174 We counted the number of transitions (Ts) and transversions (Tv) observed at 175 qualitatively validated sites across the genome, we computed κ=2Ts/Tv, and the 176 relative distribution of mutations across the 1 st , 2 nd , and 3 rd codon positions across the 177 open reading frame (ORF). We obtained an estimate for dN/dS as follows: for each 178 codon of the reference ORF, we computed the expected number of synonymous (si) and 179 non--synonymous sites (ni) and, for each read j spanning that codon, the number of 180 observed synonymous (s r ij) and non--synonymous substitutions (n r ij). Using all the 181 codons where si>0, we then obtained an estimate of the number of synonymous 182 substitutions per synonymous site (pS) and non--synonymous substitutions per non--183 synonymous site (pN): (and analogously for pN) ,where mi is the 184 number of reads covering codon i and ncod is the total number of codons in the ORF. 185 From pN and pS we obtained dN/dS according to (34). 186 We calculated the number of validated sites at which STOP--codons are observed within 187 the reading frame, and used these count to estimate an upper limit on the mutation rate. where N is the number of sites and piX is 218 the fraction of reads bearing nt X at site i. The entropy measures the amount of 219 "disorder" in the population, and it is maximum at a site when all four bases are equally 220 represented. 221 222 In this section, we discuss the results of the for BRF ( Figure 1A Distribution of polymorphisms across the genome 300 There were 12 SSPs, whose average frequency between the two runs is above 1% in the 301 Inoculum, 19 in FLF, and 25 in BRF (see Supplementary Table) . Some of these were 302 clustered in the capsid protein region (beginning of protein VP3) (1 in the Inoculum, 4 303 in FLF and 5 in BRF) and in the 3' untranslated region (UTR) (6 in the Inoculum, 5 in 304 FLF and 6 in BRF). Where single reads spanning these sites within the VP3 or 3'UTR 305 were available, there was no evidence that that these mutations were linked together on 306 individual FMDV genomes. In particular, the first cluster was shared between the two 307 foot samples and corresponded to changes encoding amino acid residues associated 308 with heparan sulphate (HS) binding. The Inoculum used in this experiment had 309 undergone extensive cell culture passage and, in common with other in--vitro adapted 310 hosts drives the reversion of positively charged amino acid residues at specific sites in 312 the viral capsid (20, 37). A consensus level substitution (>50%) exists within both feet 313 samples of run 1 compared to the reference sequence (see above and the 314 Supplementary Table) . This polymorphism corresponded to a change within the 60 th 315 codon of protein VP3 (VP3 60 ). Although below the level of the consensus sequence, 316 additional qualitatively validated SSPs that were present in both feet samples were 317 detected at four further sites (VP2 134 , two codon positions within VP3 56 and VP3 59 ) that 318 impact on the ability of FMDV to bind HS. All but one of the mutations clustered within 319 the 3' UTR of the three samples were located within the first four RNA--RNA pairings 320 either side of the apex of a conserved stem--loop. This structure, one of two stem--loops 321 previously predicted for FMDV and other picornaviruses (6) (33) is thought to generate 322 long--distance RNA--RNA interactions that may impact upon viral replication (40). The 323 presence of shared mutations between the two foot samples suggests a common history 324 for the viruses arising as a result of the shared route of intra--host transmission from 325 initial replication sites in the tongue to epithelial sites in the feet via the blood. 326 However an alternative explanation -that the virus is subject to a common selection 327 pressure in both sites cannot be ruled out. 328 suggesting that the absence of observed genetic variability may be due to lack of power 333 to detect it. By grouping the site--specific polymorphism frequencies into discrete bins, 334 we can examine the proportion of sites experiencing different polymorphic frequencies 335 and thereby obtain a comprehensive picture of the heterogeneity in the viral 336 populations ( Figure 5 ). Across the three samples, most sites exhibit a range of low--337 frequency SSPs between 0.01% --1%. Only a few sites showed higher frequency 338 polymorphism, and these sites were more numerous for the samples from the feet than The presence of STOP codons can be used to obtain an upper limit on the mutation rate 369 (per nucleotide per transcription event) of this virus. We hypothesise that these 370 mutations are lethal and are therefore generated in the last round of cellular replication. 371 Moreover, by assuming a replication strategy involving the minimum number of 372 copying events in the cell (the "stamping machine" strategy, see (44) of the infection across hosts (5) can be explained with this mechanism, which is made 412 more accessible to study by NGS. Moreover, the statistical characteristics of the SSPs we 413 identified (κ, dN/dS) are very similar to those found previously (8), further 414 corroborating the validity of our results. Finally, randomizations of the diversity 415 measured in the capsid region allowed us to obtain simulated clones whose 416 characteristics in terms of mutation were analogous to those found in (8). We conclude 417 that NGS data can be used to examine the nucleotide diversity of each genome position 418 at unprecedented resolution. Observing the mutant spectrum of the viral population at 419 a fine resolution will provide a more sophisticated understanding of evolutionary 420 processes shaping its variability. 421 Comparisons between the sequences recovered from the Inoculum and clinical lesions 422 provide new insights into the impact of early replication events on viral evolution 423 within a host. This study reveals that only a few sites displayed mutations present in a 424 large fraction of the population, i.e. high frequency polymorphisms (>1%), while the 425 vast majority of the polymorphisms were present at lower frequencies. We hypothesize 426 that the high frequency polymorphisms have been selected over multiple rounds of 427 replication within cells, and that the lower frequency polymorphisms most likely 428 directly reflect the high rate of mutation experienced by these viruses, as our estimate 429 of an upper limit of the genome--wide mutation rate suggests. In this study we used a cell 430 culture adapted virus (as the Inoculum) which gave us the opportunity to monitor 431 changes at specific loci associated with the HS binding site that were under selection 432 pressure during initial replication in a mammalian host. Examination of these sites 433 (collated in the Supplementary Table) identifies hot and cold spots in the HRV ORF, and some minority variants, the resolution 440 is not sufficient to observe the micro--evolutionary processes whose signature lies in 441 small fractions of the viral population (<2%). Moreover, their estimation of the 442 substitution rate during the infection is based solely on the count of the nucleotides 443 changed among those analyzed: although the value is compatible to our genome--wide, 444 mutation rate, we believe that considering the cellular process of viral replication (and 445 specifically assuming the minimum number or copying events in a cell) allows us to gain 446 a better insight of the process generating variation in the viral population and obtain a 447 more stringent upper bound. 448 Figure 5 give rise to the amount of diversity we observe. The data collected in studies like this 496 can be used for building models aimed at understanding the link between the micro--497 evolution of FMDV at the cellular scale with the population heterogeneity at the host 498 scale. We anticipate that a model of viral replication across several cell generations 499 within a host will produce a more stringent upper bound to the genome--wide mutation 500 rate. 501 Although further work is required, these findings strongly suggest that data generated 502 through the use of this methodology can provide novel insights into viral evolutionary 503 dynamics at a greater resolution than previously achieved for a positive--stranded virus 504 such as FMDV. In particular, the genome wide assessment of polymorphic frequencies is 505 likely to be an important asset in the parameterization of models that can evaluate the 506 role of quasi--species dynamics in RNA virus evolution. 507 508 The quality scores associated with each nucleotide were lower on the first run and 539 decreased towards the end of reads ( Figure A1 ). In order to make direct comparisons 540 between the two runs, we trimmed reads from the second run to 50nt. Typically, quality 541 scores decreased along a read, as the reliability of the sequencing process decreased 542 with the number of cycles of the Sequencing Platform. As Figure A1 shows, a trade off is 543 present between the number of reads kept and their quality. For both runs, we 544 discarded reads with average error per nt below θ = 0.2%, resulting in a flatter error 545 profile along the read. 546 With this choice of the threshold, 66% of the reads were retained from the first run (a 547 total Reads alignment and trimming 553 The vast majority of the filtered 50nt--reads aligned unambiguously with less than 5 554 mismatches to the reference inoculum genome, previously established using 555 conventional Sanger sequencing (9) (Genbank accession no. EU448369), (run 1: 92.5% 556 for Inoculum, 98.9% for FLF, and 97.8% for BRF, run 2: 95.8% for Inoculum, 98.4% for 557 FLF and 96.2% for BRF). The remaining reads were either ambiguously aligned reads or 558 contained a large number (>4) of mismatches to the reference sequence, and were 559 discarded from the analysis. For each sample, an almost equal number of reads were 560 derived from positive and negative strands of the viral cDNA. 561 Further filtering of the data was undertaken after alignment of the reads. Within the 562 aligned reads, mismatches occurred with similar frequency at each of the 50nt of the 563 reads, except from the edges, where a higher number of mismatches was observed 564 ( Figure A2 ). Presumably, these peaks were due to a small number of sequences with 565 insertions or deletions close to the ends of the reads: for subsequent analysis we 566 trimmed away the first and last 5 nts of each aligned read, leaving only the 40 central 567 nucleotides where the mismatch curve was flat. 568 569 Curing of foot--and--mouth disease virus from persistently infected cells by 578 ribavirin involves enhanced mutagenesis Escherichia coli DNA polymerase III 580 epsilon subunit increases Moloney murine leukemia virus reverse 581 transcriptase fidelity and accuracy of RT--PCR procedures RNAi targeting of West 584 Nile virus in mosquito midguts promotes virus diversification Use of a portable real--time reverse transcriptase--polymerase 589 chain reaction assay for rapid detection of foot--and--mouth disease virus Genetic and phenotypic variation of foot--and--mouth disease virus 593 during serial passages in a natural host Comparative genomics of foot--and--mouth 596 disease virus Rhinovirus genome evolution 599 during experimental human infection 601 Analysis of Foot--and--mouth disease virus nucleotide sequence variation 602 within naturally infected epithelium Transmission pathways of foot--and--mouth disease virus in 607 the United Kingdom in 609 and A. Arias. 2006. Viruses as quasispecies: biological implications. Curr Top Nucleotide 612 sequence heterogeneity of an RNA phage population Rates of spontaneous mutation among RNA viruses Mutation rates among RNA viruses Infidelity of SARS--CoV Nsp14--exonuclease mutant virus 620 replication is revealed by complete genome sequencing Viral population 628 estimation using pyrosequencing Statistical distributions The quasispecies nature and 632 biological implications of the hepatitis C virus The structure and function of a foot--and--mouth disease virus--636 oligosaccharide receptor complex DNA bar coding and pyrosequencing to identify rare HIV 641 drug resistance mutations Is the quasispecies concept relevant to 646 RNA viruses Statistical analysis of the DNA 648 sequence of human chromosome 22 Efficient 652 infection of cells in culture by type O foot--and--mouth disease virus requires 653 binding to cell surface heparan sulfate Low--abundance HIV drug--659 resistant viral variants in treatment--experienced persons correlate with 660 historical antiretroviral use Sequence space coverage, 662 entropy of genomes and the potential to detect non--human DNA in human 663 samples From RNA to 665 quasispecies: a DNA polymerase with proofreading activity is highly 666 recommended for accurate assessment of viral diversity Ultra--deep pyrosequencing of 671 hepatitis B virus quasispecies from nucleoside and nucleotide reverse--672 transcriptase inhibitor (NRTI)--treated patients and NRTI--naive patients Cross--talk between orientation--dependent 676 recognition determinants of a complex control RNA element, the enterovirus 677 oriR Simple methods for estimating the numbers 679 of synonymous and nonsynonymous nucleotide substitutions Increased fidelity reduces poliovirus 682 fitness and virulence under selective pressure in mice Massively 685 parallel pyrosequencing highlights minority variants in the HIV--1 env 686 quasispecies deriving from lymphomonocyte sub--populations Tissue culture adaptation of foot--and--mouth disease virus 690 selects viruses that bind to heparin and are attenuated in cattle Deciphering human 696 immunodeficiency virus type 1 transmission and early envelope 697 diversification by single--genome amplification and sequencing Spontaneous mutation rate 700 of measles virus: direct estimation based on mutations conferring 701 monoclonal antibody resistance The 3' end 703 of the foot--and--mouth disease virus genome establishes two distinct long--704 range RNA--RNA interactions with the 5' end region Low--abundance 710 drug--resistant viral variants in chronically HIV--infected, antiretroviral 711 treatment--naive patients significantly impact treatment outcomes Use of massively parallel ultradeep pyrosequencing to 715 characterize the genetic diversity of hepatitis B virus in drug--resistant and 716 drug--naive patients and to detect minor variants in reverse transcriptase and 717 hepatitis B S antigen The relationship between mutation frequency and replication strategy 720 in positive--sense single--stranded RNA viruses 724 Quantitative deep sequencing reveals dynamic HIV--1 escape and large 725 population shifts during CCR5 antagonist therapy in vivo 731 Characterization of mutation spectra with ultra--deep pyrosequencing: 732 application to HIV--1 drug resistance 737 TABLE 1: Primers. Oligonucleotide primers used for the amplification of the two 738 large, overlapping FMDV genome fragments studied (omitting the S fragment up to 739 and including the poly(c) tract), for both the first and second run 740 PCR Set Primer * Primer Sequence TABLE 2: Statistics of polymorphic sites. General statistics of qualitatively and 744 quantitatively validated SSPs receiving coverage larger than 100x. Ts: transitions in 745 SSPs, Tv: transversions in SSPs, κ=2Ts/Tv, dN: non--synonymous mutations in the 746 ORF, dS: synonymous mutations in the ORF, 1 st , 2 nd and 3 rd : mutations in codon 747 positions in the ORF 1: Coverage of the reference genome. Obtained with the filtered Panel A: first dataset, panel B: second dataset. The three samples (Inoculum, Front 751 Left Foot and Back Right Foot) receive a generous coverage from both runs fluctuations are higher on the first dataset. Average coverage is 4873x (Inoculum), 753 8665x (FLF) and 6594x (BRF) for the first dataset, and 16827x (Inoculum), 11924x 754 (FLF) and 15945x (BRF) for the second dataset. On top of the figure, the sequenced 755 fraction of the genome At few sites, the 762 mismatch frequency is higher; as expected, the number of these peaks is larger in 763 the FLF and BRF than in the Inoculum. A small fraction of sites show perfect 764 agreement of all the reads with the reference genome Obtained by aligning the reads to 768 the reference genome. Panel A: Inoculum, panel B: FLF, panel C: BRF. This second 769 dataset has higher coverage than the first one, and a lower fraction of sites with no 770 mismatches Correlations of polymorphism frequencies in the viral populations The Spearman rank correlation ρ is indicated for each pair 777 of datasets. Only qualitatively validated SSPs receiving coverage above 100x in both 778 runs are shown. The correlation coefficients between the two runs in the Inoculum 779 and FLF are similar, while they are lower for BRF Frequency distribution of the weighted 784 averaged mismatch frequencies between the two runs, for the three samples (the 785 ordinate represents the frequencies of sites showing that fraction of mismatches) Dashed lines: sites receiving 788 coverage of 100 or more in both runs, and classified as validated site specific 789 polymorphisms (SSPs) (2,622 sites for Inoculum, 1,434 for FLF and 1 All lines show a similar trend: a small fraction of the sites (<1%) display no 791 variability in both runs, most of the sites show a very mild amount polymorphism in 792 the viral population (between 0.01% and 1%), while a very small fraction of the 793 sites (0.14% for Inoculum, 0.22% for FLF and 0.39% for BRF) present variation at a 794 level above 1% were tested: only the reads whose average error was below a threshold θ were 801accepted. More stringent thresholds decrease the errors on the reads (small dashed, 802dotted, dot--dashed and dashed lines). The insets show the fraction of reads retained 803 after the filtering process (using a threshold θ=0.2%) and retaining 66% of the reads 804 in the first dataset and 95% of the second dataset. i.e. nucleotides from 5 to 45 in each aligned read. 812 813 814