key: cord-0828607-suig5n17 authors: Roder, AE.; Khalfan, M.; Johnson, KEE.; Ruchnewitz, D.; Knoll, M.; Banakis, S.; Wang, W.; Samanovic, MI.; Mulligan, MJ.; Gresham, D.; Lässig, M.; Łuksza, M.; Ghedin, E. title: Diversity and selection of SARS-CoV-2 minority variants in the early New York City outbreak date: 2021-05-06 journal: bioRxiv DOI: 10.1101/2021.05.05.442873 sha: 53cc032fb2d81822f392490e7bde63830f7a8ab1 doc_id: 828607 cord_uid: suig5n17 High error rates of viral RNA-dependent RNA polymerases lead to diverse intra-host viral populations during infection. Errors made during replication that are not strongly deleterious to the virus can lead to the generation of minority variants. Here we analyzed minority variants within the SARS-CoV-2 data in 12 samples from the early outbreak in New York City, using replicate sequencing for reliable identification. While most minority variants were unique to a single sample, we found several instances of shared variants. We provide evidence that some higher-frequency minority variants may be transmitted between patients or across short transmission chains, while other lower-frequency, more widely shared variants arise independently. Further, our data indicate that even with a small transmission bottleneck, the heterogeneity of intra-host viral populations is enhanced by minority variants present in transmission samples. Our data suggest that analysis of shared minority variants could help identify regions of the SARS-CoV-2 genome that are under increased selective pressure, as well as inform transmission chains and give insight into variant strain emergence. IMPORTANCE When viruses replicate inside a host, the virus replication machinery makes mistakes. Over time, these mistakes create mutations that result in a diverse population of viruses inside the host. Mutations that are neither lethal to the virus, nor strongly beneficial, can lead to minority variants. In this study, we analyzed the minority variants in SARS-CoV-2 patient samples from New York City during the early outbreak. We found common minority variants between samples that were closely related and showed that these minority variants may be transmitted from one patient to another. We show that in general, transmission events between individuals likely contain genetically diverse viral particles, and we find signatures of selection governing intra-host evolution. We conclude that the analysis of shared minority variants can help to identify transmission events and give insight into emergence of new viral variants. The circulation of a novel coronavirus was reported in late 2019 out of Wuhan Province, (Supplementary Table 1) . We achieved more than 88% coverage of the genome at 5X for all 12 163 of the NS samples. To determine the major clades represented within our samples, we mapped them against 165 a global tree using 10,932 global isolates. We characterized the main genetic clades by identifying 166 non-synonymous amino acid mutations that originate in prevalent viral population subtrees and 167 used the Wuhan/Hu-1/2019 strain to root the tree. The New York isolates mapped to two major 168 clades. Ten of the sequences belonged to clade 20C, defined by mutations S:D614G, ORF1b:P314L, ORF3a:Q57H, and ORF1a:T265I, while two sequences, from the two samples 170 from the same patient (NYU-VC-009), mapped to clade 20B, defined by the mutations S:D614G, 171 ORF1b:P314L, N:R203K, N:G204R, and ORF14:G50N ( Fig. 2A-B) . These two clades were There were three consensus changes found in all 12 samples, including 5'UTR:C241U, 183 ORF1a:C3037U, and S:A23403G (Fig. 2C) . The S:A23403G (aa S:D614G) mutation is a defining 184 mutation associated with European derived strains of the virus and found to be associated with 185 increased transmission (26, 27) . Of the 20 unique consensus changes, 13 of them represented 186 non-synonymous changes while seven were synonymous or in non-coding regions. The non-187 synonymous changes were also found more frequently in multiple samples, representing 62 of 188 the 95 total changes in the data. Of these 95 total changes, the overwhelming majority were 189 transitions with very few transversions. C to U transitions were the most frequent, followed by G 190 to A and A to G changes (Fig. 2D) . As expected, none of the identified consensus changes were 191 unique to our samples and can be found in many publicly available sequences within the USA 192 East Coast clade. To identify high confidence minority variants within this data set, we sequenced each 194 sample in duplicate, when starting material allowed (nine of 12 samples). We used a low 195 frequency threshold (0.005) to perform an initial filtering of the minority variants called by timo and 196 compared the minority variants across the replicate sequences. The large majority of minority 197 variants were not reproducible, indicating that they may have been introduced during the 198 amplification or sequencing processes (Fig. 3A) . Importantly, we did not find an obvious 199 correlation between viral load and the number of reproducible minority variants in this sample set 200 (r 2 = 0.271) (Fig. 3B-C) . Based on these observations, we filtered our list of variants for only those that existed in both replicates in locations with coverage greater than 200X and an average allele 202 frequency above 0.02. For samples that were only sequenced once due to limited specimen 203 availability, we filtered the minority variants to include only those that were present above our 204 cutoffs and existed in another sample. We used this final list of high confidence minority variants 205 for our analyses. Using these cutoffs, we identified 54 minority variants across the 12 NS samples, 29 of 207 which were unique to the samples in which they were detected. High confidence minority variants 208 were detected in eight of the 10 gene coding regions, as well as in the 5' UTR. The highest number 209 of variants were in ORF1a (Fig. 4A) . As with the identified consensus changes, there were more 210 transitions than transversions with C to U transitions accounting for the overwhelming majority of 211 the changes (Fig. 4B) . In contrast to the consensus changes, the number of variants was more 212 variable between samples, ranging from as few as one to as many as 13 in one sample (Fig. 4C ). Of the 38 different variants identified across the samples, approximately 20% were found in more 214 than one sample. Close to 50% of the shared variants were present in pairs of samples while the 215 others were shared between 3-5 samples. Samples 022 and 023 shared the highest number of 216 variants (Fig. 4C) . Thirty-five of the minority variants led to nonsynonymous changes, compared 217 to 12 synonymous changes; both synonymous and nonsynonymous changes were represented 218 within the shared variants (Fig. 4D) . There was only one instance of a minority variant that was 219 present at the same location as a consensus change within our data, in ORF1a at amino acid 220 position 1429 (Fig. 4E) . Ultimately, we found that most minority variants were unique to a single 221 sample, reinforcing the randomness of errors made by the viral RdRp which result in minority 222 variants. Transmission of minor variants between hosts. Upon entry into a new host, the viral population grows initially in an exponential way. As 250 such, the frequency of a mutation within the population is related to its origination time: a few early 251 mutations of larger frequency are followed by many later mutations of small frequency. This 252 feature, which is well-known in the context of Luria-Delbrück fluctuation assays, can be made 253 quantitative: a mutation originating at time after the start of growth has an initial frequency = 254 exp (− ), where is the growth rate of the viral population. If the mutation is nearly neutral, this 255 frequency will stay approximately constant during the subsequent growth process. These replicates) present at an allele frequency above 0.005 and coverage above 200X (Fig 3A, 5A) and computed the probability that the transmitted viral population is heterogeneous in sequence. To do this, we evaluated the cumulative mutant weight ( ), which is defined as the sum of the and in amino acid sequence ( ) (10) < 0.14) (Fig. 6B) The majority of changes we identified in our data, both at the consensus level and in minority We also saw only one instance of a minority variant in the same genomic location as a 340 consensus change in our data set -in ORF1a at aa position 1429. We initially expected to see 341 this pattern more frequently as all mutations in the consensus tree must have been a minority in 342 an intra-host viral population at some point. The fact that within our data, we see this pattern 343 infrequently could suggest that selected mutations move from minority to majority very quickly 344 and therefore capturing them as minority variants is less likely; or could suggest the opposite, that 345 it takes a very long time for this change to occur and thus, capturing it within a small data set 346 would be rare. This will be an interesting avenue to explore in future studies. 2, the size of the bottleneck has been reported to be as few as one to as many as one thousand These data contribute to an argument for transmission of minority variants; however, these Intersections between the workflow VCF files (produced by Mutect2, Freebayes, timo, VarScan, 482 iVar and haplotype caller) and the golden VCF file were generated using bcftools isec v1.9 (48). 483 The output from bcftools isec was then analyzed and compared against the respective AF-484 specific golden VCF to compare allele frequencies using a custom script. where 1 is the number of singlet variants in host and is the Heaviside step function (Fig 5A) . In these analyses, higher multiplets were excluded because they are a priori unlikely to occur A pneumonia outbreak 693 associated with a new coronavirus of probable bat origin A new 696 coronavirus associated with human respiratory disease in China A Novel Coronavirus from Patients with Pneumonia in China Coronaviridae Study Group of the International Committee on Taxonomy of V. 2020. The 702 species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV 703 and naming it SARS-CoV-2 Cryptic transmission of SARS-CoV-2 in Washington State Introductions and 721 early spread of SARS-CoV-2 in the New York City area Sequencing identifies multiple, early introductions of SARS-CoV2 Coronaviruses lacking 735 exoribonuclease activity are susceptible to lethal mutagenesis: evidence for proofreading 736 and potential therapeutics Moderate 742 mutation rate in the SARS coronavirus genome and its implications Molecular 745 characterization of SARS-CoV-2 from the first case of COVID-19 in Italy Shared SARS-CoV-2 diversity suggests localised 748 transmission of minority variants Haplotype-based variant detection from short-read 754 sequencing. ArXiv 1207.3907v2. 755 21 VarScan 2: somatic mutation and copy number alteration discovery in 757 cancer by exome sequencing An amplicon-based sequencing framework for accurately 761 measuring intrahost virus diversity using PrimalSeq and iVar A framework for variation discovery 765 and genotyping using next-generation DNA sequencing data From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline Simulating Spike mutation pipeline reveals the emergence of a more 774 transmissible form of SARS-CoV-2 Rapid evolution 781 of RNA genomes Viral quasispecies RNA replication errors and the evolution of virus 788 pathogenicity and virulence A speed-fidelity trade-off determines the mutation rate 791 and virulence of an RNA virus SARS-CoV-2 hot-spot mutations are significantly 799 enriched within inverted repeats and CpG island loci Rampant C-->U Hypermutation in the Genomes of SARS-CoV-2 and 802 Other Coronaviruses: Causes and Consequences for Their Short-and Long-Term 803 The 805 Heterogeneous Landscape and Early Evolution of Pathogen-Associated CpG Dinucleotides 806 in SARS-CoV-2 Temporal dynamics of SARS CoV-2 mutation accumulation within and across infected hosts Reanalysis of deep-sequencing data from Austria points towards Genomic epidemiology of superspreading events in Austria 820 reveals mutational dynamics and transmission properties of SARS-CoV-2 Li 825 JH, Xu YH. 2021. Population Bottlenecks and Intra-host Evolution During Human-to Human Transmission of SARS-CoV-2 SARS-CoV-2 within-host diversity and transmission Trimmomatic: a flexible trimmer for Illumina 843 sequence data Fast and accurate short read alignment with Burrows-Wheeler 845 transform Pheniqs: Fast and flexible quality-aware 849 sequence demultiplexing A program for annotating and predicting the effects of single nucleotide 852 polymorphisms, SnpEff: SNPs in the genome of Data, disease and diplomacy: GISAID's innovative 855 contribution to global health MAFFT multiple sequence alignment software version 7: 857 improvements in performance and usability IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference 862 in the Genomic Era UFBoot2: Improving 864 the Ultrafast Bootstrap Approximation TreeTime: Maximum-likelihood phylodynamic 866 analysis An interactive web-based dashboard to track COVID-19 in 870 real time