key: cord-0783082-unhl0eem authors: Pater, Adrian A.; Bosmeny, Michael S.; Barkau, Christopher L.; Ovington, Katy N.; Chilamkurthy, Ramadevi; Parasrampuria, Mansi; Eddington, Seth B.; Yinusa, Abadat O.; White, Adam A.; Metz, Paige E.; Sylvain, Rourke J.; Hebert, Madison M.; Benzinger, Scott W.; Sinha, Koushik; Gagnon, Keith T. title: Emergence and Evolution of a Prevalent New SARS-CoV-2 Variant in the United States date: 2021-01-19 journal: bioRxiv DOI: 10.1101/2021.01.11.426287 sha: 3a51670bbdbc412ebd9ada0bf918313e2bcb6d6c doc_id: 783082 cord_uid: unhl0eem Genomic surveillance can lead to early identification of novel viral variants and inform pandemic response. Using this approach, we identified a new variant of the SARS-CoV-2 virus that emerged in the United States (U.S.). The earliest sequenced genomes of this variant, referred to as 20C-US, can be traced to Texas in late May of 2020. This variant circulated in the U.S. uncharacterized for months and rose to recent prevalence during the third pandemic wave. It initially acquired five novel, relatively unique non-synonymous mutations. 20C-US is continuing to acquire multiple new mutations, including three independently occurring spike protein mutations. Monitoring the ongoing evolution of 20C-US, as well as other novel emerging variants, will be essential for understanding SARS-CoV-2 host adaptation and predicting pandemic outcomes. In early 2020 the World Health Organization declared that coronavirus disease 2019 (COVID- 19) , a potentially fatal respiratory infection caused by SARS-CoV-2, was a global pandemic (1). The high number of SARS-CoV-2 infections worldwide over time has presented the virus with 5 ample opportunity to acquire new mutations. It has been suggested that some mutations already present a fitness advantage for the virus. Notably, the D614G mutation, observed early in the pandemic, is thought to increase the transmissibility of the virus (2, 3) . The N501Y mutation of the spike protein (S) has been implicated in the rapid spread of new variants in the United Kingdom and South Africa (4) (5) (6) . A growing number of spike protein mutations could enable 10 immune evasion and reduced vaccine efficacy (7) . The U.S. has experienced a surge in cases during the third pandemic wave over the fall of 2020 and winter of 2020/2021. While many variables are likely to drive the increase in cases, it is possible that emergence of a more fit or transmissible SARS-CoV-2 variant could be a 15 contributing factor (5) . Restrictions in population movement during a global pandemic, as well as the rapid acquisition of multiple mutations, could drive emergence of novel region-specific variants. This evolutionary paradigm might explain the rise of distinct SARS-CoV-2 variants now being observed around the world during the COVID-19 pandemic (5, 6, 8) . 20 Here we report the characterization of a SARS-CoV-2 variant, 20C-US, that emerged in and has remained mostly confined to the U.S. Its quiet rise to prominence among other circulating variants in the late summer and early fall of 2020 coincides with the third U.S. pandemic wave. Based on existing genomic data, we predict that this variant may already be the most dominant variant of SARS-CoV-2 in the U.S., likely accounting for the majority of COVID-19 cases. In 25 addition to the five signature mutations of the 20C-US variant, new mutations continue to accrue. These include protease, nucleocapsid, and spike protein mutations that highlight the ongoing evolution of SARS-CoV-2. Results: 30 Genomic and phylogenetic and characterization of 20C-US, a prevalent new SARS-CoV-2 variant in the U.S. In response to anticipated genetic changes occurring in the SARS-CoV-2 virus, we began sequencing viral genomes for genomic epidemiology and surveillance. With sequencing focused on the U.S. upper Midwest in the state of Illinois, we generated full genome sequences from samples taken beginning in March 2020 to present. During phylogenetic reconstruction with our Illinois genome sequences, a particular branch within the 20C clade became noticeably more 5 pronounced (Fig. 1A) . We identified five closely co-occurring signature mutations that appeared synapomorphic to the new clade within 20C. These mutations resulted in amino acid changes of N1653D and R2613C in ORF1b, G172V in ORF3a, and P67S and P199L in the nucleocapsid (N) gene, the last of which also introduces a stop codon mutation at position Q46 of ORF14 (Table 1) . (Fig. 1C) . substantial fraction of genomes comprised this new variant for most U.S. states (Fig. 1C) To further characterize 20C-US, we identified all GISAID samples with the signature mutations 25 of ORF3a:G172V, ORF1b:N1653D, and N:P67S and that also possessed N:P199L or any mutations at position 2613 for ORF1b. We then reconstructed a phylogenetic tree with these 4681 sequences. A branching event was observed in this new tree that was initiated by two new mutations, a synonymous mutation at the nucleotide level, C14805T, that co-occurs at the same time with a non-synonymous mutation of the ORF1a gene that changes M2606 to I2606 (Table 30 2, Fig. 2A) . The co-occurrence of these two mutations in the 20C-US lineage was first observed Within the new clade defined by ORF1a:M2606I, three additional branches of significance emerge. One is a nucleocapsid mutation at position 377, converting an aspartate (D) to tyrosine (Y) ( Table 2) . Since the summer months of 2020, this mutation has occurred many times in the 15 20A, 20B, and 20C clades. However, it has clearly established a distinct and well-developed branch within the ORF1a:M2606I lineage of 20C-US (Fig. 2C) . When viewed geographically, its distribution closely mirrors that of the ORF1a:M2606I 20C-US genotype (Fig. 2D ). nsp16, which would be predicted to disrupt hydrogen bonding to an adjacent glutamate and structured water molecule and possibly alter local protein stability (Fig. 3B) . These two rather unique mutations co-occur in 20C-US and could potentially alter genome integrity, mutation retention, transcript integrity, and translation efficiency of viral messenger RNA. 25 The two largest SARS-CoV-2 viral RNA transcripts, ORF1a and ORF1b, are translated into polyproteins that must be further processed by proteases to release mature, functional viral nsp proteins. Two proteases within the ORF1a gene are responsible for this processing, nsp3 and nsp5 (12). The parental ORF1a:L3352F mutation carried by 20C-US creates a mutation of interactions with viral or cellular factors (22) . ORF3a plays a role in viral particle maturation and release at the cell membrane and has been proposed to co-mutate with the spike protein (23). African SARS-CoV-2 variant 501Y.V2. The 20C-US variant has also recently acquired a Q677H or a Q173K mutation in the spike protein. Q677H is directly adjacent to the furin cleavage site. The furin cleavage site is a novel motif not observed in SARS-CoV viruses that is proposed to significantly enhance infectivity (24) . Furin cleavage is a critical priming step essential for efficient entry of SARS-CoV-2 viruses 20 into cells (24) . A mutation of interest has been the P681H in the spike protein of the novel UK variant 501Y.V1, also due to close proximity to the furin cleavage site (5) . Q677 and P681 are mutated to histidine in 20C-US and 501Y.V1, respectively, suggesting a potentially important effect of histidine near the furin cleavage site. The Q677 amino acid resides in a similar region on the spike protein as D614, which is commonly mutated to a G residue (3) (Fig. 3D) . 25 S:Q173K resides in the S1 A domain nearby the receptor binding domain (RBD), where the wellknown E484K and N501Y mutations are found (5, 6) . The S1A domain, like the RBD, exhibits low conservation, which helps SARS-CoV-2 adapt to host cells and host immunity. Although antibodies against the SARS-CoV-2 spike protein are believed to primarily target the RBD (25), 30 antibodies isolated from COVID-19 convalescent patients have been found to bind very tightly to the S1 A domain (26) . Conversion from a neutral to charged amino acid might alter interactions important for antibody recognition. In addition, N501Y and N501T mutations in the spike protein are beginning to occur in the 20C-US variant. The significance of S:N501 mutations has been underscored by their reoccurrence in apparently highly transmissible forms of the SARS-CoV-2 virus (5, 6) . These mutations occur in the RBD and are directly implicated in modulation 5 of host cell interaction (Fig. 3D ). We have characterized the emergence and rise of a prevalent SARS-CoV-2 variant within the 20C clade that is highly specific to the continental U.S. 20C-US is predicted to soon surpass 50% 10 penetrance to become the dominant variant in the U.S. (Fig. 4A) . It is unclear whether natural selection or genetic drift has driven the rise in prevalence of 20C-US. Nonetheless, its dominance has been largely achieved during the third pandemic wave when cases have risen significantly (Fig. 4B) . During this period, Google mobility data are consistent with no major changes in population movement patterns across the U.S. that could account for an increase in the 15 proportion of 20C-US to other variants (Fig. 4C) . Recent studies on hospital care for COVID-19 patients in the U.S. indicate that adjusted mortality rates are decreasing and patient outcomes are improving (27, 28) . Taken together, these observations suggest the possibility that 20C-US may have some degree of increased transmissibility but not a significantly increased disease severity. 20 The biological effects of the combined 20C-US mutations, as well as the viral characteristics of 20C-US, like fitness, transmissibility, and virulence, remain to be experimentally characterized. However, we note that all signature mutations that occurred during the establishment of the 20C- The mechanism and rate of SARS-CoV-2 transmission necessitates strict measures that 15 effectively limit population movement (29) . Since the outbreak of the global COVID-19 pandemic, international travel has become highly restricted. Novel variants that emerge in an isolated region or country may transmit locally among that population and develop distinct genotypes and phenotypes. Thus, it would be expected that regional territories would develop their own distinct SARS-CoV-2 variants over time. When searching for novel emerging variants, 20 focusing on local and regional data may provide an advantage. Our ability to identify the 20C-US variant can be partly attributed to our initial focus on Illinois state-level data since the prevalence of the 20C-US variant was more pronounced in the U.S. Midwest. While this manuscript was in preparation, the Nextstrain group updated their global phylogenetic 25 analysis server for SARS-CoV-2 to begin designating emerging clades. We found that the 20C-US variant closely tracks with the newly designated 20G clade, demonstrating that this approach will be valuable in helping to identify new potential variants or clades of interest. When (Fig. 4D) . A detailed assessment of the emergence and rise to prevalence should also be undertaken for these variants. Unless successful vaccination efforts can be greatly accelerated, we predict the emergence of dominant novel variants in many global regions that are relatively isolated, possibly including Brazil, New Zealand, the African west coast, and Japan. 5 This study underscores the need for greater genomic surveillance of the SARS-CoV-2 virus, especially at the regional level where novel variants will first emerge. Modern genomic surveillance enables observation of evolution in near real-time, prediction of major shifts in viral fitness, and assurance that vaccines are kept current. 10 Sequencing of SARS-CoV-2 Samples Briefly, lower-quality samples (those with large numbers of gaps or lacking sufficient metadata) 15 are filtered out, then the dataset is sorted based on the month it was acquired as well as the U.S. state it was acquired in. From each of these subsets, Nextstrain attempts to randomly pick an equal number of samples. All these samples are then recombined and processed together. Data and materials availability: Genome sequences are currently being submitted to, and will be available through, the GISAID initiative (https://www.gisaid.org/). Gagnon laboratory Illinois genome sequence results can be visualized and analyzed on their Nextstrain group page: https://nextstrain.org/groups/illinois-gagnon-public. WHO declares COVID-19 a pandemic Spike mutation D614G alters SARS-CoV-2 fitness Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus Early transmissibility assessment of the N501Y mutant strains of SARS-CoV-2 in the United Kingdom Estimated transmissibility and severity of novel SARS-CoV-2 Emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa. medRxiv Polymorphism and selection pressure of SARS-CoV-2 vaccine 15 and diagnostic antigens: implications for immune evasion and serologic diagnostic performance Geographic and Genomic Distribution of SARS-CoV-2 Mutations Nextstrain: real-time tracking of pathogen evolution Epistasis and the adaptability of an RNA virus Viral quasispecies evolution. Microbiology and molecular biology reviews : MMBR Coronavirus biology and replication: implications for SARS-CoV-2 The Enzymatic Activity of the nsp14 Exoribonuclease Is Critical for Replication of MERS-CoV and SARS-CoV-2 Structural basis and functional analysis of the SARS coronavirus nsp14-nsp10 complex Coronavirus nonstructural protein 16 is a cap-0 binding enzyme possessing (nucleoside-2'O)-methyltransferase activity Crystallographic structure of wild-type SARS-CoV-2 main protease acylenzyme intermediate with physiological C-terminal autoprocessing site Nsp3 of coronaviruses: Structures and functions of a large multi-domain protein The SARS Coronavirus 3a protein causes endoplasmic reticulum stress and induces ligand-independent downregulation of the type 1 interferon receptor Cell type-specific cleavage of nucleocapsid protein by effector caspases during SARS coronavirus infection A di-acidic signal required for selective export from the endoplasmic reticulum SARS-CoV-2 and ORF3a: Nonsynonymous Mutations, Functional Domains, and Viral Pathogenesis. mSystems 5 The role of severe acute respiratory syndrome (SARS)-coronavirus accessory proteins in virus pathogenesis Molecular conservation and differential mutation on ORF3a gene in Indian SARS-CoV2 genomes A Multibasic Cleavage Site in the Spike Protein of SARS-CoV-2 Is Essential for Infection of Human Lung Cells Engineering human ACE2 to optimize binding to the spike protein of SARS coronavirus 2 Structures of Human Antibodies Bound to SARS-CoV-2 Spike Reveal Common Epitopes and Recurrent Features of Antibodies Variation in US Hospital Mortality Rates for Patients Admitted With COVID-19 During the First 6 Months of the Pandemic Trends in COVID-19 Risk-Adjusted Mortality Rates Scientific and ethical basis for social-distancing interventions against COVID-19. The Lancet. Infectious diseases MAFFT version 5: improvement in accuracy 5 of multiple sequence alignment IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies