key: cord-1029866-fnaw93s8 authors: Yang, Hsin-Chou; Wang, Jen-Hung; Yang, Chih-Ting; Lin, Yin-Chun; Hsieh, Han-Ni; Chen, Po-Wen; Liao, Hsiao-Chi; Chen, Chun-houh; Liao, James C. title: Subtyping of major SARS-CoV-2 variants reveals different transmission dynamics date: 2022-04-11 journal: bioRxiv DOI: 10.1101/2022.04.10.486823 sha: 3e9a82ad19bb44301b244d73cf19847959087831 doc_id: 1029866 cord_uid: fnaw93s8 SARS-CoV-2 continues to evolve, causing waves of the pandemic. Up to March 2022, eight million genome sequences have accumulated, which are classified into five major variants of concern. With the growing number of sequenced genomes, analysis of the big dataset has become increasingly challenging. Here we developed systematic approaches for comprehensive subtyping and pattern recognition for transmission dynamics. By analyzing the first two million viral genomes as of July 2021, we found that different subtypes of the same variant exhibited distinct temporal trajectories. For example, some Delta subtypes did not spread rapidly, while others did. We identified sets of characteristic single nucleotide variations (SNVs) that appeared to enhance transmission or decrease efficacy of antibodies for some subtypes of the Delta and Alpha variants. We also identified a set of SNVs that appeared to suppress transmission or increase viral sensitivity to antibodies. These findings are later confirmed in an analysis of six million genomes as of December 2021. For the Omicron variant, the dominant type in the world, we identified the subtypes with enhanced and suppressed transmission in an analysis of seven million genomes as of January 2022 and further confirmed the findings in a later analysis of eight million genomes as of March 2022. While the “enhancer” SNVs exhibited an enriched presence on the spike protein, the “suppressor” SNVs are mainly elsewhere. Disruption of the SNV correlation largely destroyed the enhancer-suppressor phenomena. These results suggest the importance of fine subtyping of variants, and point to potential complex interactions among SNVs. shows that the frequency of strains represented by CSSs increased with time, and CSSs almost completely represent the genome variations after July 2020. and Omicron CSSs are provided in Fig. 2B, Fig. 3B, and Fig. 4B , respectively. To characterize and subtype strain variations in more detail, the CSS approach can be applied to individual countries. As the highly contagious variant Delta was first discovered in India, we further subtyped the Delta strain sequenced in India using the CSS approach, which resulted in eleven subtypes ( Figs. 2A and 2B) . For illustration, we focus on the first six subtypes. The first three subtypes (Delta-01, Delta-02, and Delta-03) exhibited increasing temporal trajectories and the other three subtypes (Delta-04, Delta-05, and Delta-06) had much lower temporal trajectories ( Fig. 2A) . Remarkably, the same pattern of the differential temporal trajectories for the subtypes was found in many other countries consistently (Fig. S9) , indicating that the subtypes and their differences are reproducible. The strains in the six Delta subtypes carry all or the majority of the signature SNVs defined for the Delta variant (T19R, T478K, D950N, D614G, L452R, and P681R in the spike protein, 12 SNVs in other proteins, and 1 in the 3' untranslated region) (Fig. 2B ). Yet, the subtypes exhibited distinct temporal trajectories. The first three subtypes with rising temporal trajectories (Delta-01 to 03 in Figs. 2A and 2B) are defined by eight, nine, and eleven signature SNVs, respectively, with 100% allelic associations. We define these signature SNVs as "transmission enhancers", since they are strongly associated with the rapid rise in proportion. The remaining three subtypes with lower temporal trajectories (Delta-04 to 06 in Figs. 2A and 2B) all contain a set of 100% associated signature SNVs, in addition to the enhancer SNVs in some strains. It appears that these signature SNVs "suppressed" the rise of the temporal trajectories. Thus, we define them as "transmission suppressors". As the enhancer SNVs are 100% associated in Delta-01 to 03, we looked for similar strains without the complete set of the enhancer SNVs. The result shows that strains missing any one of the 8 enhancer SNVs in Delta-01 (Fig. 2C) , missing any of the 9 enhancer SNVs in Delta-02 (Fig. 2D) , or missing any of the 11 enhancer SNVs in Delta-that the enhancer SNVs work cooperatively to gain viral fitness (i.e., a synergistic or positive cooperativity effect), and disfavors the possible hitchhiking passenger roles played by some SNVs. In contrast to the first three subtypes, the other three subtypes (Delta-04 to 06) exhibit different temporal patterns ( Fig. 2A) . The majority of the strains in these three subtypes also carried the signature SNVs in the spike protein for the Delta variant (i.e., the T19R-L452R-T478K-D614G-P681R-D950N haplotype) (Fig. 2B) . Nevertheless, the temporal trajectories of the three subtypes are suppressed dramatically after acquiring the suppressor SNVs located mainly on the non-spike proteins (Figs. 2F -2H) . Delta-04 carries the transmission suppressors C20320T (nsp15), G29427A (N), C24745T (S), C6539T (nsp3), and exhibited the low temporal trajectory ( Figs. 2A -2B) . Delta-05 and Delta-06 carried overlapping but different sets of transmission suppressors with some synonymous SNVs (Fig. 2B) . These two subtypes also generally suppressed the rise of the temporal proportion, although not as effective as Delta-04 ( Fig. 2A) . Note that the defining SNVs for Delta-04 to 06 subtypes (CSSs) contain both suppressors and enhancers. We examined the effect of acquiring defining SNVs, starting from strains containing only the enhancers but not the suppressors (Figs. 2F--2H). Strains containing the 5, 3, and 3 enhancers in the defining SNV sets of Delta-04 to Delta-06, respectively, but not the suppressors all exhibited a rising temporal pattern. However, when they acquired all the suppressor SNVs, the rising patterns were suppressed. We did not find sufficient samples or presence of partial set of suppressors. Therefore, whether the transmission suppressor act independently or cooperatively to reduce the temporal trajectories remains inconclusive. However, since we did not find partial suppression of the temporal patterns, it is likely that the suppressors also work cooperatively (see a discussion in the subsection "Suppression effect"). These transmission suppressor SNVs play a suppressing role on the viral transmission through genetic epistasis with other signature SNVs. Since some of the suppressor SNVs in Delta-04 to 06 are synonymous SNVs, the effect may come from codon usage or RNA-level interactions, although the hitchhiking effect cannot be ruled out. Remarkably, the identified enhancer and suppressor SNVs can be further confirmed in an analysis of six million SARS-CoV-2 genomes (Fig. S10) . This reflects the robustness of the identified enhancer and suppressor SNVs. We also applied the CSS-based approach to subtype the Alpha variants identified in UK ( Fig. 3A) , the country where Alpha was first discovered. The first subtype (Alpha-01) presents the much higher temporal trajectory than the other two subtypes (Alpha-02 and Alpha-03). These Alpha subtypes were also found in many other countries, and showed the similar temporal trajectory pattern (Fig. S11) . We defined the signature SNVs of Alpha-01 as transmission enhancers, as they enhanced the transmission of the variant (Fig. 3B) . Alpha-02 or 03 contained some of these enhancers, but most importantly, they acquired a set of suppressors that appeared to suppress the rise of the temporal trajectory (Fig. 3B) . The proportion of the strains pertaining to the first subtype was reduced dramatically if any of the transmission enhancers were missing (Fig. 3C) . This finding suggests that the enhancers work cooperatively. Similar to the Delta subtypes, we did not find sufficient samples or presence of partial set of suppressors (Figs. 3D -3E) . Therefore, whether the transmission suppressor act independently or cooperatively to reduce the temporal trajectories remains inconclusive. As of 12 Jan, 2022 (n = 7,026K), we applied the CSS-based approach to subtype the Omicron variants in the United Kingdom with a larger sample size compared to other countries (Fig. 4A) . Omicron is known as the variant with a large number of signature SNVs especially in the spike protein. The first and second subtypes (Omicron-01 and 02) present the much higher temporal trajectory than the third subtype (Omicron-03). These Omicron subtypes were also found in many other countries, and showed the consistent temporal trajectory patterns (Fig. S12) . We defined the signature SNVs of Omicron-01 and 02 as transmission enhancers, as they enhanced the transmission of the variant (Fig. 4B) . Omicron-03 contained some of these enhancers, but most importantly, they acquired a set of suppressors that appeared to suppress the rise of the temporal trajectory (Fig. 4B) . Furthermore, these findings were successfully confirmed in an analysis of eight million SARS-CoV-2 genomes accumulated as of 23 Feb, 2022 (n = 8,475K) (Fig. S13) . The proportion of the strains pertaining to the first two subtypes were reduced dramatically if any of the transmission enhancers were missing (Fig. 4C) . This finding suggests that the enhancers work cooperatively. Similar to the Delta and Alpha subtypes, we did not find sufficient samples or presence of partial set of suppressors ( Figs. 4D -4E) . Therefore, whether the transmission suppressor act independently or cooperatively to reduce the temporal trajectories remains inconclusive. However, the current observation reveals that the four suppressor SNVs appeared together since the first emergence in November 2021. The set of four suppressor SNVs has a stronger effect on transmission suppression compared to the subsets that they miss any suppressor SNVs (see a discussion in the subsection "Suppression effect"). A further analysis of a larger sample size (n = 8,475K) as of 23 Feb, 2022 identified seven Omicron CSSs (Figs. S14A and S14B), named as Omicron-04 ~ Omicron-10. The first six Omicron subtypes Omicron-04 ~ Omicron-9 exhibited a pattern of enhanced transmission and the final subtype Omicron-10 suppressed transmission. Patterns of enhanced and suppressed transmission in these CSSs were also confirmed in many countries (Figs. S14C -S14V). Compared to the previous results (n = 7,026K) in Because of a high allelic association in the transmission suppressor SNVs L3290L(nsp5)-I1081V(S)-L106F(ORF3a)-D343G(N), more data are needed to distinguish that the suppressing effect is contributed by I1081V solely and/or the full set of transmission suppressor SNVs. SARS-CoV-2 is an RNA virus and can readily acquire mutations during the replication process and generate new variants and subtypes. The subtypes arising from the recent common ancestral variant may have highly correlated SNVs but remarkably different genomic sequences and transmission patterns. In this study, we developed a systematic dimension reduction approach to characterizing the viral subtypes based on the correlated SNV sets with allelic association and monitoring their emergence and growth. We also developed a pattern recognition approach to grouping CSSs and detecting the sets of transmission enhancers and suppressors. By analyzing 8 million genome sequences of SARS-CoV-2, we provided real-world evidence for the viral subtypes. The identified subtypes exhibit differential temporal trajectories. The patterns can be characterized by the sets of transmission enhancers and suppressors located both on the spike protein and elsewhere. This highlights the importance of SNVs in both spike and non-spike proteins. Spike-protein signature SNVs are often used as a proxy for diagnosing variants and explaining increasing viral transmissibility. Our result shows that almost all CSSs contain SNVs on the spike protein (1,053/1,057 = 99.62%), and only 4 CSSs do not contain any defining SNVs on spike. In total, 37.16% of the defining SNVs of a CSS are located in the spike protein, which is larger than the proportion of SNVs in the spike protein in the whole genome (3,822/29,903 = 12.78%). Remarkably, spike-protein is highly related to allelic association; spike-protein is characterized by the high ratio of non-synonymous SNVs vs. synonymous SNVs with allelic association, frequent intergenic allelic association (i.e., Spike-Nucleocapsid association), and frequent intragenic allelic association. These results indicate that spike is constantly under selection and that it is the most important protein coded by the viral genome which determines the overall viral fitness and transmission effectiveness. The non-spike protein signature SNVs have been commonly found in the important variants, however, their roles have been largely under-appreciated. Our results reveal that the non-spike-proteins signature SNVs provide subtle information for the subtypes of a variant. In addition, the non-spike-proteins signature SNVs are also relevant to the viral transmissibility. Allelic association provides a direction for investigating mechanistic interactions. The non-spike-proteins SNVs can directly elevate or reduce the viral transmissibility through a direct genetic epistasis with other SNVs (i.e., genetic buffering). On the other hand, the observed association of non-spike protein SNVs may be explainable by genetic hitchhiking. However, this explanation is disfavored because we did not find "hitchhikers" of less than the whole set of suppressors. The current vaccines have been designed targeting at the spike protein. Our results suggest that the non-spike-proteins SNVs should also be considered to improve the sensitivity of SARS-CoV-2 variants such as Omicron, Delta, and Alpha to pharmacological intervention. Our results show that the ratio of nonsynonymous versus synonymous amino acid change (r) is much higher in the transmission enhancer SNVs compared to the suppressor SNVs. Interestingly, the Alpha variant has a high r value (r = 4.5) for the transmission enhancer SNVs, and the Delta variant has an even higher r value (r = 11.5). Moreover, the Delta variant has a r value (r = 1.375) for the transmission suppressor SNVs much lower than the Alpha variant (r = 4). A higher nonsynonymous vs. This partially reflects the dominance of Delta compared to other variants. Compared to the previous variants, Omicron is known as a variant with a significant enrichment of spike signature SNVs. SNVs in the receptor binding domain (RBD) of the spike protein (S) can alter the affinity to the angiotensin converting enzyme 2 (ACE2) receptor and potentially cause vaccine escape [7, 19, 20] . Omicron has exhibited a more rapid transmission than previous Variants of Concern. Because of a lasting evolution of SARS-CoV-2, there is an unmet need to systematically track the dynamic changes of the viral subtypes, understand their signature SNVs, and improve our understanding and controlling for the COVID-19 pandemic. However, the computation becomes a hurdle when the number of genomes exceeds a million scale. We find that allelic association is a hallmark for an emergence and growth of a subtype and therefore can be employed to detect a strain subtype. We downloaded and preprocessed 1,180K, 1,589K, 1,872K, 2,215K, 6,316K, 7,205K, Deletions were also detected. Annotation of the SNVs was collected from CNCB (https://bigd.big.ac.cn/ncov/release_genome). Statistical analyses were performed by using our self-developed R codes. CSS subtyping was established based on the proposed analysis procedures (Fig. S5A) . Matrix representation for a dimension reduction of the proposed CSS analysis procedure is provided (Fig. S6) Transmission enhancer SNVs and transmission suppressor SNVs were detected by using the proposed procedures (Fig. S5B) CSSs. An optimal parameter vector is critical for the decision rule (a classifier of CSSs). The parameter vector , , , , was updated and optimized by using Particle Swarm Optimization [29] as follows: where is the weight of (default: 0.9 ); acceleration constant for individuals ("particle") ~Uniform 0, (default: 0.2 ); acceleration constant for population ("swarm") ~Uniform 0, (default: 0.2); is the number of initial random particles (default: 200). is the best state for individual at iteration and is the best state for population at iteration . In each updating of and , the best states and were updated simultaneously as follows: Here we considered an objective function (i.e., misclassification frequency in CSSs) as follows: in Fig. 4 is illustrated as an example. In each subfigure, three curves indicate the temporal trajectories for the three subgroups as follows: (i) "Omicron-03" (Fig. 4B) carries Data, disease and diplomacy: GISAID's innovative contribution to global health. Global Challenges Nextstrain: real-time tracking of pathogen evolution A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology Characteristics of SARS-CoV-2 and COVID-19 Analysis of genomic distributions of SARS-CoV-2 reveals a dominant strain type with strong allelic associations Global analysis of more than 50,000 SARS-CoV-2 genomes reveals epistasis between eight viral genes Cell entry mechanisms of SARS-CoV-2 Structural and Functional Basis of SARS-CoV-2 Entry by Using Human ACE2 Coronavirus biology and replication: implications for SARS-CoV-2 A Genomic Perspective on the Origin and Emergence of SARS-CoV-2. Cell Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic Identifying SARS-CoV-2-related coronaviruses in Malayan pangolins A Novel Coronavirus from Patients with Pneumonia in China Phylogenetic network analysis of SARS-CoV-2 genomes The authors declare no competing interests. Fig. 2 is illustrated as an example. In each subfigure, three curves indicate the temporal trajectories for three subgroups: (i) "Delta-11" (Fig. 2B) carries a full set of transmission suppressor SNVs: T1773T(nsp3)-T3750I(nsp6)-L4252L(nsp9)-A222V(S) (green color and circle symbol); (ii) "Delta-11 with T1773T(nsp3)-T3750I(nsp6)-L4252L(nsp9)" indicates the Delta strains that they carry the signature SNVs similar to Delta-11 but miss a suppressor SNV A222V(S), i.e., the Delta strains which carry only the suppressor triplet T1773T(nsp3)-T3750I(nsp6)-L4252L(nsp9) (blue color and x symbol); (iii) "Delta with A222V(S)" indicates the Delta strains with a transmission suppressor SNV in the spike protein A222V (red color and triangle symbol). The number of total strains per date (gray bar) is displayed with the histogram in the background. "Delta with A222V(S)" has a higher temporal trajectory than "Delta-11", indicating that the set of four transmission suppressor SNVs work cooperatively and have a stronger synergistic effect in suppressing transmission suppression than a single SNV A222V(S). The curves for "Delta-11" and "Delta-11 with T1773T(nsp3)-T3750I(nsp6)-L4252L(nsp9)" are very close, reflecting that the four suppressor SNVs T1773T(nsp3)-T3750I(nsp6)-L4252L(nsp9)-A222V(S) have a high allelic association, and therefore it's hard to observe any missing suppressor SNVs from the set of transmission suppressor SNVs.