key: cord-0925675-f8t6h1ys authors: Weng, Shenghui; Shang, Jingzhe; Cheng, Yexiao; Zhou, Hangyu; Ji, Chengyang; Yang, Rong; Wu, Aiping title: Genetic differentiation and diversity of SARS-CoV-2 omicron variant in Its early outbreak date: 2022-04-25 journal: Biosaf Health DOI: 10.1016/j.bsheal.2022.04.004 sha: 060bb96dcacbbc5d353a68a0d9c7426049cdb85c doc_id: 925675 cord_uid: f8t6h1ys The recently emerged Omicron variant of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has quickly spread to most countries. Although many consensus mutations of the Omicron variant have been recognized, little is known about its genetic variation during its transmission in the population. Here, we comprehensively analyzed the genetic differentiation and diversity of the Omicron variant during its early outbreak. We found that Omicron achieved more structural variations, especially deletions, on the SARS-CoV-2 genome than the other four variants of concern (Alpha, Beta, Gamma, and Delta) in the same timescale. In addition, the Omicron variant acquired, except for 50 consensus mutations, seven great new non-synonymous nucleotide substitutions during its spread. Three of them are on the S protein, including S_A701V, S_L1081V, and S_R346K, which belong to the receptor-binding domain (RBD). The Omicron BA.1 branch could be divided into five divergent groups spreading across different countries and regions based on these seven novel mutations. Furthermore, we found that the Omicron variant possesses more mutations related to a faster transmission rate than the other SARS-CoV-2 variants by assessing the relationship between the genetic diversity and transmission rate; the findings indicated that more attention should be paid to the significant genetic differentiation and diversity of the Omicron variant for better disease prevention and control. Since the Coronavirus Disease 2019 (COVID-19) outbreak in December 2019, the frequent emerging events of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants have raised significant concerns [1] . To prioritize the monitoring of noteworthy SARS-CoV-2 variants, the World Health Organization (WHO) divided highlighted variants into three categories: variants of concern (VOCs), variants of interest (VOIs), and variants under monitoring Omicron is currently the variant with the most mutations, carrying 50 characteristic mutations, 31 of which are on the S protein [4] . Some of these characteristic mutations are also present in other variants [5] , while others are unique mutations, such as G339D, S375F, G446S, and Q498R [6] . In the first month of Omicron circulating in the human population, it was split into four lineages, including B.1.1.529, BA.1, BA.2, and BA. 3 . At the beginning of the Omicron variant's spread, BA.1 was the dominant lineage. However, recently, in Denmark, the amount of sequenced BA.2 genomes increased rapidly and the BA.2 lineage has become the dominant strain [7] . These phenomena indicated that the Omicron variant can quickly evolve and differentiate at a high transmission speed. Furthermore, some studies have confirmed that the Omicron variant has a significantly greater immune escape capability than the SARS-CoV-2 strain outbreak in 2019. Vaccinated people and previously infected people are still at an extremely high risk of being infected with Omicron [8] [9] [10] [11] . Although a preliminary understanding of Omicron mutations has been achieved, internal dynamic evolution and genetic differentiation have remained unknown during its transmission in the human population. Therefore, it is urgent to understand the evolutionary progress of Omicron in the early outbreak stage, which will be significantly helpful in the prevention and control of the Omicron variant's spread. The genome sequences of SARS-CoV-2 were downloaded from the GISAID [12] . The multiple sequence alignment (MSA) and spatiotemporal files (Metadata) were downloaded on December 12, 2021. Since the delayed updates of MSA file data, we downloaded raw Omicron sequence data and treated them as other sequences in the MSA file suffered. Excluding sequences with more than 5% unknown bases (N), there were 12304 Omicron sequences until December 20, 2021. To further analyze the genetic diversity in the later period, we further downloaded 103,688 Omicron sequences collected in England on January 8, 2022. These sequences were individually aligned to the reference WIV04 (EPI_ISL_402124) by MAFFT [13] . The early-stage sequences of variants Alpha, Beta, Gamma, and Delta were extracted from the MSA file to form two datasets with the same sequence number or time duration as the Omicron dataset mentioned above. The initial time points for the other four variants (Alpha, Beta, Gamma, and Delta) were set at the day when they had 100 genome sequences. The mutation information of these sequences was extracted by an R package, which could be found at https://github.com/wuaipinglab/genome_treatment. Mutations before the 300 th and after the 29,000 th bases were discarded for the low sequencing quality at the head and tail of the genome. Nucleotide substitutions occurring more than three times and insertion or deletion occurring more than once were kept. To remove the interference from low-quality sequencing data, we discarded sequences with N on the position between one base before and one base after each variant's consensus deletions/insertions. The final used sequence numbers of each strain were shown in supplementary tables 4 and 7. The Phylogenic tree was downloaded from NextStrain [14] , accessed on and the five cluster groups (a, b, c, d, and e) were calculated with the Omicron sequences treated above. Only the mutations that appeared in more than half of the sequences were shown. The spread of SARS-CoV-2 was distributed scaleless. Many infected people could contribute to a large-scale virus spread through multiple gathering events in its early stages. In the scaleless network, a few nodes could connect to a large number of nodes. These key intermediate nodes might help to infer the transmission route of Omicron. We first extracted all the mutations (nucleotide substitutions, deletions, and insertions) in Omicron sequences to discover these key nodes. We then clustered these genome sequences based on their mutation similarity using an apcluster package in R [15] . Eventually, we had 253 clusters. Within each cluster, the earliest strain that appeared in different countries was selected as the representative sequences, and a total of 782 representative sequences were obtained. Two sequences were speculated to have a propagation relationship if there was only one nucleotide substitution difference, and a link was made between them. Nucleotide substitutions were discarded if there were more than 1,000 N (from more than 10,000 sequences) at its location. Therefore, although some nucleotide substitutions occurred many times, they were not included in further analysis. Finally, these 782 sequences formed an omicron propagation network with 8224 edges. We visualized this network in the software Gephi [16] . We downloaded the monomer structure of S protein (QHD43416.pdb) from https://zhanggroup.org//COVID-19/ on December 22, 2021. We visualized the S proteins using Pymol [17] . Mutations in Omicron were labeled on the S proteins. The emergence of the Omicron variant has raised significant concerns about its vast genome mutations. The Omicron variant had 50 consensus mutations, including 43 nucleotide substitutions, six deletions, and one insertion. Of them, protein ( Fig. 1A and Table S1 ). The other four variants of concerns possessed relatively fewer mutations: Alpha had 22 mutations, Beta had 18 mutations, Gamma had 23 mutations, and Delta had 29 mutations. In addition, systematic studies from the NextStrain website revealed that Omicron did not come from the previous dominant strain Delta [5, 14, 18] but was an individually emerging variant (Fig. 1B) . The accumulated genomes increased exponentially within 47 days, indicating that the Omicron variant had a relatively high speed spread worldwide (Fig. 1C ). During the first 47 days after their emergence, Omicron, Alpha, Beta, Gamma, and Delta variants were reported with 12304, 3364, 733, 961, and 441 sequenced genomes, respectively (Fig. 1C) . We compared the mutation number of these variants accumulated in their first 47 days. The result showed that Omicron contained 398 nucleotide substitutions, similar to that in the Beta variant and was half of the other three variants ( Fig. 1D and Table S2 ). However, deletions or insertions in Omicron significantly happened more frequently than those in the other variants, up to twice as many ( Fig. 1D and Table S3 ). Then we performed a systematic analysis of these deletions and insertions. Although all the five SARS-CoV-2 variants shared a similar deletion regional preference, deletions in variant Omicron had a wider distribution on the genome. The deletion regions in variant Omicron generally covered the regions where most deletions occurred in the other variants (Fig. 1E ). Furthermore, more diverse deletion combinations were observed in Omicron (Fig. 1F ). The Omicron variant was further divided into four lineages, namely B. 2B) . After a cluster analysis by all nucleotide substitutions, including these ten high-frequency mutations, the early-stage BA.1 sequences could be divided into five groups, as groups a-e (Fig. 2B) . Except for Group a, each Group had one or two nucleotide substitutions. Three nucleotide substitutions on the S protein belonged to groups b, c, and d, respectively. We then built a network. Two sequences were linked in this network if there was only one nucleotide substitution difference between them. The whole network presented a process of continuous diffusion from the center to the outside. We labeled groups a-e in this network. We found that these five groups appeared in the different parts of the network (Fig. 2C) . Group a was at the center of the network. Groups c, d, and e were on the outside, connecting to group a through several nodes. Notably, group b did not connect with the other groups in the network. The intermediate nodes between group b and other groups were not included in our sequence dataset. When we mapped the detection time of each node into the network, we found that groups c and d appeared earliest, followed by groups b and e ( Fig.2D and Fig. S1 ). The spatiotemporal analyses showed that each of these groups had a unique distribution. Although in some countries, such as the United Kingdom and the United States, all groups were detected ( Fig. S1A-S1D) . These results indicated that the Omicron variant mutated and evolved during its early transmission. In the early-stage Omicron genome sequences, diverse mutations appeared in BA.1. These mutations divided BA.1 into one original Group and four subgroups. We calculated the consensus mutations occurring in more than 50% of sequences within each Group. Six unique nucleotide substitutions were notable ( Fig. 3A and Table S6 ). Three unique nucleotide substitutions on the S protein belonged to groups b, c, and d. These mutations were S_R346K (group c), S_A701V (group d), and S_L1081V (group b), respectively. Besides, group b had L106F on the orf3a protein which could induce apoptosis [19] . The L106F has been reported in India and Brazil [20, 21] . To determine whether more genome sequences could contribute to more mutations, we compared the internal diversity among different variants (Alpha, Delta, and Omicron) with the same number of sequences from England (Table S7 ). Until January 8, 2022, there were 103,688 sequences of the Omicron variant in England, and the genome of Omicron also accumulated faster than that of the other variants. The faster accumulation indicated that the spread speed of the Omicron variant was faster than the other variants in this region (Fig. 4A) . In our results, the deletion diversity among Omicron increased, with a rapid increase in the number of sequenced genomes, significantly faster than that of the Alpha and Delta variants (Fig. 4B ). In addition, a similar growth trend was shared between the internal diversity of nucleotide substitution and that of insertion or deletion in variant Alpha, Delta, and Omicron (Fig. 4C ). When we mapped the location of deletions and insertions from different variants on the SARS-CoV-2 genome, we found that in the sequences with the same time duration after the initial day of each variant in England, which was labeled by a dotted line in Fig. 4A , the deletions of variant Omicron distributed wider on the genome than that of other variants (Fig. 4D) . The wider distribution could result from a rapid sequence accumulation of variant Omicron in the early stage (Fig. 4A ). However, when these variants came to have the same sequence number, their deletion distribution tended to be similar (Fig. 4E) . The above results indicated that higher genetic diversity in the Omicron variant could be related to a faster spread in its early outbreak stage. Compared to the other variants of concern, variant Omicron had almost four showed that deletions could affect the virus protein greater than single nucleotide substitutions [23] . A systemic analysis revealed that deletion in the SARS-CoV-2 had a regional preference. It was also illustrated that the recurrent deletions on the N-terminal Domain of the S protein partially covered the binding domain of some neutralizing antibodies indicating a potential role of the deletions in virus evolution [24, 25] . Therefore, in preventing and controlling the COVID-19 pandemic, it was necessary to pay more attention to the internal genetic diversity, including nucleotide substitutions and deletions or insertions of the dominant variants. Previous studies have shown that the Omicron variant consisted of four sub-lineages. These sub-lineages seemed to emerge at similar times, two of which (BA.1 and BA.2) had spread worldwide [26] . A recent study showed that the BA.2 lineage, which appeared later, might spread faster than BA.1 [7] . Our study showed that the BA.1 lineage continued to differentiate. We divided BA. On the S protein of the Omicron variant, there were 31 consensus mutations. Some of them, such as S477R, Q498R, and N501Y, have already been associated with an increased binding ability to the ACE2 receptor [28] [29] [30] [31] . Another consensus nucleotide substitution, K417N, has been confirmed to be able to inactivate some therapeutic neutralizing antibodies [29] [30] [31] . The consensus deletion S_del69/70 has been proved to help the virus enter host cells [32] . Except for these notable mutations, we found that a series of novel mutations continued to emerge on the S protein during the spread of variant Omicron. Three novel nucleotide substitutions (S_R346K, S_A701V, S_L1081V) were detected in these early-stage sequences, of which S_R346K was on the receptor-binding domain. S_R346K has been proved to slightly affect the binding between SARS-CoV-2 virus and class2 antibodies [33] . Another nucleotide substitution, S_A701V, was one of the dominant mutations in the third pandemic wave in Malaysia [34] . In addition, many studies have proved that Omicron had a solid ability to escape several neutralizing antibodies [8] , and previously infected people were also the susceptible population [11] . Therefore, it is critical to figure out the ability of not only consensus mutations but also these emerging mutations of the Omicron variant on virus transmission ability and immune escape capability. Table 7 . Several Genome Sequences of an early outbreak of the five variants of concerns. More mutations have been carried by the SARS-CoV-2 Omicron variant than previously reported variants. However, the genetic differentiation and diversity within Omicron variant that occurs during its early spread remains unclear. In this study, the genetic differentiation and diversity of the Omicron variant during its early outbreak has been comprehensively analyzed. More deletions on Omicron genome were accumulated than other four SARS-CoV-2 variants in the same timescale. Seven new notable non-synonymous mutations emerged in addition to 50 known consensus mutations. The rapid spread of the Omicron variant might lead to its high genetic differentiation and diversity in the population. Our study showed that Omicron had remarkably rapid genetic differentiation and mutational diversity with its rapid spread. The findings reminded us that more attention should be paid to the emerging Omicron sub-lineages in disease prevention and control. Genomic variation, origin tracing, and vaccine development of SARS-CoV-2: A systematic review WHO, Tracking SARS-CoV-2 variants Rapid epidemic expansion of the SARS-CoV-2 Omicron variant in southern Africa Sequence analysis of the Emerging Sars-CoV-2 Variant Omicron in South Africa The mysterious origins of the Omicron variant of SARS-CoV-2 Omicron sub-lineage BA. 2 may have "substantial growth advantage Omicron escapes the majority of existing SARS-CoV-2 neutralizing antibodies Considerable escape of SARS-CoV-2 Omicron to antibody neutralization Omicron thwarts some of the world's most-used COVID vaccines Increased risk of SARS-CoV-2 reinfection associated with emergence of the Omicron variant in South Africa GISAID: Global initiative on sharing all influenza data-from vision to reality MAFFT multiple sequence alignment software version 7: improvements in performance and usability Nextstrain: real-time tracking of pathogen evolution APCluster: an R package for affinity propagation clustering Gephi: an open source software for exploring and manipulating networks Pymol: An open-source molecular graphics tool Genomic perspectives on the emerging SARS-CoV-2 omicron variant The ORF3a protein of SARS-CoV-2 induces apoptosis in cells Diversity of SARS-CoV-2 genome among various strains identified in Lucknow Detection of potential new SARS-CoV-2 Gamma-related lineage in Tocantins shows the spread and ongoing evolution of P. 1 in Brazil, bioRxiv Biochemical characterization of protease activity of Nsp3 from SARS-CoV-2 and its inhibition by nanobodies Going beyond SNPs: The role of structural genomic variants in adaptive evolution and species diversification Conserved Pattern and Potential Role of Recurrent Deletions in SARS-CoV-2 Evolution Recurrent deletions in the SARS-CoV-2 spike glycoprotein drive antibody escape Where did Omicron come from? Three key theories Omicron sublineage with potentially beneficial mutation S:346K The impact of mutations in SARS-CoV-2 spike on viral infectivity and antigenicity Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding Comprehensive mapping of mutations in the SARS-CoV-2 receptor-binding domain that affect recognition by polyclonal human plasma antibodies Antibody resistance of SARS-CoV-2 variants B. 1.351 and B. 1.1. 7 Recurrent emergence of SARS-CoV-2 spike deletion H69/V70 and its role in the variant of concern lineage B. 1.1. 7 The R346K Mutation in the Mu Variant of SARS-CoV-2 Alter the Interactions with Monoclonal Antibodies from Class 2: A Free Energy of Perturbation Study Phylogenomic analysis of SARS-CoV-2 from third wave clusters in Malaysia reveals dominant local lineage B. 1.524 and persistent spike mutation A701V This work was supported by the National key research and development program (2021YFC2301300); the CAMS Innovation Fund for Medical Sciences The authors declare that there are no conflicts of interest. Conceptualization, Writing -Review, and Editing, Validation.