key: cord-0938901-f6t0yvao authors: Ma, Wentai; Yang, Jing; Fu, Haoyi; Su, Chao; Yu, Caixia; Wang, Qihui; de Vasconcelos, Ana Tereza Ribeiro; Bazykin, Georgii A.; Bao, Yiming; Li, Mingkun title: Genomic perspectives on the emerging SARS-CoV-2 omicron variant date: 2022-01-13 journal: Genomics Proteomics Bioinformatics DOI: 10.1016/j.gpb.2022.01.001 sha: 303690d9885147f1298a8b6a5c42a343e5329b6f doc_id: 938901 cord_uid: f6t0yvao A new variant of concern for SARS-CoV-2, Omicron (B.1.1.529), was designated by the World Health Organization on November 26, 2021. This study analyzed the viral genome sequencing data of 108 samples collected from patients infected with Omicron. First, we found that the enrichment efficiency of viral nucleic acids was reduced due to mutations in the region where the primers anneal to. Second, the Omicron variant possesses an excessive number of mutations compared to other variants circulating at the same time (62 vs. 45), especially in the Spike gene. Mutations in the Spike gene confer alterations in 32 amino acid residues, more than those observed in other SARS-CoV-2 variants. Moreover, a large number of nonsynonymous mutations occur in the codons for the amino acid residues located on the surface of the Spike protein, which could potentially affect the replication, infectivity, and antigenicity of SARS-CoV-2. Third, there are 53 mutations between the Omicron variant and its closest sequences available in public databases. Many of these mutations were rarely observed in public databases and had a low mutation rate. In addition, the linkage disequilibrium between these mutations was low, with a limited number of mutations (6) concurrently observed in the same genome, suggesting that the Omicron variant would be in a different evolutionary branch from the currently prevalent variants. To improve our ability to detect and track the source of new variants rapidly, it is imperative to further strengthen genomic surveillance and data sharing globally in a timely manner. On November 22, 2021, the first genome sequence of a new variant of concern (VOC), Omicron (also known as B.1.1.529), was released in GISAID (Global initiative on sharing all influenza data) (EPI_ISL_6590782) [1] . The sample was obtained from a patient who arrived in Hong Kong on November 11 from South Africa via Doha in Qatar (https://news.sky.com/story/covid-19-how-the-spread-of-omicron-went-frompatient-zero-to-all-around-the-globe-12482183). To date, the first known Omicron variant sample was collected on November 5, 2021 in South Africa (EPI_ISL_7456440). Until December 12, 2021, there were over 2000 Omicron sequences submitted to the GISAID from South Africa, Botswana, Ghana, the United Kingdom, and many other countries. The emergence of this variant has attracted much attention due to the sheer number of mutations in the Spike gene, which may affect the viral transmissibility, replication, and binding of antibodies, and its dramatic increase in South Africa [2] . Preliminary studies showed that the new variant could substantially evade immunity from prior infection and vaccination [3, 4] . Meanwhile, a preprint report proposed that the emergence of the Omicron variant was associated with an increased risk of SARS-CoV-2 reinfection [5] . However, it is still unclear where the new variant came. In this study, we characterized the genomic features of the Omicron variant using data from 108 patients infected with the Omicron variant, which were generated by the Network for Genomic Surveillance in South Africa (NGS-SA) [2, 6] , and we speculate that the new variant is unlikely derived from recently discovered variants through either mutation or recombination. Among 207 Omicron samples sequenced and shared by NGS-SA, 158 samples had more than 90% of the viral genome covered by at least 5-fold, which were used in the subsequent analysis. Notably, two sequencing protocols were implemented. The first was to enrich the viral genome with the Midnight V6 primer sets followed by sequencing on the GridION platform (hereinafter referred to as Midnight, dx.doi.org/10.17504/protocols.io.bwyppfvn) . The second protocol involved enrichment by the Artic V4 primer set, and the amplicons were sequenced on the Illumina MiSeq platform (hereinafter referred to as Artic, dx.doi.org/10.17504/protocols.io.bdp7i5rn). Fifty samples were sequenced using both protocols, and we found a high consistency in the major allele frequency between the two protocols ( Figure S1 ). Artic data were preferred due to higher sequencing depth The sequencing depth profile of the SARS-CoV-2 genome was similar among samples sequenced by the same protocol but differed markedly between the two protocols ( Figure 1A) . The sequencing depth varied among different genomic regions, reflecting the differential enrichment efficiency of the primers used for amplification. Moreover, we found that the large number of mutations possessed by the Omicron variant had a significant impact on the enrichment efficiency of the primers. In particular, enrichment efficiency of seven primers in the Artic protocol was affected by at least one mutation, and that of three primers in the Midnight protocol was affected ( Figure 1A ). The worst coverages of the three regions for Primers 76, 79, and 90 using the Artic protocol were all associated with the presence of mutations in the region where these primers annealed to, whose sequencing depths were reduced by 2586, 246, and 234-fold, respectively, compared to the expected depth ( Figure 1B) In contrast, the efficiency of Midnight primers was less influenced by mutations in the Omicron variant. The three affected primers, Primers 10, 24, and 28, showed no reduction, 2-fold, and 28-fold reduction, respectively, in sequencing depth compared to the expected depth. The number of mutations (with major allele frequency ≥ 70%) of the Omicron variant varied from 61 to 64, and 61 of them were identified in more than 90% of the samples, which included 54 SNPs, six deletions, and one insertion. All these mutations were fixed at the individual level (Figure 2A) . The total number of mutations was significantly higher than that of other variants detected in South Africa in November RaTG13) and Wuhan-Hu-1 [8] [9] [10] [11] [12] . The dramatic changes in the Spike protein and RBD regions may substantially change the antigenicity and susceptibility to pre-existing antibodies. Most mutations occurred on the surface of the trimeric Spike protein, especially in the RBD region ( Figure S2 ). Eight of the 16 mutations in the RBD region (K417N, G446S, E484A, Q493R, G496S, Q498R, N501Y, and Y505H) were located at positions that were proposed to be critical for viral binding to the host receptor angiotensin-converting enzyme 2 (ACE2) [13] . Among them, the K417N and N501Y mutations, which were also identified in the Beta variant, were reported to influence binding to human ACE2 [14] ; N501Y confers a higher affinity of the viral Spike protein to ACE2 [15] . How other mutations affect the affinity to ACE2 of humans and other animal hosts is still unknown. Moreover, some other amino acid changes in the Spike protein are known to be associated with changes in replication and infectivity of the virus. For example, Δ69-70 could enhance infectivity associated with increased cleaved Spike incorporation [16] ; P681H could potentially confer replication advantage through increased cleavage efficacy by furin and adaptation to resist innate immunity [3, 17] ; H655Y was suspected to be an adaptive mutation that could increase the infectivity of the virus in both human and animal models [16] . In addition, amino acid mutations in other proteins, such as R203K and G204R in the Nucleoprotein protein, could also potentially increase the infectivity, fitness, and virulence of the virus [18] . Of note, the function of these mutations was investigated because they were present in other VOCs. The effect of other less frequent mutations and the combination of the aforementioned mutations on the biology of the virus warrants further investigation. Mutations in the RBD region, which is the target of many antibodies, may compromise the neutralization of existing antibodies induced by vaccination or natural infection [19] . Recent studies have shown severely reduced neutralization of the Omicron variant by monoclonal antibodies and vaccine sera [4, 20, 21] . Meanwhile, preliminary studies suggested that the Omicron variant caused three times more reinfection than previous strains, further supporting the speculation that the new variant can evade immunity from prior infection and vaccination [5] . However, the escape from pre-existing immunity was incomplete, and a vaccine booster shot is likely to provide a high level of protection against the Omicron variant [4] . Here, we analyzed the epitope regions of 182 protein complex structures of antibodies that bind to SARS-CoV-2 Spike, the RBD, or N-terminal domain (NTD) from the Protein Data Bank. We found that mutations in the Omicron variant were enriched in the epitope region of the Spike protein ( Figure 3A) . The median number of antibodies bound to the Omicron mutation sites was 53, which was significantly higher than those bound to other positions (median 3, P < 0.001, Mann-Whitney U test). Moreover, we found that these mutations could potentially impact the binding of different classes of antibodies (by analyzing the deep mutational scanning data [22] , Figure 3B ), which was classified by the location and conformation of antibody binding [23] , suggesting that the therapeutic strategy of antibody cocktails may also be affected. In addition to the 61 shared mutations, some specific mutations were identified in Most Omicron lineage-specific mutations (52/54) were identified in public databases ( Figure 4C ). However, they were unlikely to be present in one sequence by chance. First, over half of the mutations were rarely detected in the populations, i.e., 33 mutations were detected in less than 1000 samples out of five million sequences (16 mutations were detected in less than 100 samples). Second, the mutation rate (represented by the occurrence number of mutations on the phylogenetic tree) was extremely low for 13 of the mutations (occurring only once in the evolution of SARS-CoV-2, mutation rate = 1). Third, the linkage disequilibrium between these mutations was low, and only four mutation pairs had r 2 greater than 0.8. Moreover, we further examined whether any combination of these mutations appeared in public databases and found that the maximum number of mutations in the same genome was six. Therefore, the evolutionary trajectory of the Omicron lineage cannot be resolved by the current genome data. The unique genome features of the Omicron variant make it the most special SARS-CoV-2 variant to date. The excess number of nonsynonymous mutations in the Spike gene implies that the Omicron variant might evolve under selection pressure, which may come from antibodies or adaptation to new hosts. It is speculated that it may have been incubated in a patient chronically infected with SARS-CoV-2, e.g., HIV patients with immunocompromising conditions. This hypothesis was supported by the accelerated viral evolution observed in immunocompromised patients and has been previously proposed to explain how the Alpha variant was generated [24, 25] (https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sarscov-2-lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/563). If this hypothesis is true for the Omicron variant, we suspect that the original virus that infected the patient might still be missing in public databases because the current closest sequences were circulating in population one and a half years ago, the time was too long, even for a chronic infection. Another hypothesis involves a spillover from humans to animals and spills back from animals to humans; such a process has been proposed to be possible in mink [26] . Interestingly, a recent study proposed that the progenitor of the Omicron variant seemed to have evolved in mice for some time before jumping back into humans [27] . The binding affinity test between the Omicron RBD and animal The sequencing data were retrieved from SRA database in NCBI (BioProject: PRJNA784038), which was generated by the Network for Genomic Surveillance in South Africa (NGS-SA) [2, 6] . In total 211 samples were downloaded on November 30, 2021 (Table S1 ). The virus lineage was assigned by Pango [29] , four samples that cannot be assigned to the Omicron lineage were discarded. All the remaining 207 samples were assigned to Omicron BA.1. Quality control and adaptor trimming were performed by FASTP [30] . The resultant reads were mapped to Wuhan-Hu-1 (NC_045512.2) using minimap2 (-ax sr) [31] . Primer alignment and trimming were performed by the align_trim function from Artic (https://artic-tools.readthedocs.io/en/latest/commands/#align_trim). The mpileup file and the read count file were generated by SAMtools [32] and Varscan2 [33] . The consensus sequence was obtained using the following criteria: 1) depth ≥ 5-fold; 2) frequency of the major allele ≥ 70%. The sequencing depth was calculated for each nonoverlapping window with a size of 100 bp, except for the last window, which ranged from 29801 to 29880 bp. The fold change of each primer region was calculated by the sum of the depth of all samples in this region divided by the expected value (assuming no differences among regions). We downloaded the structures of 182 protein complexes of antibodies that bind to the SARS-CoV-2 Spike or its RBD or NTD from the Protein Data Bank (all structures available before August 8, 2021, www.rcsb.org). The residues in the Spike protein involved in binding to antibodies were identified by a distance of less than 4.5 Å between two counterparts in which van der Waals interactions occur. Deep mutational scanning results were obtained from https://jbloomlab.github.io/SARS2_RBD_Ab_escape_maps/, which includes information on sites in the SARS-CoV-2 RBD where mutations reduce binding by antibodies/sera [22] . The escape score at each position was calculated as the mean of the scores of all antibodies belonging to the same class. We downloaded the cryogenic electron microscopy structure of SARS-CoV-2 Spike extracellular domain (PDB: 6VYB) and crystal structure of RBD-hACE2 complex (PDB: 6LZG) from the Protein Data Bank (PDB) (https://www.rcsb.org/). The structure of the RBD region was extracted from the RBD-hACE2 complex. All structures were visualized by PyMOL software (https://pymol.org/2/). Omicron mutations relative to the Wuhan-Hu-1 are labeled on the structure except for those invisible in the structure. The amino acid sequences were converted from nucleotide sequences using MEGA-X (10.1.8) [34] . Phylogenetic construction was performed by IQ-TREE (1.6.12) [35] . The GTR+F model was used for nucleotide sequences, while the Blosum62 model was used for amino acid sequences. The estimation of the time to the most recent common ancestor (TMRCA) and mutation rate was performed by BEAST (2.6.4) [36] using 108 sequences collected between November 13, 2021, and November 23, 2021. The HKY85 nucleotide substitution model and strict molecular clock were used. The distance of two SARS-CoV-2 sequences was represented by the mutation difference, which was calculated by an online tool at National Genomics Data Center, [1, 37] . The r 2 statistic was used to measure the strength of the linkage disequilibrium between each pair of mutations [38] . The calculation of linkage disequilibrium was based on all unique haplotypes from public databases. genomes were most similar to SARS-CoV-2 [8, 9] , two pangolin coronaviruses (Pangolin MP789 and Pangolin GXP5L) [10, 11] , and sequences of other recently collected VOCs (EPI_ISL_6141707, EPI_ISL_6774033, EPI_ISL_6898988, EPI_ISL_6585201 for the Alpha, Beta, Gamma, and Delta variants, respectively. All sequences were collected in November 2021, and those collected in South Africa were preferred) were included in the analysis of the phylogenetic tree. The Wuhan-Hu-1 sequence is shown as the outgroup of the tree for better visualization [12] . The number of mutations relative to Wuhan-Hu-1 is listed on the right of the tree. Insertion or deletion of multiple bases was considered as a single mutation. ACE2, angiotensinconverting enzyme 2; VOC, variants of concern; RBD, receptor-binding domain. Each dot represents a mutation (major allele frequency >= 70%). disease and diplomacy: GISAID's innovative contribution to global health Rapid epidemic expansion of the SARS-CoV-2 Omicron variant in southern Africa The P681H mutation in the Spike glycoprotein confers Type I interferon resistance in the SARS-CoV-2 Alpha (B.1.1.7) variant Omicron extensively but incompletely escapes Pfizer BNT162b2 neutralization Increased risk of SARS-CoV-2 reinfection associated with emergence of the Omicron variant in South Africa A year of genomic surveillance reveals how the SARS-CoV-2 pandemic unfolded in Africa KaKs_Calculator 3.0: calculating selective pressure on coding and non-coding sequences Coronaviruses with a SARS-CoV-2-like receptor-binding domain allowing ACE2-mediated entry into human cells isolated from bats of Indochinese peninsula A pneumonia outbreak associated with a new coronavirus of probable bat origin Are pangolins the intermediate host of the 2019 novel coronavirus (SARS-CoV-2)? Identifying SARS-CoV-2-related coronaviruses in Malayan pangolins A new coronavirus associated with human respiratory disease in China Structural and functional basis of SARS-CoV-2 entry by using human ACE2 Experimental evidence for enhanced receptor binding by rapidly spreading SARS-CoV-2 variants The N501Y Spike substitution enhances SARS-CoV-2 infection and transmission Recurrent emergence of SARS-CoV-2 Spike deletion H69/V70 and its role in the Alpha variant B.1.1.7 Functional evaluation of the P681H mutation on the proteolytic activation the SARS-CoV-2 variant B.1.1.7 (Alpha) Spike Nucleocapsid mutations R203K/G204R increase the infectivity, fitness, and virulence of SARS-CoV-2 Defining variant-resistant epitopes targeted by SARS-CoV-2 antibodies: A global consortium study Reduced neutralization of SARS-CoV-2 Omicron variant by vaccine sera and monoclonal antibodies Omicron escapes the majority of existing SARS-CoV-2 neutralizing antibodies Mapping mutations to the SARS-CoV-2 RBD that escape binding by different classes of antibodies SARS-CoV-2 neutralizing antibody structures inform therapeutic strategies Persistence and evolution of SARS-CoV-2 in an immunocompromised host SARS-CoV-2 evolution during treatment of chronic infection Transmission of SARS-CoV-2 on mink farms between humans and mink and back to humans Evidence for a mouse origin of the SARS-CoV-2 Omicron variant Where did "weird" Omicron come from? A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology Fastp: an ultra-fast all-in-one FASTQ preprocessor Minimap2: pairwise alignment for nucleotide sequences The sequence alignment/map format and SAMtools VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing MEGA X: molecular evolutionary genetics analysis across computing platforms IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis The global landscape of SARS-CoV-2 genomes, variants, and haplotypes in 2019nCoVR Linkage disequilibrium--understanding the evolutionary past and mapping the medical future We thank Dr. Jennifer Giandhari, Dr. Eduan Wilkinson, and Dr. Tulio de Oliveira from The authors declare that they have no competing interests.