key: cord-0767546-o8dmg1ij authors: Weng, Shenghui; Zhou, Hangyu; Ji, Chengyang; Li, Liang; Han, Na; Yang, Rong; Shang, Jingzhe; Wu, Aiping title: Conserved Pattern and Potential Role of Recurrent Deletions in SARS-CoV-2 Evolution date: 2022-03-07 journal: Microbiol Spectr DOI: 10.1128/spectrum.02191-21 sha: cc858a48ed1eaf295422161c142421750456aa22 doc_id: 767546 cord_uid: o8dmg1ij SARS-CoV-2 continues adapting to human hosts during the current worldwide pandemic since 2019. This virus evolves through multiple means, such as single nucleotide mutations and structural variations, which has brought great difficulty to disease prevention and control of COVID-19. Structural variation, including multiple nucleotide changes like insertions and deletions, has a greater impact relative to single nucleotide mutation on both genome structures and protein functions. In this study, we found that deletion occurred frequently in not only SARS-CoV-2 but also in other SARS-related coronaviruses. These deletions showed obvious location bias and formed 45 recurrent deletion regions in the viral genome. Some of these deletions showed proliferation advantages, including four high-frequency deletions (nsp6 Δ106-109, S Δ69-70, S Δ144, and Δ28271) that were detected in around 50% of SARS-CoV-2 genomes and other 19 median-frequency deletions. In addition, the association between deletions and the WHO reported variants of concern (VOC) and variants of interest (VOI) of SARS-CoV-2 indicated that these variants had a unique combination of deletion patterns. In the spike (S) protein, the deletions in SARS-CoV-2 were mainly in the N-terminal domain. Some deletions, such as S Δ144/145 and S Δ243-244, have been confirmed to block the binding sites of neutralizing antibodies. Overall, this study revealed a conservative regional pattern and the potential effect of some deletions in SARS-CoV-2 over the whole genome, providing important evidence for potential epidemic control and vaccine development. IMPORTANCE Mutations in SARS-CoV-2 were studied extensively, while only the structure variations on the spike protein were discussed well in previous studies. To study the role of structural variations in virus evolution, we described the distribution of structure variations on the whole genome. Conserved patterns were found of deletions among SARS-CoV-2, SARS-CoV-2-like, and SARS-CoV-like viruses. There were 45 recurrent deletion regions (RDRs) in SARS-CoV-2 generated through the integration of deleted positions. In these regions, four high-frequency deletions parallelly appeared in multiple strains. Furthermore, in the spike protein, the deletions in SARS-CoV-2 were mainly in the N-terminal domain, blocking the binding sites of some neutralizing antibodies, while the structural variations in SARS-related coronavirus were mainly in the N-terminal domain and receptor binding domain. The receptor binding domain is highly related to hosting recognition. The deletions in the receptor binding domain may play a role in host adaption. SARS-CoV-2 variant, B.1.617.2, also named the Delta strain by World Health Organization (WHO), has shown increased transmission and immune escape capabilities (3, 4) . People with previously induced antibodies still have the risk of infection by this variant (5) . Therefore, there is an urgent need to understand the molecular mechanism underlying the adaptive evolution of SARS-CoV-2. SARS-CoV-2 can take advantage of genome variation to evolve rapidly, including single nucleotide polymorphisms (SNPs) and structural variations (SVs). SVs consist of short fragment insertions, deletions, sequence reversals, and recombination, etc. Current research mainly focused on SNPs (6) , but SV changes can include more nucleotides, which may have a greater impact on genomic structure or protein function. Many SVs arise during a viral passage, while only a small part can be retained and spread. These preserved deletions may have played a potential role during the evolution of SARS-CoV-2 (7) . Previous studies have shown that fragment deletions have the possibility to affect the proliferation and transmission of SARS-CoV-2 (8, 9) . For instance, a 382-nucleotide deletion in the ORF8 protein weakening the virulence of SARS-CoV-2 was reported in the early stages of the SARS-CoV-2 epidemic in Singapore (8) . A D500-532 deletion event was shown to reduce the host INF-b response, a mutation that seemed to occur early in this epidemic and can be found on the nonstructural protein 1 (nsp1) (10) . Another 34-nucleotide deletion was found in France on the ORF6 protein. This variant was shown to induce the overexpression of several specific cytokines, including CCL2/ MCP1, PTX3, and TNFa, etc., which are involved in the regulation and transduction of NF-kb signaling (11) . Recently, in the B.1.1.7 lineage of SARS-CoV-2, D69-70 and D144 were found in the S protein. S D69-70 was shown to increase the viruses' ability to release the S2 structure, which can augment viral infectivity and improve viral syncytium production (12) . Based on a bioinformatic analysis, Reham et al. found S D144 can alter the pocket structure on the N-terminal (NTD) of the S protein and reduce the affinity between the NTD and endogenous host antibodies (13) . With the accumulation of site information and structural variations, more SARS-CoV-2 variants with divergent mutations continue to appear in the literature. For instance, the B.1.1.7 variant (the Alpha strain) outbreak occurred in the United Kingdom first, then the B.1.617.2 variant (the Delta strain) outbreak happened in India. These variants have been observed to evade vaccine immunity (3, 14) . Four recurrent deletion regions (RDRs), including S D69-70 and S D144, in the S protein, prevent the virus from being bound by neutralizing antibodies (15) . The above observations suggest that deletion is one of the ways for SARS-CoV-2 to escape from adaptive immunity and to adapt to their host. Therefore, a systematical analysis of the pattern of deletions in SARS-CoV-2 and their potential effect on immune escape is urgently needed. In this study, we comprehensively analyzed deletions and insertions in SARS-CoV-2, together with those in SARS-CoV-2-like and SARS-CoV-like viruses. We found that there were conserved patterns of deletions not only in SARS-CoV-2 but also in SARS-CoV-2-like and SARS-CoV-like viruses. Among all recurrent deletion regions (RDRs), SARS-CoV-2 evolved four highfrequency deletions that were found in over 48% of sequenced strains, which were mostly the dominant Alpha strain (lineage B.1.1.7). It is worth noting that the deletions from RDRs were detected in all six variants of concern as defined by the WHO (16) with different combinations. Furthermore, the NTD and RBD regions of the S protein possess multiple RDR regions, which may promote rapid viral adaptation to the host. Common RDRs in SARS-CoV-2, SARS-CoV-2-like, and SARS-CoV-like viruses. Previous studies have shown a regional preference of deletions in the NTD domain of the S protein in SARS-CoV-2, forming four recurrent deletion regions (RDR1-4) (15) . Here, we systematically analyzed these deletion and insertion events in 1,289,583 high-quality SARS-CoV-2 genomes downloaded on July 05, 2021, from GISAID (see in Method). In total, 1007 unique deletions and 387 unique insertions (Table S1 and S2) were detected. The maximum number of occurrences of these unique insertions was only 397 times in the dataset, while that of these unique deletions was 685,744 (53.18%) times, which indicates that deletions were the major structural variations in SARS-CoV-2. Across the entire genome, these deletions showed a clear regional preference (Fig. 1A) , which mainly occurred in the nsp1, nsp2, nsp3, nsp4, nsp6, S, N proteins, and accessory proteins (Fig. S1A ). After the integration of these deleted positions, 45 RDRs were generated (Table S3) . Furthermore, we found that the diversity of deletions increased with time significantly (Fig. S1B) . To investigate these biased RDRs in other coronaviruses, we collected genome sequences for SARS-CoV-2-like and SARS-CoV-like viruses from different hosts and generated a set of deletions referring to SARS-CoV-2 or SARS-CoV (Table S4 ). In these sequences, we found that most deletions were in three regions, forming three high deletion/insertion areas (HDA 1-3) in the front part of the nsp3, S, and ORF8, respectively ( Fig. 1B and C) . The pattern indicated that these three HDAs were conserved among SARS-related coronaviruses. In these three proteins in the SARS-CoV-2 genome, there were twelve RDRs, seven RDRs, and one 436-nt large RDR, respectively. These facts led us to speculate the roles of these RDRs and HDAs in the evolution of coronaviruses. In addition, when we pulled out the aligned sequences of S D69-70 in SARS-CoV-2-like viruses, we found that the high-frequency S D69-70 deletion in SARS-CoV-2 also existed in these SARS-related sequences (Fig. 1D ). This finding further indicated that deletions in HDAs were not randomly distributed. These deletions with a location preference could be the result of adaptive selection. Diversity of deletion types in RDRs. Among the 45 RDRs in the SARS-CoV-2 genome, the distribution of the deletions in RDRs showed location-dependent characteristics. In the S protein, RDRs were located in its NTD domain. In the first fifth of the ORF3a sequence, there was one long 122-nt RDR. RDRs were identified out in a cluster between the M protein and N protein, also covering four accessory proteins (ORF6, ORF7Aa, ORF7b, and ORF8). Two longest RDRs were involved in this cluster, including one 402-nt RDR and one 436-nt RDR ( Fig. 2A) . The discontinuous transcription mechanism of SARS-CoV-2 may be the reason underlying the location preference of these RDRs (17) . When we studied the association between the length and the deletion type of these RDRs, we found that more deletion types were identified in longer RDRs. In the five longest RDRs, there were more than 30 deletion types detected. Especially, among the 240-nt RDR in the nsp3, 5 deletion types have been identified. While in the S protein, among the 49-nt RDR22 and 37-nt RDR23, their deletion types were as high as 32 and 42, respectively. The relatively high diversity of these short RDRs of S protein could be due to the important role of the S protein in the adaptive evolution of the SARS-CoV-2 virus. These results showed that deletions are prone to occur in some specific RDRs. Though there were 45 RDRs, including 842 deletion types that have been identified (Table S3) in the SARS-CoV-2 genome, only 4 high-frequency deletions were observed in over 600,000 strains (48.58%) among all genomes (Fig. 2B ). They were nsp6 D106-108, D28271 (a single-nucleotide intergenic deletion before the protein N), and two widely studied deletions, S D69-70 and S D144. It is worth noting that two of these deletions, S D69-70 and S D144, have been reported to occur spontaneously in immunodeficient patients (12, 18) . In addition, 19 median-frequency deletions were identified in the nsp1, nsp6, S, and N, as well as four accessory proteins (ORF3a, ORF6, ORF7a, and ORF8) (Fig. 2C) . By analyzing the temporal and spatial distribution of these four high-frequency deletions, we found that these deletions shared a similar growth pattern on six continents. They all emerged first in Europe at the end of 2020 and then spread to other regions (Fig. S2) . Phylogenetic trees were used to analyze whether these mutations were distributed repeatedly and widely. The results showed that these four high-frequency deletions were distributed in parallel branches (Fig. 2D) . Their repeated and independent occurrence pattern indicated their potential advantages for viral adaption. Relationships between deletions and SARS-CoV-2 variants. Four high-frequency deletions were found in a similar spatiotemporal pattern, which indicated there were internal associations between them. Venn diagram analysis showed that most of these four deletions appeared together in about 580,000 (45%) sequenced SARS-CoV-2 genomes (Fig. 3A) . Except for the cooccurrence of four high-frequency deletions, other combinations among these four deletions also occurred repeatedly. We named each combination of deletions from group a to group o (Fig. 3B) . Five of these combinations were observed more than 10,000 times, which were relatively frequent since this was higher than that of the middle-frequency deletions shown in Fig. 2C . Recently, the WHO reported variants of concern (VOC) and variants of interest (VOI) of SARS-CoV-2 based on their potential transmission risk and immune escape abilities. When we calculated the ratios of each combination in different variants, we found that except for the combination of four high-frequency deletions, the combination with three of four high-frequency deletions were also mainly belonged to the Alpha strain (lineage B. (Fig. 3B) . Apart from four high-frequency deletions, we observed that each SARS-CoV-2 variant had its unique combination of deletions and mutations (Fig. 3C) (Fig. 3D) . These results showed that variants formed at different times and in different environments carried their unique deletions and mutations. Association of deletions and mutations in the S protein. To explore whether these deletions could influence viral antigenicity or change the binding regions for neutralizing antibodies, we first analyzed the relationship between deletions, mutations, antigenic sites, and the binding regions for neutralizing antibodies in the S protein ( Fig. 4 and Fig. S3 ). We found that the high-prone mutation regions staggered to the RDRs. The NTD of the S protein was a high-risk region for deletions, while multiple high-frequency mutations were found in the S2 part of the S protein ( Fig. 4A and B) . In SARS-CoV-2, IgG and IgA epitopes mainly located in the S2 domain ( Fig. 4C) (19) . Previous studies had proved that most of the mutations cannot change the antigenic site of the virus (20) . We further collected the currently reported neutralizing antibodies for the SARS-CoV-2 S protein. These antibodies could be divided into three types according to their binding sites (NTD, RBM, and HR2) (Table S5) . Deletions in the NTD partially overlapped with the binding sites of these neutralizing antibodies (Fig. 4D ). When we mapped RDRs and HDAs to the 3-D structure of the S protein, we found that, in SARS-CoV-2like viruses, HDAs mainly aggregated in the RBD. In SARS-CoV-2, however, RDRs were mainly in the NTD. RDRs and HDAs partially overlapped in the NTD covering S D69-70. (Fig. 4E) . The observed deletion site distribution was different in SARS-CoV-2 and SARS-CoV-2-like viruses. However, since all these strains belong to the same species, the observed variation may be due to the limited evolutionary time. Given a long period of evolution, deletions in SARS-CoV-2 and SARS-CoV-2-like viruses may tend to a similar pattern. All the results above indicated that some deletions in the S protein may contribute to the viral adaption, including the viral transmissibility and immune escape (20) . Deletions were frequently and widely occurring in SARS-CoV-2, yet recent studies mainly focused on the S protein. Our analyses showed an overall distribution profile of deletions Microbiology Spectrum over the entire SARS-CoV-2 genome. We found that these deletions had a significant regional preference. We further extended this study to SARS-CoV-2-like and SARS-CoV-like viruses, finding that in these sequences deletion and insertion events also had a regional preference. These results implied that RDRs may have played a role in the evolution of SARS-CoV-2 and SARS-related coronaviruses. Within all RDRs, four high-frequency deletions were detected majorly in the Alpha variant, which indicated the rapid increase of these deletions was because of the widespread outbreak of Alpha strain. Among these four deletions, S D144 was already proved to be involved in the viral escape from neutralizing antibodies (15) . S D69-70 was involved in the increasement of cell entry efficiency (12) . However, the function of middle-frequency deletions was still unclear. Furthermore, the cooperation of these deletions with some SNPs may play a certain role in the SARS-CoV-2 adaption and evolution. Therefore, further studies were urgently needed to understand their role in viral evolution and transmission. The RNA-RNA interaction may trigger these deletions during viral replication. Lei et al identified the SARS-CoV-2 RNA genome structure by icSHAPE and found a large number of RNA-RNA interaction regions. Omer et al. found that SARS-CoV-2 had many short-or long-distance RNA-RNA interactions within cells (21) . A study revealed the structural variants were enriched in the transcription regulatory site (TRS) of the SARS-CoV-2 genome (17) . Here, we also found that a portion of identified RDRs was also located in front of the ORFs. More studies are required to reveal the reasons for the occurrence of deletions and their location preference. The occurrence of deletions has been shown to lead to the immune escape of SARS-CoV-2 strains from neutralizing antibodies such as 4A8 (15) . In this study, we found that the neutralizing antibody binding sites, which were mainly located at the NTD and RBD of SARS-CoV-2, overlapped with RDRs in the NTD domain. Furthermore, these RDRs and mutations in the S protein were present in a staggering arrangement. RDRs appeared mostly in the NTD of the S protein, while most of the high-frequency mutations presented in the S2. The complementary relationship between deletions and mutations indicated that SARS-CoV-2 evolved through using deletions to partly escape host immunity. The deletions with regional preference may work synergistically with other mutations to yield more comprehensive and rapid adaptability. Since current SARS-CoV-2 vaccines were mainly developed against the S protein, these insertion and deletion speculations raise many questions. For instance, whether the development of vaccines can tolerate these SVs? Are the vaccines already on the market significantly weakened or partly weakened due to these deletions? Some studies have proved the role of the deletions in the S protein in viral adaption, especially in the changing NTD antigenicity from potently neutralizing convalescent plasma or specific neutralizing antibodies (20) . The deletions on S D144/145 and S D243-244 were confirmed at the binding sites of a neutralizing antibody 4A8. These two deletions were proved to have the ability to abolish 4A8 binding (15) . Therefore, these SVs (deletions and insertions) require careful monitoring and tracking in the future. Sequence source. The aligned SARS-CoV-2 sequences were acquired on July 8, 2021, from the GISAID database (22) . All sequences were collected before July 5, 2021. The sequences longer than 29000 nt have already been aligned using MAFFT in the GISAID database. After downloading these aligned sequences, quality control was operated according to the sequence quality standard following National Information Center, together with a host screening. Sequences owing more than 15 Ns or 50 merged bases were discarded and only those isolated from human samples were kept. There were finally 1,289,583 sequences used in deletion and insertion identification and further analysis. At the same time, we collected from GISAID the pedigree information, sampling time, and location information of these SARS-CoV-2 sequences. Insertion, deletion, and mutation identification. Sequences related to SARS-CoV and SARS-CoV-2 were collected and treated by the tool 'Genome-to-Variants' from the website of the China National Center for Bioinformation (https://ngdc.cncb.ac.cn/ncov/online/tool/variation) to obtain mutation information (23) . The tool aligns each sequence to its reference sequences and then lists variations in VCF format. The reference sequences for SARS-CoV-like and SARS-CoV-2-like sequences were NC_004718.3 (NCBI Reference Sequence) and EPI_ISL_402124 (GISAID Reference Sequence), respectively. The insertion and deletion locations were extracted from the VCF file. Since this online tool is hard to treat a big dataset of SARS-CoV-2, we used an R script to treat the mega sequence data which was described in Fig. S4 in the supplemental material Fig. S4 The sequence named EPI_ISL_402124 was used as a reference sequence to extract the mutations and SVs information. We pulled out the reference sequence and compared it to the other sequences one by one. To avoid interference with sequencing quality, only the sites with a gap against normal bases (A, T, C, and G) will be treated as insertions or deletions. The R script is available at https://github.com/wuaipinglab/genome_treatment. Phylogenetic tree analysis. The phylogenetic trees of SARS-CoV-like and SARS-CoV-2-like viruses were built by the ORF1b and constructed with the software 'FastTree' with version 2.1.9 (24) with the parameters "Fasttree-gtr-nt." The SARS-CoV-2 phylogenetic tree was also constructed using FastTree software using their full-length sequences with the same parameters. The phylogenetic tree for each high-frequency deletion Microbiology Spectrum was shown on a background. The background sequence for viral evolution was composed of the latest sequence in each PANGO lineage. The PANGO lineage with deletions was highlighted in red. Recurrent deletion region identification. To assemble a reasonable recurrent region, all the deletions that happened less than five times were removed. The remained deletions were joined by their location on the SARS-CoV-2 reference genome. The assembled area was defined as recurrent deletion regions (RDRs). For further analysis of the characteristic of these RDRs, the counts of the deletion types in each RDR were recorded. In SARS-CoV-like viruses, the insertion and deletion positions were uniformly corrected, based on the starting position of each protein against SARS-CoV-2. RDR visualization in protein structures. The simulated S protein structure, which belongs to lineage A, was created by Zhang Yang lab's website (http://zhanglab.ccmb.med.umich.edu/COVID-19/). The 3D structure visualization was done in PyMOL (25) . Neutralizing antibody. The SARS-CoV-2 antibody information was collected from the coronavirus antibody database CoV-AbDab (26) . The antibodies owing the ability to neutralize the SARS-CoV-2 virus were selected. Their target sites on the virus were collected from their original research articles which were listed in Table S3 in the supplemental material. We used an R script to display these neutralizing antibodies with detailed binding positions on the S protein. Emergence of a new SARS-CoV-2 variant in the UK Variation in SARS-CoV-2 outbreaks across sub-Saharan Africa Reduced neutralisation of the Delta (B. 1.617. 2) SARS-CoV-2 variant of concern following vaccination Reduced neutralization of SARS-CoV-2 B. 1.617 by vaccine and convalescent serum Increased transmissibility and global spread of SARS-CoV-2 variants of concern as at Detection of a SARS-CoV-2 variant of concern in South Africa Going beyond SNPs: the role of structural genomic variants in adaptive evolution and species diversification Effects of a major deletion in the SARS-CoV-2 genome on the severity of infection and the inflammatory response: an observational cohort study Emerging of a SARS-CoV-2 viral strain with a deletion in nsp1 Genomic monitoring of SARS-CoV-2 uncovers an Nsp1 deletion variant that modulates type I interferon response Characterization of SARS-CoV-2 ORF6 deletion variants detected in a nosocomial cluster during routine genomic surveillance Recurrent emergence of SARS-CoV-2 spike deletion H69/V70 and its role in the variant of concern lineage B.1.1.7 Bioinformatics prediction of B and T cell epitopes within the spike and nucleocapsid proteins of SARS-CoV2 SARS CoV-2 escape variants exhibit differential infectivity and neutralization sensitivity to convalescent or post-vaccination sera Recurrent deletions in the SARS-CoV-2 spike glycoprotein drive antibody escape A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology Structural variants in SARS-CoV-2 occur at template-switching hotspots Neutralising antibodies in Spike mediated SARS-CoV-2 adaptation Viral epitope profiling of COVID-19 patients reveals cross-reactivity and correlates of severity SARS-CoV-2 variants, spike mutations and immune escape The short-and long-range RNA-RNA Interactome of SARS-CoV-2 Data, disease and diplomacy: GISAID's innovative contribution to global health An online coronavirus analysis platform from the National Genomics Data Center FastTree 2-approximately maximumlikelihood trees for large alignments Pymol: an open-source molecular graphics tool CoV-AbDab: the coronavirus antibody database Recurrent Deletions in SARS-CoV-2 Microbiology Spectrum We acknowledge the members of the Wu laboratory for insightful discussions regarding this study. We gratefully acknowledge the laboratories who shared the sequence data via the GISAID. We have no conflicts of interest to declare.