key: cord-0661067-3mjcg56t authors: Yin, Changchuan; Yau, Stephen S.-T. title: Inverted repeats in coronavirus SARS-CoV-2 genome and implications in evolution date: 2020-11-24 journal: nan DOI: nan sha: 8b0ed525d72a40b12bd329dd4246a2dee17dfb42 doc_id: 661067 cord_uid: 3mjcg56t The coronavirus disease (COVID-19) pandemic, caused by the coronavirus SARS-CoV-2, has caused 60 millions of infections and 1.38 millions of fatalities. Genomic analysis of SARS-CoV-2 can provide insights on drug design and vaccine development for controlling the pandemic. Inverted repeats in a genome greatly impact the stability of the genome structure and regulate gene expression. Inverted repeats involve cellular evolution and genetic diversity, genome arrangements, and diseases. Here, we investigate the inverted repeats in the coronavirus SARS-CoV-2 genome. We found that SARS-CoV-2 genome has an abundance of inverted repeats. The inverted repeats are mainly located in the gene of the Spike protein. This result suggests the Spike protein gene undergoes recombination events, therefore, is essential for fast evolution. Comparison of the inverted repeat signatures in human and bat coronaviruses suggest that SARS-CoV-2 is mostly related SARS-related coronavirus, SARSr-CoV/RaTG13. The study also reveals that the recent SARS-related coronavirus, SARSr-CoV/RmYN02, has a high amount of inverted repeats in the spike protein gene. Besides, this study demonstrates that the inverted repeat distribution in a genome can be considered as the genomic signature. This study highlights the significance of inverted repeats in the evolution of SARS-CoV-2 and presents the inverted repeats as the genomic signature in genome analysis. The novel human coronavirus SARS-CoV-2 (formerly, 2019-nCoV) first emerged in Wuhan, China, in December 2019, the causative agent for Coronavirus Disease-2019 (COVID-19) pandemic, has claimed 1.38 million mortality in the globe as of Nov.24, 2020 (Max Roser and Hasell, 2020) . Understanding the molecular structure and evolution of SARS-CoV-2 genome is of urgency for SARS-CoV-2 is a beta coronavirus, like MERS-CoV and SARS-CoV. All three of these coronaviruses have their origins in bats. Yet the zoonotic origin of SARS-CoV-2 is still unconfirmed. Zhou et al. (2020c,b) 's study showed that the bat SARS-related coronavirus strain SARSr-CoV/RaTG13, identified from a bat Rhinolophus affinis in Yunnan province, China, in July 2012, shares 96.2% nucleotide identity. A recent study identified a new SARSr-CoV/RmYN02 (2019) from Rhinolophus malayanus, which is closely related to SARS-CoV-2 (Zhou et al., 2020a) . SARSr-CoV/RmYN02 shares 93.3% nucleotide identity with SARS-CoV-2 and comprises natural insertions at the S1/S2 cleavage site of the Spike protein. The unique S1/S2 cleavage in the Spike protein in SARS-CoV-2 may confer the zoonotic spread of SARS-CoV-2. However, the originating relationship among these CoVs is not entirely clear. SARS-CoV-2 coronavirus contains a linear single-stranded positive RNA genome (Fig.1) . The SARS-CoV-2 RNA genome of 29.9kb has a total of 11 genes with 11 open reading frames (ORFs) (Yoshimoto, 2020) , consisting of the leader sequence (5'UTR), the coding regions, and 3'UTR pseudoknot stem-loop (Wu et al., 2020) . The coding regions include ORF1ab and genes encoding 16 non-structural proteins (Finkel et al., 2020) and structural proteins (spike (S), envelope (E), membrane (M), and nucleocapsid (N)) (Gordon et al., 2020) , and several accessory proteins. ORF1ab encodes replicase polyproteins required for viral RNA replication and transcription (Chen et al., 2020b; Cavasotto et al., 2020) . Nonstructural protein 1 (nsp1) likely inhibits host translation by interacting with 40S ribosomal subunit, leading to host mRNA degradation through cleavage near their 5'UTRs. Nsp 1 promotes viral gene expression and immunoevasion in part by interfering with interferon-mediated signaling. Nonstructural protein 2 (nsp2) interacts with host factors prohibitin 1 and prohibitin 2, which are involved in many cellular processes including mitochondrial biogenesis. The third non-structural protein (nsp3) is Papain-like proteinase. Nsp3 is an essential and the largest component of the replication and transcription complex. The Papain-like proteinase cleaves nonstructural proteins 1-3 and blocks the host's innate immune response, promoting cytokine expression (Serrano et al., 2009; Lei et al., 2018) . Nsp4 encoded in ORF1ab is responsible for forming doublemembrane vesicle (DMV). The other non-structural proteins are 3CLPro protease (3-chymotrypsinlike proteinase, 3CLpro) and nsp6. 3CLPro protease is essential for RNA replication. The 3CLPro proteinase accounts for processing the C-terminus of nsp4 through nsp16 in coronaviruses (Anand et al., 2003) . Together, nsp3, nsp4, and nsp6 can induce DMV (Angelini et al., 2013) . SARS-coronavirus has a unique RNA replication facility, including two RNA-dependent RNA polymerases (RNA pol). The first RNA polymerase is a primer-dependent non-structural protein 12 (nsp12), and the second RNA polymerase is nsp8, nsp8 has the primase capacity for de novo replication initiation without primers (Te Velthuis et al., 2012) . Nsp7 and nsp8 are essential proteins in the replication and transcription of SARS-CoV-2. Nsp7 is responsible for nuclear transport. The SARS-coronavirus nsp7-nsp8 complex is a multimeric RNA polymerase for both de novo initiation and primer extension (Prentice et al., 2004; Te Velthuis et al., 2012) . Nsp8 also interacts with ORF6 accessory protein. The nsp9 replicase protein of SARS-coronavirus binds RNA and interacts with nsp8 for its functions (Sutton et al., 2004) . Helicase (nsp13) possesses helicase activity, thus catalyzing the unwinding dsRNA or structured RNA into single strands. Importantly, nsp14 may Figure 1 : The structural diagram of SARS-CoV-2 genome (GenBank: NC_045512). The diagram of SARS-CoV-2 genome was made using DNA Feature Viewer (Zulkower and Rosser, 2020) . function as a proofreading exoribonuclease for virus replication, hence, SARS-CoV-2 mutation rate remains low. Furthermore, the SARS-CoV-2 genome encodes several structural proteins. The structural proteins possess much higher immunogenicity for T cell responses than the non-structural proteins (Li et al., 2008) . The structural proteins include spike (S), envelope (E), membrane protein (M), and nucleoprotein (N) (Marra et al., 2003; Ruan et al., 2003) . The Spike glycoprotein has two domains S1 and S2. Spike protein S1 attaches the virion to the host cell membrane through the receptor ACE2, initiating the infection (Wan et al., 2020; Wong et al., 2004) . After being internalized into the endosomes of the cells, the S glycoprotein is then cleaved by cathepsin CTSL. The spike protein domain S2 mediates fusion of the virion and cellular membranes by acting as a class I viral fusion protein. Especially, the spike glycoprotein of coronavirus SARS-CoV-2 contains a furin-like cleavage site (Coutard et al., 2020) . Recent study indicates that SARS-CoV-2 is more infectious than SARS-CoV according to the changes of S protein-ACE2 binding affinity . The envelope (E) protein interacts with membrane protein M in the budding compartment of the host cell. The M protein holds dominant cellular immunogenicity (Liu et al., 2010) . Nucleoprotein (ORF9a) packages the positive-strand viral RNA genome into a helical ribonucleocapsid (RNP) during virion assembly through its interactions with the viral genome and a membrane protein M (He et al., 2004) . Nucleoprotein plays an important role in enhancing the efficiency of subgenomic viral RNA transcription and viral replication. In addition to the coding regions, SARS-CoV-2 genome contains hidden structures that can retain genome stability, regulate gene replication and expression, and control virus life cycles. The noncoding genome structures include leader sequences, transcriptional regulatory sequences (TRS), G-quadruplex structures, frame-shifting regions, and repeats. The first non-coding structure is the 5' leader sequence of about 265 bp is the unique characteristic in coronavirus replication and plays critical roles in the gene expression of coronavirus during its discontinuous sub-genomic replication (Li et al., 2005) . SARS-CoV-2 contains G-quadruplex structures (Ji et al., 2020) . It is well established that sequences with G-blocks (adjacent runs of Guanines) can potentially form non-canonical G-quadruplex (G4) structures (Choi and Majima, 2011; Métifiot et al., 2014) . The G4 structures are formed by stacking two or more G-tetrads by Hoogsteen hydrogen bonds and often are the sites of genomic instability, serving one or more biological functions (Bochman et al., 2012 ). An inverted repeat is a single-stranded sequence of nucleotides followed by downstream its reverse complement downstream. The intervening sequence between the initial sequence and the reverse complement is called a spacer. When the spacer sequence is zero, the inverted repeat is called a palindrome. For example, the inverted repeat, 5'-ATTCGCGAAT-3' is a palindrome, the palindromefirst sequence is 5'-ATTCG-3', and the palindrome-second sequence is 5'-CGAAT-3'. When the spacer in an inverted repeat is non-zero, the repeat is generally inverted. In a generally inverted repeat, we still denote the initial sequence as a palindrome-first sequence and the downstream reverse complement as a palindrome-second sequence. For example, in the general inverted repeat, 5'-TTTAGGT...ACCTAAA-3', the palindrome-first sequence is 5'-TTTAGGT-3', and the palindromesecond sequence is 5'-ACCTAAA-3'. Through self-complementary base pairing, an inverted repeat can form a stem-loop (hairpin) structure in an RNA molecule, where the palindrome-first and palindrome-second sequences make a stem, and the spacer sequence makes a loop. It should be noted that an inverted repeat may not have perfect complementary base pairing in palindrome-first and palindrome-second sequences, so the stem formed by an imperfect inverted repeat can have mismatches, insert, or deletions. Inverted repetitive sequences are principal components of the archaeal and bacterial CRISPR-CAS systems (Mojica et al., 2005) , which function as adaptive antiviral defense systems. Inverted repeats have important biological functions in viruses. Inverted repeats delimit the boundaries in transposons in genome evolution and form stem-loop structures in retaining genome instability and flexibility. Inverted repeats are described as hotspots of eukaryotic and prokaryotic genomic instability (Voineagu et al., 2008) , replication (Pearson et al., 1996) , and gene silencing (Selker, 1999) . Therefore, inverted repeats involve cellular evolution and genetic diversity, mutations, and diseases. Despite the paramount roles of the non-coding structures, the non-coding structures are not immediately visible as the coding regions. This study is to identify one of the crucial non-coding structures, inverted repeats in SARS-CoV-2 genome, and investigate the cohort of the inverted repeats and the virus evolution. The complete genomes of coronaviruses were scanned for inverted repeats using Palindrome analyzer (Brázda et al., 2016) . Palindrome analyzer (http://bioinformatics.ibp.cz/) is a web-based server for retrieving palindromic and inverted repeats in DNA or RNA sequences. Palindrome server describes the features of inverted repeats including similarity analysis, localization, and visualization. To ensure consistency in comparing coronavirus genomes, we only extracted the inverted repeats with the perfect complementary base pairing of the palindrome-first and palindrome-second sequences. Noted that a short inverted repeat of length P can be inside a long inverted repeat of length Q (Q > P ), in this case, we only extracted the inverted repeats of length Q and excluded the inverted repeat of length P . The retrieved inverted repeats were mapped on the protein genes in a genome according to the positions of the palindrome-first and palindrome-second sequences of the inverted repeats. The distributions of inverted repeats on protein genes in the different genomes are assessed by the Wasserstein distance, known as the earth mover's distance. The Wasserstein distance corresponds to the minimum amount of work required to transform one distribution into the other. The p − th Wasserstein distance between two probability distributions µ and ν is defined as follows (Vallender, 1974) , , where Γ(µ, ν) denotes the set of probability distributions on R × R with marginals µ and ν. The following complete genomes of SARS-CoVs and SARS-related coronaviruses (SARSr-CoVs) were downloaded from NCBI GenBank: SARS-CoV-2 (GenBank: NC_045512.2) (Wu et al., 2020) , SARS-COV/BJ01 (GenBank: AY278488), SARSr-CoV/RaTG13 (GenBank: MN996532) (Zhou et al., 2020c) , SARSr-CoV/RmYN02 (GISAID: EPI_ISL_412977) (Zhou et al., 2020a; Shu and McCauley, 2017) , and MERS-CoV (GenBank: NC_019843) (Zaki et al., 2012) . Long inverted repeats are deemed to greatly influence the stability of the genomes of various organisms. The longest inverted repeats identified in SARS-CoV-2 genome is 15 bp sequence, the palindrome-first sequence 5'-ACTTACCTTTTAAGT-3' is at 8474-8489 (nsp3 gene), and the palindrome-second sequence 5'-ACTTAAAAGGTAAGT-3' is at 13295-13310 (nsp10 gene). The repeats of 11-15 bp are predominantly located in the gene of the Spike (S) protein ( Fig.2(a) and (b) ). The other three protein genes (nsp3, RdRp, and N protein) are also enriched with long inverted repeats. Long inverted repeats often contribute to the stability of a genome because of stable stems formed by the long inverted repeats. The results also suggest the recombinations took place at the gene of the Spike protein during evolution. Together, four protein genes (S, nsp3, RdRp, and N protein) of abundant inverted repeats are evolving dramatically and are critical for virus survival, therefore, can be the pharmaceutical targets (Gao et al., 2020) . The relation of virus genomes may provide insights on the zoonotic origin and evolution of the viruses. To examine the close relevance of human and bat CoVs, we evaluate and compare the distributions of inverted repeats of 11-15 bp in four CoV genomes: SARS-CoV-2 ( Fig.2(a) ), SARS-CoV ( Fig.3(a) ), MERS-CoV ( Fig.4(a) ) SARSr-CoV/RaTG13 ( Fig.5(a) ), and SARSr-CoV/RmYN02 ( Fig.6(a) ). The repeat numbers of the inverted repeats of 11-15 bp on each protein gene in the genomes are shown in Fig.2 Fig.3(b) , Fig.4 (b), Fig.5(b) , and Fig.6(b) . The repeat numbers are counted by both the palindrome-first and palindrome-second sequences of the inverted repeats. Taking account of the inverted repeats of wide ranges 8-15 bp, we computed the pairwise Wasserstein distances of the repeat numbers of protein genes in three closely related SARSr-CoVs: the distance between SARS-CoV-2 and SARSr-CoV/RaTG1 is 6.8571, the distance between SARS-CoV-2 and SARSr-CoV/RmYN02 is 5.7143, and the distance between SARSr-CoV/RaTG1 and SARSr-CoV/RmYN02 is 6.3571. Therefore, we conclude that SARS-CoV-2 strain is more closely related to SARSr-CoV/RaTG1 (2013) than SARSr-CoV/RmYN02 (2019). Both SARS-CoV-2 and SARSr-CoV/RmYN02 may evolve from SARSr-CoV/RaTG1. We also observe that the Spike protein gene in SARSr-CoV/RmYN02 ( Fig.6(b) ) have more long inverted repeats than the counterparts of SARS-CoV-2 ( Fig.2(b) ) and SARSr-CoV/RaTG1 (Fig.5(b) ). Unsurprisingly, the Spike protein in SARSr-CoV/RmYN02 contains natural insertions at the S1/S2 cleavage site. This cleavage site may originate from some recombination events of the Spike genes as the result of inverted repeats. The total frequencies of inverted repeats of different lengths in the human and bat CoVs also suggest that SARS-CoV-2 is closely related SARSr-CoV/RaTG13 (Fig.7) . Notedly, Fig. 7 shows that the inverted repeats of all lengths are increasing from SARS-CoV (in 2003) to SARS-CoV-2 (in 2019). From these repeat analyses, we may infer that during evolution, the recombinations may occur and produce accumulative inverted repeats under natural selection. We see that recombinations can be one of the driven forces for fast evolution. The COVID-19 pandemic has caused substantial health emergencies and economic stress in the world. Vaccine development is critical to mitigating the pandemic. The facts revealed in this study that three proteins nsp3, RdRp, and the Spike protein are rich with inverted repeats suggest that these three proteins are functional significance for virus survivals, and shall be the targets of drug design and vaccine development. If we relax the matching pairs in the inverted repeats, we expect that much longer inverted repeats can be identified, and the number of inverted repeats in the virus genome will be increased significantly. The imperfect inverted repeats are the natural forms of the repeats to maintain the genome structures. Because the perfect inverted repeat distribution and types in a genome are unique and extracting the perfect inverted repeats are parameter-free, the perfect inverted repeats can be considered as the genomic signature. The signatures from perfect inverted repeats are consistent, therefore, can be used for distinguishing the closely related viruses and differing virus mutation variants. The quantitative comparison of the signature can also provide phylogenetic taxonomy when appropriate numerical metrics for the signatures are realized. Therefore, the perfect inverted repeats can be an effective barcode to delimit species and genotypes. Coronavirus main proteinase (3CLpro) structure: basis for design of anti-SARS drugs Severe acute respiratory syndrome coronavirus nonstructural proteins 3, 4, and 6 induce double-membrane vesicles DNA secondary structures: stability and function of G-quadruplex structures Palindrome analyser-A new web-based server for predicting and evaluating inverted repeats in nucleotide sequences Functional and druggability analysis of the SARS-CoV-2 proteome Mutations strengthened SARS-CoV-2 infectivity Emerging coronaviruses: genome structure, replication, and pathogenesis Conformational changes of non-B DNA The spike glycoprotein of the new coronavirus 2019-nCoV contains a furin-like cleavage site absent in CoV of the same clade The coding capacity of SARS-CoV-2 Machine intelligence design of 2019-nCoV drugs A SARS-CoV-2 protein interaction map reveals targets for drug repurposing Characterization of protein-protein interactions between the nucleocapsid protein and membrane protein of the SARS coronavirus Discovery of G-quadruplexforming sequences in SARS-CoV-2 Nsp3 of coronaviruses: Structures and functions of a large multi-domain protein T cell responses to whole SARS coronavirus in humans siRNA targeting the leader sequence of SARS-CoV inhibits virus replication The membrane protein of severe acute respiratory syndrome coronavirus acts as a dominant immunogen revealed by a clustering region of novel functionally and structurally defined cytotoxic T-lymphocyte epitopes The genome sequence of the SARS-associated coronavirus Coronavirus Pandemic (COVID-19) G-quadruplexes in viruses: function and potential therapeutic applications Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements Inverted repeats, stem-loops, and cruciforms: significance for initiation of DNA replication Identification and characterization of severe acute respiratory syndrome coronavirus replicase proteins Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection Gene silencing: repeats that count Nuclear magnetic resonance structure of the nucleic acid-binding domain of severe acute respiratory syndrome coronavirus nonstructural protein 3 GISAID: Global initiative on sharing all influenza data-from vision to reality The nsp9 replicase protein of SARS-coronavirus, structure and functional insights The SARS-coronavirus nsp7+ nsp8 complex is a unique multimeric RNA polymerase capable of both de novo initiation and primer extension Calculation of the Wasserstein distance between probability distributions on the line Replication stalling at unstable inverted repeats: interplay between DNA hairpins and fork stabilizing proteins Receptor recognition by the novel coronavirus from Wuhan: an analysis based on decade-long structural studies of SARS coronavirus A 193-amino acid fragment of the SARS coronavirus S protein efficiently binds angiotensin-converting enzyme 2 A new coronavirus associated with human respiratory disease in China The proteins of severe acute respiratory syndrome coronavirus-2 (SARS CoV-2 or n-COV19), the cause of COVID-19 Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia A novel bat coronavirus closely related to SARS-CoV-2 contains natural insertions at the S1/S2 cleavage site of the spike protein Addendum: A pneumonia outbreak associated with a new coronavirus of probable bat origin A pneumonia outbreak associated with a new coronavirus of probable bat origin DNA features viewer, a sequence annotations formatting and plotting library for python We sincerely appreciate the researchers worldwide who sequenced and shared the complete genome data of SARS-CoV-2 and other coronaviruses from GISAID (https://www.gisaid.org/). This We declare we have no competing interests. • COVID-19: coronavirus disease 2019 • SARS: severe acute respiratory syndrome • SARS-CoV-2: severe acute respiratory syndrome coronavirus 2 • MERS-CoV: Middle East Respiratory Syndrome coronavirus • CRISPR: clusters of regularly interspaced short palindromic repeats • ACE2: angiotensin-converting enzyme 2 • NCBI: National Center for Biotechnology Information (USA)