key: cord-0935472-rtcwfik6 authors: Yin, Changchuan; Yau, Stephen S.-T. title: Inverted repeats in coronavirus SARS-CoV-2 genome manifest the evolution events date: 2021-08-31 journal: J Theor Biol DOI: 10.1016/j.jtbi.2021.110885 sha: 60d36d8658d8db381f548be7d40a4d651742beb5 doc_id: 935472 cord_uid: rtcwfik6 The world faces a great unforeseen challenge through the COVID-19 pandemic caused by coronavirus SARS-CoV-2. The virus genome structure and evolution are positioned front and center for further understanding insights on vaccine development, monitoring of transmission trajectories, and prevention of zoonotic infections of new coronaviruses. Of particular interest are genomic elements Inverse Repeats (IRs), which maintain genome stability, regulate gene expressions, and are the targets of mutations. However, little research attention is given to the IR content analysis in the SARS-CoV-2 genome. In this study, we propose a geometric analysis method and using the method to investigate the distributions of IRs in SARS-CoV-2 and its related coronavirus genomes. The method represents each genomic IR sequence pair as a single point and constructs the geometric shape of the genome using the IRs. Thus, the IR shape can be considered as the signature of the genome. The genomes of different coronaviruses are then compared using the constructed IR shapes. The results demonstrate that SARS-CoV-2 genome, specifically, has an abundance of IRs, and the IRs in coronavirus genomes show an increase during evolution events. An inverted repeat (IR) is a sequence that matches its downstream reverse 55 complement sequence. The initial sequence and the reverse complement may 56 have a spacer, which can vary from zero to thousands of bases. An IR of zero 57 spacer is specially named a palindrome. For example, the inverted repeat, 58 5'-TTTACGTAAA-3' is a palindrome, the palindrome-first is 5'-TTTAC-3', 59 and the palindrome-second is 5'-GTAAA-3'. When the spacer in an inverted 60 repeat is non-zero, the repeat is generally inverted. For convenience, we still 61 denote the initial sequence in a general IR as a palindrome-first and the down- (Pearson et al., 1996) In the IRs of a genome, the palindrome-first and palindrome-second se-120 quences have a strong tendency to form a stem structure when the length 121 of the palindrome-first or palindrome-second sequence is long or when the 122 spacer between the palindrome-first or palindrome-second sequences is short. The Delaunay triangulation provides a unique way of triangulating the , where sup denotes the supremum and inf denotes the infimum. Because two IR graphs may not have the same number of nodes, the direct 178 one-to-one comparison of two graph nodes is impossible. The similarity of 179 two IR graphs can be measured by the Hausdorff distance. If two IR graphs 180 are similar, then more parts can be superimposed, the Hausdorff distance is 181 small, otherwise, the Hausdorff distance is large. The following is the procedure for the IR graph analytics in a genome. (1) By string matching, scan the whole genome for inverted repeats using To identify and analyze IRs in genomes, the complete genomes of coron-192 aviruses were scanned for IRs by in-house computer programs in this study. The computer programs for IR analysis including IR retrieval, similarity, and SARS-CoV/BJ01 (Fig.2) , and four bat-CoV genomes: bat-CoV/Pangolin, MERS-CoV, swine-CoV/SADS, and bat-CoV/ZC45 (Fig.3) From the qualitative and quantitative analysis, we observe that SARS- CoV-2 strain ( Fig. 1(d) ) is more closely related to SARSr-CoV/ZC21 and 245 SARSr-CoV/RaTG13 ( Fig. 2(a,b) ) than SARSr-CoV/RmYN02 and SARS-246 CoV ( Fig.2(c,d) ). The IR distribution results (Table 1) show that human CoVs such as we may consider SARS-CoV-2, SARSr-CoV/RaTG13 and SARSr-CoV/ZC21 278 to be three very close strains. From Table 1 , we show that Pangolin-CoV 279 genome contains the highest frequency of IRs and is far from SARS-CoV-2. The result also shows that the human CoVs (human-CoV/OC43, human- The relative evolution proximities among the bat and human-CoVs are 297 further inferred by the Hausdorff distances of the IR points in the CoV 298 genomes (Fig. 5.) . The results also exhibit that the closest relative of 299 SARS-CoV-2 is SARSr-CoV/RaTG13, followed by bat-CoV/RpYN06. In 300 to civets-SCoV, supporting the theorem that Civets could be the interme- (Table 1) . 313 We observed that long IRs of at least 12 bp in CoV genomes increase 314 over evolution time. For example, swine-CoV/SADs is very virulent and 315 is considered in the raw native state, it contains the least number of IRs 316 (Fig.3(c) ), whereas the human-CoVs, which have evolved in humans for long 317 period, the IRs in human-CoVs are enriched (Fig.4) . In this study, we present a novel geometric method to represent the 337 genome architecture using the IR contents in the genome. The method maps 338 the IR points into a geometric shape so that the IR distribution can be Competing interests 416 We declare we have no competing interests. SARS-COV-2. Archives 469 of Virology SARS-CoV-472 2 hot-spot mutations are significantly enriched within inverted repeats and 473 CpG island loci Genomic characterization and infectivity of a novel 478 SARS-like coronavirus in Chinese bats. Emerging Microbes & Infections 479 Compar-481 ing images using the Hausdorff distance. Pattern Analysis and Machine 482 Intelligence The architecture of SARS-CoV-2 transcriptome tifying SARS-CoV-2-related coronaviruses in Malayan pangolins of the 2019 novel coronavirus Coronavirus Pandemic (COVID-19) Intervening se-515 quences of regularly spaced prokaryotic repeats derive from foreign genetic 516 elements Role of 518 inverted DNA repeats in transcriptional and post-transcriptional gene si-519 lencing Inverted repeats, stem-loops, and cruciforms: significance for initiation of 522 DNA replication Gene silencing: repeats that count GISAID: Global initiative on sharing all in-525 fluenza data-from vision to reality coronavirus found in a patient with characteristic symptoms Inverted repeats in coronavirus SARS-CoV-2 553 genome and implications in evolution. Communications in Information and 554 Systems Isolation of a novel coronavirus from a man with pneu-557 monia in Saudi Arabia A novel bat coronavirus closely 561 related to SARS-CoV-2 contains natural insertions at the S1/S2 cleavage 562 site of the spike protein Identification of novel bat coronaviruses sheds light 565 on the evolutionary origins of SARS-CoV-2 and related viruses Fatal swine acute di-568 arrhoea syndrome caused by an HKU2-related coronavirus of bat origin A pneumonia outbreak asso-572 ciated with a new coronavirus of probable bat origin Short inverted repeats contribute to localized mutability 576 in human somatic cells Yau Highlighs Presenting geometric graph method for inverted repeat (IR) analysis Finding the correlation between IR distributions and evolution events Comparing the IR distributions of SARS-CoV-2 and bat and human CoVs Inferring mutations on IRs as the major evolution driver in SARS