key: cord-0788540-xx0t5119 authors: Klimczak, Leszek J.; Randall, Thomas A.; Saini, Natalie; Li, Jian-Liang; Gordenin, Dmitry A. title: Similarity between mutation spectra in hypermutated genomes of rubella virus and in SARS-CoV-2 genomes accumulated during the COVID-19 pandemic date: 2020-08-03 journal: bioRxiv DOI: 10.1101/2020.08.03.234005 sha: b3efc72e144915f3cd017fa0eba0e69875234c46 doc_id: 788540 cord_uid: xx0t5119 Genomes of tens of thousands of SARS-CoV2 isolates have been sequenced across the world and the total number of changes (predominantly single base substitutions) in these isolates exceeds ten thousand. We compared the mutational spectrum in the new SARS-CoV-2 mutation dataset with the previously published mutation spectrum in hypermutated genomes of rubella - another positive single stranded (ss) RNA virus. Each of the rubella isolates arose by accumulation of hundreds of mutations during propagation in a single subject, while SARS-CoV-2 mutation spectrum represents a collection events in multiple virus isolates from individuals across the world. We found a clear similarity between the spectra of single base substitutions in rubella and in SARS-CoV-2, with C to U as well as A to G and U to C being the most prominent in plus strand genomic RNA of each virus. Of those, U to C changes universally showed preference for loops versus stems in predicted RNA secondary structure. Similarly, to what was previously reported for rubella, C to U changes showed enrichment in the uCn motif, which suggested a subclass of APOBEC cytidine deaminase being a source of these substitutions. We also found enrichment of several other trinucleotide-centered mutation motifs only in SARS-CoV-2 - likely indicative of a mutation process characteristic to this virus. Altogether, the results of this analysis suggest that the mutation mechanisms that lead to hypermutation of the rubella vaccine virus in a rare pathological condition may also operate in the background of the SARS-CoV-2 viruses currently propagating in the human population. inhibiting viral and retrotransposon proliferation [5] [6] [7] . ADARs (ADAR1 and ADAR2) are double-strand (ds) RNA-specific enzymes converting adenine 33 to inosine (A to I). Since inosine pairs with cytosine, this will result in A to G changes after the 34 next round of replication. The preference of ADARs for certain deamination motifs -reflecting a 35 combination of immediate nucleotide context and the anticipated dsRNA formed by folding -was 36 assessed for in vitro editing of several RNA substrates. Based on these data, software was 37 developed aimed to assign predictive ADAR deamination scores to any A position in a given 38 RNA molecule [8] . The ADAR editing sites that were deduced in RNAs of cultured stimulated 39 immune cells [9] agreed with the preferences defined in the in vitro study. It remains to be 40 established whether these preferences would hold for a wide variety of RNA substrates in 41 conditions of controlled in vivo expression of either ADAR1 or ADAR2. Unlike ADARs, the structure of APOBEC enzymes allow deamination only in single-strand (ss) RNA or in ssDNA. At least two APOBECs, APOBEC1 and APOBEC3A are capable of 45 deaminating cytosine to uracil in RNA, however an RNA editing capacity of other APOBECs 46 cannot be excluded [10] [11] [12] . Cytosine deamination in RNA creates the normal RNA base -47 uracil, which can be then accurately copied in subsequent rounds of RNA replication. DNA substitutions in each SARS-CoV-2 and rubella MAFs, dividing a base substitution count by the 150 number of the mutated base in the reference sequence. We then assessed similarity of base 151 substitution densities distributions between rubella and each of SARS-CoV-2 filtered MAF using 152 non-parametric Spearman correlation with the null hypothesis that, there is no positive 153 correlation between spectra in rubella and SARS-CoV-2. Statistical evaluation of mutagenesis in trinucleotide-centered mutation motifs Calculating enrichment and statistical evaluation of mutagenesis in a small number of 157 trinucleotide-centered mutation motifs identified from mechanistic knowledge turned productive 158 in our prior assessments of mutagenesis associated with established mechanisms and known 159 preference to certain trinucleotide motifs [14, 19, 26, 27] . In this study we extended statistical 160 evaluation to all 192 possible trinucleotide centered motifs. Trinucleotide and single-nucleotide frequencies in the genomic background were calculated 162 using two alternative methods: (i) context-based -counts in the 41 nt windows centered around each mutation location; (ii) reference-based -counts in the whole reference genome. The output for each analysis (.out) in dot-bracket notation was input into BBEdit (https://www.barebones.com/products/bbedit/) and all characters in both sequence and notation 205 rows were made space delimited. Each of these rows were pasted into Excel and turned into 206 space delimited cells. Sequence and notation were separately copied and pasted using the 207 "Transform" function into a new Excel spreadsheet. A column with the nucleotide position was 208 added and the file saved as a tab delimited text file *RNAfold.txt. For each resulting file the first 209 column was the nucleotide position, the second column is the nucleotide, and the third column 210 was the annotation of that nucleotide in dot-bracket notation. The *RNAfold.txt files were used to 211 add a stem-loop annotation column "RNAfold" to all MAF files using the vlookup function in Excel and saved as a tab delimited text file. For searching for motifs and trinucleotides, *RNAfold.txt files were used to create a searchable 215 text files as follows. Columns two and three of each file were copied into a two new text files. On 216 command line, the two columns in each file were merged using 217 awk '{$(NF-1)=$(NF-1)""$NF;$NF=""}1' OFS=" " The output file from this was opened in BBEdit, the line breaks were removed, resulting in a file Design of the analysis The overarching hypothesis of this study was that some of the processes generating RNA 248 mutation load in population of SARS-CoV-2 genomes are similar (but not necessarily identical) to the processes that generated changes in genomic RNAs of hypermutated rubella viruses. For that purpose, we obtained the viral genome FASTA files and processed them to obtain have originated from the already mutated virus spreading the same mutation(s) into genomes of 277 multiple (up to thousands) downstream isolates (see column "times_refPos_mutated_inMAF" in 278 S1A Table) . Therefore, we also annotated each mutation in a sample by the number of different 279 samples in which such a mutation was found. Since each genome of an individual isolate 280 contained only few mutations and many of these mutations were identical in multiple isolates, we built our analysis to evaluate the overall spectrum of non-redundant mutation events that Table) Table) would be the least affected by functional selection and thus, most accurately 298 represent the impact of unconstrained mutational processes operating on the viral genome. While this group is smaller, it still contains a sufficient number of changes (4,740 mutation 300 events) for detecting trends in the mutational patterns. The mutation spectrum of 7,416 NoDupsFunc events (S2B Table) was also analyzed. Each of the three SARS-CoV-2 mutation 14 302 spectra was compared to the combined mutation spectrum (993 base substitutions) from six 303 independent isolates originated from the hypermutated rubella-vaccine virus [19] (S3A Table) . Unlike many SARS-CoV-2 isolates, where individual mutated event could be carried from one 305 isolate to another, each rubella isolate contained mutations that had occurred independently 306 from the vaccine virus in each subject. Thus, the total of the mutational events in six rubella isolates was, at least in part, representative of the mutation spectrum. However, mutation 308 spectra in each rubella isolate may represent an unknown level of selection. Indeed, a number 309 of mutations was observed in more than one rubella isolate (see column "times_refPos_mutated_inMAF" in S3A Table) . Some level of selection was also indicated by 311 analysis of synonymous and nonsynonymous substitutions in each codon [19] . Therefore, we 312 made separate comparisons of the rubella mutation spectra with each of the three SARS-CoV-2 313 non-redundant MAFs (Fig 1) : the non-duplicated mutation events (NoDups), and its two subsets -the non-duplicated mutation events with potential of functional significance (NoDupsFunc) and 315 the non-duplicated non-functional mutation events (NoDupsNonFunc). Table) . The only apparent discrepancy between the two viruses was in a high density of the G to U 367 changes in the plus-strand of SARS-CoV-2, while they were nearly absent in rubella. We note 368 that the density of G to U changes in minus-strand (reported as the complementary C to A 369 changes in plus-strand in Fig 2) was similar to other low abundant changes in both viruses. A 370 possible origin of the increased G to U changes in SARS-CoV-2 genomes will be detailed in Tables and Methods) . We then compared mutation counts in loop vs stem for each type of 384 base substitutions (Fig 3 and S5 Table) . 407 outstanding feature of the G to U changes showed up in SARS-CoV-2, but not in rubella (see 408 above). Mutational motif preferences in SARS-CoV-2 and rubella genomes suggest APOBEC 411 cytidine deaminases as a source of C to U base substitutions in the plus RNA strand. Since base substitution spectra in SARS-CoV-2 and rubella are correlated, it is likely that they 413 also shared common mechanisms which generated these changes. It is well established that Bars represent minimum estimates of mutation load that can be assigned to motif-specific 435 mutagenic mechanism (MutLoad) as described in Methods. Statistical evaluation of Table. 445 446 The only revealed similarity between statistically-significant enriched motifs in rubella and in 447 SARS-CoV-2 was for the uCn to uUn changes, consistent with the tCn to tTn ssDNA 448 mutagenesis specificity of a subgroup of APOBEC cytidine deaminases. However, even within 449 the APOBEC-like group of motifs there was a difference between strong enrichment with uCa to 450 uUa motif in rubella and the lack of statistically significant preference for this motif in SARS- CoV-2. There were also three groups of motifs significantly enriched in SARS-CoV-2, but not in 452 rubella (see Tab 1 and Discussion for possible mechanistic assignment of these motifs). We also assessed the potential loop vs stem preference for trinucleotide motifs containing C to 455 U and G to U single base substitutions that showed overall loop vs stem preference. None of for loops versus stems in the RNA secondary structure (Fig 3 and S5 Table) . This phenomenon 506 is similar to the preference of ssDNA and mRNA editing in loops by the APOBEC3A cytidine 507 deaminase [35, 38] . Agnostic analysis of enrichments in all 192 possible trinucleotide mutation 508 motifs highlighted statistically significant excess of uCa to uUa motif in rubella, however these 23 509 changes were not prevalent in SARS-CoV-2. Mutations in plus-strands of both viruses showed 510 statistically significant enrichments with uCg to uUg and uCu to uUu motifs (Fig 4A,B) . These 511 motifs belong to a group uCn to uUn (tCn to tTn in DNA) which is characteristic of several 512 APOBEC cytidine deaminases ([14] and references therein). We note that although these 513 signatures were enriched in the non-functional mutations (NoDupsNonFunc), they did not pass 514 the 0.05 FDR threshold in filtered datasets that included mutations with potential functional 515 effects (NoDupsFunc). These differences in the mutation signatures between SARS-Cov-2 and 516 rubella may be due to different APOBEC family members performing editing or due to the 517 confounding presence of other sources of C to U mutagenesis, such as spontaneous cytosine 518 deamination that frequently occurs in ssDNA [39] or oxidative mutagenesis capable of 519 generating C to T mutations in ssDNA in vivo [40, 41] . In support of the role of oxidative 520 damage in SARS-CoV-2 genomes, is the increased prevalence of G to U substitutions which is 521 consistent with the oxidation of guanines in the RNA plus-strand (Fig 2) There were two more groups of trinucleotide mutation motifs involving C to U (and 531 complementary G to A) substitutions in plus RNA strand specifically enriched for SARS-CoV-2 532 (Fig 4A,B) . The aCn to aUn (reverse complement nGu to nAu) group of motifs may represent a 533 preference previously unknown for APOBECs in RNA or just a mutagenic mechanism yet to be 534 defined. The cGn to cAn group of motifs seen in the plus-strand may be in fact due to mutations 24 535 of the reverse complement motif nCg to nUg in the minus-strand. nCg to nTg (CpG to TpG) 536 germline and somatic mutagenesis is universally present in DNA of species with 5-537 methylcytosine and is generated by systems specialized to mutagenesis in methylated CpG shows with high statistical confidence that nCg to nUg (CpG to UpG) mutagenesis in the minus 543 strand is enriched (Fig 4B) supporting the role of nCg-(CpG)-specific cytosine deamination in 544 minus RNA strand in SARS-CoV-2 genomic mutagenesis. In summary, comparison of base substitution spectra and signatures between hypermutated 547 rubella isolates and the SARS-CoV-2 multi-genome dataset demonstrates both similarities and 548 differences in the mutational processes active upon the two plus-strand RNA viruses. It is 549 important to understand the mechanisms that contribute to mutagenesis of viral genomes, since 550 hypermutation of even inactivated rubella vaccine virus was shown to generate reactivated viral 551 particles [19] . We demonstrate here that the APOBEC-specific uCa to uUa changes that are 552 highly enriched in hypermutated rubella, are much less prevalent in SARS-CoV-2. We propose 553 that assessment of uCa to uUa signature in viral genomes can provide insights into the potential 554 hypermutation risk of SARS-CoV-2. Moreover, understanding the genomic mutational patterns 555 is important for predicting virus evolution. Our study has highlighted several distinct features of 556 SARS-CoV-2 mutational spectrum that, after validation with independent dataset(s) can be used 557 to build predictive models for this and related SARS viruses. 25 559 Acknowledgments We thank Dr. Natalya Degtyareva and Mr. Adam Burkholder for critical reading of the 561 manuscript and Mr. Michael Doubintchik for help in the initial evaluation of the on-line databases 562 of SARS-CoV-2 genomes. We gratefully acknowledge the authors, as well as the originating 563 laboratories for submitting the sequences to GISAID's EpiCoV™ Database on which this 564 research is based; complete list of contributors is provided in S9 Table. 565 26 566 567 Journal 570 of virology Viral quasispecies Coronaviruses: an RNA 576 proofreading machine regulates replication fidelity and diversity Epub 2011/05/20 Nidovirus RNA polymerases: Complex 580 enzymes handling exceptional RNA genomes Molecular Basis of Genetic Variation of Viruses Adenosine deaminase acting on RNA (ADAR1), a suppressor of double-586 stranded RNA-triggered innate immune responses Creative deaminases, self-inflicted damage, and genome evolution Predicting sites of ADAR editing in double-stranded 593 Epub 2011/05/19 Identification of the long, edited dsRNAome of LPS-stimulated 596 immune cells APOBEC3A Is Implicated in a Novel Class of G-to-A mRNA Editing in WT1 Transcripts. PloS 601 one APOBEC3A cytidine deaminase induces RNA editing in monocytes and macrophages Functions and 608 regulation of the APOBEC family of proteins. Seminars in cell & developmental biology Clusters of Multiple Mutations: Incidence and Molecular 612 Mechanisms. Annual review of genetics An APOBEC3A 616 hypermutation signature is distinguishable from the signature of background mutagenesis by 28 617 APOBEC3B in human cancers Base 620 damage within single-strand DNA underlies in vivo hypermutability induced by a ubiquitous 621 environmental agent APOBECs and virus restriction Hypermutation in human cancer genomes: footprints and 628 mechanisms The RNA editing enzyme 631 APOBEC1 induces somatic mutations and a compatible mutational signature is present in 632 esophageal adenocarcinomas Infectious 636 vaccine-derived rubella viruses emerge, persist, and evolve in cutaneous granulomas of 637 children with primary immunodeficiencies Rubella virus replication and links to teratogenicity Roles of Host Gene and Non-coding RNA 645 Expression in Virus Infection Architecture and biogenesis of plus-strand RNA virus 647 replication factories Evidence for host-650 dependent RNA editing in the transcriptome of SARS-CoV-2 Versatile 654 and open software for comparing large genomes ANNOVAR: functional annotation of genetic variants from 658 high-throughput sequencing data The Impact of 661 Environmental and Endogenous Damage on Somatic Mutation Load in Human Skin Fibroblasts Mutation signatures specific to DNA alkylating agents in yeast and cancers A fast, lock-free approach for efficient parallel counting of 668 occurrences of k-mers ViennaRNA Package 2.0 Global initiative on sharing all influenza data -from vision 676 to reality. Euro surveillance : bulletin Europeen sur les maladies transmissibles = European 677 communicable disease bulletin Nextstrain: real-681 time tracking of pathogen evolution Comparative genomics suggests limited 685 variability and similar evolutionary patterns between major clades of SARS-CoV-2 Mutation Patterns of Human SARS-CoV-2 and Bat RaTG13 Coronavirus Genomes Are Strongly Biased Towards C>U Transitions, Indicating Rapid 689 Evolution in Their Hosts Instability and decay of the primary structure of DNA Quantification 694 of ongoing APOBEC3A activity in tumor cells by monitoring RNA editing at hotspots The 698 repertoire of mutational signatures in human cancer Passenger 706 hotspot mutations in cancer driven by APOBEC3A and mesoscale genomic features Heat-induced deamination of cytosine residues in deoxyribonucleic 710 acid Oxidative stress-induced mutagenesis in single-strand DNA occurs primarily at cytosines and is 714 DNA polymerase zeta-dependent only for adenines and guanines Mutational signatures of redox stress in yeast single-strand DNA and of aging in human 719 mitochondrial DNA share a common feature The emerging role of RNA modifications in the regulation of mRNA 723 stability Discovery 726 and characterization of artifactual mutations in deep coverage targeted capture sequencing data 727 due to oxidative DNA damage during sample preparation Uracil 731 Accumulation and Mutagenesis Dominated by Cytosine Deamination in CpG Dinucleotides Genome-Wide Estimates of Mutation Rates and Spectrum in 736 Schizosaccharomyces pombe Indicate CpG Sites are Highly Mutagenic Despite the Absence of 737 DNA Methylation. G3 (Bethesda, Md) Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral 741 defense Cytosine deamination and selection of 745 CpG suppressed clones are the two major independent biological forces that shape codon 33 746 usage bias in coronaviruses RNA virus attenuation by 750 codon pair deoptimisation is an artefact of increases in CpG/UpA dinucleotide frequencies Mutational patterns correlate with genome organization in SARS and other 754 coronaviruses Mean values of ADAR scores calculated as described in Methods. 761 Source data are in S1 Dataset Comparison of trinucleotide-centered motif C to U mutation densities between 764 locations prone to loop or to stem formation in viral RNA genomes Densities are 766 calculated by dividing counts of each motif mutations in either loop or in stem by counts of this 767 motif in the loop-forming or in stem-forming regions of the reference sequence. Statistical 768 comparison between mutagenesis in stem vs loop for every base substitution was done by two-769 tailed Fisher's exact test. P-values were corrected by FDR including 16 motifs containing C to U 770 base substitution Comparison of trinucleotide-centered motif G to U mutation densities between 776 locations prone to loop or to stem formation in viral RNA genomes Densities are 778 calculated by dividing counts of each motif mutations in either loop or in stem by counts of this 779 motif in the loop-forming or in stem-forming regions of the reference sequence. Statistical 780 comparison between mutagenesis in stem vs loop for every base substitution was done by two-781 tailed Fisher's exact test. P-values were corrected by FDR including 16 motifs containing G to 782 U base substitution ADAR scores and complete ADAR analysis for SARS-CoV-2 filtered MAFs (contains 788 source data for S1 Fig) Mutation calls and RNA fold prediction in SARS-CoV-2 genomes 791 S1A Table. Complete list of mutation calls in all SARS-CoV-2 genomes in TCGA compatible 792 Mutation Annotation Format (MAF) Annotation of predicted RNA-fold in SARS-CoV-2 reference positions. 794 795 S2 Table. SARS-CoV-2 filtered Mutation Annotation Files (MAFs) NoDupsNonFunc -de-duplicated set of mutations from all samples of the dataset NoDupsFunc -de-duplicated set of mutations from all samples of the dataset NoDups -de-duplicated set of mutations from all samples of the dataset 801 802 S3 Table. Mutation calls and RNA fold prediction in rubella genomes 803 S3A Table. The list of 993 mutations in six rubella isolates Sequences are shown in DNA format (T instead of U) to maintain compatibility with other