key: cord-0874190-v8suvr98 authors: Ramazzotti, Daniele; Angaroni, Fabrizio; Maspero, Davide; Mauri, Mario; D’Aliberti, Deborah; Antoniotti, Marco; Graudenzi, Alex; Piazza, Rocco title: Large-scale analysis of synonymous viral variants reveals global adaptation of the SARS-CoV-2 to the human codon usage date: 2021-04-23 journal: bioRxiv DOI: 10.1101/2021.04.23.441151 sha: 2abee3d5a9140c661a10098ab2a8abf897d19ce0 doc_id: 874190 cord_uid: v8suvr98 Many large national and transnational studies were dedicated to the analysis of SARS-CoV-2 genome. Most studies are focused on missense and nonsense mutations, however approximately 30% of the SARS-CoV-2 variants are synonymous, therefore changing the target codon without affecting the corresponding protein sequence. Here, by performing a large-scale analysis of more than 40000 SARS-CoV-2 genome sequences, we show that, over time, the virus is adapting its codon usage to that of the human host through the accumulation of silent mutations, thus likely improving its effectiveness in using the human aminoacyl-tRNA set. The sole exception to this rule is represented by mutations bearing the signature of the defensive APOlipoprotein B Editing Complex (APOBEC) machinery, which show a negative profile. Taken globally, this study indicates that codon usage adaptation may play a relevant role in the evolution of SARS-CoV-2, which appears to be much more complex than previously anticipated. One-Sentence Summary Synonymous SARS-CoV-2 mutations positively impact viral evolution by increasing adaptation to the human codon usage. The COVID-19 pandemic caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has currently affected approximately 102.1 million people, with 2.2 million COVID-related deaths and half a million new cases per day (1) (updated 2 nd February 2021), hence posing a major health threat on a global scale. 5 This led to a surge of active research aimed at studying SARS-CoV-2 biology: many large national and transnational efforts were dedicated to the analysis of SARS-CoV-2 genome, with more than 100000 raw sequencing datasets currently available. These analyses succeeded in reconstructing the phylogeny of the virus and, at least in part, to implement lineage tracking strategies (2, 3, 4, 5, 6, 7) . However, virtually none of the variants detected so far, with the few 10 relevant exceptions of the p.D614G Spike mutation (8), and the mutations associated with lineages B.1.1.7, B1.1.28 and P1 (9, 10, 11), were unequivocally shown to play a functional role. This evidence led to the conclusion that the vast majority of the reported SARS-CoV-2 sequence changes are neutral or purifying, therefore having limited or no impact on the aggressiveness of COVID-19 pandemic. 15 To the best of our knowledge, most of the studies performed so far focused on missense and nonsense mutations, i.e., mutations that either cause a change in the sequence of the target protein or generate a premature termination codon. However, a large fraction of the SARS-CoV-2 mutations (approximately ~30%) are synonymous, therefore changing the target codon without affecting the associated protein sequence on mRNA translation, as opposed to 20 missense/nonsense variants. This bulk set of mutations are usually discarded during variant analyses, as they are considered as functionally irrelevant. However, it is well known from decades (12, 13, 14, 15 ) that silent mutations may have a profound impact on protein expression, by changing the aminoacyl-tRNA molecule responsible for translating the mutated codon into the corresponding amino acid, a well-known process coordinated by the ribosomal machinery. A silent substitution causing a switch to a more abundant aminoacyl-tRNA is prone to increase the overall protein translating efficiency, while the opposite may occur if the tRNA pairing with the mutated codon is less abundant than the original one (16, 17, 18, 19) . Therefore, we hypothesized that synonymous 5 SARS-CoV-2 variants may play a functional role by adapting the sequence of the virus to the codon usage (CU) of the human host, thus improving its effectiveness in using the human aminoacyl-tRNA set, ultimately making the translation of viral particles more efficient. To investigate this hypothesis, we analyzed a total of 40743 viral genomes and 239546 silent viral mutations identified in 9 different studies (see Materials and Methods), showing: 10 1) that at the global level, the adaptation of mutated codons to the human CU is significantly greater than the wild-type viral counterpart, with the sole exception of mutations bearing the APOlipoprotein B Editing Complex (APOBEC) signature, showing a negative profile, and 2) that the viral adaptation to the human codon usage is increasing over time during the course of the pandemic. 15 To analyze the possible adaptation of the viral codons to the human CU during the course of COVID-19 pandemic, we initially analyzed high-quality samples from NCBI BioProject: PRJEB37886 (Dataset 1). This dataset comprises a total of 27485 COVID-19 cases, where viral SARS-CoV-2 RNA was sequenced using an Amplicon strategy with high coverage. After alignment of the raw viral data to the SARS-CoV-2-ANC reference genome (5) and variant 20 calling (see Materials and Methods for further details), silent variants were further annotated in terms of reference and mutated codon, variant frequency (VF) and sampling time (Supplementary File 1). The relative codon usage (RCU) adaptation was calculated as the ratio between the human CU of the mutated codon and the wild-type counterpart. As previously shown by our team and by others (6, 20, 21), a relevant subset of viral mutations, linked to the C>T signature, are generated by the activation of the human APOBEC machinery. APOBEC represents a family of cytidine deaminases playing a critical role in the intrinsic responses of the host to infections operated by a large set of viruses, among which retroviruses, herpesviruses, papillomaviruses, parvoviruses, hepatitis B and retrotransposons. APOBEC cytidine deaminases target the viral genome, introducing large numbers of C>T mutations, 5 eventually causing the functional inactivation of the pathogen. Therefore, APOBEC mutations globally represent a defense raised by the host cells to suppress viral infection (22). To define the contribution of the APOBEC machinery, we isolated APOBEC C>T | G>A mutations from the entire set of synonymous variants identified in Dataset 1 and we analyzed the relative codon usage profile of mutated vs. wild-type codons in this subgroup. Interestingly, we 10 noticed that the APOBEC variants are associated with a slight decrease in the log2-transformed RCU (see Figure 1A and Supplementary Figure 1A , median log2-transformed RCU≈-0.172.); as opposite, the non-APOBEC (named as OTHER in the following) group showed a marked increase in the log2-transformed RCU (see Figure 1A and Supplementary Figure 1A , median log2-transformed RCU≈0.148; two-sided t-student p-value<2 -16 ), supporting the hypothesis that 15 synonymous mutations not driven by APOBEC can indeed increase the adaptation of the viral sequences to the human CU. To validate our findings, we considered two additional datasets; the first one (Dataset 2, Figure 1B and Supplementary Figure 1B Figure 1C and Supplementary Figure 1C) including data from NCBI BioProject: PRJNA613958. All the analyses corroborated our hypothesis, confirming the increase in the log2-transformed RCU for the synonymous mutations belonging to the OTHER group. Importantly, this process is not restricted to individual viral genes, as it is widespread across the entire SARS-CoV-2 genome, suggesting that the virus may benefit from a global adaptation to the human CU (Figure 1, Supplementary Figures 1 and 2 As the number of available viral sequences is quickly increasing over time, it will be interesting 10 to isolate individual silent variants showing relevant changes in their VF over large time spans in order to study specific codon usage adaptation patterns in the context of individual viral genes/proteins. In conclusion, global as well as time-series analyses of silent mutations indicate that codon usage adaptation may play a relevant role in the evolution of SARS-CoV-2, suggesting that the 15 evolution of the viral genome has been more intense than previously anticipated. Further studies will be required to thoroughly dissect the overall impact and clinical relevance of these findings. In the next future it will be important to assess if specific viral genomes with a high codon usage adaptation also show an increased infectiousness and/or clinical aggressiveness, which, at the time being, is very difficult, owing to the limited availability of clinical data in publicly available 20 SARS-CoV-2 sequencing studies. In this regard and given the profound impact of COVID-19 pandemic, every effort should be made to share clinical information together with sequencing data, which is unfortunately rarely done. 16. Lavner, Y., Kotlar, D. (2005) . "Codon bias as a factor in regulating expression via translation rate in the human genome". Gene, 345(1) , 127-138. 17. Duret, L. (2000) . "tRNA gene number and codon usage in the C. elegans genome are coadapted for optimal translation of highly expressed genes". Trends in Genetics, 16(7 Authors declare that they have no competing interests. All data are available in the main text or the supplementary materials or can be downloaded from the original SARS-CoV-2 sequence repositories. Figs. S1 to S4 The p-value of the two-sided t-student test is reported for every dataset. A) Density plot of the VF-weighted log2-transformed RCU for the APOBEC (blue) and OTHER (orange) substitutions over the course of the disease. B) Surface plot representing the mean RCU of APOBEC and OTHER variants binned for VF and 5 SARS-CoV-2 genome position. The log2 mean binned variant count is reported as a blue-to-red color gradient. and Notes to the APOBEC group is presented with respect to collection date (binned every two weeks; grey lines represent the median values). The three datasets under study are presented in the different panels: (A) Dataset 1, (B) Dataset 2, and (C) Dataset 3. The p value of the one-sided Mann-Kendall trend test 20 We acknowledge that this research was partially supported by the Italian Ministry of University and Research (MIUR) -Department of Excellence project PREMIA (PREcision MedIcine Approach: bringing biomarker research to clinic)