key: cord-1040032-kiv7egfv authors: Tyagi, Neetu; Sardar, Rahila; Gupta, Dinesh title: Comparative analysis of codon usage patterns in SARS-CoV-2, its mutants and other respiratory viruses date: 2021-03-03 journal: bioRxiv DOI: 10.1101/2021.03.03.433699 sha: deb25b647e2b11a44e150dfb9dcf862431b78f00 doc_id: 1040032 cord_uid: kiv7egfv The Coronavirus disease 2019 (COVID-19) outbreak caused by Severe Acute Respiratory Syndrome Coronavirus 2 virus (SARS-CoV-2) poses a worldwide human health crisis, causing respiratory illness with a high mortality rate. To investigate the factors governing codon usage bias in all the respiratory viruses, including SARS-CoV-2 isolates from different geographical locations (~62K), including two recently emerging strains from the United Kingdom (UK), i.e., VUI202012/01 and South Africa (SA), i.e., 501.Y.V2 codon usage bias (CUBs) analysis was performed. The analysis includes RSCU analysis, GC content calculation, ENC analysis, dinucleotide frequency and neutrality plot analysis. We were motivated to conduct the study to fulfil two primary aims: first, to identify the difference in codon usage bias amongst all SARS-CoV-2 genomes and, secondly, to compare their CUBs properties with other respiratory viruses. A biased nucleotide composition was found as most of the highly preferred codons were A/U-ending in all the respiratory viruses studied here. Compared with the human host, the RSCU analysis led to the identification of 11 over-represented codons and 9 under-represented codons in SARS-CoV-2 genomes. Correlation analysis of ENC and GC3s revealed that mutational pressure is the leading force determining the CUBs. The present study results yield a better understanding of codon usage preferences for SARS-CoV-2 genomes and discover the possible evolutionary determinants responsible for the biases found among the respiratory viruses, thus unveils a unique feature of the SARS-CoV-2 evolution and adaptation. To the best of our knowledge, this is the first attempt at comparative CUBs analysis on the worldwide genomes of SARS-CoV-2, including novel emerged strains and other respiratory viruses. A phylogenetic tree was constructed of complete genome sequences using MEGA 6.0 software [47] using the Neighbor-joining (NJ) algorithm and Kimura-2 parameter model on 1000 bootstrap replicates. To investigate the factors affecting the synonymous codon usage bias, RSCU values were calculated using CodonW software (http://codonw.sourceforge.net/). The RSCU values were calculated using the formula: Xij represents the number of codons for the amino acid and ni represents the degenerate number of a specific synonymous codon, ranging from 1 to 61. High RSCU is the ratio of observed to the expected value for a given amino acid and its value is not affected by the length of the sequence or amino acid frequency [34] . A higher RSCU value (RSCU>1) indicates positive codon bias and is considered as a preferred codon, whereas the lower RSCU value (RSCU<1) represents the negative codon bias termed as under-preferred codons. The RSCU values across all respiratory viruses with respect to host (H. sapiens) were compared and visualised with a heatmap in R. To further investigate the synonymous codon usage pattern, the ENC-plot was generated by plotted the ENC values against the GC3 values. The ENC is used to measure the deviation from the random codon usage pattern; its value ranges from 20-61. A lower ENC value (<35) corresponds to a strong codon usage bias, whereas higher ENC values (>35) represent low codon bias [48] . The standard ENC values were calculated using the formula, S represents the given GC3s value. Codon usage disparity is governed by mainly two important factors, mutation pressure and natural selection. In neutrality plot analysis, the main factors affecting the CUBs were determined by taking the mean GC content at 1 st and 2 nd position (GC12 x-axis) and plotted that against GC content at 3 rd position of the codon (GC3 y-axis) values, calculated by CodonW. Plotting of GC12 values against GC3s helps analyse the correlation between the base compositions of all three different codon sites, thus determining the main factor responsible for the codon usage bias. The regression line's slope indicates the effect of mutational pressure [49] . The dinucleotide frequency analysis is another way of establishing the relation with the codon usage bias, calculated using DAMBE. The average relative abundance value for each dinucleotide was determined by the odds ratio, defined as the ratio of observed and expected dinucleotide frequencies. The odds ratio value >1.23 was considered over-represented, whereas the value <0.78 as underrepresented [50] . We collected 61,962 SARS- CoV-2 whole-genome sequences submitted to the GISAID until April 2020, consisting of 13280, 15506, 20779, 2660, 2615, 4183, and 2939 sequences of G, GH, GR, L, O, S and V clades, respectively. We also collected the recently reported SARS-CoV-2 variant sequences for VUI202012/01(n= 528) and 501Y.V2 (n= 184) from the UK and SA. As expected, the phylogenetic tree indicates that the newly SARS-CoV-2 was evolutionarily closer to the SARS-CoV and MERS viruses. The phylogenetic tree also reveals the SARS-CoV-2 genome to be relatively distant from the influenza A virus strains (H1N1 and H3N2) clustered with the Respiratory Syncytial Virus (RSV) forming a separate clade (Figure 1a) . We calculated the nucleotide composition for all the respiratory viruses genomes studied here. The nucleotide composition for SARS-CoV-2, SARS, and MERS found to be similar and in the order U>A>G>C. Whereas RSV follows the order A>U>G>C, and H1N1 and H3N2 followed the order A>G>U>C. We found that the nucleotides at the 3 rd position of the codon follow the trend U3>A3>C3>G3 for all viruses, except SARS-CoV and MERS, for which the trend is U3>A3>G3>C3 (see S. Table 1 ). The average GC content for all the respiratory viruses is 0.40, with a standard deviation of 0.02. The CAI for all respiratory viruses ranges from 0.68-0.72. The higher CAI value indicates a better adaptation of the virus in the human host. The CUB exists in many RNA viral genomes, generally determined by mutation and selection pressure. RSCU analysis was performed to investigate the codon usage bias variation in all the studied virus genomes. With respect to the H. sapiens host, we found 11 over-represented or preferred codons by SARS-CoV-2 clades, i.e., UUA, UUG, AGU, CCU, AUU, GUU, GCU, AGA, CAA, GAA and GGU. Out of these, 6 are ending with U, 4 with A and 1 with G. The 9 under-represented or randomly used codons were CUC, CUG, UCC, CCC, AUA, GUG, GCC, CAG and GGG comprising of 4-G ending, 4-C ending, and 1-A ending (see S. Table 2 ). From the identified codons, most of them are previously reported as preferred codons [51] . From the analysis, we found that all the viruses studied here are highly biased towards A/U-ending codons. In contrast, the under-represented codons were found to be C/G ending, as reported earlier in the recent studies conducted on SARS-CoV-2 genomes Among the respiratory viruses, it has been observed that there is no significant CUB pattern observed among the clades. Moreover, a significantly different pattern was observed in all the SARS-CoV-2 clades with reference to the human genome. For a few codons, the RSV and influenza A virus show a slight variation in CUB, compared with that of the other respiratory viruses, as shown in S.Table2. The phylogenetic analysis also confirms the pattern as these three viruses clustered together, forming a separate clade. Summarily, our results suggest that the codon usage pattern is highly similar in SARS-CoV-2 clades, including the newly discovered variants from the UK and SA. The nucleotide composition analysis in the study revealed that in SARS-CoV-2, the composition follows the order U>A>G>C. We also observed that U and A nucleotides occurred most frequently at the codon's third position, confirming a previous finding [43] . All the viruses included in this study show a higher frequency of A and U nucleotides, suggesting that the bias in the genome composition of respiratory viruses also reflects in their codon usage patterns. Our next attempt is to determine the factors involved in shaping the CUBs. Previous studies suggest that codon usage bias is affected by several factors. Out of these factors, two widely accepted factors are mutation pressure and natural selection [55] . Other influencing factors include GC content, hydrophobicity, and GC3 etc. As stated earlier, we found A3 and U3 frequencies to be higher than G3 and C3, suggesting the contribution of mutational force in shaping the codon usage among respiratory viruses. To further investigate the contribution of mutational force, ENC-plot was generated, where the ENC value is plotted against GC3. If the codon usage bias is only affected by the GC3 value, all the points should be exactly on the standard curve. The ENC-GC3 plot has shown that all the points clustered together just below the standard curve, highlighting that G+C compositional play an important role in shaping the codon usage bias in all the respiratory viruses under study. The points below the curve also indicate that some independent factors, other than mutational force, like natural selection, might also play a role in shaping the CUBs in respiratory viruses [51] . Furthermore, to examine the degree of codon bias contributing forces, the neutrality plot was generated. In the plot, To the best of our knowledge, this is the first attempt at comparative CUBs analysis on worldwide SARS-CoV-2 genomes, including the newly emerged strains and other respiratory viruses. We conclude that there is no significant impact of reported SARS-CoV-2 mutations on the codon usage preferences. Summarily, it was found that in all the respiratory viruses, the codon usage bias is highly similar and relatively low. The study also discloses that mutational pressure and natural selection are the leading forces determining the codon usage bias amongst all the respiratory viruses. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Epidemiology, Genetic Recombination, and Pathogenesis of Coronaviruses Emerging coronaviruses: Genome structure, replication, and pathogenesis Origin and evolution of pathogenic coronaviruses A decade after SARS: Strategies for controlling emerging coronaviruses A Novel Coronavirus from Patients with Pneumonia in China Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet2020 A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster A pneumonia outbreak associated with a new coronavirus of probable bat origin Cross-species transmission of the newly identified coronavirus 2019-nCoV Coronavirus diversity, phylogeny and interspecies jumping Review Bovine Coronavirus Coronavirus avian infectious bronchitis virus Fatal swine acute diarrhoea syndrome caused by an HKU2-related coronavirus of bat origin Role of respiratory viruses in acute upper and lower respiratory tract illness in the first year of life: A birth cohort study Comparative Review of SARS-CoV-2 , SARS-CoV , MERS-CoV , and Influenza A Respiratory Viruses Phylogenetic network analysis of SARS-CoV-2 genomes Genomic variance of the 2019-nCoV coronavirus Genomics Genome-wide codon usage pattern analysis reveals the correlation between codon usage bias and gene expression in Cuscuta australis Distinct viral clades of SARS-CoV-2: Implications for modeling of viral spread Geographic and Genomic Distribution of SARS-CoV-2 HouriiyahTegally Emergence and rapid spread of a new severe acute respiratory syndromerelated coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: A proposal for a synonymous codon choice that is optimal for the E. coli translational system Temperature influences synonymous codon and amino acid usage biases in the phages infecting extremely thermophilic prokaryotes Synonymous codon usage pattern in model legume Medicago truncatula Genome-wide analysis of codon usage bias in four sequenced cotton species Codon usage bias and its influencing factors for Y-linked genes in human Multi-omics Data Integration, Interpretation, and Its Application Multivariate analyses of codon usage of SARS-CoV-2 and other betacoronaviruses Characterization of codon usage pattern in SARS-CoV-2 Analysis of codon usage of severe acute respiratory syndrome corona virus 2 (SARS-CoV-2) and its adaptability in dog Global initiative on sharing all influenza data -from vision to reality Molecular Evolutionary Genetics Analysis Version 6 . 0 Homo sapiens, Saccharomyces cerevisiae; Escherichia coil; Bacillus subtilis; Dict.~steliumdiscoideum; Drosophila melanogaster) Directional mutation pressure and neutral molecular evolution Analysis of base and codon usage by rubella virus A comprehensive analysis of genome composition and codon usage patterns of emerging coronaviruses Codon usage and phenotypic divergences of SARS-CoV-2 genes Synonymous but not the same : the causes and consequences of codon bias Codon usage pattern and its influencing factors in different genomes of hepadnaviruses Revelation of Influencing Factors in Overall Codon Usage Bias of Equine Influenza Viruses Analysis of Nipah Virus Codon Usage and Adaptation to Hosts