key: cord-0910652-hidirfkv authors: Tort, Fernando L.; Castells, Matías; Cristina, Juan title: A COMPREHENSIVE ANALYSIS OF GENOME COMPOSITION AND CODON USAGE PATTERNS OF EMERGING CORONAVIRUSES date: 2020-04-12 journal: Virus Res DOI: 10.1016/j.virusres.2020.197976 sha: 7a9e47966415478924fcc17d876c7984ef246b0c doc_id: 910652 cord_uid: hidirfkv Abstract An outbreak of atypical pneumonia caused by a novel Betacoronavirus (βCoV), named SARS-CoV-2 has been declared a public health emergency of international concern by the World Health Organization. In order to gain insight into the emergence, evolution and adaptation of SARS-CoV-2 viruses, a comprehensive analysis of genome composition and codon usage of βCoV circulating in China was performed. A biased nucleotide composition was found for SARS-CoV-2 genome. This bias in genomic composition is reflected in its codon and amino acid usage patterns. The overall codon usage in SARS-CoV-2 is similar among themselves and slightly biased. Most of the highly frequent codons are A- and U-ending, which strongly suggests that mutational bias is the main force shaping codon usage in this virus. Significant differences in relative synonymous codon usage frequencies among SARS-CoV-2 and human cells were found. These differences are due to codon usage preferences. The family Coronaviridae consists of four genera, namely, Alphacoronavirus (αCoV), Betacoronavirus (βCoV), Gammacoronavirus (γCoV) and Deltacoronavirus (δCoV) (Chen et al., 2020) . Viruses from this family possess a single stranded, positive-sense RNA genome ranging from 26 to 32 kilobases in length (Su et al., 2016) . Coronaviruses (CoV) have been identified in several avian hosts, as well as in mammals, including humans, bats, civets, mice, dogs, cats, cows and camels (Clark, 1993; Cavanagh, 2007; Zhou et al., 2018) . Although several CoV are pathogenic to humans, most of them are associated with mild clinical symptoms (Su et al., 2016) . Nevertheless, two notable exceptions have been described: severe acute respiratory syndrome (SARS) coronavirus (SARS-CoV), a novel βCoV that emerged in southern China in 2002 (Peiris et al., 2004) and Middle East respiratory syndrome (MERS) coronavirus (MERS-CoV), which was first detected in Saudi Arabia in 2012 (Zaki et al., 2012) . Before the SARS epidemic, bats were not known to be hosts for CoVs. In the last 15 years, bats have been found to be hosts of more than 30 CoVs. Interactions among various bat species themselves, batanimal and bat-human interactions, such as the presence of live bats in wildlife wet markets in China, have been proved to be important for interspecies transmission of CoVs (Wong et al., 2019) . In fact, both SARS-CoV and MERS-CoV likely originated in bats, and genetically diverse CoVs that are related to these viruses were discovered in bats worldwide (Cui et al., 2019) . Previous studies in different bat species from China permitted the identification of at least 41 new βCoV, all of them were Rhinolophus spp. bat-CoV (Li et al., 2017) . At this moment, in Wuhan, the capital city of Hubei province of the People's Republic of China, an outbreak of atypical pneumonia caused by a novel coronavirus (SARS-CoV-2) is currently underway. The outbreak appears to have started from a zoonotic transmission at a market in Wuhan where animals and meat were sold (Chan et al., J o u r n a l P r e -p r o o f 2020a). As February 26th, 2020, there have been 77,780 cases of SARS-CoV-2 confirmed in China, including 2,666 deaths (WHO, 2020) . In addition, SARS-CoV-2 has been reported in 33 countries outside China, with 2,459 cases confirmed and 34 deaths (WHO, 2020) . Very recent studies revealed that SARS-CoV-2 can be considered a new human-infecting βCoV (Lu et al., 2020) . The World Health Organization declare this 2019-nCoV outbreak as a public health emergency of international concern on January 30 th , 2020 (WHO, 2020b) and the disease caused by this specific virus species have recently been designated as (WHO, 2020) . In this sense, the Coronavirus Study Group of the International Committee on Taxonomy of Viruses (ICTV), formally recognizes this virus as a relative to severe acute respiratory syndrome SARS-CoVs and designates it as severe acute respiratory syndrome coronavirus 2: SARS-CoV-2 (Gorbalenya et al., 2020) . In order to gain insight into the emergence, evolution, adaptation and spread of the SARS-CoV-2 viruses, a comprehensive analysis of genome composition and codon usage of βCoV circulating in China was performed. Available and comparable complete genome sequences of 81 βCoV strains isolated in China, including SARS-CoV-2 as well as βCoV isolated from different hosts, were obtained from GenBank database (available at: http://www.ncbi.nlm.nih.gov). For accession number, strain name, host and date of isolation, see Supplementary Material Table 1 . For each strain ORFs1a+ORF1b+S+E+N+M were concatenated. Sequences were aligned using the MACSE program (Ranwez et al., 2011) . MACSE algorithm is a useful tool for accommodating sequencing errors and other biological deviations from J o u r n a l P r e -p r o o f the coding frame (Ranwez et al., 2011) . The alignment is available upon request. The dataset comprised a total of 725,433 codons. Codon usage, amino acid usage, dinucleotide frequencies, base composition, the relative synonymous codon usage (RSCU) (Sharp and Lee, 1986) , total GC content, GC content in the third position of the codon (GC3s) and effective numbers of codons (ENC) were calculated using the program CodonW (written by John Peden) as implemented in the Galaxy server version 1.4.4 (Afgan et al., 2018) . The relationship between compositional variables and samples was obtained using multivariate statistical analyses. Principal component analysis (PCA) is a type of multivariate analysis that allows a dimensionality reduction. Singular Value Decomposition (SVD) method was used to calculate the principal components (PC). Unit variance was used as scaling method. By the same approach, Heatmaps were also constructed, which is a data matrix for visualizing values in the dataset by the use of a color gradient. This gives a good overview of the largest and smallest values in the matrix. Rows and/or columns of the matrix are clustered so that sets of rows or columns rather than individual ones can be interpreted. PCA and Heatmaps analysis were done using the ClustVis program (Metsalu et al., 2015) . The RSCU values of human cells were obtained from Kazusa database (available at: http://www.kazusa.or.jp/codon/). To study codon usage preferences in SARS-CoV-2 in relation to the codon usage of human cells, we employed the codon adaptation index (CAI) (Sharp & Li, 1987) . CAI was calculated using the approach of Puigbo et al. (2008a) (available at: http://genomes.urv.es/CAIcal) for human cells. This method allows to compare a given codon usage (in our case, SARS-CoV-2) to a predefined reference set (human). A J o u r n a l P r e -p r o o f statistically significant difference among CAI values was addressed applying a Wilcoxom & Mann-Whitney test (Wessa, 2012) . In order to discern if statistically significant differences in the CAI values arise from codon preferences, we used e-CAI (Puigbo et al., 2008b) to calculate the expected value of CAI (eCAI) at the 95 % confident interval. A Kolmogorov-Smirnov test for the expected CAI was also performed (Puigbo et al., 2008b) . In order to study the genetic composition of SARS-CoV-2 emerging in China, 81 ORFs sequences from βCOV strains isolated in China (including SARS-CoV-2 and SARS-CoV from humans and βCoV isolated from bats, civets and ferrets) were aligned and the nucleotide frequencies at third codon position, total GC content, GC content at the third codon position, ENC and dinucleotide frequencies were established for all strains ORFs and PCA was performed. The results of these studies are shown in Figure 1 . Heatmap analysis on genomic composition of all βCoV enrolled in these studies revealed a distinct genome composition of SARS-CoV-2 strains isolated in China, in relation to all other strains isolated from humans as well as bats, civets or ferrets (see To study if these differences in genomic composition can be observed at codon and amino acid usage, these variables were established for all βCoV strains enrolled in these studies and their relations were observed by Heatmap analysis. The results of these studies are shown in Figure 2 . Significant differences in codon and amino acid usage was observed among SARS-CoV-2 strains and all other βCoV strains included in these studies (see Fig. 2 ). Among β-CoV strains isolated from bats, a significant variation was also observed. Linkage analysis suggests a closer relation among SARS-CoV-2 and β- CoV isolated from bats, and a more distant relation with SARS-CoV and other β-CoV isolated from civet and ferret enrolled in these analyses (Fig. 2) . To gain insight into the genomic composition of SARS-CoV-2 strains, the nucleotide frequencies found for this virus where compared to the nucleotide frequencies found for other human CoVs. The results of these studies are shown in Fig.3 . Some general characteristics of CoVs were observed, since the U-frequencies is significantly above average, while the C-frequencies are below the expected frequencies. In the case of purines, A is preferred over G (Fig. 3) . As it can be seen in Fig. 3 , most variation occurs in the C/U and not the A/G section. Moreover, the results of these studies also revealed the presence of species-specific trends, since the C/U ratio differs profoundly per coronavirus type (see Fig.3 ). Comparison of SARS-CoV-2 with the other two recent zoonotic transmission to the human population (SARS and MERS-CoVs) reveals that SARS-CoV-2 is quite extreme with a C-count of 18.4 % (Fig. 3) . On the other hand, SARS-CoV-2 has the highest A-count among CoVs enrolled in these studies (29.9 %, mean A-count of 27.6 ± 1.1 for all human CoVs enrolled in these studies) (see Fig. 3 ). To study the extent of codon usage bias in SARS-CoV-2, the ENC's values were calculated for all SARS-CoV-2 strains enrolled in these studies. A mean value of 48.54 ± 2.34 was obtained. Due to the fact that all values obtained were >40, the results of these studies suggest that the overall codon usage among SARS-CoV-2 is similar among themselves and slightly biased. Mean ENC values of 49.11 ± 0.02, 49.66 ± 0.52 and 49.21 ± 0.05 were found for SARS and βCoV isolated from bats and civets, respectively. ENC quantifies how far a codon usage departs from equal usage of J o u r n a l P r e -p r o o f synonymous codons and is a measure of codon usage biases in genomes that ranges from 20 (maximal bias) to 61 (unbiased) (Wright, 1990) . Although βCoV ENC values are roughly similar, in the case of SARS-CoV-2 strains we observed a range of ENC values from 45.18 to 50.09. Since codon usage by its very nature is multivariate, it is necessary to analyze the data using different and complementary approaches. An ENC-GC3S plot (ENC plotted against GC content at the third codon position) can be used as a method that quantifies how far the codon usage of a gene departs from equal usage of synonymous codons (Wrigth, 1990) . If GC3S is the only determinant factor shaping the codon usage pattern, the values of ENC would fall on a continuous curve, which represents random codon usage (Jiang et al., 2007) . If G+C compositional constraint influences the codon usage, then the GC3S and ENC correlated spots would lie on or below the expected curve (Tsai et al., 2007) . When the ENC-GC3S plot was constructed with values obtained for all 81 βCoV strains enrolled in this analysis (including SARS-CoV-2 strains), all spots lie below the expected curve, indicating that G+C compositional constraints may play a role in all βCoVs codon usage (see Fig. 4 ). In order to compare the codon usage preferences of SARS-CoV-2 with those of human cells, the RSCU values of the codons in SARS-CoV-2 ORFs were calculated and compared with those of human. The results of these studies are shown in Table 1 . , and GGU (Gly). As it can be seen, most of the highly preferred codons are A and U-ending, while most of the highly underrepresented codons are C and G-endind codons, particularly CG containing codons. These results strongly suggest that mutational bias is a main force shaping codon usage in SARS-CoV-2 (see Table 1 ). In fact, when the occurrences of dinucleotides are established, they are not random and no dinucleotide is present at the expected frequencies (see Supplementary Material Table 2 (Table 1) . Highly underrepresented codons ACG (Thr), GCG (Ala) and GGG (Gly) were observed in all three viruses in relation to human cells (Table 1) . On the other hand, significant differences in frequencies of two glutamine codons (GAG and GAA) were observed among SARS-CoV-2 and SARS and MERS-CoVs (Table 1) CoVs (1.04 and 1.05, respectively). These results suggest that although a general codon usage pattern among human βCoV can be found, each virus may evolve to a unique codon usage in its adaptation to the host'cells. In order to compare the codon usage preferences of SARS-CoV-2 with those of humans, CAI values for all triplets were calculated, using human codon usage as reference set. The results of these studies are shown in Table 2. CAI index ranges from 0 to 1, being 1 if the frequency of codon usage by SARS-CoV-2 equals the frequency of usage of the reference set. A mean value of 0.710 was obtained for SARS-CoV-2 genes in relation to human (see Table 2 ). To evaluate if the differences were statistically significant, we performed a Wilcoxon & Mann-Whitney test. The results of this tests revealed that the differences in CAI values were statistically significant (T = 256, p-value < 0.001). To discern if the statistically significant differences in CAI values arise from codon preferences (Puigbo et al., 2008a) , the expected CAI (e-CAI) values were calculated for SARS-CoV-2 sequences in relation to human codon usage reference set. The e-CAI algorithm (Puigbo et al., 2008b) generated 500 random sequences with the same nucleotide and amino acid composition as the sequences of interest (in this case SARS-CoV-2 sequences). Then, we calculated the CAI values for all of them, and applied a Kolmogorov-Smirnov test for the e-CAI of these random sequences in order to show whether the generated sequences follow a normal distribution. The results of these studies revealed an e-CAI value of 0.719 (p < 0.05). Kolmogorov-Smirnov test revealed a normal distribution of the generated sequences (Kolmogorov-Smirnov test of e-CAI value of 0.028, which is below critical value of 0.061). Taking all these results together, our studies revealed that the CAI values for SARS-CoV-2 genes are different from the CAI values obtained from human cells. Again, these results suggest that these differences are related to codon usage preferences. On January 30 th 2020, the World Health Organization declared the current SARS-CoV-2 outbreak a public health emergency of international concern (WHO, 2020). To gain insight into the biology and evolution of emerging SARS-CoV-2, a comprehensive analysis of genome composition, codon and amino acid usage of βCoV strains isolated in China from humans, bats, civets and ferret hosts was performed, including SARS-CoV-2 strains recently isolated from current outbreak. The results of these studies revealed that SARS-CoV-2 strains enrolled in these analyses have a distinct genome composition in relation to other βCoV strains isolated from human (SARS-CoV), bats, civets and ferrets (see Fig. 1 ). This is in agreement with very recent studies revealing that SARS-CoV-2 is sufficiently divergent from SARS-CoV to be considered a new human-infecting βCoV (Lu et al., 2020; Zhu et al., 2020) . This distinct genomic composition is also reflected in its codon and amino acid usage patterns (see Fig. 2 ). Moreover, correlation distances and average linkage suggests a closer relation among SARS-CoV-2 and bats βCoV isolated in China and a more distant relation to SARS-CoV or SARS-like βCoV isolated from civets or ferrets (see Fig. 2 ). This is in agreement with recent results suggesting that bats might be the original host of this virus, an animal sold at the seafood market in Wuhan (Lu et al., 2020; Chan et al., 2020b) . This speaks of the importance of bats as a reservoir of potential emerging CoV. Moreover, significant numbers of new βCoV have been discovered in Chinese bats species, particularly Rhinolophus affinis (Li et al., 2017) . Interestingly, a clear degree of variation in codon and amino acid usage was observed among βCoV isolated J o u r n a l P r e -p r o o f from bats included in these studies (see Fig. 2 ). Nevertheless, bats might represent an intermediate host facilitating the emergence of the SARS-CoV-2 in humans (Wong et al., 2019) . More studies will be needed to address this important issue. In these studies, a biased nucleotide composition was found for SARS-CoV-2 genome (Fig. 3) . This bias can also have a major influence on derived parameters, as previously demonstrated for other CoVs (Berkhout and Van Hemert, 2015) . This is in agreement with the results of this work, since the nucleotide composition of SARS-CoV-2 has a strong influence in the codons that are used by this virus for the translation of its RNA genome. Previous studies on codon usage in RNA viruses have shown that mutational pressure is the major factor in shaping codon usage patterns in comparison with natural selection (Jenkins and Holmes, 2003; Wang et al., 2011) . Although mutational pressure is still a major driving force, it is certainly not the only evolutionary force that might be considered in RNA viruses. In these studies, a significant A genomic content was found in SARS-CoV-2 in comparison with other human CoVs (Fig. 3) . Previous studies done in Human Immunodeficiency virus (HIV) found that this virus has an A-rich genome and this property has been proposed to help the virus to avoid recognition by the innate immune system (Vabret et al.,2012) . This could provide a strong selective pressure on retroviruses as well as many others RNA viruses, including CoVs (Kindler and Thiel, 2014; van Hemert et al., 2014) . Moreover, previous studies done in HIV suggest that an RNA genome with A-rich domains may provide a molecular signature that is recognized during virus replication (van Hemert et al., 2013) . In the context of CoV, A-rich regions, like the Transcription Regulation Sequence (TRS), which is involved in transcription of sub-genomics RNAs have been established (Pyrc et al., 2004) . This suggest that nucleotide bias may serve distinct biological function in SARS-CoV-2 as J o u r n a l P r e -p r o o f well as in other CoVs and is in direct relation to the characteristic codon usage of these viruses (Berkhout and Van Hemert, 2015) . A mean ENC value of 48.54 ± 2.34 was obtained for SARS-CoV-2 strains enrolled in these studies, suggesting that the overall codon usage among SARS-CoV-2 is similar among themselves and slightly biased. This is in agreement with mean ENC values obtained for other CoV, like SARS-CoV (ENC = 48.99) (Gu et al., 2004) ; Bovine Coronavirus (BCoV) (ENC = 43.78) (Castells et al., 2017) ; MERS-CoV (ENC = 55.50) (Alnazawi et al., 2017) or Avian CoV (ENC = 51.33) (NSP2; Brandao, 2013) . ENC-GC3s plot of the values obtained for SARS-CoV-2 revealed that all spots cluster below the expected curve, suggesting that G+C compositional constraints play a role in SARS-CoV-2 (see Fig. 3 ). In these studies, significant differences in RSCU frequencies among SARS-CoV-2 and human cells were found (Table 1) . A strong bias toward A and U ending codons was found. This in agreement with very recent studies revealing a significant predominance of A and U at third codon positions in CoV genomes (Sheikh et al., 2020; Wang et al., 2018; Kandeel et al., 2020) . These results also suggest that these differences are related to codon usage preferences. Previous studies have shown that both cytosine deamination and selection of CpG-suppressed clones are the major factors that shape codon bias in CoV genomes (Woo et al., 2007) . MERS-CoV codon usage revealed a bias among hydrophobic amino acids, being CCG (Pro) and GUU (Val) the least and most frequently used codons (Chen et al., 2017) . The same results were found in this work for SARS-CoV-2 ORFs, since CCG y GUU resulted to be the least and most frequently hydrophobic codons (11 and 352 times, respectively). Regarding hydrophilic amino acids, the least and most frequently used codons in MERS-CoV were CGG (Arg) and GAU (Asp), respectively (Chen et al., J o u r n a l P r e -p r o o f 2017). These codons were also the least and most frequently hydrophilic codons used in SARS-CoV-2 ORF's (11 and 310 times, respectively). Similarly, CpG and UpU dinucleotide frequencies resulted to be the lowest and highest frequencies found in MERS-CoV (Chen et al., 2017) . The same results were obtained in these studies on SARS-CoV-2, since CpG and UpU were the lowest and highest dinucleotide frequencies found (0.22 and 1.96, respectively) (see Supplementary Material Table 2 ). Again, these analyses revealed that genomic composition affects the codon usage pattern in both, MERS-CoV and SARS-CoV-2 viruses. The results of these studies revealed that SARS-CoV-2 strains enrolled in these analyses have a distinct genome composition in relation to other βCoV strains. This distinct genomic composition is also reflected in its codon and amino acid usage patterns. Most of the highly frequent codons are A-and U-ending, which strongly suggests that mutational bias is the main force shaping codon usage in this virus. Significant differences in RSCU frequencies among SARS-CoV-2 and human cells were found. These differences are due to codon usage preferences. Comparative genomic analysis MERS CoV isolated from humans and camels with special reference to virus encoded helicase The Galaxy platform for accessible, reproducible and collaborative biomedical analyses On the biased nucleotide composition of the human coronavirus RNA genome The evolution of codon usage in structural and non-structural viral genes: the case of Avian coronavirus and its natural host Gallus gallus Genome-wide analysis of codon usage bias in Bovine Coronavirus Coronavirus avian infectious bronchitis virus A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster Genomic characterization of the 2019 Microbes & Infec Emerging coronaviruses: genome structure, replication, and pathogenesis Analysis of the codon usage pattern in Middle East Respiratory Syndrome Coronavirus Bovine coronavirus Spread, circulation, and evolution of the Middle East respiratory syndrome coronavirus Origin and evolution of pathogenic coronaviruses Severe acute respiratory syndrome-related coronavirus: The species and its viruses -a statement of the Analysis of synonymous codon usage in SARS Coronavirus and other viruses in the Nidovirales The extent of codon usage bias in human RNAviruses and its evolutionary origin From SARS and MERS CoVs to SARS-CoV-2, moving toward more biased codon usage in viral structural and nonstructural genes To sense or not to sense viral RNA-essentials of coronavirusinnate immune evasion MEGA-X: Molecular evolutionary genetics analysis across computing platforms Extensive diversity of coronaviruses in bats from China Severe acute respiratory syndrome CAIcal: a combined set of tools to assess codon usage adaptation E-CAI: a novel server to estimate an expected value of Codon Adaptation Index (eCAI) Genome structure and transcriptional regulation of human coronavirus NL63 Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding Clustvis: a web tool for visualizing clustering of multivariate data using Principal Component Analysis and heatmap Posterior summarization in Bayesian phylogenetics using Tracer MACSE: Multiple alignment of coding sequences accounting for frameshifts and stop codons The codon adaptation index -a measure of directional synonymous codon usage bias, and its potential applications Analysis of preferred codon usage in the coronavirus N genes and their implications for genome evolution and vaccine design Epidemiology, genetic recombination, and pathogenesis of coronaviruses Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10 Virus Evol. 4,vey016 Analysis of codon usage bias and base compositional constraints in iridovirus genomes The biased nucleotide composition of HIV-1 triggers type I interferon response and correlates with subtype D increased pathogenicity On the nucleotide compositionand structure of retroviral RNA genomes The A-nucleotide preference of HIV-1 in the context of its structured RNA genome Free Statistics Software, Office for Research Development and Education, version 1.1.23-r7 Analysis of codon usage in bovine viral diarrhea virus Global Epidemiology of Bat Coronaviruses World Health Organization (2020) Coronavirus disease (COVID-19) outbreak. Situation report -36 Statement on the second meeting of the International Health Regulations (2005) Emergency Committee regarding the outbreak of novel coronavirus (2019-nCoV) Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study The "effective number of codons" used in a gene Preliminary estimation of the basic reproduction number of novel coronavirus (2019-nCoV) in China, from 2019 to 2020: A data-driven analysis in the early phase of the outbreak A pneumonia outbreak associated with a new coronavirus of probable bat origin Fatal swine acute diarrhoea syndrome caused by an HKU2-related coronavirus of bat origin A novel Coronavirus from patients with pneumonia in China a RSCU, relative synonymous codon usage Cod, codons; HC, human cells; CoV-2, SARS-CoV-2 Highly increased codons in SARS-CoV-2 with respect to human (∆≥0.30) are shown underlined and in Codon adaptation of SARS-CoV-2 genes in relation to human codon usage, displayed as CAI a values CAI-Hs; codon adaptation index in relation to H. sapiens reference codon usage set. In all cases, mean ± standard deviation values are shown. NA We acknowledge Drs. Pilar Moreno, Gonzalo Moratorio and Rodney Colina for critical reading of this work.