key: cord-0708967-3mj670c2
authors: Gu, Wanjun; Zhou, Tong; Ma, Jianmin; Sun, Xiao; Lu, Zuhong
title: Analysis of synonymous codon usage in SARS Coronavirus and other viruses in the Nidovirales
date: 2004-02-25
journal: Virus Res
DOI: 10.1016/j.virusres.2004.01.006
sha: 51d7f503fd0210340c4af70eac1f146c86ef36c1
doc_id: 708967
cord_uid: 3mj670c2

In this study, we calculated the codon usage bias in severe acute respiratory syndrome Coronavirus (SARSCoV) and performed a comparative analysis of synonymous codon usage patterns in SARSCoV and 10 other evolutionary related viruses in the Nidovirales. Although there is a significant variation in codon usage bias among different SARSCoV genes, codon usage bias in SARSCoV is a little slight, which is mainly determined by the base compositions on the third codon position. By comparing synonymous codon usage patterns in different viruses, we observed that synonymous codon usage pattern in these virus genes was virus specific and phylogenetically conserved, but it was not host specific. Phylogenetic analysis based on codon usage pattern suggested that SARSCoV was diverged far from all three known groups of Coronavirus. Compositional constraints could explain most of the variation of synonymous codon usage among these virus genes, while gene function is also correlated to synonymous codon usages to a certain extent. However, translational selection and gene length have no effect on the variations of synonymous codon usage in these virus genes.

Synonymous codons are not used equally both within and between genomes (Grantham et al., 1980; Martin et al., 1989; Lloyd and Sharp, 1992) . Compositional constraints and natural selection are thought to be the two main factors accounting for codon usage variation among genes in different organisms (Karlin and Mrazek, 1996; Lesnik et al., 2000) . The diverse patterns of codon usage in mammals may arise from compositional constraints of the genomes (Karlin and Mrazek, 1996; Francino and Ochman, 1999; Majumdar et al., 1999; Ghosh et al., 2000) . In contrast, in some unicellular organisms, such as Escherichia coli Abbreviations: bp, base pair; SARSCoV, severe acute respiratory syndrome Coronavirus; RSCU, relative synonymous codon usage; ENC, effective number of codons; CA, correspondence analysis; GC 3S , the frequency of G+C at the synonymous third position of sense codons; A 3S , T 3S , G 3S and C 3S , the adenine, thymine, guanine and cytosine content at synonymous third positions; ORF, open reading frame; PCR, polymerase chain reaction; S.D., standard deviation * Corresponding author. Tel.: +86-25-83619983; fax: +86-25-83619983.

E-mail address: zhlu@seu.edu.cn (Z. Lu).

and Saccharomyces cerevisiae, high expressed genes have a strong selective preference for codons with a high concentration of the corresponding acceptor tRNA molecule, whereas low expressed genes displayed a more uniform pattern of codon usage (Gouy and Gautier, 1982; Grantham et al., 1981; Ikemura, 1981 Ikemura, , 1985 Lesnik et al., 2000) . Moreover, mutational pressure rather than translational selection is the most important determinant of the codon bias in some human RNA viruses (Levin and Whittome, 2000; Jenkins et al., 2001; Jenkins and Holmes, 2003) . Furthermore, replicational and transcriptional selection is responsible for the codon usage variation among the genes of Borrelia burgdorferi (McInerney, 1998) . In some other researches, codon usage was also found to be related to gene function (Chiapello et al., 1998; Epstein et al., 2000; Ma et al., 2002) , protein secondary structure (Chiusano et al., 1999 (Chiusano et al., , 2000 Oresic and Shalloway, 1998; Xie and Ding, 1998; Gupta et al., 2000) , cellular location of gene products (Chiapello et al., 1999) and gene length (Coghlan and Wolfe, 2000; Marais and Duret, 2001; Moriyama and Powell, 1998) . Severe acute respiratory syndrome (SARS) is a respiratory disease that was recently reported in Asia, North America and Europe (Chan-Yeung and Yu, 2003; Drazen, 2003). Although genome sequence of severe acute respiratory syndrome Coronavirus (SARSCoV) has been published and many studies have been performed on SARSCoV in recent months (Paul et al., 2003; Qin et al., 2003; Marra et al., 2003; Snijder et al., 2003) , little genomic analysis is available on this virus. Codon usage data of SARSCoV might give some clues to the features of SARSCoV genome and some evolutionary information of this virus. Here, we analyzed the codon usage data of this virus and other viruses in the order Nidovirales. The key evolutionary determinants of codon usage bias in these viruses were also investigated.

SARSCoV is a large, enveloped, positive-stranded RNA virus, which belongs to order Nidovirales, family Coronaviridae, genus Coronavirus in virus taxonomy (Marra et al., 2003) . The complete genome and coding sequences of SARSCoV TOR2 isolation were obtained from GenBank (Version 134.0). To keep the statistical significance of codon usage bias, only sequences with length above 150 bps were analyzed (Table 1) . To compare the codon usage pattern among different viruses, coding genes of 10 other viruses belonging to order Nidovirales (six viruses in the genus Coronavirus, four viruses in the genus Arterivirus) were also parsed from GenBank (Version 134.0) ( Table 2) .

Relative synonymous codon usage values of each codon in a gene were used to examine the synonymous codon usage without the confounding influence of amino acid composition (Sharp and Li, 1986) . N 3S , the frequency of base N at synonymous third codon positions, was also used to b f 1 and f 2 , respectively, represent the first axis mean value and the second axis mean value in CA of each genome. calculate the extent of base composition bias. Additionally, the effective number of codons of a gene (ENC) was used to quantify the codon usage bias of a gene (Wright, 1990) , which is the best overall estimator of absolute synonymous codon usage bias (Comeron and Aguade, 1998) . ENC value ranges from 20 (when only one codon is used per amino acid) to 61 (when all synonymous codons are equally used for each amino acid).

Correspondence analysis was used to investigate the major trend in codon usage variation among genes. Each gene is represented as a 59 dimensional vector, and each dimension corresponds to the RSCU value of one sense codon (excluding AUG, UGG and three stop codons).

CA based on RSCU values relies on two main steps (Mardia et al., 1979) . The first step is to measure the similarities in codon usage using the squared Euclidean distance among all genes, and the resulting distance table will be used to compute the coordinates of the genes in a multidimensional space. The second step provides the visualization of these Euclidean distances through positioning genes by successive orthogonal projections of the cloud of points. Essentially, this process consists in finding the linear transformations f 1 , f 2 , . . . , f 58 of the original variables f 1 , f 2 , . . . , f 59 . The f -variables are calculated and ordered according to the values of relative variance. f 1 is the maximum value; f 2 is the next value and is by construction not correlated with f 1 . The same applies to f 3 , f 4 , and so on, until f 58 . So, genes with similar codon usage are neighbors on the components of projection.

Linear regression analysis was used to find the correlation between codon usage bias and nucleotide composition. One tailed t-test was used to compare the variation of codon usage between different gene groups (Ewens and Grant, 2001) . As a null hypothesis, it is assumed that mean values of codon usage indices in different gene groups is statistically the same. Under the null assumption, t-statistic could be calculated. Then, P-value is derived and it is taken as significance when P-value is below 0.05.

A C++ program was developed to calculate the codon usage indices for each gene. CA and other statistical analysis were performed with statistical software SPSS 11.0.

The details of coding genes in SARSCoV and the overall RSCU values of 61 sense codons in SARSCoV were, respectively, shown in Tables 1 and 3. All preferentially used codons in SARSCoV are all A-ended or U-ended codons (Table 3) . SARSCoV is a GC poor genome with GC content of 37.52%. Due to compositional constraints, it is expected that A-ended and/or U-ended codons should be preferentially used in this genome. To study the codon usage variation among different SARSCoV genes, ENC and GC 3S values of different SARSCoV genes were calculated (Table 1) . ENC values of different SARSCoV genes vary from 42.19 to 59.06, with a mean value of 48.99 and S.D. of 6.41. Because all ENC values of SARSCoV genes are much higher (ENC > 40), codon usage bias in SARSCoV genome is a little slight. However, there is a marked variation in codon usage pattern among different SARSCoV genes (S.D. = 6.41). Similarly, GC 3S values of each SARSCoV gene also confirm the heterogeneity of synonymous codon usage among different SARSCoV genes, which range from 28.3 to 58.1% with a mean of 37.23 and S.D. of 8.78%.

CA was implemented for all identified ORFs from each of the 11 virus genomes as a single dataset, which consists of 103 coding sequences. CA detected one major trend in the first axis which accounted for 15.40% of the total variation, and none of the other axes individually accounted for more than 7.60% of the total variation. A plot of the first axis and the second axis of each gene was shown in Fig. 1 . Although this graph is a little complex with some overlap among genes from different genomes, it is clear that genes from a particular genome tend to cluster together. The separation of one virus genome from other virus genomes is determined to be significant on both axes (t-test, P-value <10 −15 on the first axis and P-value <10 −3 on the second axis). So, similar to codon usage in mammals and bacteria, synonymous codon usage in these viruses is also virus specific.

To show whether there is a correlation between virus codon usage and its host, these 103 virus genes were divided into several groups according to the virus host. For example, because both SARSCoV genes and human Coronavirus 229E infect human, genes in these two viruses were incorporated as a group. Next, t-test was also used to test whether the separation of different viral genes which infect different hosts is significant. The P-value is 0.57 on the first axis and is 0.08 on the second axis, which suggested that codon usage in different virus genes was not host specific.

In Fig. 1 , all virus genes in the genus Coronavirus were plotted in red. At the same time, all viral genes in the genus Arterivirus were plotted in blue. Coronavirus genes are mainly located on the left side of the plot, while a majority of Arterivirus genes are located on the right side. The separation of Coronavirus genes and Arterivirus genes on the first axis is statistically significant (t-test, P-value <10 −15 ). Hence, synonymous codon usage appears to be conserved between phylogenetically related viruses.

Also, SARSCoV genes were widely extended in the first axis (Fig. 1) . Six of eleven SARSCoV genes were located in the cluster of Coronavirus genes, while the other five SARSCoV genes were located in the cluster of Arterivirus genes. Therefore, SARSCoV might have been diverged far from all three known Coronavirus groups. Comparing with all other viruses in the genus Coronavirus, it might be more evolutionary related to the genus Arterivirus.

Linear regression analysis was implemented to find whether there is some correlation between synonymous codon usage bias and nucleotide compositions. The R 2 value and significance level of these regression analyses was listed in Table 4 . The first axis value of each gene in CA is closely correlated with all the base compositions on the third codon position, while the second axis of each gene is correlated with some base compositions on the third codon position to a certain extent. Therefore, compositional constraint mainly determines the variation of synonymous codon usage among these virus genes.

Furthermore, we plotted the first axis values in CA and GC 3S values of each gene (Fig. 2) . The GC 3S mean value of genes in coronaviruses ranges from 26.09 to 37.32, and it ranges from 45.18 to 53.76 in arteriviruses (Table 2) . Although codon usage bias appears to be conserved between evolutionary related viruses (Section 3.3), the patterns of codon usage in different virus genes also appear to be a direct function of the GC content on the third codon position of these genes.

The plot of ENC and GC 3S is another effective way to explore codon usage variation among genes (Wright, 1990) . ENC values of each virus gene were plotted against its Table 4 Summary of linear regression analysis between the first two axes in CA and the nucleotide contents on the third codon position in all selected virus genes a corresponding GC 3S (Fig. 3) . The solid line represents the curve if codon usage is only determined by GC content on the third codon position. A large proportion of points lie near to the solid line on the left region of this distribution. It also suggests that mutational bias is the main factor determines the codon usage variation among these genes. However, there are also some points lying below the expected curve. Hence, other than mutational bias, there might be some additional factors drive the codon usage variation among these genes.

To show whether translational selection or gene function were correlated with the observed variation in codon bias, all virus genes were grouped into several classes according to gene function. Because most of these viruses contain genes coding for RNA polymerase, envelop protein and structural glycoprotein, these three gene groups were selected to find whether there is some correlation between codon usage and gene function. One tailed t-test was then performed on ENC values of these genes with the hypothesis that there is no correlation between codon usage bias and gene function. Some associations have been found. Average codon usage bias is higher in RNA polymerase gene group than in envelop gene group (t-test, P-value = 0.031), and it is higher in polymerase gene group than in structural glycoprotein gene group (t-test, P-value = 0.002). But, there is no association between codon usage in structural glycoprotein gene group and envelop protein gene group (t-test, P-value = 0.74). Because the structural glycoprotein and envelop protein are all structural proteins in these viruses and RNA polymerase is a nonstructural protein, it is clear that codon usage in structural genes is significantly diverged from that in nonstructural genes. On the other hand, structural genes are generally highly expressed than nonstructural genes. So, if translational selection was also contributed to codon usage bias in these genes, codon usage bias in structural genes should be higher than in RNA polymerase genes. However, RNA polymerase genes (ENC = 49.25) were found to have greater codon usage bias than structural genes (ENC = 54.60 for envelop gene and ENC = 55.33 for structural glycoprotein). Hence, codon usage bias in these virus genes is not related to gene expression level. Furthermore, we also performed a linear regression analysis on ENC value and gene length of each gene. But, there was no significant correlation between codon usage and gene length in these virus genes (P-value > 0.05). So, gene function, rather than translational selection and gene length, is another factor accounting for codon usage variation among these virus genes.

Our analysis revealed that synonymous codon usage bias in SARSCoV was less biased, which was mainly determined by the base compositions on the third codon position. Comparative analysis of codon usage bias in the order Nidovirales also suggested that codon usage in these viruses was virus specific and mutational bias was the main factor drives the codon usage variation among these viruses. Gene function was also related to codon usage bias in these viruses to some extent. But, translational selection and gene length might have no effect on the codon usage pattern in these viruses. Some published results has shown that the overall extent of codon usage bias in RNA viruses is low and there is little variation in bias between genes (Levin and Whittome, 2000; Jenkins et al., 2001; Jenkins and Holmes, 2003) . Although SARSCoV is a newly detected RNA virus infecting human, the synonymous codon usage pattern in SARSCoV we described here is also in accordance with these published codon usage pattern of human RNA viruses (Jenkins and Holmes, 2003) . Because mutation rates in RNA viruses are much higher than those in DNA viruses (Drake and Holland, 1999) , it is understandable that mutation pressure is the main determinant of codon usage bias in SARSCoV. Our analysis also revealed that there was no host specific codon usage pattern in these viruses. So, host genome might have no obvious effect on the evolution of these viruses.

Some phylogenetic analysis of SARSCoV (Qin et al., 2003; Marra et al., 2003) has shown that SARSCoV does not closely resemble any of the three previously known groups in genus Coronavirus. But Snijder et al. (2003) has proposed that SARSCoV is most closely related to group 2 Coronaviruses. Based on different codon usage patterns in different coronaviruses, we revealed that codon usage patterns of each virus was phylogenetically distinct and SARSCoV might have been diverged far from all three known Coronavirus groups, which is in accordance with the results Qin et al. (2003) and Marra et al. (2003) proposed.

Codon usage patterns and the phylogenetic results we proposed here are useful to understand the processes governing the evolution of SARSCoV, especially the roles played by mutation pressure and natural selection. Further, such information might be helpful to understand the pathogenesis and the origin of SARSCoV.

Outbreak of severe acute respiratory syndrome in Hong Kong Special Administrative Region: case report

Codon usage and gene function are related in sequences of Arabidopsis thaliana

Codon usage as a tool to predict the cellular location of eukaryotic ribosomal proteins and aminoacyl-tRNA synthetases

Correlations of nucleotide substitution rates and base composition of mammalian coding sequences with protein structure

Second codon positions of genes and the secondary structures of proteins. Relationships and implications for the origin of the genetic code

Relationship of codon bias to mRNA concentration and protein length in Saccharomyces cerevisiae

An evaluation of measures of synonymous codon usage bias

Mutation rates among RNA viruses

Case clusters of the severe acute respiratory syndrome

A functional significance for codon third bases

Isochores result from mutation not selection

Studies on codon usage in Entamoeba histolytica

Codon usage in bacteria: correlation with gene expressivity

Codon catalog usage and the genome hypothesis

Codon catalog usage is a genome strategy modulated for gene expressivity

Studies on the relationships between the synonymous codon usage and protein secondary structural units

Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system

Codon usage and tRNA content in unicellular and multicellular organisms

Evolution of base composition and codon usage bias in the genus Flavivirus

The extent of codon usage bias in human RNA viruses and its evolutionary origin

What drives codon choices in human genes?

Ribosome traffic in E. coli and regulation of gene expression

Codon usage in nucleopolyhedroviruses

Evolution of codon usage patterns: the extent and nature of divergence between Candida albicans and Saccharomyces cerevisiae

Cluster analysis of the codon use frequency of MHC genes from different species

Compositional correlation studies among the three different codon positions in 12 bacterial genomes

Synonymous codon usage, accuracy of translation, and gene length in Caenorhabditis elegans

Multivariate analysis

The genome sequence of the SARS-associated Coronavirus

Variation in G + C content and codon choice: differences among synonymous codon groups in vertebrate genes

Replicational and transcriptional selection on codon usage in Borrelia burgdorferi

Gene length and codon usage bias in Drosophila melanogaster, Saccharomyces cerevisiae and Escherichia coli

Specific correlations between relative synonymous codon usage and protein secondary structure

Characterization of a novel Coronavirus associated with severe acute respiratory syndrome

A complete sequence and comparative analysis of a SARS-associated virus (Isolate BJ01)

Codon usage in regulatory genes in Escherichia coli does not reflect selection for 'rare' codons

Codon usage in yeast: cluster analysis clearly differentiates highly and lowly expressed genes

Unique and conserved features of genome and proteome of SARS-Coronavirus, an early split-off from the Coronavirus group 2 lineage

The 'effective number of codons' used in a gene

The relationship between synonymous codon usage and protein structure

This research is a part of Projects 60121101 and 60223002 supported by National Natural Science Foundation of China.