key: cord-0682910-l4947bew authors: LÜ, Hui; ZHAO, Wei‐Ming; ZHENG, Yan; WANG, Hong; QI, Mei; YU, Xiu‐Ping title: Analysis of Synonymous Codon Usage Bias in Chlamydia date: 2005-01-24 journal: Acta Biochim Biophys Sin (Shanghai) DOI: 10.1111/j.1745-7270.2005.00009.x sha: 3407ec3e0f4bd479cea93cdf1270e2fabcbad190 doc_id: 682910 cord_uid: l4947bew Abstract Chlamydiae are obligate intracellular bacterial pathogens that cause ocular and sexually transmitted diseases, and are associated with cardiovascular diseases. The analysis of codon usage may improve our understanding of the evolution and pathogenesis of Chlamydia and allow reengineering of target genes to improve their expression for gene therapy. Here, we analyzed the codon usage of C. muridarum, C. trachomatis (here indicating biovar trachoma and LGV), C. pneumoniae, and C. psittaci using the codon usage database and the CUSP (Create a codon usage table) program of EMBOSS (The European Molecular Biology Open Software Suite). The results show that the four genomes have similar codon usage patterns, with a strong bias towards the codons with A and T at the third codon position. Compared with Homo sapiens, the four chlamydial species show discordant seven or eight preferred codons. The ENC (effective number of codons used in a gene)‐plot reveals that the genetic heterogeneity in Chlamydia is constrained by the G+C content, while translational selection and gene length exert relatively weaker influences. Moreover, mutational pressure appears to be the major determinant of the codon usage variation among the chlamydial genes. In addition, we compared the codon preferences of C. trachomatis with those of E. coli, yeast, adenovirus and Homo sapiens. There are 23 codons showing distinct usage differences between C. trachomatis and E. coli, 24 between C. trachomatis and adenovirus, 21 between C. trachomatis and Homo sapiens, but only six codons between C. trachomatis and yeast. Therefore, the yeast system may be more suitable for the expression of chlamydial genes. Finally, we compared the codon preferences of C. trachomatis with those of six eukaryotes, eight prokaryotes and 23 viruses. There is a strong positive correlation between the differences in coding GC content and the variations in codon bias (r=0.905, P<0.001). We conclude that the variation of codon bias between C. trachomatis and other organisms is much less influenced by phylogenetic lineage and primarily determined by the extent of disparities in GC content. Edited by You‐Xin JIN trachoma, biovar lymphogranuloma venereum (LGV) and biovar mouse pneumonitis (C. muridarum, MoPn). C. trachomatis is the causative agent of trachoma, which leads to preventable blindness worldwide, and it also causes several sexually transmitted diseases, such as urethritis, cervicitis and salpingitis [5, 6] . C. pneumoniae, which causes pneumonia, sinusitis and bronchitis, has attracted great scientific attention because it has also been found to be associated with atherosclerosis, acute myocardial infarction and chronic neurological diseases [7] . Romero et al. [8] have reported that four factors (strand-specific mutational biases, replicational-transcriptional selection, the hydropathy of each encoded protein and the degree of amino acid conservation) are involved in the codon usage bias in C. trachomatis. However, it is not clear whether genetic and environmental factors affect codon usage in Chlamydia. In this study, we have analyzed the codon usage data of MoPn, Ct (here indicating biovar trachoma and LGV), C. pneumoniae and C. psittaci and examined how other factors may affect codon usage variation in Chlamydia. We have also compared the codon preferences of Ct with those of Escherichia coli, Saccharomyces cerevisiae, adenovirus and Homo sapiens. Knowledge of the codon usage pattern in Chlamydia and a comparison of codon preference between Chlamydia and other species may therefore assist in the development of nucleic acid vaccines and improve the understanding of factors shaping codon usage patterns. Complete genomic sequences of MoPn, Ct, C. pneumoniae and C. psittaci were obtained from GenBank (Bethesda, Maryland, USA; http://www.ncbi.nlm.nih. gov/). Codon usage data were analyzed using the codon usage database (Chiba, Japan; http://www.kazusa.or.jp/ codon/) and the CUSP program of EMBOSS (The European Molecular Biology Open Software Suite, Cambridge, UK; http://bioinfo.pbi.nrc.ca:8090/EMBOSS/). The protein-coding sequences (*.ffn files) of complete genomes were downloaded from the GenBank FTP site (ftp://ftp.ncbi.nih.gov/). Because there is a negative correlation between codon usage bias and gene length-that is, codon usage is restricted in short coding sequences-genes having sequences of less than 100 codons were excluded from the analysis. GC3 indicates the G+C content at the third position of synonymously degenerate codons. The effective number of codons of a gene (ENC) was used to quantify how far the codon usage of a gene departs from equal usage of synonymous codons [9] . ENC can take values from 20 (when only one codon is used per amino acid) to 61 (when all synonyms are used in equal frequencies). ENC appears to be a good measure of the extent of codon preference in a gene. Therefore, the GC3, ENC and gene length of each gene were calculated using the program INteractive Codon Analysis 1.0 (Division of Biology, Faculty of Science, Zagreb, Croatia; http://www.bioinfo-hr.org/inca). The codon usage pattern across genes was examined by the ENC-plot, which is a plot of ENC versus GC3. We compared the codon preferences among Ct, six eukaryotes, eight prokaryotes and 23 viruses. The genomic sequences were retrieved from GenBank ( The correlation between codon usage variation among Chlamydia genes and two parameters (G+C content at synonymous sites and gene length) was analyzed using the linear regression analysis model from microsoft excel with significance-of-difference levels of P<0.05 or P<0.01. A similar method was used to analyze the relationship between interspecific codon usage variation and coding GC content. Overall codon usage patterns and codon usage data in MoPn, Ct, C. pneumoniae and C. psittaci are listed in Table 1 . C. pecorum was not analyzed because sufficient data were not available. In the four genomes that were analyzed, the amino acids Arg, Leu, Gly and Val have different codon usage biases because they have six-fold and four-fold coding degeneracy, while the preferred codons of amino acids that have two-fold or three-fold coding degeneracy are uniform. The amino acids Arg, Leu and Ser have six-fold coding degeneracy. For MoPn and C. pneumoniae, Arg uses CGT most frequently, while AGA is most commonly used for Ct and C. psittaci. Although the most and the least commonly used codons of Arg are different, all four genomes prefer to use the codons with A and T ending, not with G and C ending. Codons with A and T ending are used 1.37 times more often than codons with G and C ending for MoPn, 1.7 times more often for Ct, 1.16 times more often for C. pneumoniae and 2 times more often for C. psittaci. AGA and CGA are the two codons with A ending among the six codons coding for Arg. The four chlamydial species prefer to use AGA rather than CGA. Leu is encoded by six codons with two having A at the third position: TTA and CTA. TTA is used 1.55 ± 0.65 times more often than CTA. Leu uses TTA the most and CTG/CTC the least. TTA is used 3.21 times more often than codon with the lowest frequency for MoPn, 2.68 times for Ct, 2.52 times for C. pneumoniae and 7.31 times for C. psittaci. As a result, Leu uses codons with A and T ending 0.88 times more often than codons with G and C ending for MoPn, 0.76 times more often for Ct, 0.79 times more often for C. pneumoniae, and 1.06 times more often for C. psittaci. For Ser, TCT is most commonly used, and TCG is used with the lowest frequency. Codons with A and T ending are used 1.07 times more than codons with G and C ending for MoPn, 0.93 times more often for Ct, 1.07 times more often for C. pneumoniae, and 0.98 times more often for C. psittaci. In a word, Arg, Leu and Ser prefer to use the codons with A and T ending and the usage bias of C. psittaci is more evident than others. The amino acids Ala, Gly, Pro, Thr and Val have fourfold coding degeneracy (XYA, XYC, XYG and XYT). For Ala and Pro, XYT is used most frequently while XYG is used the least. For Gly, C. psittaci uses XYT the most and XYG the least, while the other species show a different bias; that is, they use XYA the most and XYC the least. For Thr, the usage of XYA is approximately the same as that of XYT, XYA/XYT is the most often used synonymous codon and XYG is the least commonly used synonymous codon. Val uses XYT most often and XYC the least, except for C. pneumoniae, for which XYG is used least often by Val. The usage of codons with A and T ending is 2.46 ± 1.05 times higher than codons with G and C ending for Ala, 0.90 ± 0.19 times higher for Gly, 2. 99 ± 1.49 times higher for Pro, 1.30 ± 0.34 times higher for Thr and 1.26 ± 0.68 times higher for Val. The amino acids Asn, Asp, Cys, His, Phe and Tyr have two-fold codon degeneracy (XYC and XYT). They prefer to use XYT (10.1 to 36.2 per 1000 codons) rather than XYC (4.54 to 20.8 per 1000 codons). XYT is used 0.91 ± 0.32 times more often than XYC for Asn, 1.91 ± 0.41 times more often for Asp, 0.59 ± 0.29 times more often for Cys, 1.35 ± 0.37 times more often for His, 0.77 ± 0.27 times more often for Phe and 0.88 ± 0.55 times more often for Tyr. For the amino acids Gln, Glu and Lys, whose two-fold degeneracy is of the form XYA or XYG, XYG (8.12 to 23.6 per 1000 codons) is used less often than XYA (26.2 to 51.5 per 1000 codons). XYA is used 1.44 ± 0.74 times more than XYG for Gln, 1.07 ± 0.35 times more for Glu and 1.69 ± 0.59 times more for Lys. Ile is the only amino acid that has three-fold codon degeneracy (XYA, XYC and XYT). XYT is used most frequently (0.78 ± 0.31 times more than XYC and 1.66 ± 0.52 times more than XYA). Compared with all the other amino acids where codons with G or C at the third position are used the least, Ile uses XYA with the least frequency. However, for Ile, codons with A and T ending (XYA and XYT) are used 1.46 ± 0.35 times more often than codons with G and C ending (XYC). As a whole, all chlamydial species or biovars analyzed show significant preference for one postulate codon for each amino acid. They show a high bias of codon usage toward the codons with T and/or A ending rather than C and/or G ending for all degenerate codons (1.16-5.96 times). At the same time, there are some differences in codon usage patterns among various chlamydial species or biovars. C. psittaci shows the greatest bias towards optimal codons. For example, the codon used for Arg with the highest frequency (18 per 1000 codons) was used 15 times more than the codon with the lowest frequency (1.12 per 1000 codons) in C. psittaci, but only 2.7 times more in MoPn, 6.5 times more in Ct and 3.3 times more in C. pneumoniae. Generally, codons used more than twice as frequently as host consensus codons are regarded as preferred codons of heterologous genes. Compared with Homo Fract refers to the proportion of all synonymous codons encoding the same amino acid. The frequency of each codon that appears in the coding sequence of the individual gene is 1/1000. Shaded codons are the preferred codons in Chlamydia. Triplets in bold face indicate a high frequency in coding the amino acid. Rimmed codons appear during low-frequency coding of the amino acid. Ct denotes biovar trachoma and LGV. The codon usage pattern also varies between genes in the same chlamydial genome. Plotting ENC values against GC3 is one effective way to explore this heterogeneity [9] . The ENC value of each chlamydial gene is plotted against its corresponding GC3 in Fig. 1(A) . Due to limited data, C. psittaci was not included in the analysis. The curve shows the expected position of genes whose codon usage is only determined by variation in GC3 content. If a particular gene is subject to G+C compositional constraints, it will lie on or just below the expected curve. If a gene is subject to selection for translationally optimal codons, it will lie considerably below the expected curve. Among most chlamydial genes, the GC3 values vary from 0.08 to 0.44, while the ENC values vary from 40 to 56. If the genes have low codon usage bias, the translational selection factor does not appear to be important for gene expression. Genes with lower GC3 values also have lower ENC values, indicating a stronger codon bias. A large number of points lie near the solid curve on the left side of this distribution, suggesting that these genes are subject to GC compositional constraints. Statistically, the relationship between ENC and GC3 is significantly positive (P<0.001), suggesting that mutational bias may be the major determinant of codon usage variation among chlamydial genes. The genetic heterogeneity correlates positively with the A+T content at the third position, which is consistent with the preference for codons with T and/or A ending as discussed above. The relationship between gene length and synonymous codon usage bias has been reported for Drosophila melanogaster, Escherichia coli, Saccharomyces cerevisiae, Pseudomonas aeruginosa and Yersinia pestis [10] [11] [12] . Here, the plot of gene length against ENC or against GC3 [ Fig. 1 (B,C) ] appears to assume the shape of a normal distribution. Shorter genes have a much wider variance in ENC values, vice versa for longer genes. We have analyzed the relationship between ENC value and gene length, and the relationship between GC3 and gene length in chlamydial genes. None of the correlations were statistically significant. Evidently, gene length affects codon usage of Chlamydia only in a minor way. Similar results were also found in S. pneumoniaee, P. aeruginosa and SARS coronavirus [12] [13] [14] . As mentioned above, MoPn, Ct, C. pneumoniae and C. psittaci adopt similar codon usage patterns. Thus, the codon preferences of Ct, as a representative of Chlamydiae, were compared with those of E. coli, yeast, adenovirus and Homo sapiens to see which will be the suitable host for the optimal expression of Chlamydia genes. From Table 2 , there are 23 codons showing a Ct-to-E. coli ratio higher than 2 or lower than 0.50, 24 codons showing a Ct-to-adenovirus ratio higher than 2 or lower than 0.50 and 21 codons showing a Ct-to-human ratio higher than 2 or lower than 0.50, but only 6 codons show-ing a Ct-to-yeast ratio higher than 2 or lower than 0.50, suggesting that codon usage of Ct genes more closely resembles that of yeast genes than that of E. coli, adenovirus and human genes. Thus, to express chlamydial genes efficiently in E. coli or human cell systems, codon optimization of the chlamydial genes may be required. At the same time, we can speculate that the Chlamydia genes may be more efficiently expressed in the yeast system. On the basis of the above observations, we compared the codon usage of several other eukaryotes and prokaryotes with that of Chlamydia (Table 3) . To examine whether different species comply with the same codon usage rule, we compared Ct not only with six eukaryotes and eight prokaryotes, but also with 23 viruses, taking into account that both Chlamydia and viruses belong to obligate intracellular microorganisms whose codon usage may be restricted by their hosts (Table 3) . From Table 3 , it is clear that Ct presents similar codon preferences to Vibrio cholerae, Saccharomyces cerevisiae and Schizosaccharomyces pombe because less than 10 codons show Ct-to-species ratios either higher than 2 or lower than 0.5 irrespective of their phylogenetic lineages. However, there are 23 codons with a Ct-to-E. coli ratio either higher than 2 or lower than 0.5, 40 codons with a Ct-to-Mycobacterium tuberculosis ratio either higher than 2 or lower than 0.5, 47 codons with a Ct-to-Bifidobacterium adolescentis ratio either higher than 2 or lower than 0.5, 21 codons with a Ct-to-human ratio either higher than 2 or lower than 0.5, 28 codons with a Ct-to-Eremothecium goss ratio either higher than 2 or lower than 0.5 and 31 codons with a Ct-to-Neurospora crassa ratio either higher than 2 or lower than 0.5, indicating that the codon preferences are significantly different between them. If codon usage is a major determinant of gene expression, Ct genes may be expressed more efficiently in species such as Vibrio cholerae, Saccharomyces cerevisia and Schizosaccharomyces pombe. From Table 3 , it can also be seen that the codon preference of Ct is most similar to the vaccinia virus, followed by the human coronavirus and human immunodeficiency virus, but is least similar to the human adenovirus and human herpesvirus. Thus, it can be speculated that Ct genes can probably express well in the vaccinia virus system. This hypothesis remains to be tested. To investigate whether the changes in coding GC content among various species are associated with the observed variations in codon bias, the absolute values of the http://www.abbs.info; www.blackwellpublishing.com/abbs Table 2 Comparison of codon preferences between Ct and E. coli, yeast, adenovirus (ad), and Homo sapiens ( Val 1/1000 represents the frequency of each codon that appears in the whole coding gene. Ct/E. coli, Ct/yeast, Ct/ad and Ct/human indicate the ratio of codon usage frequency in Ct to that in E. coli, Saccharomyces cerevisiae, adenovirus and Homo sapiens respectively. A ratio higher than 2 or lower than 0.5 indicates that the codon preference differs greatly, and vice versa. GCT GCG GCA GCC CGC AGG AGA CGG CGA CGT AAT AAC GAT GAC TGT TGC CAG CAA GAG GAA GGG GGT GGC GGA CAC CAT ATA ATT ATC TTG TTA CTG CTA CTT CTC AAG AAA ATG TTT TTC CCC CCT CCG CCA TCT AGT AGC TCG TCA TCC ACC ACG ACA ACT TGG TAT TAC GTT GTG GTA difference in coding GC content between Ct and other species were calculated. A linear regression analysis was then performed to examine the relationship between the differences in coding GC content and the corresponding numbers of codons with obviously different usage frequency. As a result, a strong positive correlation between differences in coding GC content and variations in codon bias (r=0.905, P<0.001, Table 3 ) was observed, indicating that the codon usage variation between Ct and other species is strongly constrained by coding GC content. In summary, the variation of codon usage bias between Ct and other organisms appears to be determined by the extent of disparities in coding GC content, and is less influenced by phylogenetic lineage. Table 3 Comparison of codon preference among different species GCcod indicates the G+C content of protein genes. GCd indicates the absolute value of the difference in coding GC content between Ct and other species. Number indicates the number of codons with great diversities in codon bias (ratios higher than 2 or lower than 0.5) between Ct and other organisms. http://www.abbs.info; www.blackwellpublishing.com/abbs Discussion Previous analyses of codon usage have suggested that both a huge interspecific variation and a clear intragenomic variability exist. Codon usage bias is found to be related to different biological factors, such as tRNA abundance, strand-specific mutational bias, gene expression level, gene length, amino acid composition, protein structure, mRNA structure and GC composition [15] [16] [17] [18] . However, directional mutation pressure on DNA sequences and natural selection affecting gene translation are the two major factors that have been widely accepted to account for both interspecific codon usage variation and intragenomic codon usage variability. With regard to the codon usage variation among genes within the same organism, this phenomenon has been observed in a wide range of species. In some unicellular organisms, such as E. coli and Saccharomyces cerevisiae, highly expressed genes have a strong selective preference for the codons complementary to the most abundant tRNA species, whereas lowly expressed genes display more uniform codon usage patterns largely compatible with the mutational bias in the absence of translational selection [19, 20] . In mammals and birds, the diverse patterns of codon usage may arise from compositional constraints of the genomes [21-23]. Romero et al. have conducted the correspondence analysis for C. trachomatis genes and found that the most important source of variations among the genes comes from whether the sequence is located on the leading or lagging strand of replication, resulting in an over-representation of G or C, respectively [8] . In the present study, we used the ENC-plot to analyze the factors affecting codon usage variation among genes and extended the analysis to other chlamydial species. Our analysis has reinforced the abovementioned findings. Here, genetic heterogeneity in the Chlamydia species is observed to be constrained by GC content, while the gene length has only a minor impact on the codon choice. In various species of Chlamydia, genetic heterogeneity seems to be the result of similar factors. In C. trachomatis, the major trend in codon choices is not as strong as in other species [8] , so it can be postulated that there may be several major factors which shape chlamydial gene codon usage. In addition to strand-specific mutational biases, GC3 may be another important factor. Certainly, illustrating all the factors which shape chlamydial codon usage variation is a complex issue. However, mutational pressure, not the selective forces acting at the translational level, may play an important role in determining the codon usage variation among chlamy-dial genes. All these findings that hold true for C. trachomatis as well as for other tested chlamydial species support the "mutational bias-translational selection" hypothesis. Our studies indicate that the codon usage variation between Ct and other species (including eukaryotes, prokaryotes and viruses) is much less influenced by phylogenetic lineage and primarily determined by the extent of disparities in coding GC content. It can be inferred that codon usage variability among different species may not depend on their phylogenetic relationships, but on their coding GC content. Consequently, the coding GC content may be more useful in predicting the amino acid or nucleotide sequence rather than the phylogenetic reconstruction. Our conclusion is consistent with previous studies that were mostly focused on genome GC content and limited to the three domains of life (Bacteria, Archaea, and Eukarya). For example, Chen et al. have proposed that only two parameters, genome GC content and context-dependent nucleotide bias, effectively differentiate the genome-wide codon bias of 100 eubacterial and archaeal organisms [24] . Moreover, they found that genome GC content variation is the most important parameter differentiating codon bias between different organisms [24] . It has been reported that seven GC-rich microbial genomes belonging to different domains of life adopt similar codon usage patterns regardless of their phylogenetic lineages [25] . In the present study, we have examined the relationship between the coding GC content and cross-species disparities in codon usage. Since the whole genome contains coding regions and non-coding regions, coding GC content may presumably be a more accurate reflection of the codon usage bias than genome GC content. Most importantly, our analysis revealed that coding GC content is correlated with crossspecies disparities in codon usage not only in all three domains of life but in non-cellular microbes. In the present study, a comprehensive analysis of codon usage and genome base composition in chlamydial species has revealed that: (1) MoPn, Ct, C. pneumoniae and C. psittaci adopt similar codon usage patterns, although C. psittaci shows the greatest bias towards optimal codons; (2) the chlamydial species prefer to use the codons with A and T at the third codon position; and (3) the gene codon usage pattern is significantly different between Chlamydia (MoPn, Ct, C. pneumoniae and C psittaci) and human genomes. Compared with human genomes, the four chlamydial genomes do not have the same preferred codons. Furthermore, the biased trend towards A and T coincides with high A+T content at silent sites in Chlamydia (the mean value is 70%). Since Chlamydia are AT-rich organisms, it is reasonable that A and/or T ending codons are predominant in their genomes. These findings suggest that it is still necessary to differentiate various chlamydial species-even biovars-when designing specific strategies for optimizing chlamydial codons even though there are significant similarities in the codon usage pattern among all tested Chlamydia species. An assumption made in the present study that chlamydial genes may express more efficiently in Saccharomyces cerevisiae and vaccinia virus systems is potentially important. This may serve as a guide for manipulating the expression of the targeted genes. Chlamydial genes optimizing with host-preferred codons are likely to improve the expression levels of the chlamydial genes in a given host. Our preliminary experiments have proved that the chlamydial major outer membrane protein (MOMP) gene optimized with human-preferred codon usage shows a higher level of expression in mammalian cells than the wildtype MOMP gene (data not shown). Thus, yeast and vaccinia virus expression systems may be better applied to the production of chlamydial proteins. We plan to use yeast and vaccinia virus expression systems to test our hypothesis. In summary, our work has provided a basic understanding of the evolution and pathogenesis of Chlamydia, with some new insights into the mechanisms for codon usage bias and vaccine development to prevent chlamydial diseases. Codon frequencies in 119 individual genes confirm consistent choices of degenerate bases according to genome type Expression of proteins encoded by foreign genes in Saccharomyces cerevisiae Evolution of codon usage patterns: The extent and nature of divergence between Candida albicans and Saccharomyces cerevisiae Synonymous codon usage in Cryptosporidium parvum: Identification of two distinct trends among genes Chlamydial infections Epidemiology of ocular Chlamydial infection in a trachoma-hyperendemic area Evidence of chronic Chlamydia pneumoniae infection in patients with Behcet's disease Codon usage in Chlamydia trachomatis is the result of strand-specific mutational biases and a complex pattern of selective forces The "effective number of codons" used in a gene Gene length and codon usage bias in Drosophila melanogaster, Saccharomyces cerevisiae and Escherichia coli Factors affecting codon usage in Yersinia pestis Gene expressivity is the main factor in dictating the codon usage variation among the genes in Pseudomonas aeruginosa Analysis of factors shaping S. pneumoniae codon usage Analysis of synonymous codon usage in SARS coronavirus and other viruses in the Nidovirales Quantitative relationship between synonymous codon usage bias and GC composition across unicellular genomes DNA G+C content of the third codon position and codon usage biases of human genes Directional mutation pressure, selective constraints, and genetic equilibria Noise in eukaryotic gene expression Ribosome traffic in E. coli and regulation of gene expression Codon usage in yeast: Cluster analysis clearly differentiates highly and lowly expressed genes The influence of translational selection on codon usage in fishes from the family Cyprinidae Studies on codon usage in Entamoeba histolytica What drives codon choices in human genes? Codon usage between genomes is constrained by genome-wide mutational processes Seven GC-rich microbial genomes adopt similar codon usage patterns regardless of their phylogenetic lineages