key: cord-0752095-a73rdm51 authors: Zhang, Yanping; Jin, Xiaojie; Wang, Haiyan; Miao, Yaoyao; Yang, Xiaoping; Jiang, Wenqing; Yin, Bin title: Compelling Evidence Suggesting the Codon Usage of SARS-CoV-2 Adapts to Human After the Split From RaTG13 date: 2021-10-08 journal: Evol Bioinform Online DOI: 10.1177/11769343211052013 sha: 5f8facf812166115bde5b79e07d046badd132b63 doc_id: 752095 cord_uid: a73rdm51 SARS-CoV-2 needs to efficiently make use of the resources from hosts in order to survive and propagate. Among the multiple layers of regulatory network, mRNA translation is the rate-limiting step in gene expression. Synonymous codon usage usually conforms with tRNA concentration to allow fast decoding during translation. It is acknowledged that SARS-CoV-2 has adapted to the codon usage of human lungs so that the virus could rapidly proliferate in the lung environment. While this notion seems to nicely explain the adaptation of SARS-CoV-2 to lungs, it is unable to tell why other viruses do not have this advantage. In this study, we retrieve the GTEx RNA-seq data for 30 tissues (belonging to over 17 000 individuals). We calculate the RSCU (relative synonymous codon usage) weighted by gene expression in each human sample, and investigate the correlation of RSCU between the human tissues and SARS-CoV-2 or RaTG13 (the closest coronavirus to SARS-CoV-2). Lung has the highest correlation of RSCU to SARS-CoV-2 among all tissues, suggesting that the lung environment is generally suitable for SARS-CoV-2. Interestingly, for most tissues, SARS-CoV-2 has higher correlations with the human samples compared with the RaTG13-human correlation. This difference is most significant for lungs. In conclusion, the codon usage of SARS-CoV-2 has adapted to human lungs to allow fast decoding and translation. This adaptation probably took place after SARS-CoV-2 split from RaTG13 because RaTG13 is less perfectly correlated with human. This finding depicts the trajectory of adaptive evolution from ancestral sequence to SARS-CoV-2, and also well explains why SARS-CoV-2 rather than other viruses could perfectly adapt to human lung environment. Viruses need to adapt to hosts to survive and proliferate. 1 In order to rapidly replicate themselves, the key question for viruses is how to efficiently utilize the limited resources in host cells. The central dogma dictates that the virus proliferation includes (at least for RNA viruses like SARS-CoV-2) the replication and translation processes. While the elements required for replication is evolutionarily conserved across coronaviruses, 2 the mechanism of virus translation is quite mysterious. This prompts us to consider how the virus translate its own RNAs in the host cells. In fact, among the various biological processes, the most energy-consuming and rate-limiting step is mRNA translation. 3 Therefore, virus RNAs have to compete with host RNAs for translation machineries. It is believed that higher translation efficiency is favored by natural selection, [4] [5] [6] and this is especially crucial for viruses. Conceivably, if the high expression of viral genes is not achieved, then the optimization of viral protein sequence/structure is not meaningful because the viral genes must express to a sufficiently high amount before they exert their functions. A smart way to elevate mRNA translation efficiency is to optimize the synonymous codon usage. [7] [8] [9] Although synonymous codons encode identical amino acids, they do have different cognate tRNA concentrations. 10 Synonymous codons with higher tRNA availability are advantageous as they are decoded and translated with faster rates. Since tRNA availability is highly correlated with relative synonymous codon usage (RSCU), 11, 12 it is generally accepted that frequently used codons are optimized (optimal) codons that are suitable for fast translation as well as favored by natural selection. 13 The RSCU of eukaryotes has been adapted to the tRNA concentration during long term evolution. 12 For viruses that invade the hosts, the viral codon usage is usually uncorrelated with host codon usage. It means that viral genes would be less capable of utilizing the tRNA pool and have lower translation efficiency. Indeed, manipulating codon usage has already been used as a strategy to control viral gene expression. 14 On the contrary, from the virus perspective, a wise way to efficiently invade the host is to make its codon usage similar to that of the host transcriptome. 1 Since RSCU is measured by the expressed genes in a particular tissue/sample, the different tissues of the same species could have distinct RSCU spectrum. A previous literature found an interesting co-evolution pattern between virus and host. 15 When the codon usage of virus is similar to that of the host, the translation efficiency of host genes would be suppressed so that the virus gains an advantage due to translation selection. However, the above analyses were majorly conducted at unicellular level (yeast and human cell), which does 2 Evolutionary Bioinformatics not have a problem of tissue-specific gene expression. In reality, the infection of SARS-CoV-2 is also selective, reminding us that we should analyze different tissues respectively. Analogous conclusion was proposed based on proteomic evidence that although the GC content of SARS-CoV-2 is poorly correlated with human genome (possibly for avoiding human immune response 16 ), the human genes with similar codon usage to SARS-CoV-2 were down-regulated upon infection. 17 It is reported that although the codon usage of SARS-CoV-2 is not correlated with RSCU of human genome, it is correlated with codon usage of lung expressed genes. 1 In other words, SARS-CoV-2 is adapted to human lungs. This observation nicely explains why SARS-CoV-2 could be rampant in the lung environment. However, question comes that why other viruses do not have this advantage to infect human lungs? To answer this question, there are 2 steps to be accomplished. First is to look into multi-tissue transcriptome data to confirm that the codon usage of SARS-CoV-2 is really most similar to that of human lungs (compared with other human tissues). Next, one should analyze the codon usage of a close coronavirus to see whether the other viruses are less correlated with human lungs. RaTG13 is a coronavirus found in bats (Rhinolophus affinis). Although its evolutionary divergence with SARS-CoV-2 is still under debate, 18 it is widely accepted that RaTG13 is the closest sequence relative to SARS-CoV-2. [19] [20] [21] In this study, by retrieving the GTEx 22 RNA-seq data of 30 human tissues (from over 17 000 individuals) and analyzing the RSCU of SARS-CoV-2, RaTG13, and human tissues, we draw the conclusion that codon usage of SARS-CoV-2 is most adaptive to human lungs, and that this adaptation event took place after the split from RaTG13. Our finding nicely explains the uniqueness of SARS-CoV-2 and demonstrates the evolutionary trajectory of codon usage in coronaviruses. The RNA-seq data of human tissues were downloaded from GTEx portal. 22 The expression level of each gene (measured by TPM, transcripts per million) is given in the downloaded file. We downloaded the SARS-CoV-2 and RaTG13 genome from the NCBI website (https://www.ncbi.nlm.nih.gov/ genome/). The coding sequence of human genome was downloaded from the Ensembl website of version hg19 (ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/ cds/). The RSCU is defined as the frequency of a codon divided by the mean frequency of all its synonymous codons. 23 Frequency is calculated with a particular set of genes, which are usually the highly expressed genes (HEG) in a given sample. Under this definition, for an amino acid with N synonymous codons, the sum of RSCU values of all its codons will exactly be N. For a codon, RSCU > 1 represents frequently used codons, or termed optimal codons. RSCU < 1 represents non-optimal codons. Considering the arbitrary definition of HEG in a sample, it is also feasible to calculate RSCU by weighting each gene by expression level instead of discarding the lowly expressed genes. In this study, the "all genes weighted" RSCU is calculated by weighting all gene expression, and the "tissue-specific HEG" has excluded the top 500 genes with the highest mean expression in all tissues. The excluded 500 genes represent house-keeping genes. The statistics like Pearson correlation coefficient (PCC) is performed in R language and environment. Delta PCC = PCC SARS-CoV-2 − PCC RaTG13 , where PCC SARS-CoV-2 is the PCC of RSCU between SARS-CoV-2 and human samples, and PCC RaTG13 is the PCC of RSCU between RaTG13 and human samples. We retrieve GTEx transcriptome data of 30 human tissues (from over 17 000 individuals) and intend to see the correlation of codon usage between human tissues and SARS-CoV-2. The RSCU values of SARS-CoV-2 are calculated with standard pipeline. 1 For human tissues, the RSCU value could be either calculated from the gene sequences of a set of arbitrarily selected HEG, or alternatively, we could weight each gene by its expression level so that the arbitrary cutoff is avoided. Theoretically, the 2 approaches should produce similar RSCU values because the HEG have higher weights. We first weight each gene by expression level and calculate the RSCU for each sample. Then we perform correlation analysis and calculate the Pearson correlation coefficient (PCC) between SARS-CoV-2 and every human sample. Totally 30 tissues are ranked by median PCC. While lung samples generally have the highest RSCU correlation with SARS-CoV-2 ( Figure 1A) , it is striking that the PCC is globally negative, with a median PCC = −0.32 and 95% quantile [−0.36, −0.19] for all human samples ( Figure 1B ). Since we have calculated the human RSCU by weighting each gene by expression level, we wonder whether another approach could improve the correlation. We select the top 50% genes with higher expression levels and calculate RSCU from the raw sequences of these genes (not weighted by expression level). It turns out that the RSCU under this approach does not differ much from the weighted one, with median PCC = −0.31 and 95% quantile [−0.35, −0.19] when compared to SARS-CoV-2. We have observed unexpected results where the RSCU values are negatively correlated between human and SARS-CoV-2. Given the fact that SARS-CoV-2 is very successful in invading Zhang et al 3 human lungs, there should be some features that are unique to lungs to allow SARS-CoV-2 to take advantage of them. It is known that some house-keeping genes are constantly highly expressed in all tissues (such as constitutionally expressed RPL family genes) while some other genes are selectively highly expressed in particular tissues (termed tissue-specific HEG). Our purpose is to try to reflect the tissue-specific codon usage, but the weighting method would obviously overrate the contribution of house-keeping genes and underestimate the importance of tissue-specific HEG. Among the tens of thousands of genes in the human genome, we rank the genes with mean expression across all GTEx samples, and empirically find that the top 500 genes are constantly highly expressed in most samples. Therefore, we regard the top 500 genes as housekeeping genes. Notably, the robustness of this cutoff would be discussed later. In order to reflect the contribution of tissue-specific HEG, we remove the top 500 house-keeping genes and perform the weighting method to calculate RSCU. The PCC values between SARS-CoV-2 RSCU and human RSCU are calculated for each tissue respectively. Strikingly, the correlation is no longer "all negative" but shows considerable fractions of positive correlation in each tissue ( Figure 1C ). Lung is still the most highly correlated tissue, and the median PCC across all tissues is −0.06 with 95% quantile [−0.26, 0.15] ( Figure 1D ). For all the lung samples, 93% of them show positive correlation with SARS-CoV-2 and 3.6% show significantly positive PCC. These results agree with our intuitive expectation. To test the robustness of the tissue-specific HEG approach, we alternatively selected the top 300, top 800, and top 1000 genes as the house-keeping genes, and calculated the RSCU and PCC for the remaining genes weighted by expression level. The aforementioned patterns are robust. Lung is always the tissue showing the highest correlation with SARS-CoV-2, and 87% to 94% of the lung samples show significantly positive PCC. Therefore, we conclude that the tissue-specific HEG, rather than the constitutionally highly expressed house-keeping genes, are suitable for distinguishing the codon usage of different tissues. Lung is still the most highly correlated tissue with SARS-CoV-2. A large fraction of lung samples shows positive correlation with the codon usage of SARS-CoV-2. Given the observation that the codon usage of SARS-CoV-2 conforms to that of the human lung expressed genes, one would intuitively ask why only SARS-CoV-2 rather than other closely related coronaviruses has succeeded in the adaptation. To answer this question, we retrieve the sequence of RaTG13, the closest coronavirus to SARS-CoV-2. We calculate the correlation of RSCU between RaTG13 and human tissues where the tissuespecific HEG (excluding the top 500 house-keeping genes) are used. For a given human sample, we define delta PCC = PCC SARS-CoV-2 − PCC RaTG13 . A positive delta PCC means SARS-CoV-2 is more adaptive to human. We rank the human tissues with increasing delta PCC values. Amazingly, all samples have a positive delta PCC value, among which lungs generally have the highest delta PCC (Figure 2A ). This result demonstrates that SARS-CoV-2 always has better codon adaptation to human tissues compared with RaTG13, and that the extent of difference peaks in lungs. Note that the RSCU above is calculated with the tissuespecific HEG. We have proved that using this set of genes is better than using all genes. To reemphasize this notion, we use a lung sample (ID: Lung_GTEX-QDT8-0926-SM-32PL2) to display how the choice of gene sets would affect the RSCU and correlation. First, we calculate RSCU with all genes weighted by expression level in this lung sample. The PCC with SARS-CoV-2 ( Figure 2B ) and the PCC with RaTG13 ( Figure 2C ) are both negative and do not differ much. However, if we calculate RSCU with the tissue-specific HEG by excluding the top 500 house-keeping genes, then the PCC with both SARS-CoV-2 and RaTG13 has greatly improved but the PCC with SARS-CoV-2 ( Figure 2D ) is obviously higher than the PCC with RaTG13 ( Figure 2E ). This pattern, where SARS-CoV-2 is better correlated with human, suggests that the adaptation has taken place after SARS-CoV-2 split from RaTG13. From the perspectives of structural biology, biochemistry, and molecular biology, there are already plenty of explanations for how SARS-CoV-2 could perfectly adapt to the human environment. However, several essential links are still missing. (1) Why SARS-CoV-2 rather than other closely related coronaviruses could have this perfectness? (2) If the high expression of viral genes is not achieved, then the optimization of protein sequence/structure is almost futile (because quantity and quality are both important for the functioning of viral genes). The answer to these questions could be addressed in the light of evolution. 24 Conceivably, mutation is the source of natural selection and evolution. [25] [26] [27] Different mutated sequences were subjected to natural selection to eliminate the less adaptive ones. 28 The transition from ancient sequence to the present SARS-CoV-2 sequence reliably reflects this long-lasting evolution process. By comparing the codon usage profiles of SARS-CoV-2, RaTG13, and numerous human tissues, we successfully find that SARS-CoV-2 always has better correlation with human tissues compared with its closest relative RaTG13. This observation nicely reconstructs the evolution history of SARS-CoV-2. After split from RaTG13, SARS-CoV-2 has optimized its codon usage to become more similar to that of human tissues, especially for lungs. This allows SARS-CoV-2 to utilize the resources (like tRNA pools) of human cells more efficiently, and explains why SARS-CoV-2 is so unique compared with other viruses. The adaptive codon usage of SARS-CoV-2 facilitates the viral RNA translation. A high translation rate of viral genes guarantees the high amount of viral proteins, which serves as the prerequisite of successful viral invasion and proliferation. To the host, the optimization of viral sequence is undesired, but for the virus itself, this codon optimization might be a milestone in evolution. Interestingly, the previous study on co-evolution of host-virus also found that the codon usage correlation is higher between virus and symptomatic hosts (compared to other natural hosts), 15 this simply support our conclusion that SARS-CoV-2 has optimized the codon usage after splitting from RaTG13. We also emphasize that the correlation of codon usage between host and parasite should be investigated tissue by tissue, that is, taking the tissue specific gene expression into account, rather than using the whole reference genome. As we have shown, different human tissues exhibit completely different correlation with the virus. SARS-CoV-2 is highly correlated with human lungs, indicating its perfect adaptation to lung environment. The whole genome sequence (of a host species) is used only when there are no high-quality tissue-specific transcriptome data available. Note that codon optimization is only a necessary but not sufficient requirement for successful viral invasion. The codon optimization ensures a relatively high level of viral gene expression. However, the expressed viral protein should also be structurally optimal to invade the host cells. The natural selection on codon optimization is exerted on synonymous mutations, while the selection on protein sequence/structure is exerted on missense mutations. One could conclude that any mutations in the viral coding sequence are potentially subjected to natural selection. Therefore, under this global pandemic, any single mutation in the SARS-CoV-2 sequence should not be ignored since it might change the viral adaptiveness and even virulence. The codon usage of SARS-CoV-2 has adapted to human lungs to allow fast decoding and translation. This adaptation probably took place after SARS-CoV-2 split from RaTG13 because RaTG13 is less perfectly correlated with human. This finding depicts the trajectory of adaptive evolution from ancestral sequence to SARS-CoV-2, and also well explains why SARS-CoV-2 rather than other viruses could perfectly adapt to human lung environment. GC usage of SARS-CoV-2 genes might adapt to the environment of human lung expressed genes No species-level losses of s2m suggests critical role in replication of SARS-related coronaviruses SARS-CoV-2 has the advantage of competing the iMet-tRNAs with human hosts to allow efficient translation Dietary nitrogen alters codon bias and genome composition in parasitic microorganisms Selection-driven cost-efficiency optimization of transcripts modulates gene evolutionary rate in bacteria Cost-efficiency tradeoff is optimized in various cancer types revealed by genome-wide analysis Codon optimization of Iranian human papillomavirus type 16 E6 oncogene for Lactococcus lactis subsp cremoris MG1363 Codon usage determines translation rate in Escherichia coli Codon usage influences the local rate of translation elongation to regulate co-translational protein folding Solving the riddle of codon usage preferences: a test for translational selection Codon usage and tRNA content in unicellular and multicellular organisms Codon usage and tRNA genes in eukaryotes: correlation of codon usage diversity with translation efficiency and with CG-dinucleotide usage as assessed by multivariate analysis Natural selection on synonymous mutations in SARS-CoV-2 and the impact on estimating divergence time Downregulating viral gene expression: codon usage bias manipulation for the generation of novel influenza a virus vaccines Dissimilation of synonymous codon usage bias in virus-host coevolution due to translational selection Base composition and host adaptation of the SARS-CoV-2: insight from the codon usage perspective SARS-CoV-2 codon usage bias downregulates host expressed genes with similar codon usage Pros and cons of the application of evolutionary theories to the evolution of SARS-CoV-2 Full-genome evolutionary analysis of the novel corona virus (2019-nCoV) rejects the hypothesis of emergence as a result of a recent recombination event Addendum: a pneumonia outbreak associated with a new coronavirus of probable bat origin The divergence between SARS-CoV-2 and RaTG13 might be overestimated due to the extensive RNA modification The genotype-tissue expression (GTEx) project The codon adaptation Index-a measure of directional synonymous codon usage bias, and its potential applications Nothing in biology makes sense except in the light of evolution Retrieving the deleterious mutations before extinction: genome-wide comparison of shared derived mutations in liver cancer and normal population Mutation profiling of a limbless pig reveals genomewide regulation of RNA processing related to bone development Comparative genomic analysis of a naturally born serpentized pig reveals putative mutations related to limb and bone development Fast evolution of SARS-CoV-2 driven by deamination systems in hosts. Future Virol We thank the members in our group that have given suggestions to our project. At this SARS-CoV-2 time we should especially thank all the medical workers fighting against SARS-CoV-2. The RNA-seq data of human tissues were downloaded from GTEx portal. 22 The expression level of each gene (measured by TPM, transcripts per million) is given in the downloaded file. We downloaded the SARS-CoV-2 and RaTG13 genome from the NCBI website (https://www.ncbi.nlm.nih.gov/ genome/). The coding sequence of human genome was downloaded from the Ensembl website of version hg19 (ftp://ftp. ensembl.org/pub/release-75/fasta/homo_sapiens/cds/). Wenqing Jiang https://orcid.org/0000-0002-9605-5799Bin Yin https://orcid.org/0000-0002-1762-7585