key: cord-1000134-ps1hx4th
authors: Wei, Changshuo; Chen, Yan-Ming; Chen, Ying; Qian, Wenfeng
title: The missing expression level-evolutionary rate anticorrelation in viruses does not support protein function as a main constraint on sequence evolution
date: 2021-03-13
journal: Genome Biol Evol
DOI: 10.1093/gbe/evab049
sha: d0888dec3e097a9fe0dae594569caabd022227b6
doc_id: 1000134
cord_uid: ps1hx4th

One of the central goals in molecular evolutionary biology is to determine the sources of variation in the rate of sequence evolution among proteins. Gene expression level is widely accepted as the primary determinant of protein evolutionary rate, because it scales with the extent of selective constraints imposed on a protein, leading to the well-known negative correlation between expression level and protein evolutionary rate (the E-R anticorrelation). Selective constraints have been hypothesized to entail the maintenance of protein function, the avoidance of cytotoxicity caused by protein misfolding or nonspecific protein-protein interactions, or both. However, empirical tests evaluating the relative importance of these hypotheses remain scarce, likely due to the non-trivial difficulties in distinguishing the effect of a deleterious mutation on a protein’s function vs. its cytotoxicity. We realized that examining the sequence evolution of viral proteins could overcome this hurdle. It is because purifying selection against mutations in a viral protein that result in cytotoxicity per se is likely relaxed, while purifying selection against mutations that impair viral protein function persists. Multiple analyses of SARS-CoV-2 and nine other virus species revealed a complete absence of any E-R anticorrelation. As a control, the E-R anticorrelation does exist in human endogenous retroviruses where purifying selection against cytotoxicity is present. Taken together, these observations do not support the maintenance of protein function as the main constraint on protein sequence evolution in cellular organisms.

Understanding variation in the rate of sequence evolution among proteins encoded by the same genome has always been one of the central goals in molecular evolutionary biology.

Two evolutionary mechanisms have been proposed to explain the widely observed negative correlation between the expression level and the protein evolutionary rate (the E-R anticorrelation) in cellular organisms, i) the maintenance of protein function and/or ii) the avoidance of cytotoxicity. However, empirical tests evaluating the relative importance of these mechanisms on the rate of protein sequence evolution remain scarce. We estimated the E-R correlation in 10 virus species and observed a complete absence of any E-R anticorrelation. The observation does not support function maintenance as the primary selective constraint on protein sequence evolution.

The rate of protein sequence evolution varies by orders of magnitude among proteins encoded by a single genome (Koonin and Wolf 2010; Li et al. 1985; Wilson et al. 1977; Zuckerkandl and Pauling 1965) , and understanding sources of such variation is a central goal in the molecular evolution research field (Echave et al. 2016; Kimura 1968; Pal et al. 2006; Rocha 2006; Wilson et al. 1977; Zhang and Yang 2015) . The protein evolutionary rate is determined primarily by the level of the selective constraint-the factor(s) reducing the extent of divergence for a protein sequence relative to a neutral evolutionary process, owing to the operation of purifying selection. The selective constraint has been found to scale with a given protein's gene expression level: the occurrence of a deleterious mutation in a more highly expressed gene will result in a larger number of improperly formed proteins. As a consequence, the expression level (E) is negatively correlated with protein evolutionary rate (R) in cellular organisms of all three domains of life (Pal et al. 2001; Zhang and Yang 2015) . This phenomenon is known as the E-R anticorrelation.

Two mutually non-exclusive explanations have been proposed for the ubiquitous E-R anticorrelation reported in cellular organisms (Pal et al. 2001; Zhang and Yang 2015) . First, the function maintenance hypothesis explained the E-R anticorrelation based on the idea that the observed optimal expression level of a protein represents a stable equilibrium between the benefit of protein function and the biochemical cost of protein expression (Cherry 2010; Gout et al. 2010) . Therefore, around the optimal expression level, the functional benefit of having one more protein molecule can be approximated by the biochemical cost of expressing this molecule. Similarly, the functional decrease induced by a slightly deleterious mutation can be approximated by the loss of the cellular resources synthesizing nonfunctional protein molecules (or more accurately, the functionally impaired protein components), which is proportional to both the effect size of the deleterious mutation and the expression level of the gene (Cherry 2010; Gout et al. 2010) .

While the function maintenance hypothesis in theory can explain the E-R anticorrelation,

it has not been extensively tested using empirical data (Zhang and Yang 2015) .

A second explanation for the E-R anticorrelation is the cytotoxicity avoidance hypothesis.

Here, the term "cytotoxicity" is used in its broad sense: the negative consequence caused by the misfolding of a protein (Drummond et al. 2005; Drummond and Wilke 2008; Vavouri et al. 2009; Yang et al. 2010) or by its nonspecific interaction (i.e., misinteraction) with other proteins in the cell (Levy et al. 2012; Vavouri et al. 2009; Yang et al. 2012; Zhang et al. 2008) . Such cytotoxicity also scales with the gene expression level as a deleterious mutation will render more misfolded or misinteracted proteins if it occurs in a more highly expressed gene. Thus, the term selective constraint-in addition to the intuitive meaning of "maintaining the function of a protein"-has been hypothesized to entail the selective pressure for avoiding cytotoxicity (Drummond et al. 2005) . Although the role of cytotoxicity avoidance in protein sequence evolution has been supported by results from many studies (reviewed in Zhang and Yang 2015) , there has been considerable debate over its validity in recent years (Biesiadecka et al. 2020; Plata and Vitkup 2018; Razban 2019; Usmanova et al. 2021 ).

In principle, both the function maintenance hypothesis and the cytotoxicity avoidance hypothesis can be used to explain the E-R anticorrelation observed in cellular organisms ( Figure 1A , left). However, a rigorous evaluation of the relative importance of these hypotheses using empirical data remains absent, likely because it is highly complicated and challenging to experimentally decipher, precisely, whether a deleterious mutation disrupts the function of a protein, induces cytotoxicity, or both.

We speculated that viruses might provide an excellent opportunity for evaluating the relative importance of these two hypotheses (Figure 1A , right). On the one hand, viruses do not have cell structures where cytotoxicity can play a direct role. Even if viral proteins cause some host cell cytotoxicity, such cytotoxicity may not significantly reduce the proliferation rate of the virus, because it is likely masked (at least to some extent) by the rapid cytopathogenic damage that the regular viral infect-and-replicate life cycle causes to the host cells. Consider that viruses hijack the machinery of host cells, assemble new viruses, then burst out from and kill host cells, doing so within a rapid time frame (Bojkova et al. 2020; El-Sayed et al. 2016) . In sharp contrast, consider that cytotoxicity plays a role mainly in a chronic manner because it may take up to years for misfolded proteins to accumulate and aggregate to an effective level (Bergh et al. 2015; Gáspári and Perczel 2010) . Consistently, protein aggregation associated diseases often occur in the long-lived cells, such as neurons (Chiti and Dobson 2017) , in elder animals.

On the other hand, it is conceivable that slightly deleterious mutations that impair viral protein functions should remain being selected against, because the impaired infect-andreplicate functions should reduce the proliferation efficiency of viruses. Furthermore, the extent of such selective constraint is still proportional to the viral gene expression level:

the equilibrium between the functional benefit and the biochemical cost of protein expression should remain effective in determining the optimal expression level of viral proteins. The reason is explained as follows. Although viruses do not have their own protein-synthesis machinery, they often shut off transcription and translation of the host cell genes (Walsh and Mohr 2011) and regard the current protein-synthesis resources of the infected host cells as their own to express viral proteins. Consequently, viruses incur a fitness cost for the production of viral proteins of impaired functions, as such resources could have been used to synthesize fully functional viral proteins.

Taken together, we reasoned that a mutation in the viral genome would likely evolve under relaxed purifying selection or even neutrally if it merely causes cytotoxicity to the host cell; in contrast, the evolution of viral genes remains constrained by the maintenance of protein function. Therefore, it should be informative to determine if the E-R anticorrelation exists among viral proteins. The cytotoxicity avoidance hypothesis is not supported if the E-R anticorrelation persists in viruses, and the function maintenance hypothesis is not supported if the E-R anticorrelation is missing in viruses (Figure 1B-C) .

Intriguingly, a previous study reported that the E-R anticorrelation exists among 10

Mononegavirales virus species (Pagán et al. 2012) although the aim of the study was not to use viruses as a system to determine the molecular mechanism(s) of selective constraints.

The authors ultimately reported the apparent existence of the E-R anticorrelation in these viruses only after they excluded 10 "outlier" open reading frames (ORF) from a total of 82 ORFs in their analysis. We were deeply interested in these findings, and we suspect that excluding these "outlier" ORFs which were not specified a priori may have masked the real situation. We also suspect that another limitation of their study was the relatively low number of sequenced variants for the virus species they examined (the median number was 18 for a virus species), which would have limited the accuracy in the estimation of the protein evolutionary rate.

Given that the current SARS-CoV-2 pandemic is occurring in the world nearly ubiquitously equipped with the capacity for high-throughput sequencing, the thousands of confirmed variants for this virus provide us with the opportunity to rigorously evaluate the relative explanatory power of the function maintenance hypothesis and the cytotoxicity avoidance hypothesis. SARS-CoV-2 is a positive-sense single-stranded RNA virus, which replicates using an RNA-dependent RNA polymerase and does not integrate into the genome of its host's cells Zhou et al. 2020) . Note that SARS-CoV-2 often lyses host cells within only dozens of hours after infecting them (Bojkova et al. 2020) . We evaluated the presence of E-R anticorrelation in SARS-CoV-2 and in nine other virus species. Fundamentally, our finding that these viral proteins do not exhibit any E-R anticorrelation does not support the role of function maintenance as the main determining factor in protein sequence evolution.

To determine the E-R correlation coefficient in the SARS-CoV-2 genome, we estimated the protein evolutionary rate for each of the nine ORFs from the reported sequences of 7310 SARS-CoV-2 variants isolated from patients. Specifically, the protein evolutionary rate for an ORF was defined as the average pairwise amino acid differences among http://mc.manuscriptcentral.com/gbe variants, normalized by the ORF length (Figure 2A and Table 1 ). To determine if these rates are correlated with mRNA expression levels, we further estimated the respective expression levels of these ORFs from high-throughput RNA sequencing data (Kim et al. 2020) for Vero cells infected by SARS-CoV-2 ( Figure 2B and Table 1) .

We detected no E-R anticorrelation in SARS-CoV-2 (ρ = 0.60, P = 0.097, N = 9, Spearman's correlation, Figure 2C ). As a control, we estimated whether E-R anticorrelation exists in Vero cells; these cells were originally isolated from the kidneys of green monkey Chlorocebus sabaeus and are widely used to culture viruses (Rhim et al. 1969 ). Specifically, we estimated the protein sequence divergence using a comparison between green monkey (C. sabaeus) and its close relative, macaque (Macaca mulatta, Figure 2A ), and correlated the protein sequence divergence values with the mRNA levels of nuclear genes for Vero cells from the same RNA sequencing data set ( Figure 2B ). We found that expression level and evolutionary rate were negatively correlated in Vero cells (ρ = -0.18, P = 1.77×10 -72 , N = 9481, Spearman's correlation, Figure 2C ), confirming as expected that these host cells do exhibit the E-R anticorrelation.

Presumably, the absence of the E-R anticorrelation in SARS-CoV-2 could be caused by the narrower range in which the ORFs of SARS-CoV-2 are expressed (174 times between the most highly and the most lowly expressed ORFs, N and ORF6, respectively, Table 1 ).

To test if this relatively narrow range of gene expression has disabled the detection of the E-R anticorrelation in SARS-CoV-2, we restricted the detection of the E-R correlation coefficient in the Vero cell to a subset of endogenous genes. The expression levels of the top 20% endogenous genes in the Vero Cell are within a range of 150 times, which is smaller than 174 times as in SARS-CoV-2. Among these genes, expression level remained negatively correlated with evolutionary rate (ρ = -0.08, P = 4.4×10 -4 , N = 1915, Spearman's correlation, Figure S1 ), suggesting that the expression range of 174 times in SARS-CoV-

The lack of E-R anticorrelation in SARS-CoV-2 suggests that function maintenance unlikely plays the primary role in determining the protein evolutionary rate for cellular genes. However, there are some other possible explanations for our finding that SARS-CoV-2 does not exhibit the E-R anticorrelation. One possible explanation is the potential impact of specific SARS-CoV-2 protein synthesis dynamics (Brierley et al. 1989; . That is, the translation of ORF1ab is known to be subject to a programmed ribosomal frameshift between ORF1a and ORF1b, and the encoded polypeptides are known to be subject to further proteolysis in host cells; this ORF thus encodes for a total of 11 or 15 nonstructural proteins (nsps, Figure 3A ). As a consequence, although these nsps have the same mRNA abundance, they can vary in protein abundance, owing for example to the programmed ribosomal frameshift, as well as potential variation in translation elongation rate (resulted from codon usage bias or other cis regulatory elements) Zhao et al. 2021; Zhao et al. 2020 ) and protein stability ).

To evaluate this possibility, we tested if the E-R anticorrelation exists for the 23 proteins encoded by the SARS-CoV-2 genome (15 nsps by ORF1ab plus eight proteins by the other eight ORFs). Specifically, we evaluated the expression level for each protein based on mass spectrometry data for SARS-CoV-2 infected Vero cells (Davidson et al. 2020) . As our analysis based on RNA expression data, this protein data-based analysis found no evidence for the E-R anticorrelation (ρ = 0.36, P = 0.09, N = 23, Spearman's correlation, Figure 3B and Table S1 ), suggesting that the use of RNA expression data in our analysis does not account for the undetected E-R anticorrelation in SARS-CoV-2.

Three SARS-CoV-2 proteins (S, E, and M) are known to be packaged for secretion at the endoplasmic reticulum. Secreted proteins-especially those that undergo N-linked glycosylation-and membrane proteins have been postulated to be under relatively weak selective constraints for avoiding cytotoxicity; this thinking is based on the stringent quality control mechanisms of the endoplasmic reticulum or on the impacts of their subcellular locations. Indeed, there have been reports indicating that some of these proteins exhibit a weaker E-R anticorrelation (Feyertag et al. 2017) or even a positive E-R correlation (Feyertag et al. 2019) . To control for impacts specific to confounding factors from protein secretion, we performed an additional analysis that excluded proteins S, E, and M from the 23 proteins encoded by the SARS-CoV-2 genome. The E-R anticorrelation remained absent (ρ = 0.28, P = 0.24, N = 20, Spearman's correlation, Figure 3C ), suggesting that these three proteins do not account for the undetected E-R anticorrelation in SARS-CoV-2.

Another possible explanation for our finding that SARS-CoV-2 does not exhibit the E-R anticorrelation is the potential impact from its host-jumping into humans .

That is, the evolutionary rate estimated from the sequencing data for SARS-CoV-2 variants in human hosts may not genuinely reflect its long-term evolutionary rate (Longdon et al. 2014 ), due to for example relaxed purifying selection in a subset of viral ORFs in human hosts. Seeking to exclude this possibility, we estimated the protein evolutionary rates in a phylogenetic tree of SARS-CoV-2 and three related coronaviruses that have not achieved zoonotic transfer into humans (RaTG13 isolated from bats and GX-P5E and GD-1 from pangolin samples, Figure 4A ).

For each ORF, we calculated the number of nonsynonymous substitutions per nonsynonymous site (d N ) and the number of synonymous substitutions per synonymous site (d S ). We used their ratio (d N /d S ) to infer the protein evolutionary rate in an effort to control for the genomic variation on mutation rates ( Table 1) . As in our analysis based on virus sequence data for human hosts, in this animal host-based analysis we found no evidence for the E-R anticorrelation among the nine SARS-CoV-2 ORFs (ρ = 0.62, P = 0.09, N = 9, Spearman's correlation, Figure 4B ). It is worth noting that the saturation of synonymous changes could have led to an inaccurate estimation of d S , and consequently, the d N /d S . Nevertheless, we also detected no E-R anticorrelation in SARS-CoV-2 when d N was used as the protein evolutionary rate ( Figure S2A ). All these observations suggest that the estimation of protein evolutionary rate using variants collected in human hosts does not account for the undetected E-R anticorrelation in SARS-CoV-2. Also note that Influenza A virus is segmented whereas the others (including SARS-CoV-2) are nonsegmented (https://viralzone.expasy.org/). There have been hundreds of variants isolated for each of these virus species (Hatcher et al. 2017; Shu and McCauley 2017) .

For each virus species, we estimated the evolutionary rate as the average pairwise sequence difference among isolated variants for each ORF (normalized by the ORF length). Gene expression levels were estimated from the RNA sequencing or genome tiling array data of cell cultures infected by the corresponding viruses (Albarino et al. 2018; Assarsson et al. 2008; Blanco-Melo et al. 2020; Cheng et al. 2017; Zhang et al. 2020) or were retrieved from studies that estimated mRNA levels based on the band densities from electrophoresis gels (Barik 1992; Cattaneo et al. 1987; De et al. 1990; Takeuchi et al. 1993 ). We did not detect the E-R anticorrelation for any of these nine virus species (Spearman's correlation, Figure 5A -I).

Seeking to identify whether a common trend exists between the expression level and evolutionary rate of these viruses, we tested if the E-R correlation coefficients of the ten viruses (including SARS-CoV-2) were with the same sign. Six positive and four negative correlation coefficients (regardless of the statistical significance) were observed ( Figure   5J ), which does not support a common trend of the E-R correlation among these viruses (P = 0.75, the binomial test with the hypothesized probability equal to 50% for both positive and negative correlation coefficients). But recalling that the relatively small number of ORFs (and therefore, limited statistical power) for each virus species could prevent robust trend detection, we attempted to synthesize the data across all ten virus species examined in our study. We understand that the evolutionary rates and expression levels among these viruses are not directly comparable; therefore, we performed a metaanalysis to combine the correlation coefficients calculated for the individual virus species.

Briefly, Fisher's z-transformation was performed to obtain the weight for each virus species, and the generic inverse-variance pooling method was applied to estimate a pooled correlation coefficient. The pooled E-R correlation coefficient ρ was equal to 0.08 (P = 0.16, N = 392, Figure 5J ), with 95% confidence intervals [-0.03, 0.17], a finding supporting the lack of the E-R anticorrelation for these viruses.

It is worth noting that the evolutionary rate estimated from the genomic sequences of virus variants are polymorphic and therefore, may not genuinely reflect the substitution rate in the long-term evolution. Seeking to exclude this possibility, we further estimated the evolutionary rate from the orthologous genes of sister virus species or subspecies.

Specifically, we estimated the evolutionary rates as the d N /d S ratio for each orthologous ORF pair between SARS-CoV-2 and SARS-CoV; the latter caused the severe acute respiratory syndrome in 2003. Again we detected no E-R anticorrelation (ρ = -0.07, P = 0.88, N = 9, Spearman's correlation, Figure 6A ).

Note that SARS-CoV-2 and SARS-CoV are often not considered to be two virus species.

The evolutionary rates between SARS-CoV-2 and its sister species (MERS-CoV) were not presented because their sequence divergence was too high to warrant precise sequence alignment. For example, the protein similarity of the ORF N was only 43% between SARS-CoV-2 and MERS-CoV in the alignable region, and that of the ORF S was only 35%. Such extensive sequence divergence, even compared with the sister species, was also common for some other virus species used in Figure 5 . Nevertheless, we successfully estimated the d N /d S ratios between human cytomegalovirus (Human betaherpesvirus 5) and Chimpanzee cytomegalovirus (Panine betaherpesvirus 2), and among five Orthopoxvirus species

We detected no E-R anticorrelation for cytomegalovirus or Orthopoxvirus (Figure 6B-C) or in a meta-analysis that estimated the pooled Spearman's correlation coefficient from three groups of viruses (ρ = 0.13, P = 0.11, N = 158, Figure 6D ). Keeping in mind the possibility of saturated synonymous changes, we also estimated the E-R correlation with d N instead of d N /d S being used to estimate the protein evolutionary rate. We again found no evidence for the E-R anticorrelation ( Figure S2B-D) . Taken together, the E-R anticorrelation was not evident in virus evolution in the short term (evolutionary rates estimated from pairwise sequence difference among variants) or in the long term (evolutionary rates estimated as d N /d S of orthologous genes of virus species).

No E-R anticorrelation was detected for any of the ten virus species we examined in this study, indicating that an E-R anticorrelation will not be evident when protein sequence evolution is constrained only by the maintenance of protein function. Therefore, this observation contradicts the function maintenance hypothesis but is consistent with the cytotoxicity avoidance hypothesis.

The cytotoxicity avoidance hypothesis further predicts that the E-R anticorrelation should be present among viral ORFs if the selection against cytotoxicity exists ( Figure 7A) . We realized that this prediction could be tested with human endogenous retroviruses (HERVs), which are the remnants of ancient integration of exogenous retroviruses that occurred in the germline (Griffiths 2001) . There are tens of thousands of HERVs in the human genome (Belshaw et al. 2004 ), each including some or all of the four ORFs, gag (encoding core proteins), pro (protease), pol (reverse transcriptase), and env (envelope proteins). The vast majority of HERVs are presumably neither infectious nor functional, but numerous HERVs retained their protein expression activity in human cells (Andersson et al. 2002; Seifarth et al. 2005) . The cytotoxicity induced by HERVs can reduce the fitness of the host organisms, which will, in turn, compromise the propagation of these viral elements. Consequently, any purifying selection against cytotoxicity could reasonably be expected to play a role in the sequence evolution of HERVs ( Figure 7A) . Therefore, the cytotoxicity avoidance hypothesis can be tested with HERVs; the hypothesis is not supported if the E-R anticorrelation among ORFs of HERV is not detected.

To test if the E-R anticorrelation exists among HERVs, we identified orthologous endogenous retrovirus pairs between humans (Homo sapiens) and chimpanzees (Pan troglodytes, Figure 7B ) and estimated their protein evolutionary rate from their sequence divergence ( Figure 7C ). We further estimated the expression level for each HERV ORF from RNA sequencing data (Tokuyama et al. 2018 ). We found that expression level was negatively correlated with protein evolutionary rate among HERV ORFs (ρ = -0.20, P = 1.1×10 -13 , N = 1316, Spearman's correlation, Figure 7D and Figure S3 ), indicating the cooccurrence between the E-R anticorrelation and purifying selection against cytotoxicity. Furthermore, the correlation was not significantly different from the E-R anticorrelation trend among human endogenous genes (P = 0.29, subsampling test, Figure 7E ). Taken together, these observations confirm that the E-R anticorrelation can be evident in viruses when purifying selection against cytotoxicity is present.

In this study, we show that the widely reported E-R anticorrelation in cellular organisms is missing in viruses, where selection against cytotoxicity is relaxed but selection against impaired function remained (Figures 2-6) . Furthermore, this correlation exists when selection against cytotoxicity is present as viral sequences are integrated into the host genome (Figure 7) . These observations suggest that, whereas the maintenance of protein function definitely plays some roles in constraining protein sequence evolution, it is not likely the main component of the selective constraint that generates the E-R anticorrelation.

Instead, the selective constraint may be better interpreted as the consequence of the avoidance of cytotoxicity or other mechanisms not yet known. Nevertheless, we realize that the lack of E-R anticorrelation observed for the viruses in our study is subject to at least three caveats, which we discuss below.

First, the E-R anticorrelation is expected under the assumption that positive selection does not play a role, because beneficial mutations are considered too rare to affect the protein evolutionary rate (Zhang and Yang 2015) . In principle, the absence of the E-R anticorrelation in SARS-CoV-2 could result from the violation of this assumption; i.e., it is conceivable that positive selection could have extensively driven the sequence evolution of viral proteins, due to for example the escape from the host immune systems. However, this is unlikely for several reasons. First, there are multiple reports of strong signals for purifying selection among the SARS-CoV-2 variants isolated from patients Tang et al. 2020) . Moreover, to our knowledge there are only a few examples of signals suggesting positive selection of SARS-CoV-2 in human hosts.

Although not highlighted in our results, note that we did conduct additional analyses to examine this caveat possibility that positive selection could have driven the evolution of SARS-CoV-2. We estimated the d N /d S ratio on the phylogenetic tree of SARS-CoV-2 and three related coronaviruses (shown in Figure 4A ) using Phylogenetic Analysis by Maximum Likelihood (PAML) (Yang 2007) . The d N /d S ratio was smaller than 0.1 and significantly smaller than 1 (P < 0.001, likelihood ratio tests) for each of the nine SARS-CoV-2 ORFs (Table S2) . Further, the proportion of sites showing a signal for positive selection (d N /d S > 1) was equal to zero for each ORF (Table S2) . Similar results have also been reported in previous studies of Influenza A virus and other RNA viruses (Pybus et al. 2007; Suzuki 2006) . All of these observations suggest that positive selection is rare and is unlikely to account for the observed absence of the E-R anticorrelation in viruses.

A second caveat is that the avoidance of cytotoxicity could still exert some role in the evolution of viral genomes. For example, if cytotoxicity from a viral protein reduces the health of a host cell, then this reduced health could in turn reduce the proliferation rates for the infecting viruses; this scenario would cause selection-based avoidance of this type of cytotoxicity on viral genome evolution. However, we suspect that such an effect is unlikely to be strong, since SARS-CoV-2 kills the host cell within a time window of only dozens of hours (Bojkova et al. 2020) , whereas cytotoxicity is known to act mainly in long-lived cells, such as in neurons (Chiti and Dobson 2017; David et al. 2010 ). More obviously, if such avoidance of cytotoxicity in viral genome evolution is still effective to some extent, our data-driven challenge of the function maintenance hypothesis remains: the selective constraint from function maintenance would not be sufficient to create any trend suggesting the E-R anticorrelation.

Third, our study so far tested the function maintenance and cytotoxicity avoidance as two competing hypotheses because they are the two most popular theories used to explain the E-R anticorrelation. We do realize the possibility that the E-R anticorrelation observed in cellular organisms is explained by the additional mechanisms not yet known. Nevertheless, the identification of such novel mechanisms will not affect our conclusion that the function maintenance hypothesis is not well supported by the empirical data because we have reported that the maintenance of function (alone, or together with additional unknown mechanisms that operate in viruses) is not sufficient to generate a significant E-R anticorrelation in protein sequence evolution.

We noted in the introduction that Pagán et al. claimed the detection of the E-R anticorrelation in ten negative-stranded Mononegavirales virus species (Pagán et al. 2012 ).

However, there are at least two major limitations of their study. First, the detection of the E-R anticorrelation in Pagán et al. was dependent on excluding 10 "outlier" ORFs from a total of 82 ORFs (Pagán et al. 2012) . We believe that this procedure was inappropriate because these "outliers" were not specified a priori. As a matter of fact, no E-R anticorrelation (P = 0.17) was detected by Pagán et al. before these "outliers" were removed, which is actually consistent with our results. Furthermore, it is notable that the number of sequenced variants for each virus species in Pagán et al. was much smaller than that of our data sets. Specifically, in Pagán et al., the number of isolated and sequenced variants for each virus species used to estimate the evolutionary rate varied between 6 (Sendai virus) and 110 (Newcastle disease virus), with the median equal to 18. In fact, five out of their ten virus species have also been included in our analysis (Ebola virus, human parainfluenza virus type 3, human respiratory syncytial virus, measles virus, and mumps virus, Figure 5 ). In sharp contrast to their work, the variant numbers we assessed for these five virus species were between 209 (measles virus) and 1006 (human respiratory syncytial virus). This much-enlarged number of sequenced variants per virus species in our study enabled a significant improvement in accuracy for estimation of protein evolutionary rate, which bolstered our confidence for a complete absence of any E-R anticorrelation in viruses.

Some viruses can infect multiple cellular species, leading to symptoms in some hosts (symptomatic hosts) but not in others (asymptomatic hosts). It is conceivable that reducing virus-associated cytotoxicity in the asymptomatic host can help propagate and spread the virus . By this logic, it should be informative to test if the E-R anticorrelation is present in asymptomatic hosts but is absent in symptomatic hosts for the same virus. However, we realized that technical hurdles in partitioning the evolutionary histories between the symptomatic and asymptomatic hosts for the same virus could hinder the investigation in this vein. Taking the Zika virus for example, the partition between the viral genome evolution in its symptomatic host (humans) vs. in its asymptomatic host (mosquitos) is not trivial. It is because the Zika virus has been reported transmitted from mosquitos to humans time and again (Gutierrez-Bugallo et al. 2019) . The sequence difference between the virus variants isolated from two humans can arise from the evolutionary history either in humans or in mosquitos. Nevertheless, comparing the E-R correlation coefficients for the same virus between its asymptomatic and symptomatic hosts deserves further investigation once the technical hurdle is overcome in the future.

In sum, our results provide an empirical test for evaluating the relative importance between the function maintenance hypothesis and the cytotoxicity avoidance hypothesis. Our results indicate that a given mutation may not confer its impacts through a specific loss of function, instead, through other mechanisms such as cytotoxicity. In addition to this theoretical purport, our findings also have medical implications. That is, the currently widespread default thinking from experimental biologists and medical scientists that disease-associated mutations are likely to impact some specific molecular function should be expanded: the potentially high likelihood that a given mutation confers its effects via cytotoxicity per se should be taken into consideration. Consistent with this idea, 83% of disease-causing missense polymorphisms observed in human genomes have been estimated to affect protein stability, potentially influencing the misfolding propensity of proteins (Wang and Moult 2001) . This new insight will help to computationally predict disease-causing mutations from genome-wide sequencing data, an ongoing research direction that has been widely valued in recent years (Cooper and Shendure 2011; Wu et al. 2014) , by pointing to the direction to identify nonsynonymous mutations that can reduce the structural stability of proteins. 

The mRNA level of an ORF of coronaviruses cannot be estimated from the abundance of mapped reads in RNA sequencing data due to the nested nature of the genome and subgenomes (Kim et al. 2020) .

Nevertheless, such nested subgenomes were produced by the discontinuous negativestrand RNA synthesis (i.e., leader-to-body fusion); therefore, the abundance of a subgenome can be estimated from the number of the leader-to-body junction reads ( Figure   2B ). The reason is briefly described as follows. There are two types of transcriptionregulating sequences (TRS): TRS-L is downstream to the leader sequence, and the body TRSs (TRS-B) are upstream to individual ORFs. During the synthesis of the negativestrand RNA, the negative-strand TRS-B may hybridize with the TRS-L in the positivesense genomic RNA, and the synthesis continues using the leader sequence as the template, resulting in a leader-to-body junction. For each subgenome, only the most upstream (5′-) ORF is translated (when leaky scanning of ribosomes is not under consideration). Therefore, the mRNA level of an ORF in a subgenome can be inferred from the number of leader-containing reads. ORF1ab is translated from the genomic RNA. Therefore, although there is no leader-to-body junction, leader-containing reads can still be used to estimate its mRNA level.

The RNA sequencing reads of coronavirus-infected cell lines were aligned to the references with STAR 2.7.3a (Dobin et al. 2013) The TRS-L of SARS-CoV-2 is the 70 th -75 th nucleotides in its genome, and that of MERS-CoV is the 62 th -67 th nucleotides. Therefore, only the reads with the 5′-junction site located between the 55 th -85 th nucleotides of the SARS-Cov-2 genome (or between 45 th -80 th nucleotides of the MERS-CoV genome) were counted as they represented subgenomes generated by canonical leader-to-body fusions. A leader-to-body junction read was assigned to the subgenome according to the ORF sequence immediately downstream of its 3′-junction site; a leader-to-body junction read will be discarded if its 3′-junction site is beyond 50 nucleotides upstream of the start codon of any ORFs. The mRNA level of ORF1ab was calculated from the number of the leader-containing reads aligned to the genome RNA without gaps. ORF10 in SARS-CoV-2 was excluded in this study because it does not have a TRS-B, nor was its subgenome detected (Kim et al. 2020 ). Zhang et al.

provided three replicate RNA-seq data sets for MERS-CoV infected Calu-3 cells . Considering that the ranks of mRNA level among ORFs were consistent in the three replicates, replicate #3 was used in this study.

The expression level of an ORF was estimated from the abundance of sequencing reads mapped to it. The RNA sequencing data were aligned to the corresponding references with STAR using default parameters.

The expression level of an ORF was given by RSEM (Li and Dewey 2011) . It is worth noting that the Influenza A virus subtype H1N1 strain used for estimating expression level (A/Puerto Rico/8/1934) is not directly related to the virus variants used for estimating evolutionary rate, the viruses from the 2009 pandemic. Nevertheless, we surmised that the relative expression levels of ORFs unlikely changed dramatically among Influenza A virus variants since a stoichiometric balance among viral proteins has to be largely maintained to ensure the correct packaging of the viral particle. Albarino et al. (2018) provided three RNA-seq data sets at different time points after Huh7 cells were infected by Ebola viruses; the average expression level at three time points was used for each ORF. ORFs were retrieved from a previous study that performed RNA sequencing experiments on human cytomegalovirus-infected CD34+ hematopoietic stem cells (Cheng et al. 2017 ).

The RNA sequencing data of cells harvested at two days after infection, when the viral ORFs were most actively transcripted, were used in this study.

The expression level of an ORF was retrieved from the band densities from electrophoresis gels estimated in previous studies (Barik 1992; Cattaneo et al. 1987; De et al. 1990; Pagán et al. 2012; Takeuchi et al. 1993 ).

with each other, and their encoded proteins are translated with various mechanisms such as leaky scanning, ribosomal frameshifting, and RNA editing. Three scenarios were considered when we estimate the expression level of overlapping ORFs. First, overlapping ORFs share the same start codon (e.g., ORF1a/ORF1ab in SARS-CoV-2). Second, overlapping ORFs are translated in different frames due to leaky scanning (e.g., ORF3a/ORF3b in SARS-CoV-2). In both scenarios, we assigned the estimated mRNA level to the longest ORF (e.g., ORF1ab or ORF3a) for simplicity; the evolutionary rate was also estimated from this ORF. In the third scenario, for instance P/V/C in measles virus and in human parainfluenza virus type 3, there are overlapping ORFs sharing the same start codon (P and V) as well as overlapping ORFs in different translation frames (C and P) in the same genomic region; we split the total estimated mRNA level to individual ORFs according to the reported relative protein abundance (ViralZone, https://viralzone.expasy.org/857, retrieved on June 18, 2020).

The RNA sequencing data of SARS-CoV-2 infected Vero cells were aligned to the genome of C. sabaeus with STAR using default parameters. Similarly, the RNA sequencing data for human peripheral blood mononuclear cells were aligned to the human genome with STAR using default parameters.

Each read was assigned to its aligned gene under the parameter --quantMode TranscriptomeSAM. Considering that multiple splicing isoforms may exist for a gene, the abundance of each isoform was estimated, and the total abundance of all isoforms of a gene was used as its expression level.

The RNA sequencing data for human peripheral blood mononuclear cells were aligned to the human genome with STAR using the parameters as follows: --outSAMtype BAM SortedByCoordinate --outFilterMultimapNmax 30. The annotation file of HERV ORFs were downloaded from gEVE on August 5, 2020 (genomebased endogenous viral element database, http://geve.med.u-tokai.ac.jp) (Nakagawa and Takahashi 2016) . Noted that some HERV ORFs were annotated as multiple truncated regions. For each annotated ORF or ORF region, the expression level was defined as the ratio between the aligned reads number, which was estimated with Telescope 1.0.3 (Bendall et al. 2019) with the default parameters, and the length of ORF or ORF region.

Since annotated regions of the same HERV ORF can have different estimated expression levels, we used the average expression level of all expressed annotated ORF regions of the same ORF to represent the expression level of this ORF.

The abundance of proteins encoded by the SARS-CoV-2 genome was estimated from the number of peptide-spectrum matches (PSMs) of a protein, normalized by the number of its theoretical peptides. The number of PSMs was retrieved from a previous proteomic analysis on SARS-CoV-2 infected Vero cells using tandem mass spectrometry (Davidson et al. 2020) . The theoretical peptides of proteins digested by trypsin were in silico generated using in-house python script, in which the maximum number of missed arginine or lysine during digestion was set to two. Only the peptide with a length between 7-140 amino acids was counted. The estimated protein abundances of the nine ORFs of SARS-CoV-2 were highly correlated with the corresponding mRNA abundances (r = 0.99, P = 6×10 -7 , Pearson's correlation, 

The protein evolutionary rate of ten virus species examined in this study was inferred from the average pairwise difference among variants. Specifically, for each virus species, the reference sequence of each ORF was aligned against the genomes of individual virus variants by EMBOSS needle (Rice et al. 2000) . The aligned sequences (without gaps) were in silico translated to protein sequences. The number of different amino acids was counted for each variant pair, and the average was estimated from all possible variant pairs, which was further normalized by the length of the peptide sequence.

The rate of SARS-CoV-2 protein sequence during the evolution in animal hosts was estimated from the d N /d S ratio among SARS-CoV-2 and three related coronaviruses were similarly estimated using their respective sister species, SARS-CoV (GenBank:

NC_004718.3) and chimpanzee cytomegalovirus (GenBank: NC_003521).

We estimated the evolutionary rate of proteins encoded in the genome of Vero cells from the protein divergence between C. sabaeus and M. mulatta. For each gene, the fractions of identical amino acids in the sequence alignment between the two species, "C.

sabaeus % ID" and "M. mulatta % ID", were retrieved from Ensembl (release 99). The homologous gene pair was filtered if the two fractions differ by greater than 5%. The mean of the two factions was estimated and its difference from 100% was used to infer the protein evolutionary rate. The evolutionary rate of human proteins was similarly estimated as the average fraction of identical amino acids in the sequence alignment between the human and the chimpanzee, "H. sapiens % ID" and "P. troglodytes % ID".

The endogenous retrovirus protein sequences in humans and chimpanzees were downloaded from gEVE. A humanchimpanzee orthologous endogenous retrovirus pair for an annotated ORF region was identified following three criteria: 1) The orthologous pair presents the syntenic reciprocal best BLAST hits; 2) the identity of the aligned sequence was greater than 85%; and 3) the number of aligned amino acids was no less than the 85% of full protein length in either annotated region. The evolutionary rate of each annotated ORF region was estimated from the protein divergence between human and chimpanzee, defined as the number of varied amino acids divided by the average protein length of the orthologous pair. Similar to the estimation of expression levels, we used the average evolutionary rate of all annotated ORF regions to represent the evolutionary rate of the ORF when multiple regions are annotated for the same ORF.

All scripts used to analyze the data and to generate the figures are available at https://github.com/ChangshuoWei/E-R_anticorrelation-associated-hypotheses.

We acknowledge the authors and laboratories for generating and submitting the sequences to GISAID Database on which this research is based. The list is detailed in Tables S3 and   S4 . We thank Dr. (B) The E-R anticorrelation has been observed in all three domains of cellular organisms.

(C) Viruses can be used to test the relative importance of the two hypotheses explaining the E-R anticorrelation. If the E-R anticorrelation remains strong among viral ORFs, the cytotoxicity avoidance hypothesis is not supported (left), because the avoidance of cytotoxicity appears to be unnecessary for creating an E-R anticorrelation. In contrast, if the E-R anticorrelation is missing in viruses, the function maintenance hypothesis is not supported (right), because the maintenance of protein function per se is insufficient to create an E-R anticorrelation.

(A) A schematic shows how the protein evolutionary rate is estimated for Vero cell and SARS-CoV-2. The red stars on the right panel denote amino acid alternations.

(B) A schematic shows how expression level is estimated for Vero cell and SARS-CoV-2.

On the left penal, the expression level for each Vero cell gene is estimated as the normalized reads abundance aligned to it. On the right penal, the genome and two out of (J) A meta-analysis of E-R correlation for ten virus species. Spearman's correlation coefficient for individual viruses is shown. The combined correlation coefficient is given by meta-analysis of correlation coefficients (metacor function in the "meta" package of R, which combines correlation coefficients from ten individual viruses into one pooled correlation estimate). Here, the random-effects model with Sidik-Jonkman estimator is used for the estimation of between-study heterogeneity τ 2 , and z-transformed correlations are used for the meta-analysis. The results of the fixed-effect model is presented because the test of heterogeneity gave a P value = 0.16 (I 2 = 21.5%). The effective sample size N is also shown for the meta-analysis. has been added to all expression data prior to applying the logarithm transform, to avoid the logarithm of zero.

(D) Similar to Figure 5J , showing a meta-analysis of E-R correlation for three groups of viruses in Figure 6A -C. The results of the fixed-effect model is presented because the test of heterogeneity gave a P value = 0.88 (I 2 = 0.0%). The effective sample size N is also shown for the meta-analysis.

(A) The selection against cytotoxicity is restored when a viral genome is integrated into a host genome.

(B) The syntenic reciprocal best BLAST hits endogenous retrovirus pairs were identified as human-chimpanzee orthologous pairs.

(C) A schematic shows the estimation of the protein evolutionary rate for each HERV ORF.

(D) The E-R correlation in HERVs (ρ = -0.20, P = 1.1×10 -13 , N = 1316, Spearman's correlation). The expression levels of HERV ORFs are estimated from RNA-seq data for human peripheral blood mononuclear cells from six healthy individuals. The plot shows the expression level in healthy #1 (replicate #1). Replicates #2-#6 are shown in Figure S3 .

The HERV ORFs are split into six bins according to their expression levels: (-∞, -9.5), [-9 .5, -8.5), [-8.5, -7 .5), [-7.5, -6 .5), [-6.5, -5.5) , and [-5.5, +∞) . The mean (± standard errors) of the evolutionary rate is shown for each bin. Spearman's correlation coefficient is calculated from the unbinned data.

(E) The E-R anticorrelation among HERVs (indicated by the red arrow) was not significantly different from that among human non-virus protein-coding genes (ρ = -0.21, P = 6.2×10 -182 , N = 17,410). The one-tailed P value (0.29) was calculated from a subsampling test. The histogram shows the distributions of 10,000 E-R correlation coefficients estimated from individual subset of 1316 genes. Subsampling of 1316 genes expressed in human blood 10,000 times

Transcriptional analysis of viral mRNAs reveals common transcription patterns in cells infected by five different filoviruses

Developmental expression of HERV-R (ERV3) and HERV-K in human tissue

Kinetic analysis of a complete poxvirus transcriptome reveals an immediate-early class of genes

Transcription of human respiratory syncytial virus genome RNA in vitro: requirement of cellular factor(s)

Long-term reinfection of the human genome by endogenous retroviruses

Telescope: Characterization of the retrotranscriptome by accurate estimation of transposable element expression

Structural and kinetic analysis of proteinaggregate strains in vivo using binary epitope mapping

An Overexpression Experiment Does Not Support the Hypothesis That Avoidance of Toxicity Determines the Rate of Protein Evolution

Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19

Proteomics of SARS-CoV-2-infected host cells reveals therapy targets

Characterization of an efficient coronavirus ribosomal frameshifting signal: requirement for an RNA pseudoknot

Altered transcription of a defective measles virus genome derived from a diseased human brain

Dissimilation of synonymous codon usage bias in virus-host coevolution due to translational selection

Overdosage of Balanced Protein Complexes Reduces Proliferation Rate in Aneuploid Cells

Transcriptome-wide characterization of human cytomegalovirus in natural infection and experimental latency

Expression level, evolutionary rate, and the cost of expression

Protein Misfolding, Amyloid Formation, and Human Disease: A Summary of Progress Over the Last Decade

Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data

Widespread protein aggregation as an inherent part of aging in C. elegans

Characterisation of the transcriptome and proteome of SARS-CoV-2 reveals a cell passage induced in-frame deletion of the furin-like cleavage site from the spike glycoprotein

Characterization of an in vitro system for the synthesis of mRNA from human parainfluenza virus type 3

STAR: ultrafast universal RNA-seq aligner

Why highly expressed proteins evolve slowly

Mistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution

Causes of evolutionary rate variation among protein sites

MUSCLE: a multiple sequence alignment method with reduced time and space complexity

Influenza A Virus and Influenza B Virus Can Induce Apoptosis via Intrinsic or Extrinsic Pathways and Also via NF-kappaB in a Time and Dose Dependent Manner

N-glycoproteins exhibit a positive expression level-evolutionary rate correlation

Secreted Proteins Defy the Expression Level-Evolutionary Rate Anticorrelation

Chapter 2 -Protein Dynamics as Reported by NMR

The relationship among gene expression, the evolution of gene dosage, and the rate of protein evolution

Endogenous retroviruses in the human genome sequence

Vector-borne transmission and evolution of Zika virus

Virus Variation Resource -improved response to emergent viral outbreaks

Genomic characterization and infectivity of a novel SARS-like coronavirus in Chinese bats

The Architecture of SARS-CoV

Evolutionary rate at the molecular level

Constraints and plasticity in genome and molecular-phenome evolution

MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms

Identifying SARS-CoV-2-related coronaviruses in Malayan pangolins

Cellular crowding imposes global constraints on the chemistry and evolution of proteomes

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome

A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes

The evolution and genetics of virus host shifts

gEVE: a genome-based endogenous viral element database provides comprehensive viral protein-coding sequences in mammalian genomes

Level of gene expression is a major determinant of protein evolution in the viral order Mononegavirales

Highly expressed genes in yeast evolve slowly

An integrated view of protein evolution

Protein Stability and Avoidance of Toxic Misfolding Do Not Explain the Sequence Constraints of Highly Expressed Proteins

Phylogenetic evidence for deleterious mutation load in RNA viruses and its contribution to viral evolution

Protein Melting Temperature Cannot Fully Assess Whether Protein Folding Free Energy Underlies the Universal Abundance-Evolutionary Rate Correlation Seen in Proteins

Biological characteristics and viral susceptibility of an African green monkey kidney cell line (Vero)

EMBOSS: the European Molecular Biology Open Software Suite

The quest for the universals of protein evolution

Comprehensive analysis of human endogenous retrovirus transcriptional activity in human tissues with a retrovirus-specific microarray

Genomic Diversity of Severe Acute Respiratory Syndrome-Coronavirus 2 in Patients With Coronavirus Disease

GISAID: Global initiative on sharing all influenza data -from vision to reality

Natural selection on the influenza virus genome

In vitro transcription and replication of the mumps virus genome

On the origin and continuing evolution of SARS-CoV-2

ERVmap analysis reveals genome-wide transcription of human endogenous retroviruses

The relationship between the misfolding avoidance hypothesis and protein evolutionary rates in the light of empirical evidence

Intrinsic protein disorder and interaction promiscuity are widely associated with dosage sensitivity

Viral subversion of the host protein synthesis machinery

A novel coronavirus outbreak of global health concern

SNPs, protein structure, and disease

Biochemical evolution

Zhang YZ 2020. A new coronavirus associated with human respiratory disease in China

Integrating multiple genomic data to predict disease-causing nonsynonymous single nucleotide variants in exome sequencing studies

Isolation of SARS-CoV-2-related coronavirus from Malayan pangolins

Protein misinteraction avoidance causes highly expressed proteins to evolve slowly

Impact of translational error-induced and error-free misfolding on the rate of protein evolution

PAML 4: phylogenetic analysis by maximum likelihood

Constraints imposed by non-functional protein-protein interactions on gene expression and proteome size

Determinants of the rate of protein sequence evolution

Competing endogenous RNA network profiling reveals novel host dependency factors required for MERS-CoV propagation

Disome-seq reveals widespread ribosome collisions that promote cotranslational protein folding

Cis-regulatory mechanisms and biological effects of translation elongation

Evolutionary Divergence and Convergence in Proteins

The authors declare that they have no competing interests.

All data that were used to support the findings of this study are available in the public databases.