key: cord-0839767-1wi1dhv6 authors: Kustin, Talia; Stern, Adi title: Biased mutation and selection in RNA viruses date: 2020-09-28 journal: Mol Biol Evol DOI: 10.1093/molbev/msaa247 sha: 601b1d77d56ff8caf77ab36402b9ec42501ed739 doc_id: 839767 cord_uid: 1wi1dhv6 RNA viruses are responsible for some of the worst pandemics known to mankind, including outbreaks of Influenza, Ebola, and the recent COVID-19. One major challenge in tackling RNA viruses is the fact they are extremely genetically diverse. Nevertheless, they share common features that include their dependence on host cells for replication, and high mutation rates. We set out to search for shared evolutionary characteristics that may aid in gaining a broader understanding of RNA virus evolution, and constructed a phylogeny-based dataset spanning thousands of sequences from diverse single-stranded RNA viruses of animals. Strikingly, we found that the vast majority of these viruses have a skewed nucleotide composition, manifested as adenine rich (A-rich) coding sequences. In order to test whether A-richness is driven by selection or by biased mutation processes, we harnessed the effects of incomplete purifying selection at the tips of virus phylogenies. Our results revealed consistent mutational biases towards U rather than A in genomes of all viruses. In +ssRNA viruses we found that this bias is compensated by selection against U and selection for A, which leads to A-rich genomes. In -ssRNA viruses the genomic mutational bias towards U on the negative strand manifests as A-rich coding sequences, on the positive strand. We investigated possible reasons for the advantage of A-rich sequences including weakened RNA secondary structures, codon usage bias, and selection for a particular amino-acid composition, and conclude that host immune pressures may have led to similar biases in coding sequence composition across very divergent RNA viruses. Genomes of all replicating entities, including viruses and cellular hosts, have been shaped by millions of years of evolution. The rapid progress of genomics in the past few decades has brought about enormous amounts of genomic information, and today there are thousands of genomes of viruses available, which allow studying the processes that govern the evolution of these genomes (Belshaw et al. 2008; Duffy et al. 2008; Pybus and Rambaut 2009) . RNA viruses are an extremely diverse collection of entities, spanning a diverse range of hosts, morphologies, genome organizations, and genetic composition. Nevertheless, RNA viruses do share several common features that drive their evolution: (a) their ultimate dependence on the cell, (b) their high mutation rates, (c) strong purifying selection derived from constraints operating on a small and densely coding genome, and (d) sporadic but powerful positive selection driven by an evolutionary arms race with the host they infect. We hence reasoned that we may find common genomic signatures shared by RNA viruses, which in turn may allow us to learn more about the drivers of virus evolution. One example of a process that may affect viral genomes is host editing by cellular enzymes. Two notable examples are ADAR, which promotes A>G mutations, and APOBEC3 (A3), which promotes C>U mutations in single-stranded DNA, manifested as G>A mutations on the coding RNA strand of HIV (Bishop et al. 2004; Samuel 2012) . In principle, A3 promotes hyper-mutated viral genomes, which are unlikely to be capable of replicating, and hence undergo purifying selection. However, there has been extensive debate whether A3 may sometimes operate in a suboptimal manner, leading to genomes that are "viable", i.e., replication competent (Jern et al. 2009; Sadler et al. 2010; Cuevas et al. 2015; Delviks-Frankenberry et al. 2016 ). If indeed A3 or ADAR enzymes lead to viable replicating genomes, we would expect to see footprints of their activity in contemporary virus genomes. Another notable example of a shared common signature across RNA viruses is the depletion of CG and UA dinucleotides across almost all known RNA viruses Greenbaum et al. 2009; Cheng et al. 2013; Tulloch et al. 2014) . This under-representation is shared by viral hosts as well; TA dinucleotides (UA in RNA) are under-represented in most organisms, likely due to RNA-degrading enzymes located in the cytoplasm, and CG are underrepresented in plants and vertebrates, likely due to deamination processes (Burge et al. 1992; ). It appears that cells have evolved mechanisms to detect foreign genetic material bearing high levels of CG dinucleotides: recently it has been shown that the cellular enzyme ZAP restricts HIV genomes bearing RNA with multiple CGs (Takata et al. 2017) , and we and others have shown strong selection against introduction of CG in HIV and RNA viruses in general (Burns et al. 2009; Atkinson et al. 2014; Stern et al. 2017; Theys et al. 2018; Caudill et al. 2020) . We thus set out to test if there are additional shared genomic and evolutionary features in RNA viruses and compiled a large dataset of sequences from pathogenic single strand RNA viruses from Baltimore classes IV and V (+ssRNA and -ssRNA viruses, respectively). We focus on these classes in order to (a) avoid the confounding effects of double stranded viruses such as stableness of double stranded DNA/RNA, and (b) avoid reverse-transcribing viruses, whose replication cycle is unique compared to other RNA viruses, and includes a DNA stage. One of the key challenges in our study was to disentangle the roles of mutation and selection. Any extant sequence is a product of evolution from an ancestral sequence, and this process includes the action of both mutation and natural selection, occurring repetitively. Indeed, in the examples above we see that either increased introductions of mutations (via A3 enzymes or ADAR) or selection (mediated by ZAP restriction of CG rich sequences) may both lead to unique genomic signatures. To disentangle the effects of mutation versus selection, we harness the notion of incomplete purifying selection operating on viral genomes, whereby selection is relaxed at the tips of phylogeny (Fitch et al. 1997; Pybus et al. 2007; Strelkowa and Lassig 2012; Gire et al. 2014) . By contrasting between rates of substitutions at internal versus external branches of phylogenies we were able to test for the presence of mutational biases (i.e., mutations that are biased towards specific nucleotides) or for selection for specific types of mutations. Overall, our results suggest a consistent selective advantage for the abundance of the A nucleotide across almost all vertebrate RNA viral genomes. We first generated an extensive dataset of ~65,900 full coding sequences from pathogenic single strand RNA viruses, which we name PhyVirus (Fig. 1, Table S1 ). Hosts include a wide array of animals, spanning from arachnids, to birds, to fish, to mammals. As expected, the dataset contained a disproportionate number of human viruses; yet reassuringly, hundreds to thousands of sequences were available from other phylogenetic clades. We implemented an automated process to generate multiple sequence alignments and their corresponding phylogenetic trees (Materials and Methods). We focused only on alignments of coding sequences (rather than longer genomic alignments) so as to mitigate as much as possible the effects of recombination. We further iteratively ensured that phylogenies are limited to sequences with a high degree of homology, by focusing only on phylogenies where any given branch length is smaller than 0.5 substitutions/site (Materials and Methods). Finally we also ensured that phylogenies were not dramatically affected by mutational saturation (Fig. S1 ). We have made the PhyVirus dataset available online at https://www.sternadi.com/phyvirus, where all alignments, phylogenies, as well as metadata files are accessible to the wide public. We first calculated the fraction of A, C, G, and U in the coding sequences of all viral families. To our surprise we found an over-representation of A across literally all viral families, accompanied by a strong diminution of C ( Fig. 2A) . The fraction of A ranged from ~28% to ~40% in most sequences, reaching a high of 49% in VPg sequences of Rhinovirus. The exceptions were the positive single strand RNA (+ssRNA) families togaviruses, where more C was observed, and Caliciviruses, which were relatively homogenous in nucleotide content. As well, some abundance of U was noted in Coronaviruses and some -ssRNA families and of G in flaviviruses. We next examined the nucleotide composition after breaking down by host classification, to test whether composition was dependent on the host. In general, we did not notice nucleotide composition dependence on the host, with some minor exceptions, mainly in the Picornaviridae family (Fig. 2B ). Finally, when analyzing coding sequences of double strand DNA and RNA viruses, RNA bacteriophages, or coding sequences of hosts, we did not find any consistent preference for A ( Fig. S2A -C), and we note that mixed evidence exists regarding A-richness in retroviruses (van Hemert and Berkhout 1995). We next went on to examine the nucleotide composition across the three codon positions. This analysis revealed an interesting and consistent pattern: first codon positions were found to be enriched for A and G, second codon positions were found to be enriched for A and U, whereas the third codon positions were enriched for A and U in -ssRNA viruses, and for U only in +ssRNA viruses (Fig. 2D ). Once again this was in stark contrast to high GC content at third codon positions of host coding sequences (Kudla et al. 2006) . The pattern at the codon level also led to a particular pattern of amino-acid frequencies, with some differences between the host and viral amino acid frequencies observed (Fig. S3 ). We went on to test if a similarly consistent pattern is found in non-coding regions. In general, RNA viruses are devoid of non-coding sequences, and thus these sequences are quite short. Our analysis revealed a base composition that was quite different from that in the coding regions, with no consistent enrichment for A or any other nucleotide (Fig. S2D ). It seemed that a different set of "rules" apply to the non-coding regions, likely driven by the regulatory roles of non-coding RNA in RNA viruses. Most often these sequences are under strong purifying selection to maintain particular RNA structures (Robertson 1979; Desselberger et al. 1980; Le et al. 1992; Thurner et al. 2004) , and this most likely leads to a base composition that is specific to every virus and its non-coding region. Finally, we examined whether we find longer patterns of biased composition, and focused on the frequencies of dinucleotides. As has been noted previously Greenbaum et al. 2009; Cheng et al. 2013; Tulloch et al. 2014 ), we observed a strong and consistent depletion of CG and to a lesser extent a depletion of UA dinucleotides across all viruses except for togaviruses (Fig. 2C ). Conversely, we saw an enrichment for CA and UG. This may be explained by ADAR editing (UA>UG or reverse complement of UA>CA), although other ADAR editing products were less observed (AG and CU). Alternatively, CA and UG may compensate for the lack of CG and UA, since they are both one transition mutation away (the most frequent mutation that occurs naturally in viruses) from either CG or UA (but see (Di Giallonardo et al. 2017) ). Returning to our observations of A-richness, we observed no enrichment for longer patterns that include A, suggesting that A in itself is the unique factor in the virus coding sequences. Moreover, a phylogenetic analysis of substitution patterns revealed that all three types of to-A substitution (C>A, G>A, and U>A) are high (Fig. S4 ). The hallmark of incomplete purifying selection is higher dN/dS at external nodes (tips) as compared to internal nodes (Zhang et al. 2005 ). We thus first tested for the presence of incomplete purifying selection across all our datasets, by contrasting the rate of nonsynonymous to synonymous (dN/dS) substitutions between the internal branches and the external branches (Materials and Methods). Notably, external branch lengths may dramatically differ, and depend heavily on density of sampling. We thus tested various branch length cutoffs to define inclusion or exclusion of a branch as internal or external. We found a higher dN/dS ratio at the tips in 64% of the alignments in our dataset. However only 42% of the datasets displayed significant support using a likelihood ratio test (P< 0.05, after false discovery correction (Benjamini and Hochberg 1995) ) for the two-rate model that allows for different dN/dS ratios at different branches (here, internal versus external branches). In general, datasets that did not display significant support often contained fewer or less divergent sequences, or had longer branch lengths (less dense sampling). We conclude that incomplete purifying selection is probably pervasive, but significant support requires more data and denser sampling. Importantly, the alignments that did pass significance testing were a faithful taxonomical representation of our full dataset (Fig. S5 ) and were not significantly different in terms of estimated age (Fig. S1 ). We continued our analysis only with the alignments where we observed significant support for incomplete purifying selection. We next set out to contrast between the rate of to-A substitution at the internal branches and the external branches. Modeling of directional selection for A resulted in incorrect inference ( Fig. S6) , and to overcome this we performed mutational mapping along the phylogeny (Materials and Methods). We first inferred the extent to which mutation rates are biased in our datasets by focusing only on substitutions at the external branches, where selection is relaxed. We found that +ssRNA viruses display a strong mutational bias towards U, in strong contrast to the overall A-rich genome they present, but in line with their content at third codon positions (Fig. 3C ). -ssRNA viruses maintain a mutational bias towards A in their coding sequences (with the exception of hantaviruses), which is effectively a bias towards U on their genomic strand. Interestingly, our data contains two families with an ambisense coding strategy, Arenaviridae and Phenuiviridae, in which proteins are encoded from both the positive and negative strands. By default, these families are characterized as -ssRNA viruses, since only one strand (the negative one) is packed. We tested whether the differences in mutational biases between -ssRNA and +ssRNA viruses hold when separating the coding sequences of the ambisense viruses based on the strand they reside on. We observed that regardless of the strand, all coding sequences of ambisense viruses are A-rich (Fig. S8A) . However, the mutational bias was different based on the strand of the coding sequence: coding sequences on positive strands displayed a mutational bias towards U, whereas coding sequences on negative strands displayed a mutational bias towards A (Fig. S8B) . Rates to each nucleotide are normalized so as to sum up to one (Materials and Methods). See Fig. S7 for comparison between internal and external biases. (D) Families that displayed a significant association between branch location and type of substitution based on a Hansel-Mantel test (Materials and Methods). (E) Violin plots of inferred odds ratios of to-X/to-Y (where X stands for a given nucleotide and Y stands for any other nucleotide apart from X) for each nucleotide at internal versus external branches across all viral families. Panels (C) to (E) show only virus phylogenies where incomplete purifying selection was observed to be significantly supported. We were intrigued by the finding of the mutational bias towards U and sought to test if this phenomenon presents itself in the recent COVID-19 epidemic caused by the SARS-CoV-2 virus, a +ssRNA from the Coronaviridae family. Uniquely, SARS-CoV-2 evolution should reflect short term evolution since the virus has been spreading for merely a few months, and hence observed diversity reflects for the most part mutational biases rather than selection. Extensive sequencing of the virus around the globe allowed us to analyze mutational patterns that show a strong abundance of substitutions towards U (Fig. 4 ) (see also (Simmonds 2020) ), further supported by within-host diversity analyses (Fig. S9 ). We note that sequencing errors (and in particular deamination/oxidation) may lead to an increase in C>U and G>U, however this should rarely affect the consensus sequence of a virus, which is typically based on dozens to hundreds of sequencing reads (see also Fig. S9 ; Materials and Methods). All in all, the SARS-CoV-2 sequences support our observation of to-U mutational bias in viral genomes. We discuss the finding of mutations towards U in RNA viruses more in depth below. We next created contingency tables of inferred to-X/to-Y (where X stands for a given nucleotide and Y stands for any other nucleotide apart from X) substitutions at internal branches/external branches, allowing us to test for an association between branch location and direction of substitution (Fig. 3B) . Our results showed a highly consistent pattern across almost all +ssRNA viruses, supporting selection for A and selection against U (Fig. 3D-E) . In -ssRNA viruses the pattern was mixed, and we did not see any consistent signs of selection for or against any nucleotide. To conclude, our analysis shows (a) a consistent mutational bias towards U in genomes of all viruses, which leads to a mutational bias towards A in coding strands of -ssRNA viruses, and (b) selection for A in most +ssRNA viruses, which presumably compensates for the bias towards U caused by the mutational process. We put forth three possible explanations why A may be selected for in viruses. Second, it is possible that codon usage bias and translational optimization have led to the particular sequence composition observed herein. Accordingly, if codons with more A are associated with more abundant tRNAs, viral genes should be translated more efficiently. Varying codon usage has been reported for many different viruses, yet the underlying reasons for this variation remain obscure (Jenkins and Holmes 2003; Gu et al. 2004; Kryazhimskiy et al. 2008; Wong et al. 2010; Belalov and Lukashev 2013; Cardinale et al. 2013; Tian et al. 2018; Chen et al. 2020) . The breadth of the PhyVirus dataset allows probing codon bias in depth, and we calculated the relative synonymous codon usage (RSCU), the effective number of codons (ENC) (Fig. S10A ) and the tRNA adaptation index (tAI) (Fig. S10B ) (Materials and Methods), focusing on human viruses for the latter. Our correspondence analysis (CA) analysis showed that RSCU differences between viral sequences are not attributed to viral host classification, type of protein, or viral family, as reflected in the silhouette values ( Fig. 5A-C) . Silhouette scores below zero reflect bad clustering, where clusters are embedded within each other, whereas values around zero reflect an almost complete overlap of clusters, suggesting that the clustering variables do not explain differences in RSCU values. When probing which factors are responsible for the first and second components of the CA (23% and 10% of the variability in the data, respectively), we observed a strong correlation of the first axis with the synonymous nucleotide content of the third codon position !" , but very weak to no correlations of both axes with tAI (Fig 5D) , and of tAI with ENC (Fig. 5F ). If codon usage had been driven by selection for enhanced translation of proteins, we would have expected one or more of the following: higher correlation with tAI, low ENC (Fig. S10A ), unique RSCU profile of genes known to be highly expressed in viruses, such as capsid products (Fig 5B) or unique RSCU profile based on host/type of virus (Fig. 5A,C) .We do not observe any of these phenomena, whereas the correlation with !" suggests that other forces drive the codon composition. We conclude that translational optimization is not likely the driver of the sequence composition observed herein, and that the particular codon composition of a viral sequence is likely a by-product of other factors. We continue to a third possible explanation for selection towards A: selection for amino-acids encoded by A-rich codons. While there are many possible reasons leading to selection for specific amino acids, we speculated that the major histocompatibility complex (MHC) class I system may play a role in selection for specific peptides, as it is has been shown to drive the evolution of many vertebrate viruses (Kuntzen et al. 2007; Foll et al. 2014; Carlson et al. 2015; Kløverpris et al. 2015) . To test the effect of MHC on composition of viral genomes, we predicted which peptides derived from virus genomes would be weakly or strongly detected by the MHC system. Remarkably, peptides preferentially displayed by MHC systems were found to be encoded by A/G-poor and C/U-rich sequences (Fig. 6) . In other words, there should be a selective advantage for A/G-rich and C/U poor sequences that would allow escape from the MHC system. While clearly this subject merits further in-depth investigation, selection due to the MHC class I system would explain our results, in particular the selection against U and for A. MHC epitope prediction was run on all translated PhyVirus coding sequences across a variety of MHC alleles, including human, chimp, gorilla, rhesus macaque, bovine, porcine and mouse alleles (Table S2 ). Peptides were classified into peptides that would be strongly detected by the MHC system ('strong'), weakly detected ('weak'), or not at all ('none') (see also Fig. S11 ). We have found that the vast majority of RNA viruses examined herein have skewed nucleotide composition in their coding sequences, with most viruses bearing A-rich and C-poor sequences. This pattern appears to be quite consistent across hosts ranging from fish, to insects, to mammals, with the caveat that the largest number of sequences in our data was from mammalian viruses (including human viruses). The A-rich pattern disappeared when analyzing viruses of bacteria, viral non-coding regions, or coding sequences of hosts ( Fig. S2B-D) . One of the original goals of this analysis was to test whether we observe the presence of signatures of RNA editing by A3 enzymes or ADAR enzymes, or restriction by cellular enzymes such as ZAP in long-term evolution of viruses. Interestingly, we did not find any such evidence for A3, at least not in a widespread manner, found partial evidence for possible ADAR editing, and although we found a decreased CG presence in all viral families excluding Togaviridae we Frequency cannot prove that ZAP is the underlying reason for this phenomenon. In any case, it is not likely that restriction by ZAP would explain the A-richness observed in all single strand viruses. Two non-mutually exclusive hypotheses may be put forth to explain the consistent pattern of A-richness that we observe: there is selection for more A in viral sequences, and/or there is a mutational bias that leads to more A in genomes of viruses. In order to tease apart the roles of selection and mutation, we used the notion of incomplete purifying selection, which allows us to separate between recent and non-recent evolution. Our results revealed that both mutational biases and selection operate in viral genomes. In both +ssRNA and -ssRNA, we observed a mutational bias towards U on the genomic strand of these viruses, which is counteracted by selection against U and towards A in the +ssRNA viruses. We begin by discussing the mutational biases we observed. In the absence of selection, which is what we measure at external branches of trees, any biased introduction of one nucleotide should lead to its pair being introduced at an equal proportion. For example, if we denote # as the probability of erroneously incorporating an A, when this A is reverse-complemented this will lead to # = $ (Fig. S12) . While for some -ssRNA viruses we see a similar mutational bias towards both A and U, this equality in mutational biases does not hold for any of the +ssRNA viruses (Fig. 3C) . Even more intriguingly, we see a mutational bias that differs between negative and positive strands of the ambisense viruses (Fig. S8B ). Yet if we consider the genomic strand only, this bias collapses to a mutational bias that introduces more U on the genomic strand. This suggests that the mutational process is biased towards one of the strands and acts as a non -symmetrical process (Fig. S12 ). One possible explanation for this directional mutation bias may be genomic damage in the form of spontaneous deamination. Interestingly, this is consistent with a model suggesting that DNA damage is a major source of replication errors in humans (Gao et al. 2019) . For single strand RNA viruses, it is possible that the genomic damage will affect packaged genomes more than strands replicated within the cells, leading to this non-symmetrical bias. Another explanation for this mutational bias is host mediated enzymatic deamination of virus genomes that are packed as virions. We note that the genomic mutational bias towards U we find in viruses falls in line with previously published work that shows that mutation is universally biased towards AT in several species, including human and bacteria (Hershberg and Petrov 2010; Lynch 2010) . This suggests that there may be a commonality in the unknown mechanism that creates this bias. Furthermore, in double strand genomes it is nearly impossible to determine the specific mutation that causes the bias observed (to-A, to-T or both) due to the complementary nature of the genome. If the mechanism generating biases is the same in viruses and in cells, based on our analysis we speculate that the bias is towards U mutations rather than towards A. Finally, we turn to examine the underlying reasons for selection towards A. We have proposed three explanations: first, A is a weak RNA binder thus selection to A will promote less RNA secondary structures and will aid viruses in avoiding the host defense mechanisms. Second, translational selection promotes specific codons and causes bias in the nucleotide content and third, there may be selection for amino-acids encoded by codons with A. At this stage, our analyses suggest that avoidance of secondary structure or translational selection are most probably not the sole underlying cause for selection towards A, and we show tentative evidence suggesting that the MHC class I system may drive selection for codons with elevated A. In line with the above, we noted throughout our analysis that flaviviruses were outliers; their sequences were A and G rich, and they are the only +ssRNA viral family where we did not infer selection for A and selection against U (Fig. 3D-E) . When probing the sequences in this family, we noted that the majority of sequences were of vector-borne viruses (e.g., dengue virus, Zika virus, and West Nile virus). In general, vector-borne viruses were rare in other virus families, suggesting that our observations regarding selection for A do not hold for vector-borne viruses. Carefully tying this together with our hypotheses in the previous section, we note that insects lack both an MHC system and an interferon response (Flajnik and Kasahara 2001; Secombes and Zou 2017) which augments the immune response to dsRNA, and this might be the underlying reason for the lack of selection against U and towards A observed in flaviviruses. Alternatively, other characteristics of the life cycle and replication of flaviviruses may be responsible for the absence of selection we observed in these viruses. To conclude, we have found similar patterns of coding sequencing composition across a wide variety of RNA viruses. We found that both mutation towards U and selection for A drive these patterns. In general, we show here that probing viral sequences and phylogenies allows a better understanding of mechanisms that shape the evolution of viruses, and in particular, allows insights into possible footprints of host activity, potentially illuminating the interaction between hosts and viruses. Sequences for the PhyVirus dataset were primarily obtained from NIAID Virus Pathogen Database and Analysis Resource (ViPR) (Pickett et al. 2012) , and were augmented by sequences of Influenza from the NIAID Influenza Research Database (IRD) (Zhang et al. 2017) . The sequences were retrieved as single gene/protein (as opposed to genome segments) during January 2019. Host information was retrieved from ViPR and IRD. Notably, around 9,700 sequences lacked host assignment. We manually sampled a few dozen sequences and checked their host assignment in the associated publication. Most were human viruses but other hosts were present as well. We note that this does not affect any of the analyses in this study which were almost always agnostic with respect to host. Our data contains multiple non-duplicated sequences from the same viral species in order to build a comprehensive evolutionary and phylogenetic history as much as possible. These features of the dataset are summarized at Table S1 . We would like to acknowledge the ViralZone resource (https://viralzone.expasy.org/) (Hulo et al. 2011) for providing comprehensive and accessible information about viral families and genomes. An in-house computational pipeline was used for clustering the PhyVirus dataset into multiple sequence alignments and associated phylogenetic trees. We first used MegaBLAST (Morgulis et al. 2008) to create clusters of homologous sequences, by using each sequence as a query against all sequences of the PhyVirus dataset as a database, using an e-value of 10 %&! . We then aligned sequences using PRANK (Löytynoja 2014) with default settings, and reconstructed phylogenies using the maximum-likelihood method based PhyML (Guindon and Gascuel 2003) , with default settings. We next sought to ensure that phylogenies do not contain sequences that are too remote from each other. To this end we implemented an iterative scheme where we "cut" phylogenies into two or more at branches whose length was larger than 0.5. The 0.5 cutoff was chosen based on manual curation and inspection, and allowed us to avoid grouping together very remotely related sequences. The phylogenies were then rooted using midpoint rooting for analyses that required a rooted tree. Finally, clusters with less than ten sequences were omitted from the analysis. This pipeline resulted in 465 alignments from thirteen viral families that contain overall 65,951 sequences (Fig. 1) . For analyses that required codon-based alignments, we performed codon alignment using PRANK. We manually obtained non-coding sequences for a select number of virus families: Picornavirus alignments were obtained from http://www.virology.wisc.edu/acp/Aligns/seq_align.html (Palmenberg and Sgro 2002; Palmenberg et al. 2009 ), full genome dengue and ebolavirus sequences identifiers were downloaded from ViPR (Pickett et al. 2012) , allowing us to thus obtain complete record and features from NCBI. dsDNA, dsRNA and RNA phages sequences retrieval and processing dsDNA and dsRNA sequences were obtained from VIPR. Bacteriophage sequences were downloaded from NCBI (https://www.ncbi.nlm.nih.gov/) using the Cystoviridae and Leviviridae taxonomy codes. Coding sequences were extracted using genomic coordinates from NCBI. Codon usage data was downloaded from the Codon Usage Database https://www.kazusa.or.jp/codon/ (Nakamura et al. 2000) and filtered for the host classes in our data, focusing on hosts that have more than 50 coding sequences in the Codon Usage Database, overall we have analyzed 41 mammalian species, ranging from mouse, cow, monkey and human. Using the codon usage table, we counted the number of nucleotides and amino acids in coding sequences and their frequencies. Nucleotide frequencies were calculated by viral family, by host and by codon position. We first averaged over each alignment, and then averaged by viral family and codon position. This was done to avoid biasing the calculation when a very large number of sequences were available for a particular gene. Dinucleotide odds ratios '( ( , ∈ { , , , }) were calculated as described previously (Cheng et al. 2013) : , where ' and ( denote the nucleotide frequencies and '( denotes the frequency of the dinucleotide in the sequence. The BASEML program from the PAML package (Yang 2007 ) was used to run the unrestricted non-reversible (UNR) and general time reversible (GTR) models in order to infer the frequencies of substitution between all pairs of nucleotides. Since the UNR model requires a rooted tree, we implemented mid-point rooting on all of the phylogenies. About half of the datasets displayed slight significant support for the UNR over the GTR model, yet results of UNR and GTR were very consistent with respect to the frequencies of substitution (data now shown). We applied a mutation-selection model of direction selection that we have previously developed (Stern et al. 2017) to test for selection for a specific nucleotide. Briefly, the model allows an increase in the substitution rate at a proportion ( ) of sites by rescaling rows and columns of the substitution matrix going to and from the selected nucleotide. The model further accounts for incomplete purifying selection at the external branches of the tree by rescaling the among-site variation distribution (see (Stern et al. 2017 ) for more details). Notably, the original directional model was run on a phylogeny where the root sequence was known. The model was here modified, and assumed a stationary distribution at the root. As commonly practiced, this distribution was inferred based on the distribution of nucleotides at the leaves. Moreover, the original model was agnostic regarding which nucleotide is under selection, and hence assumed that all four nucleotides may be under selection with a probability of /4. We here changed the model to allow for selection for only selected nucleotide (A, C, G, or U) with a probability of . The null model ( = 0) allowed the use of a likelihood ratio test to assess for significant selection towards a specific nucleotide. Results of this analysis revealed supposedly pervasive selection towards all four nucleotides, and we concluded this is erroneous inference that is due to problematic assumptions of the model that assume a same set of substitution rate parameters across all sites (Fig. S6) . In order to assess for incomplete purifying selection we used the branch-site model of the CODEML program from the PAML package (Yang et al. 2000) . As described previously (Pybus et al. 2007) , we compared between a 1-ratio model which assumes one value for all branches of the phylogeny and a 2-ratio model that assumes one value of * for the external branches and another value of + for internal branches. The underlying assumption is that relaxed selection at external branches will lead to an increase in at external branches, i.e., Since tip lengths vary across the datasets and some external branches are very long (suggesting purifying selection may exert its effect on them), we used different branch length cutoffs to determine which external branches are treated as external branches and which external branches are categorized as internal branches when attempting to detect incomplete purifying selection. The cutoffs that we used were 0.05, 0.04, 0.03, 0.02, 0.01, 0.005, 0.001, and no cutoff (i.e., the classic definition of external branch: a branch that had no progeny branches). To determine which datasets display significant support for incomplete purifying selection we first performed a likelihood ratio test between the 1-ratio null model and the 2ratio alternative model. All p-values were corrected for multiple testing using false discovery rate (FDR) (Benjamini and Hochberg 1995) . Only datasets showing significant support under one of the cutoffs described above, and where + < * , were considered as showing evidence of incomplete purifying selection. Ancestral sequence reconstruction was performed using the FastML program under a Jukes and Cantor model for nucleotides and applying joint reconstruction of characters across the phylogenies (Ashkenazy et al. 2012) . This enabled us to map the different substitutions across each phylogenetic tree. We then classified substitutions as external or internal, based on the maximal cutoff value that allowed for detection of incomplete purifying selection. The mutational bias of each nucleotide was calculated based on external substitutions only, by calculating the fraction of substitutions towards ( ∈ { , , , }) divided by the fraction of ancestral nucleotides that are not . For convenience, the biases were then normalized so they sum up to one. To determine if there is selection for or against a specific nucleotide we constructed a 2x2 contingency table for each viral family, with the type of mutation (e.g., to-A and to-C/G/U) at the columns, and type of branch (external/internal) at the rows. Cells then included the number of substitutions for each intersection of categories. We then used the Mantel-Haenszel (MH) test to test for an association between branch type and substitution type. We performed the MH test for each viral family and for each of the four nucleotides, and corrected for multiple testing using FDR (Benjamini and Hochberg 1995) . For the between host analysis SARS-CoV-2 a multiple sequence alignment containing 53,997 sequences was downloaded from GISAID (https://www.gisaid.org/) on July 7th 2020. For each sequence we counted the numbers of each mutation type relatively to the EPI_ISL_402125 sequence (NCBI accession MN908947) from Wuhan, China. We have also reconstructed the most recent ancestor (MRCA) of the SARS-CoV-2 human clade using the bat coronavirus RaTG13 sequence (accession number: MN996532.1) as an outgroup, and the MRCA we obtained was identical to the EPI_ISL_402125 sequence. Each mutation was counted only once, under the assumption that shared mutations were due to shared ancestry. We further discarded positions that have been documented as prone to errors based on the following resource: https://github.com/W-L/ProblematicSites_SARS-CoV2, although we note that retaining or discarding these positions had almost no effect on the results. For the within-host analysis we analyzed deep sequencing data of 212 SARS-CoV-2 samples that we have recently sequenced (Miller et al. 2020) . To mitigate sequencing errors, we considered only positions with coverage above 1000 and mutation frequencies above 5%. We have calculated several measures of codon usage bias. First, relative synonymous codon usage (RSCU) was calculated for each gene as previously described (Sharp and Li 1986) , where each sequence is represented as a 59 dimensional vector. We then performed correspondence analysis (CA) to reduce dimensionality and to detect major trends in codon usage variation among the sequences. In order to asses separation of the sequences on the first two CA axes we calculated the silhouette score (Rousseeuw 1987 ) based on different clustering categories (host classification, protein type for the six main protein types shared among all viral families depicted in Fig. 5 , and viral family). Next, we calculated the effective number of codons (ENC) as previously described (Wright 1990) , where ENC values range from 20 (when only one codon is used per amino acid) to 61 (when all synonymous codons are equally used for each amino acid). We also calculated !" , which is the frequency of GC content at the synonymous third codon position. RSCU, CA, ENC and !" were calculated using the codonW software (J.F.Peden, unpublished, available at http://codonw.sourceforge.net/). Finally, we calculated the tRNA adaptation index (tAI) using the tAI package (https://github.com/mariodosreis/tai) with genomic tRNA information from Homo sapiens that was obtained from GtRNAdb (Chan and Lowe 2016) . To predict peptides that can serve as epitopes for MHC class I recognition we used the NetMHCpan4 program (Jurtz et al. 2017) . We ran the program over all PhyVirus sequences using 249 mammalian alleles from the following organisms: human, chimpanzee, swine, mouse, gorilla, rhesus macaque and bovine (Table S2 ). The prediction was performed for peptide lengths of nine with default parameters. We calculated the nucleotide content for strong and weak binding areas and for areas with no binding prediction. In our calculation we first considered only nucleotides that determine the amino acid type unequivocally (for example for valine we counted G and U only, since the wobble position can be either one of the four nucleotides). A similar analysis was performed considering all three nucleotides that code for these peptides, yielding essentially the same results (Fig. S11) . A t-test was performed to determine if the nucleotide content was significantly different between the MHC binding strengths. Multiple test correction was performed using FDR. The PhyVirus dataset is available online at https://www.sternadi.com/phyvirus, and includes all alignments, phylogenies, as well as metadata files. Raw sequencing data for the Miller et al. 2020 SARS-CoV-2 samples are available in the NCBI SRA database under BioProject ID PRJNA647529. FastML: a web server for probabilistic reconstruction of ancestral sequences The influence of CpG and UpA dinucleotide frequencies on RNA virus replication and characterization of the innate cellular pathways underlying virus attenuation and enhanced replication Causes and implications of codon usage bias in RNA viruses Pacing a small cage: mutation and RNA viruses Controlling the false discovery rate: a practical and powerful approach to multiple testing APOBEC-mediated editing of viral RNA Over-and under-representation of short oligonucleotides in DNA sequences Genetic inactivation of poliovirus infectivity by increasing the frequencies of CpG and UpA dinucleotides within and across synonymous capsid region codons Base composition and translational selection are insufficient to explain codon usage bias in plant viruses HIV-1 adaptation to HLA: a window into virus-host immune interactions CpG-creating mutations are costly in many human viruses GtRNAdb 2.0: an expanded database of transfer RNA genes identified in complete and draft genomes Dissimilation of synonymous codon usage bias in virus-host coevolution due to translational selection CpG usage in RNA viruses: data and hypotheses Extremely High Mutation Rate of HIV-1 In Vivo dsRNA sensing during viral infection: lessons from plants, worms, insects, and mammals Minimal Contribution of APOBEC3-Induced G-to-A Hypermutation to HIV-1 Recombination and Genetic Variation The 3' and 5'-terminal sequences of influenza A, B and C virus RNA segments are highly conserved and show partial inverted complementarity Dinucleotide Composition in Animal RNA Viruses Is Shaped More by Virus Family than by Host Species Rates of evolutionary change in viruses: patterns and determinants Differentiating between selection and mutation bias Long term trends in the evolution of H(3) HA1 human influenza type A Comparative genomics of the MHC: Glimpses into the evolution of the adaptive immune system Influenza virus drug resistance: a time-sampled population genetics perspective Overlooked roles of DNA damage and maternal age in generating human germline mutations Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak Patterns of oligonucleotide sequences in viral and host cell RNA identify mediators of the host innate immune system Analysis of synonymous codon usage in SARS Coronavirus and other viruses in the Nidovirales A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood Evidence That Mutation Is Universally Biased towards AT in Bacteria A test of neutral molecular evolution based on nucleotide data ViralZone: a knowledge resource to understand virus diversity The extent of codon usage bias in human RNA viruses and its evolutionary origin Likely role of APOBEC3G-mediated G-to-A mutations in HIV-1 evolution and drug resistance NetMHCpan-4.0: Improved Peptide-MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity Data Why is CpG suppressed in the genomes of virtually all small eukaryotic viruses but not in those of large eukaryotic viruses? Heterogeneity of genomes: measures and values The A-rich RNA sequences of HIV-1 pol are important for the synthesis of viral cDNA Role of HLA Adaptation in HIV Evolution Natural Selection for Nucleotide Usage at Synonymous and Nonsynonymous Sites in Influenza A Virus Genes High guanine and cytosine content increases mRNA levels in mammalian cells Viral Sequence Evolution in Acute Hepatitis C Virus Infection Conserved tertiary structure elements in the 5' untranslated region of human enteroviruses and rhinoviruses Phylogeny-aware alignment with PRANK Rate, molecular spectrum, and consequences of human mutation Adaptive Protein Evolution at the Adh Locus in Drosophila Full genome viral sequences inform patterns of SARS-CoV-2 spread into and within Israel. medRxiv:2020 Database indexing for production MegaBLAST searches Codon usage tabulated from international DNA sequence databases: status for the year 2000 Alignments and comparative profiles of picornavirus genera Sequencing and analyses of all known human rhinovirus genomes reveal structure and evolution ViPR: an open bioinformatics database and analysis resource for virology research Evolutionary analysis of the dynamics of viral infectious disease Phylogenetic evidence for deleterious mutation load in RNA viruses and its contribution to viral evolution 5' and 3' terminal nucleotide sequences of the RNA genome segments of influenza virus Silhouettes: a graphical aid to the interpretation and validation of cluster analysis APOBEC3G contributes to HIV-1 variation through sublethal mutagenesis ADARs: viruses and innate immunity Evolution of Interferons and Interferon Receptors Codon usage in regulatory genes in Escherichia coli does not reflect selection for 'rare' codons Rampant C→U Hypermutation in the Genomes of SARS-CoV-2 and Other Coronaviruses: Causes and Consequences for Their Short-and Long-Term Evolutionary Trajectories The Evolutionary Pathway to Virulence of an RNA Virus Clonal interference in the evolution of influenza CG dinucleotide suppression enables antiviral defence targeting non-self RNA Within-patient mutation frequencies reveal fitness costs of CpG dinucleotides and drastic amino acid changes in HIV Conserved RNA secondary structures in Flaviviridae genomes The adaptation of codon usage of +ssRNA viruses to their hosts RNA virus attenuation by codon pair deoptimisation is an artefact of increases in CpG/UpA dinucleotide frequencies. Elife 3:e04531. van der Kuyl AC, Berkhout B. 2012. The biased nucleotide composition of the HIV genome: a constant factor in a highly variable virus The tendency of lentiviral open reading frames to become A-rich: constraints imposed by viral genome organization and cellular tRNA availability Codon usage bias and the evolution of influenza A viruses. Codon Usage Biases of Influenza Virus The 'effective number of codons' used in a gene PAML 4: phylogenetic analysis by maximum likelihood Codon-substitution models for heterogeneous selection pressure at amino acid sites Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level Influenza Research Database: An integrated bioinformatics resource for influenza virus research We thank Rasmus Nielsen, Eran Bacharach, Pleuni Pennings and Tzachi Hagai for stimulating discussions and comments on the manuscript. This work was supported in part by a fellowship to TK from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University, and funding to AS from the Koret-UC Berkeley-Tel Aviv University Initiative in Computational Biology and Bioinformatics and from the Israeli Science Foundation (1333/16).