key: cord-0035394-bnnw8cpn authors: Goz, Eli; Zur, Hadas; Tuller, Tamir title: Hidden Silent Codes in Viral Genomes date: 2017-06-30 journal: Evolutionary Biology: Self/Nonself Evolution, Species and Complex Traits Evolution, Methods and Concepts DOI: 10.1007/978-3-319-61569-1_5 sha: 001b7a3388e3c1e712a01e30cdd8dbd3e8535e1e doc_id: 35394 cord_uid: bnnw8cpn Viruses are small infectious agents that replicate only inside the living cells of other organisms and comprise approximately 94% of the nucleic acid-containing particles in the oceans. They are believed to play a central role in evolution, are responsible for various human diseases, and have important contributions to biotechnology and nanotechnology. Viruses undergo evolutionary selection for efficient transmission from host to host by exploiting the host’s gene expression machinery (e.g., ribosomes) for the expression of the genes encoded in their genomes. As a result, viral genes tend to be expressed via non-canonical mechanisms that are very rare in living organisms. Many of the gene expression stages and other aspects of the viral life cycle are encoded in the viral transcripts via ‘silent codes’, and are induced by mutations that are synonymous to the viral amino acid content. In a series of studies that included the analyses of dozens of organisms from the three domains of life, it was shown that there are overlapping ‘silent codes’ in the genetic code that are related to all stages of gene expression regulation. The aim of this chapter is to summarize the current knowledge related to the silent codes in viral genomes and the open questions in the field. Proteins are the principal actors in all intracellular activities. Gene expression is the process by which the information encoded in a gene is used to synthesize the corresponding protein. The major cellular biophysical stages of gene expression are transcription, splicing (in eukaryotes), mRNA degradation, translation, and protein degradation; each of these stages has several substages (e.g. initiation, elongation, and termination of translation). For many years, researchers referred to the promoter (which mainly determines the transcription initiation rates) as the 'module' that includes almost all the information related to gene expression regulation, while the information related to protein structure is contained in the coding sequence via the genetic code. However, in recent years, it was shown that such a modularity is only a raw approximation of the reality (Quax et al. 2015; Supek 2016; Sauna and Kimchi-Sarfaty 2013; Fredrick and Ibba 2010; Cannarozzi et al. 2010; Bahir et al. 2009; Gorgoni, et al. 2014; Gu et al. 2010; Zafrir and Tuller 2017; Tuller and Zur 2015; Zafrir and Tuller 2015a, Yofe et al. 2014; Diament et al. (in press ); Dana and Tuller 2014a; Zur and Tuller 2012 Tuller et al. 2010a Tuller et al. , b, 2011a Zafrir et al. 2016; Goz et al. 2017) . Various signals (codes) related to all the stages of gene expression regulation, including its dynamics and amplitude, appear also in the coding sequence (ORF) itself and in the untranslated regions (UTRs), and are involved in biophysical interactions with the other segments of the transcript, and various macromolecules involved in gene expression regulation (Quax et al. 2015; Supek 2016; Sauna and Kimchi-Sarfaty 2013; Fredrick and Ibba 2010; Cannarozzi et al. 2010; Bahir et al. 2009; Gorgoni et al. 2014; Gu et al. 2010; Zafrir and Tuller 2015b; 2017; Tuller and Zur 2015; Yofe et al. 2014; Diament et al. (in press) ; Dana and Tuller 2014b; Zur and Tuller 2012; Tuller et al. 2010a Tuller et al. , b, 2011a Zafrir et al. 2016; Goz et al. 2017 ) (see Figs. 1 and 2) . Transcripts tend to also include information related to/affecting additional phenomena such as co-translational protein folding (Thommen et al. 2016; Chaney and Clark 2015) and regulation by the bacterial immune system (Terns and Terns 2011) . Specifically, it is interesting to emphasize that many of these 'silent' codes are encoded in the coding regions via the redundancy of the genetic code. A certain protein can be encoded by an exponential number of codon combinations; replacing a codon with a synonymous one can significantly affect the expression of the transcript (Quax et al. 2015; Supek 2016; Sauna and Kimchi-Sarfaty 2013; Fredrick and Ibba 2010; Cannarozzi et al. 2010; Bahir et al. 2009; Gorgoni et al. 2014; Gu et al. 2010; Zur 2015, 2017; Zafrir and Tuller 2015a; Yofe et al. 2014; Diament et al. (in press ); Dana and Tuller 2014b; Tuller et al. 2010a Tuller et al. , b, 2011a Tuller 2012, 2013; Zafrir et al. 2016; Goz et al. 2017; Bazzini et al. 2016; Morgunov et al. 2014; Sin et al. 2016 ). The information related to these codes is considered 'hidden' as it is partially encoded in synonymous/'silent' aspects of the transcript, and is much harder to model than the genetic code. Regulation of gene expression is clearly at the heart of Fig. 1 Some of the interactions of the mRNA molecule with the gene expression machinery. The affinities of these interactions are encoded in the UTRs and ORFs of the genes every biological system. Thus, understanding how aspects of this process are encoded in the transcript should have important ramifications to every biomedical discipline (e.g., human health, synthetic biology, molecular evolution, genetics, systems biology, etc.). It is important to emphasize that while there are studies that suggest codon usage bias is related to mutation drift (Bulmer 1991) , various lines of evidence have recently demonstrated that codon usage bias directly affects the translation elongation speed. Specifically, based on direct experimental measurements of ribosome densities (which are related to elongation rates) over the entire transcriptome at a single codon resolution, it was shown that different codons have different elongation rates, which correlate with corresponding tRNA levels (Dana and Tuller 2014b; Gardin et al. 2014 ). It was also experimentally shown that changing the codon content of a protein directly affects protein levels Gustafsson et al. 2004) , and thus the organism's fitness. 2 The Importance of Understanding the 'Silent' Viruses are small infectious agents that replicate only inside the living cells of other organisms. They are comprised of genetic material (RNA or DNA molecule(s)) and often additional enzymes that are enclosed within a protective coat of lipids and proteins. The viral genome contains all necessary information to initiate and complete a replication cycle within a cell. Viruses can infect all living organisms (fungi, plants, bacteria, mammals, etc.); we eat and breathe billions of virus particles regularly and carry viral genomes as part of our own genetic material. Viruses are by far the most abundant biological entities in the oceans, comprising approximately 94% of the nucleic acid-containing particles (Zimmer 2011) . They are believed to play a central role in evolution as they are important natural Fig. 2 Some signals related to gene translation regulation that are encoded in the coding sequence and 5'UTR (see references in the main text) means of transferring genes between different species (Zimmer 2011) . In addition, viruses are responsible for various human diseases: Some of them are common (e.g., common cold, influenza, chickenpox, cold sores), others are severe and fatal (e.g., ebola, AIDS, avian influenza, and SARS). Viruses also have important implications to biotechnology (e.g., they are often used as vectors to introduce genes into cells), and even to materials science and nanotechnology (e.g., they can be used as organic nanoparticles) (Fischlechner and Donath 2007) . The viral genomes undergo evolutionary selection for efficient transmission from host to host, and for exploiting the gene expression machinery of the host (e.g., ribosomes, various transcription/translation factors, etc.) for efficient synthesis of the encoded proteins and the efficient expression of various types of genes (e.g., see Gale et al. 2000) . As a result, viral genes tend to be expressed via non-canonical mechanisms that are either specific only to viruses, or very rare in living organisms (e.g., see Gale et al. 2000; Firth and Brierley 2012; Rohde et al. 1994; Brierley 1995; Lopez-Lastra et al. 2010; Fig. 3) . For example, viruses tend to include overlapping ORFs and they often include a long ORF translated into a single polyprotein that is cleaved posttranslationaly into a set of mature proteins. Viruses also tend to initiate translation from internal ribosome entry sites (IRES), and not via canonical scanning from the 5' end of the transcript. Furthermore, frequently viral genes contain functional mRNA structures related to all stages of their expression regulation. Furthermore, frequently they include strong mRNA structure related to all stages of their gene expression regulation. Regularly, the viral genetic material is RNA and not DNA and can undergo a series of transformations before translation into proteins. Finally, most of the viral genomes are very compact and include all their gene expression information in a very short genome (typically a few thousand nucleotides), etc. One gene expression aspect common to all viruses is the fact that all types of viruses must use the ribosomes (and other expression machinery) of their host. It is important to emphasize that viruses evolve to include non-canonical gene expression mechanisms since these non-canonical regulatory mechanisms contribute to their fitness. Specifically, often during viral development, the canonical gene expression mechanisms in the cell are 'shut down' (e.g., due to down-regulation of relevant initiation factors); since viruses bypass these canonical mechanisms (via non-canonical mechanisms, e.g. IRES), they can successfully exploit the intracellular gene expression resources (e.g., ribosomes and tRNAs). Some of these non-canonical mechanisms (e.g., overlapping ORFs) enable a more efficient (in terms of energy) production of viruses, and decreasing the probability of deleterious mutations (Holmes 2009). Among others, this means that it is less trivial to understand the viral silent gene expression codes as they are relatively rare and unique (Gale et al. 2000; Firth and Brierley 2012; Rohde et al. 1994; Brierley 1995; Lopez-Lastra et al. 2010; Adrian et al. 2005; Holland 2012 ). Various studies in recent years have provided statistical evidence that silent aspects in the viral genomes are related to their fitness. Specifically, among others, it was suggested that very basic features, such as mRNA folding, codon decoding times, codon or nucleotide pairs distributions (or other low order statistics of genomic sequences), may be induced by synonymous mutations and play an important role in controlling the viral life cycle (Bahir et al. 2009; Cuevas et al. 2012; Lobo et al. 2009; Jenkins et al. 2001; Greenbaum et al. 2008; van Hemert et al. 2007; Pride et al. 2006; Cardinale and Duffy 2011; Shackelton et al. 2006; Carbone 2008; Gu et al. 2004; Sau et al. 2005a Sau et al. , 2007 Zhao et al. 2008; Cheng et al. 2012; Lucks et al. 2008; Mueller et al. 2006) . In this subsection, detail the different silent aspects of viral gene expression that have been reported thus far. The most basic property of the viral coding sequences is the frequencies of the different codons. The tendency to choose specific codons has been shown to affect/regulate intracellular mechanisms (Supek 2016; Sauna and Kimchi-Sarfaty 2013; Tuller and Zur 2015; Novoa et al. 2012) : for example, it may affect translation elongation (Dana and Tuller 2014b; Gardin et al. 2014; , translation initiation Zur and Tuller 2013; Kozak 1986) , splicing (Zafrir and Tuller 2015b; Chamary and Hurst 2005) , mRNA folding (Gu et al. 2010; Zur and Tuller 2012; Tuller et al. 2010) , protein folding (Pechmann and Frydman 2013; Kramer et al. 2009) , and more; thus, we expect that viral codon bias will be under selection pressure. Indeed, many studies have suggested that viral codons may be under selection to improve the viral fitness, for example, via adaptation to the host tRNA pool (or other translation resources) (Bahir et al. 2009; Burns et al. 2006; Tao et al. 2009; Jia et al. 2009; Zhou et al. 2010; Liu et al. 2010 Liu et al. , 2011 Das et al. 2006; Cai et al. 2009; Sau et al. 2005b; Wong et al. 2010; Zhong et al. 2007; Zhang et al. 2013; Novella et al. 2004; Michely et al. 2013; Roychoudhury and Mukherjee 2010; Ma et al. 2011; Aragones et al. 2010; Tsai et al. 2007; Su et al. 2009; Bull et al. 2012; Zhao et al. 2005) . The adaptation of the viral codon usage bias to the tRNA pool is expected to improve translation efficiency via better allocation of the limited translation resources (e.g., ribosomes and tRNA molecules) (Dana and Tuller 2014b; Rocha 2004; Sharp et al. 2005) . In order to study the effects and extents of codon usage bias many measures have been developed (Sharp and Li 1987; dos Reis et al. 2004; Sabi et al. 2016; Wright 1990) . For example, Bahir et al. (Bahir et al. 2009 ) analyzed a large data set of viruses that infect hosts ranging from bacteria to humans. They show that bacteria-infecting viruses are strongly adapted to their specific hosts in terms of codon usage bias but that they differ from other unrelated bacterial hosts. Viruses that infect humans, but not those that infect other mammals or aves, show a strong resemblance to most mammalian and avian hosts, in terms of codon preferences. This observation can be partially explained by the following points: (1) There is similarity in the codon usages among most mammals (Bahir et al. 2009 ). (2) The codon usage bias among bacteria is very high (Bahir et al. 2009 ). (3) Bacteria (and thus probably also their viruses) usually undergo stronger selection for codon usage bias and for various aspects of translation optimality (among others due to their larger population size) relatively to most eukaryotes (dos Reis et al. 2004 ; dos Reis and Wernisch 2009). (4) Additional explanations may be related to the recent expansion of humans and the coevolution of their viruses, or to the hypothesis that large portions of the human genome are actually of viral origin (Bahir et al. 2009; Kazazian 2004) . Pavesi et al. suggested that the fact viruses undergo selection to include specific codons can help detecting new and ancestral viral coding regions (Pavesi et al. 2013) . Aragonès et al. suggested that the Hepatitis A virus undergoes various types of adaptations to fine-tune the translation kinetics, among others, via selection on codon usage bias (Aragones et al. 2010) . A study by Bull et al. (2012) has shown that when reeingineering the major capsid gene of the bacteriophage T7 with varying levels of suboptimal synonymous codons, the fitness of the constructs declines linearly with the number of suboptimal changes. These experiments/ analyses suggest a direct relation between codon usage bias and fitness/fitnessrecovery. Similarly, a related study by Lauring et al. (2012) compared the wild-type poliovirus to synthetic viruses carrying reengineered capsid sequences with hundreds of synonymous mutations. They found that such mutations are related to the rewiring of the population's mutant network which reduced its robustness to mutations and attenuated the virus in an animal model of infection. It is important to mention that some of these codon usage bias patterns may be associated with regulatory signals not necessarily directly related to tRNA levels (Gog et al. 2007 ); alternative or partial explanations to viral codon usage bias are mutational bias, asymmetrical mutational bias in two DNA strands, temperature, viral replication mechanisms, protein folding, dinucleotide distribution, mRNA folding, and more (Das et al. 2006; Zhang et al. 2011 Zhang et al. , 2013 Sau and Deb 2009; Adams and Antoniw 2004; Cardinale et al. 2013; Berkhout et al. 2002; Pinto et al. 2007; Cladel et al. 2008; Choi et al. 2005; Zhou et al. 2013; Burns et al. 2009; Liu et al. 2012 ). The effect on chromatin structure and nucleosome positioning is another potential constraint on the viral codon frequency distribution, as viruses are exposed to histones produced by the host (Eslami-Mossallam et al. 2016; Cohanim and Haran 2009; Babbitt and Schulze 2012) . Interestingly, a recent line of studies suggested that codon pairs' distribution is an important feature under selection in various viruses, which may be used for their attenuation for developing new vaccines (Coleman et al. 2008; Mueller et al. 2008; Martrus et al. 2013) . However, there is a debate regarding this feature, while some researchers believe that it is related directly to the distribution of codon pairs (Coleman et al. 2008; Mueller et al. 2008; Martrus et al. 2013) , others have suggested that it is related to the distribution of dinucleotides (Tulloch et al. 2014 ) which affect RNA folding (see, e.g., Babak et al. 2007 ), or may be related to the enhanced innate immune responses to viruses with elevated CpG/UpA dinucleotide frequencies rather than the viruses themselves being intrinsically defective (Tulloch et al. 2014; Belalov and Lukashev 2013) . These possible explanations still connect the viral fitness to silent features of its genome, demonstrating their importance and influence on viral fitness and evolution. Finally, it is important to emphasize the fact that many silent viral codes are localized to specific regions within the genome (Dumans et al. 2004 ). Intriguingly, a recent study has provided evidence of selection for distinct compositions of synonymous codons in viral genes that are expressed at different stages of the viral life cycle (e.g., early and late viral genes): It was shown that in the bacteriophage lambda, evolution of viral coding regions is driven, among others, by codon 'selection' which is specific to the expression time of the gene during the viral development (e.g., early expressed genes versus late expressed genes). Specifically, during the initial/progressive stages of infection, the decoding rates in early/late genes were found to be superior to those in late/early genes, respectively (Fig. 4) . This study is important since it is the first to show that the selection for codon usage in the virus is directly related to translation elongation rates. In addition, it was shown for the first time that codon elongation rates change during viral evolution; thus, this is expected to affect the codons 'selected' for each viral gene based on its expression time during the viral development cycle. Currently, due to the absence of experimental measurements, this result has been demonstrated only in one virus (bacteriophage lambda), since to perform such an analysis one needs to infer the codon decoding rates in different time points of the viral development. This can be achieved only via relevant experiments (Ingolia et al. 2009 ) and data filtering (Dana and Tuller 2014b) , and the viral genome alone is not enough. Specifically, one type of relevant experiment is Ribo-Seq which provides large-scale information (the entire transcriptome) related to the probability for seeing a ribosome over each codon in the trasncriptome in vivo (Ingolia et al. 2009 ). These experiments, when performed in different viral development stages, can be used for estimating the decoding rates of different codons in different viral conditions Liu et al. 2013 ). Since Ribo-Seq data includes various sources of bias and noise, the data should be analyzed with tools tailored specifically for parameter estimation and bias filtering in the Ribo-Seq experiments (Dana and Tuller 2014b; Diament and Tuller 2016) . As can be seen in (Fig. 4a) , to estimate codon decoding rates we do not compute a simple average of the normalized Ribo-Seq footprint count (NFC). The main reasons that a simple average does not work are related to: (1) The fact that Ribo-Seq includes various types of non-trivial biases (e.g., very extreme values in certain positions due to the biochemistry of the protocol) Tuller 2012, 2014b; Diament and Tuller 2016; Gerashchenko and Gladyshev 2017) . (2) Codons upstream of slower codons will have more reads due to traffic jams (Dana and Tuller 2014b) . (3) Codons downstream of slower codons will have more reads due to incomplete halting of the ribosomes movement during the Ribo-Seq experiment (Hussmann et al. 2015) . Consequently, the NFC always has a very thick right tail. It was shown via simulations of the Ribo-Seq procedure (Dana and Tuller 2014a, b ) that without the aforementioned problems, the NFC distribution is close to normal (resembles a Gaussian without the thick right tail). Thus, to estimate the nominal decoding rate, we must filter the right tail. It was shown via Ribo-Seq simulations that the suggested filtering estimates the correct decoding times, but due to the reasons explained above merely taking the mean of the entire NFC distribution does not correlate with the actual decoding times (Dana and Tuller 2014b) . We believe that in the future, similar results will be reported for additional viruses. Various previous studies have suggested that the UTRs of many viruses include important functional structures (Watts et al. 2009; Firth et al. 2011; Brown et al. 1992; Hyde et al. 2014; Abbink and Berkhout 2003) . For example, it was demonstrated that extensive structural elements that modulate RNA replication via different conformations appear in the 5′ and 3′ UTRs of Dengue and other flaviviruses. The promoter for Dengue virus RNA synthesis is a large stem-loop structure located at the 5′ end of the genome. This structure specifically interacts with the viral polymerase NS5 and promotes RNA synthesis at the 3′ end of a circularized genome. The circular conformation of the viral genome is mediated by long-range RNA-RNA interactions that span thousands of nucleotides (Fig. 5) . As another example, the genomes of human hepatitis C virus (HCV), and the animal pestiviruses responsible for bovine viral diarrhea (BVDV) and hog cholera (HChV), have a conserved (and probably functional) stem-loop structure in the 3' 200 bases of the 5'UTR (Brown et al. 1992) . A different study (Hyde et al. 2014) suggested that the pathogenic alphaviruses use secondary structural motifs within the 5'UTR as part of an evasion mechanism by which viruses avoid immune restriction. JFig. 4 a Schematic description of the ribosome profiling method, generation of the NFC distributions, and estimation of typical decoding rates of codons. Translation of mRNA codons (black circles) by ribosomes (blue shapes) is arrested, then exposed mRNA is digested. Protected mRNA footprints are then sequenced, mapped onto the genome, and normalized per gene by their mean read count value, resulting in NFC profiles. Then, NFC values of each specific codon type (NFC values of codons of type 'AAA' are demonstrated) are collected from all analyzed genes and presented via a histogram, where the x-axis represents the NFC values and the y-axis represents the fraction of the time (probability) each NFC value appears in the analyzed genes, thus creating the NFC distribution of a codon. (e.g, the codon 'AAA' appears with an NFC value that equals 1 in the analyzed genes in 1.6% of the times.) The combined normal/exponential model fitting of codon NFC distribution is plotted as a curve. The position of the mean NFC value is presented with a vertical line. The NFC distribution can be decomposed into a normal and an exponential component using a log-likelihood fitting. The mean of the normal component is used for computing the Mean of the Typical Decoding Rate (MTDR) of coding regions. b Relative expression levels of each of the lambda phage gene groups (early/late) in Ribo-Seq read count per nucleotide. c Adaptation of translation elongation efficiency in early and late genes to different bacteriophage development stages genes. Relative translation elongation efficiency coefficient, , as a function of time from the beginning of the lytic stage (0-20 min), where MTDR E and MTDR L are the MTDR of early and late genes, respectively. We can see that the RTEC of early genes is higher at the beginning and becomes lower with time (as expected); the first point (t = 0), when there are no measurements of expression is ignored. c Selection for translation elongation efficiency in bacteriophage coding regions. At each time point, average MTDR values of wild-type early/late genes (vertical bars) were compared to MTDR values of 100 corresponding randomized variants (histograms). Average wild-type MTDR values of each group are significantly higher (p < 0.05) than expected in random. The late genes were sampled to control for the length factor Interestingly, Firth et al. (2011) analyzed the $ 150nt 3′-adjacent to the stop codon (UGA) in Sindbis, Venezuelan equine encephalitis related alphaviruses, and in the plant virus genera (Furovirus, Pomovirus, Tobravirus, Pecluvirus and Benyvirus); they found a phylogenetically conserved stem-loop structure. Mutational analysis of the predicted structure demonstrated that the stem-loop increases read-through by up to ten-fold. Thus, this structure has an important function: increasing read-through probability. An interesting question is related to the possibility that such important functional structures appear inside the coding regions of viruses. To check this possibility, the strength of the structures within the coding regions of viral genomes can be compared to the ones we 'expect to obtain' under a 'null evolutionary model' that generates viral genomes with similar properties to the original genome (such as, encoded proteins, GC content, codon frequencies, identical distances/alignmentscores between the viral strains of the same virus). Two recent studies have performed such analyses (Goz and Tuller 2015, 2016) . In these papers (Goz and Tuller 2015, 2016) , 1666 genomes of the four Dengue serotypes and the HIV genome were analyzed, using statistical/computational analyses to detect dozens of positions suspected to undergo selection for weak/strong local mRNA folding (probably many of them are related to viral fitness), while controlling for the false discovery rate. An extensive position-specific selection for global and local mRNA structures in these viruses was demonstrated (Goz and Tuller 2015, 2016 ) (see also Goz et al. 2017 ). In addition, since robustness to mutations is an important factor that influences viral evolution (expressly in the case of RNA viruses) (Lauring et al. 2013) , it was specifically interesting to provide evidence related to the robustness of some of these structures to mutations/errors (Goz and Tuller 2016) (Fig. 6) . Inference of the HIV RNA structure (Watts et al. 2009 ) suggested that there is correlation between high levels of RNA structure and sequences that encode inter-domain loops in HIV proteins. It was shown that RNA structure can effect translation elongation rates (Tuller et al. 2011a; Dana and Tuller 2012) ; it was also shown that the elongation rates can effect co-translational folding (Zur and Tuller 2016; Yang et al. 2014; Faure et al. Fig. 5 Illustration of the functional RNA structures at the UTRs of the Dengue virus. Among others, these structures are related to genome cyclization, which is mediated by long-range RNA-RNA interactions and enables the polymerase to reach the 3′ end of long RNA molecules (adapted from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3187688/) 2016). Thus, it is possible that, among others, the RNA structure modulates ribosome elongation to promote native protein folding. It was shown that unstructured RNA regions tend to include splice site acceptors and hypervariable regions. The HIV genome also includes a functional ribosomal gag-pol frameshift stem-loop. These results suggest that the coding regions, and not only the UTRs, of various viruses are populated with local RNA structures that are important for the viral life cycle and fitness. As we mentioned in the previous section, we expect the viral coding region to include many codes/patterns that are important for the viral fitness and are longer and more complex than the single codon distribution. Thus, to show this we recently performed large-scale analyses of most all the viruses with available genomes and their hosts with a novel method for detecting hidden silent codes (that cannot be explained by codon bias) in the viral genetic material. The new statistical measure compares that mean repetitive patterns in the JFig. 6 a Modification of the wild-type secondary structure (left) after introducing a single-point G ! U mutation (right); the mutated nucleotides are marked in red; the distance between the wild-type and mutated secondary structures (number of changes in the base-pair connections required to transfer one structure into another) in this example its d = 13. b Prediction of MFE (minimum free folding energy) in local windows (red broken-line square) along the coding sequence (brown): each position i in the sequence was assigned with the MFE value predicted in the 150nt window starting at this position. c Computation of the structure-based mutational robustness (SMR): L-sequence length; d-base-pair distance between the secondary structure of the wild-type sequence (S(WT)) and the secondary structure of the mutant (S(MT)), averaged over all single-point mutants at all positions along the sequence (total of 3L mutants). d Evidence that specific regions of HIV structural genes undergo an evolutionary selection for strong folding. Each panel corresponds to wild-type (blue) and mean randomized (green) MFE profiles for one gene: The y-axis corresponds to the MFE (kcal/mol) in the 150nt genomic window starting at positions specified along the x-axis (nucleotide coordinate given with respect to the start of the coding region); red points-positions with MFE related p-value < 0.01 (in these positions the wild-type folding is stronger than in 99% of than randomized variants); yellow points-MFE-selected positions (MFE p-value < 0.01 and BH-FDR = 0.01), these positions span genomic regions that are conjectured to undergo an evolutionary selection for strong folding. We can see clusters of MFE-selected positions in each one of the structural genes (env, gag, pol); in other genes, no evidence of selection for strong folding was found. e Structure-based mutation robustness of RRE. X-axis-variant id: 1-wild type; 2-1001-randomized (structure preserving and dinucleotide and amino acid preserving variants). Y-axis-Structure-based mutational robustness (SMR). The red line corresponds to the wild-type SMR value. The p-value (portion of randomized variants with a higher robustness than in wild-type) and z-score (number of standard deviations the wild-type SMR is higher than the mean randomized SMR) are specified in red. We can conclude that RRE is significantly more robust than in random, and this robustness cannot be explained by the specific secondary structure of the corresponding region, its folding strength and/or other sequence attributes such as composition of dinucelotides and amino acids viral and host genome and identify signals that are not expected to appear in these genomes based only on the distribution of single codons Zur and Tuller 2015; Goz and Tuller 2017 ) (see Fig. 7 ). Based on this analysis, we were able to detect significant patterns of such codes (repetitive sequences) in a high percentage of the analyzed viruses (33-90% for different groups of viruses classified according to their host) and in 90% of the bacteriophages Goz and Tuller 2017) . Fig. 7 a The statistical approach for evaluating the tendency of a viral coding region to include long subsequences that tend to appear in the host. At each position in the coding region, the length of the longest subsequence starting in this position that also appears in the host is computed. The average longest host repetitive score (AHRS) is the average of all these lengths. b To evaluate the statistical significance of the AHRS in the viral coding sequences, the score was compared to the ones obtained for randomized versions of the viral genomes maintaining the proteins, codon frequencies, dinucleotide frequencies, and GC content. The figure includes the analysis for the bacteriophage lambda It is important to mention that there are some preliminary studies regarding gene expression engineering/modeling and other related aspects (see,e.g., Gorgoni et al. 2014; Tuller et al. 2011a; Sin et al. 2016; Konur et al. 2016; Sanassy et al. 2015; Wu et al. 2016; Schoech and Zabet 2014; Cheng et al. 2016; Pan et al. 2016; Haldane et al. 2014; Raveh et al. 2016; Reuveni et al. 2011) , but none dealing with complete viruses. Thus, one open question is related to the development of practical strategies for engineering viruses based on the hidden/silent information. Developing approaches for controlling these codes should enable us to manipulate (e.g., increase or decrease) the expression levels of viral genes, and thus to modulate various viral phenotypes such as replication rates. Therefore, based on such an approach, it will be possible to efficiently engineer viruses (Wimmer et al. 2009 ) for various objectives related to human health such as the design of live attenuated and killed vaccines (Lauring et al. 2010) . Today, almost all the approaches for designing vaccines are based on non-synonymous alterations of the viral genomes, ignoring the largest fraction of the information (i.e., the silent information) encoded in the viral genome. Indeed, some preliminary studies have suggested that modulating simple features, such as codon and codon-pair usage, and local mRNA folding, can be used for the development of live attenuated vaccines (Coleman et al. 2008; Goz and Tuller 2015; Nogales et al. 2014) . Such an approach can also be generalized to engineer bacteriophages for various objectives such as 'fighting' pathogenic bacteria resistance to antibiotics, and engineering the human microbiome. It may also be used to design better oncolytic viruses with improved replication/fitness in cancerous cells but not in healthy ones. Is it possible that some of the silent codes are related to the immune system? In this book chapter, we emphasized the relation between sequence patterns in the viral coding sequences and transcripts, and viral fitness, via their effect on gene expression. However, it is possible that some of these patterns are related not only to gene expression, but also to the evolution of the virus for escaping the host immune system. It is important to emphasize that in most of the analyses, we and others reported (some are mentioned above), the amino acid content of the viral genes was controlled for. Thus, the reported signals cannot trivially be attributed only to the classical mechanisms, such as viral recognition by the host (e.g., antibodies), as these mechanisms are traditionally believed to be based on interactions between proteins. However, it is very plausible that they are related to alternative known and/or unknown mechanisms. One very relevant such mechanism in bacteria is clustered regularly interspaced short palindromic repeats (CRISPR; see Fig. 8 ) (Marraffini 2015; Horvath and Barrangou 2010) . This mechanism is based on creating fragments in the viral genome that are transcribed to short RNA molecules (crRNAs); these short RNA molecules match a certain region in the viral genome and 'guide' a protein complex (CAS-crRNA complex) that cuts the viral DNA in this region and inactivates the virus. Since this mechanism is based on the recognition of short DNA subsequences that should appear in the virus/phage but not in the host, this may trigger evolution of the nucleotide composition of the virus/phage to be similar to the host. This may result in similar patterns of codons, and longer sequences that appear in the phage and the host, explaining some of the results reported here (Goz and Tuller 2017) . Finally, it is important to emphasize that similarly to viral adaptation to the host, silent features of the coding regions are expected to affect related phenomena such as horizontal gene transfer (HGT). In this case, a transferred gene is expected to be successfully expressed in a new host if its silent features are compatible (Tuller et al. 2011b; Tuller 2011 Tuller , 2012 Roller et al. 2013; Medrano-Soto et al. 2004 ). Thus, many of the results reported here may be generalized to the case of HGT. It is important to emphasize that a central HGT mechanism is transduction, the process in which bacterial DNA is moved from one bacterium to another by a virus/bacteriophage (Soucy et al. 2015) . Thus, the reported relations between (1) the Fig. 8 The short palindromic repeat (CRISPR)-Cas system provides adaptive immunity against foreign elements in prokaryotes: Upon viral injection, a small sequence of the viral genome, known as a spacer, is integrated into the CRISPR locus to immunize the host cell. Spacers are transcribed into small RNA guides that direct the cleavage of the viral DNA by Cas nucleases (Horvath and Barrangou 2010) host silent codes and (2) the transferred gene silent codes have much overlap: The fact that viral fitness is related to the similarity of its silent aspects/codes to the host should directly improve its ability to transfer genes; it is also directly related to the fact that the silent aspects/codes in the transferred genes are more adapted to the new host since the virus undergoes evolution to be better adapted to the host. Some preliminary studies of heterologous gene expression have suggested that introducing a foreign gene with a distinct codon distribution to the host results in a decrease in the host's fitness and the gene's protein levels (Gustafsson et al. 2004; Tuller et al. 2011; Welch et al. 2009 ). Computational models have suggested that this is partially due to the fact that such genes recruit more ribosomes (slower codons result in ribosomes spending more time on the mRNA), the number of available ribosomes decreases, the global initiation rates of the host genes decreases, and thus the host fitness decreases (Raveh et al. 2016; Tuller et al. 2011b; Tuller 2011 ) (though many additional explanations exist Tuller 2012; Welch et al. 2009; Angov 2011) ). However, additional experimental studies should be performed to better understand the effect of the codon bias of a transferred gene on the transferred gene expression and the host fitness. A novel long distance base-pairing interaction in human immunodeficiency virus type 1 RNA occludes the gag start codon Codon usage bias amongst plant viruses Codon usage: nature's roadmap to expression and folding of proteins Fine-tuning translation kinetics selection as the driving force of codon usage bias in the hepatitis A virus capsid Considerations in the identification of functional RNA structural elements in genomic alignments Codons support the maintenance of intrinsic DNA polymer flexibility over evolutionary timescales Viral adaptation to host: a proteome-based analysis of codon usage and amino acid preferences Codon identity regulates mRNA stability and translation efficiency during the maternal-to-zygotic transition Rationally designed, heterologous S. cerevisiae transcripts expose novel expression determinants Causes and implications of codon usage bias in RNA viruses Rationally designed, heterologous S. cerevisiae transcripts expose novel expression determinants Codon and amino acid usage in retroviral genomes is consistent with virus-specific nucleotide pressure Ribosomal frameshifting viral RNAs Secondary structure of the 5' nontranslated regions of hepatitis C virus and pestivirus genomic RNAs Slow fitness recovery in a codon-modified viral genome The selection-mutation-drift theory of synonymous codon usage Modulation of poliovirus replicative fitness in HeLa cells by deoptimization of synonymous codon usage in the capsid region Genetic inactivation of poliovirus infectivity by increasing the frequencies of CpG and UpA dinucleotides within and across synonymous capsid region codons Characterization of synonymous codon usage bias in the duck plague virus UL35 gene A role for codon order in translation dynamics Codon bias is a major factor explaining phage evolution in translationally biased hosts Single-stranded genomic architecture constrains optimal codon usage Base composition and translational selection are insufficient to explain codon usage bias in plant viruses Biased codon usage near intron-exon junctions: selection on splicing enhancers, splice-site recognition or something else? Roles for synonymous codon usage in protein biogenesis Differential dynamics of the mammalian mRNA and protein expression response to misfolding stress High codon adaptation in citrus tristeza virus to its citrus host An internal RNA element in the P3 cistron of wheat streak mosaic virus revealed by synonymous mutations that affect both movement and replication CRPV genomes with synonymous codon optimizations in the CRPV E7 gene show phenotypic differences in growth and altered immunity upon E7 vaccination The coexistence of the nucleosome positioning code with the genetic code on eukaryotic genomes Virus attenuation by genome-scale changes in codon pair bias The fitness effects of synonymous mutations in DNA and RNA viruses Determinants of translation elongation speed and ribosomal profiling biases in mouse embryonic stem cells Properties and determinants of codon decoding time distributions The effect of tRNA levels on decoding times of mRNA codons Synonymous codon usage in adenoviruses: influence of mutation, selection and protein hydropathy Estimation of ribosome profiling performance and reproducibility at various levels of resolution Three dimensional genomic organization of eukaryotic genes is correlated with their expression and function Estimating translational selection in eukaryotic genomes Solving the riddle of codon usage preferences: a test for translational selection Synonymous genetic polymorphisms within Brazilian human immunodeficiency virus Type 1 subtypes may influence mutational routes to drug resistance Multiplexing genetic and nucleosome positioning codes: a computational approach Role of mRNA structure in the control of protein folding Non-canonical translation in RNA viruses Stimulation of stop codon readthrough: frequent presence of an extended 3' RNA structural element Viruses as building blocks for materials and devices How the sequence of a gene can tune its translation Translational control of viral gene expression in eukaryotes Measurement of average decoding rates of the 61 sense codons in vivo Ribonuclease selection for ribosome profiling Controlling translation elongation efficiency: tRNA regulation of ribosome flux on the mRNA Evidence of translation efficiency adaptation of the coding regions of the bacteriophage lambda Patterns of evolution and host gene mimicry in influenza and other RNA viruses A universal trend of reduced mRNA stability near the translation-initiation site in prokaryotes and eukaryotes Codon conservation in the influenza A virus genome defines RNA packaging signals Widespread signatures of local mRNA folding structure selection in four dengue virus serotypes Evidence of a direct evolutionary selection for strong folding and mutational robustness within HIV coding regions Widespread selection for complex patterns of synonymous information in viral coding regions Analysis of synonymous codon usage in SARS Coronavirus and other viruses in the Nidovirales Codon bias and heterologous protein expression Biophysical fitness landscapes for transcription factor binding sites The evolution and emergence of RNA viruses Understanding biases in ribosome profiling experiments reveals signatures of translation dynamics in yeast A viral RNA structural element alters host recognition of nonself RNA Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling Evolution of base composition and codon usage bias in the genus Flavivirus Analysis of synonymous codon usage in the UL24 gene of duck enteritis virus Mobile elements: drivers of genome evolution An integrated in silico simulation and biomatter compilation approach to cellular computation Point mutations define a sequence flanking the AUG initiator codon that modulates translation by eukaryotic ribosomes The ribosome as a platform for co-translational processing, folding and targeting of newly synthesized proteins Rationalizing the development of live attenuated virus vaccines Codon usage determines the mutational robustness, evolutionary capacity, and virulence of an RNA virus The role of mutational robustness in RNA virus evolution Analysis of synonymous codon usage in porcine reproductive and respiratory syndrome virus The characteristics of the synonymous codon usage in enterovirus 71 virus and the effects of host on the virus in codon usage pattern Patterns and influencing factor of synonymous codon usage in porcine circovirus High-resolution view of bacteriophage lambda gene expression by ribosome profiling Translation initiation of viral mRNAs Virus-host coevolution: common patterns of nucleotide motif usage in Flaviviridae and their hosts Genome landscapes and bacteriophage codon usage CRISPR-Cas immunity in prokaryotes Changes in codon-pair bias of human immunodeficiency virus type 1 have profound effects on virus replication in cell culture The characteristics of the synonymous codon usage in hepatitis B virus and the effects of host on the virus in codon usage pattern Successful lateral transfer requires codon usage compatibility between foreign genes and recipient genomes Evolution of codon usage in the smallest photosynthetic eukaryotes and their giant viruses Optimizing membrane-protein biogenesis through nonoptimal-codon usage Reduction of the rate of poliovirus protein synthesis through large-scale codon deoptimization causes attenuation of viral virulence by lowering specific infectivity Live attenuated influenza virus vaccines by computer-aided rational design virus attenuation by genome-scale changes in codon pair bias Influenza A virus attenuation by codon deoptimization of the NS gene for vaccine development Positive selection of synonymous mutations in vesicular stomatitis virus Speeding with control: codon usage, tRNAs, and ribosomes Online model selection for synthetic gene networks Viral proteins originated de novo by overprinting can be identified by codon usage: application to the "gene nursery" of Deltaretroviruses Evolutionary conservation of codon optimality reveals hidden signatures of cotranslational folding Codon usage and replicative strategies of hepatitis A virus Evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses Codon bias as a means to fine-tune gene expression A model for competition for ribosomes in the cell Genome-scale analysis of translation elongation with a ribosome flow model Codon usage bias from tRNA's point of view: redundancy, specialization, and efficient decoding for translation optimization Plant viruses as model systems for the study of non-canonical translation mechanisms in higher plants Environmental shaping of codon usage and functional adaptation across microbial communities A detailed comparative analysis on the overall codon usage pattern in herpesviruses Meta-stochastic simulation of biochemical models for systems and synthetic biology Temperature influences synonymous codon and amino acid usage biases in the phages infecting extremely thermophilic prokaryotes Factors influencing the synonymous codon and amino acid usage bias in AT-rich Pseudomonas aeruginosa phage PhiKZ Synonymous codon usage bias in 16 Staphylococcus aureus phages: implication in phage therapy Studies on synonymous codon and amino acid usage biases in the broad-host range bacteriophage KVP40 Understanding the contribution of synonymous mutations to human disease Facilitated diffusion buffers noise in gene expression Evolutionary basis of codon usage and nucleotide composition bias in vertebrate DNA viruses The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications Variation in the strength of selected codon usage bias among bacteria Quantitative assessment of ribosome drop-off in E. coli Horizontal gene transfer: building the web of life Categorizing host-dependent RNA viruses by principal component analysis of their codon usage preferences The code of silence: widespread associations between synonymous codon biases and gene function Analysis of synonymous codon usage in classical swine fever virus CRISPR-based adaptive immune systems Co-translational protein folding: progress and methods Analysis of codon usage bias and base compositional constraints in iridovirus genomes Translation efficiency is determined by both codon bias and folding energy An evolutionarily conserved mechanism for controlling the efficiency of protein translation Codon bias, tRNA pools, and horizontal gene transfer. Mob Genet Elem Tuller T (2012) The effect of codon usage on the success of horizontal gene transfer. In: In lateral gene transfer in evolution Composite effects of gene determinants on the translation speed and density of ribosomes Association between translation efficiency and horizontal gene transfer within microbial communities Multiple roles of the coding sequence 5' end in gene expression regulation RNA virus attenuation by codon pair deoptimisation is an artefact of increases in CpG/UpA dinucleotide frequencies Host-related nucleotide composition and codon usage as driving forces in the recent evolution of the Astroviridae Architecture and secondary structure of an entire HIV-1 RNA genome Design parameters to control synthetic gene expression in Escherichia coli Synthetic viruses: a new opportunity to understand and prevent viral disease Codon usage bias and the evolution of influenza A viruses. Codon usage biases of influenza virus The 'effective number of codons' used in a gene Multiensemble Markov models of molecular thermodynamics and kinetics Codon-by-codon modulation of translational speed and accuracy via mRNA folding An intronic code for gene expression regulation in S.cerevisiae Selection for nucleotide composition adjacent to intronic splice sites improves splicing efficiency via its effect on pre-mRNA local folding in fungi Nucleotide sequence composition adjacent to intronic splice sites improves splicing efficiency via its effect on pre-mRNA local folding in fungi Unsupervised detection of regulatory gene expression information in different genomic regions enables gene expression ranking Selection for reduced translation costs at the intronic 5' end in fungi Analysis of synonymous codon usage in hepatitis A virus Analysis of synonymous codon usage patterns in torque tenosus virus 1 (TTSuV1) Gene codon composition determines differentiation-dependent expression of a viral capsid gene in keratinocytes in vitro and in vivo Analysis of synonymous codon usage in 11 human bocavirus isolates Mutation pressure shapes codon usage in the GC-Rich genome of foot-and-mouth disease virus Analysis of synonymous codon usage in foot-and-mouth disease virus The effects of the synonymous codon usage and tRNA abundance on protein folding of the 3C protease of foot-and-mouth disease virus Chicago Zur H, Tuller T (2012) Strong association between mRNA folding strength and protein abundance in S. cerevisiae Exploiting hidden information interleaved in the redundancy of the genetic code without prior knowledge Predictive biophysical modeling and understanding of the dynamics of mRNA translation and its evolution Acknowledgements This study was supported in part by a fellowship from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University. TT is partially supported by the Minerva ARCHES award.