Human cell-dependent, directional, time-dependent changes in the mono- and oligonucleotide compositions of SARS-CoV-2 genomes 1 Human cell-dependent, directional, time-dependent changes in the mono- and 1 oligonucleotide compositions of SARS-CoV-2 genomes 2 3 Yuki Iwasaki1, Takashi Abe2, Toshimichi Ikemura1 4 1. Department of Bioscience, Nagahama Institute of Bio-Science and Technology. 5 Shiga, Japan 6 2. Graduate School of Science and Technology, Niigata University, Niigata, Japan 7 8 Abstract 9 Background 10 When a virus that has grown in a nonhuman host starts an epidemic in the human 11 population, human cells may not provide growth conditions ideal for the virus. 12 Therefore, the invasion of severe acute respiratory syndrome coronavirus-2 (SARS-13 CoV-2), which is usually prevalent in the bat population, into the human population is 14 thought to have necessitated changes in the viral genome for efficient growth in the new 15 environment. In the present study, to understand host-dependent changes in coronavirus 16 genomes, we focused on the mono- and oligonucleotide compositions of SARS-CoV-2 17 genomes and investigated how these compositions changed time-dependently in the 18 human cellular environment. We also compared the oligonucleotide compositions of 19 SARS-CoV-2 and other coronaviruses prevalent in humans or bats to investigate the 20 causes of changes in the host environment. 21 Results 22 Time-series analyses of changes in the nucleotide compositions of SARS-CoV-2 23 genomes revealed a group of mono- and oligonucleotides whose compositions changed 24 in a common direction for all clades, even though viruses belonging to different clades 25 should evolve independently. Interestingly, the compositions of these oligonucleotides 26 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 2 changed towards those of coronaviruses that have been prevalent in humans for a long 27 period and away from those of bat coronaviruses. 28 Conclusions 29 Clade-independent, time-dependent changes are thought to have biological significance 30 and should relate to viral adaptation to a new host environment, providing important 31 clues for understanding viral host adaptation mechanisms. 32 33 Keyword 34 “COVID-19”, “SARS-CoV-2”, “Oligonucleotide composition”, “Time-series analysis”, 35 “Big data”, “Zoonotic virus”, “RNA virus”, “Viral adaptation”, “Coronavirus” 36 37 Background 38 Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), an RNA virus 39 belonging to the betacoronavirus genus, began to spread in the human population in 40 2019. This viral strain is believed to have been originally prevalent in bats and 41 transferred to the human population through intermediate hosts [1]. Viral growth 42 requires a wide variety of host factors (nucleotide pools, proteins, RNA, etc.) and 43 should evade the diverse antiviral mechanisms of host cells (antibodies, killer T cells, 44 interferon, RNA interference, etc.) [2-4]. Since ancestral SARS-CoV-2 strains are 45 thought to be endemic in bats, they should be well adapted to their host environment; 46 when the virus invades the human population, human cells may not provide growth 47 conditions ideal for the virus. For efficient growth and rapid spread of the infection, 48 changes in the viral genome should be required. Analyses of time-dependent changes in 49 SARS-CoV-2 in the human population can be used to characterize how and why viral 50 genomes change to adapt to a new host environment. 51 Due to the great threat of COVID-19 and remarkable development of 52 sequencing technology, a massive number of SARS-CoV-2 genome sequences are 53 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 3 available in databases, even though the epidemic has lasted for approximately 10 54 months. These sequence data have provided a wide range of insights into SARS-CoV-2 55 [5,6]. Phylogenetic methods based on sequence alignment have been widely used in 56 molecular evolution studies [7,8], and these methods are well refined and essential for 57 studying phylogenetic relationships between different viral species and variations in the 58 same viral species at the single-nucleotide level. However, when dealing with a massive 59 number of genome sequences, methods based on sequence alignment become 60 problematic because they require a large amount of computational resources. 61 We have continued to develop sequence alignment-free methods focused on 62 the oligonucleotide compositions of genome sequences [9-12]. Notably, oligonucleotide 63 composition varies widely among species, including viruses, and is designated as 64 genome signatures [13]. These compositions can be treated as numerical data, and a 65 massive amount of sequence data can easily be subjected to various statistical analyses. 66 Furthermore, even genomic fragments without orthologous and/or paralogous pairs can 67 be compared [11,12,14-17]. Specifically, our previous work on influenza A-type virus 68 genomes found that the oligonucleotide compositions of the viral genomes differed 69 between hosts (e.g., humans and birds), even for viruses within the same subtype (e.g., 70 H1N1 and H3N2 of type A) [11,12,14]; we also examined changes in the 71 oligonucleotide compositions of influenza H1N1/09, which have been epidemic in 72 humans beginning in 2009, and found that their compositions changed to approach 73 those of the seasonal flu strains H1N1 and H3N2 [11]. Furthermore, although epidemics 74 of the H1N1 and H3N2 strains began several decades apart, these strains showed highly 75 similar chronological changes from the start of these epidemics. These evolutionary yet 76 reproducible changes suggest that mutations to adapt to a new host environment 77 inevitably accumulate when the host species of a virus changes, and these changes can 78 be efficiently detected by analyzing oligonucleotide compositions. 79 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 4 Several groups, including ours, have examined changes in SARS-CoV-2 80 genomes during the early stages of the SARS-CoV-2 epidemic and found clear 81 directional changes in a group of mono- and oligonucleotides detectable on even a 82 monthly basis [15,18,19]. These directional changes will allow us to predict changes in 83 the near future. Notably, near-future prediction and verification should be the most 84 direct ways to test the reliability of the obtained results, models and ideas (e.g., those 85 discovered for influenza viruses), providing a new paradigm for molecular evolutionary 86 studies. In this context, the present study analyzed the genome sequences of over 87 seventy thousand SARS-CoV-2 strains isolated from December 2019 to September 88 2020. 89 90 91 Results 92 Directional changes in the mononucleotide compositions (%) of SARS-CoV-2 93 For fast-evolving RNA viruses, diversity within the viral population arises rapidly as 94 the epidemic progresses and subpopulation structure forms; the GISAID consortium has 95 defined at least seven main clades (G, GH, GR, L, V, S and Others). Notably, the 96 elementary processes of molecular evolution are based on random mutations, and 97 strains belonging to different clades are thought to have evolved independently. 98 Therefore, the observation of highly similar time-dependent changes independent of 99 clade has certain biological meanings and may be inevitable for efficient growth in 100 human cells. From this perspective, we first examined time-dependent changes in the 101 mononucleotide compositions (%) of SARS-CoV-2 strains isolated from December 102 2019 to September 2020. 103 Among the seven clades (G, GH, GR, L, V, S and Others) reported by the 104 GISAID consortium, we used six clades (G, GH, GR, L, V and S), excluding Others, in 105 the analysis. For the time-series analysis, we calculated the average mononucleotide 106 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 5 compositions (%) of the genomes in each clade collected monthly; in Fig. 1A, the 107 mononucleotide composition of each clade is shown as a colored line, while that for the 108 monthly collected genomes belonging to all clades is shown as a dashed line. 109 Regardless of clade, the composition of C decreased, while that of U increased 110 in a time-dependent manner, but the changes in A and G composition were less clear 111 (Fig. 1A). Correlation coefficients between the mononucleotide composition and month 112 from the start of the epidemic showed a high negative correlation for C and a high 113 positive correlation for U for all clades, but there was no clear directionality for A and 114 G (Fig. 1A and Tables 1, 2). These results indicate that the mononucleotide composition 115 of this virus may be prone to biased mutations that reduce C and increase U or the 116 mutated strains tend to be more favorable for growth in human cells. 117 118 Directional changes in short oligonucleotide compositions 119 Oligonucleotides are known to act as functional motifs, such as binding sites for a wide 120 variety of proteins and target sites for RNA modifications. Therefore, directional 121 changes in some oligonucleotides independent of clade may relate to certain processes 122 for adaptation to the new host environment. Our previous work on influenza A viruses 123 found that their oligonucleotide compositions varied among prevalent hosts [11,12]; 124 notably, although influenza virus isolated from humans tended to prefer A and U (but 125 not G and C) more than viruses isolated from birds, the human viruses showed a 126 preference for GGCG and GGGG, which are G- or C-rich. Importantly, there are 127 various examples of oligonucleotides whose changes in composition cannot be 128 explained by changes in mononucleotide composition alone, and these changes may 129 relate to the molecular mechanisms of viral adaptation to a new host. 130 From this perspective, we next analyzed time-dependent changes in di- and 131 trinucleotide compositions and found that a group of di- and trinucleotides showed a 132 highly positive or negative correlation (Figs. 1B, S1 and Tables 1, 2). Interestingly, a 133 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 6 group of A- or G-rich oligonucleotides, such as GAG and GGA, showed a high positive 134 correlation independent of clade, which was not expected from the changes in 135 mononucleotide compositions alone. To confirm the extent of these changes, we also 136 calculated the fold change in composition for the first isolated month and the last 137 examined month (Fig. 2) and found clear increases and decreases in mono- and 138 oligonucleotide compositions common among the six clades, which supports the result 139 presented in Fig. 1 and Tables 1 and 2. 140 141 Changes towards the sequences of other coronaviruses prevalent in humans 142 In a previous study of SARS-CoV-2 [16], we analyzed mono- and dinucleotide 143 compositions for the first four epidemic months without separating the sequences by 144 clade. Notably, the directional changes shown in Figs. 1 and 2 and Tables 1 and 2 were 145 absolutely consistent with the previous results, even when the six clades were separately 146 analyzed. In the previous study, time-series analysis of ebolavirus at the beginning of 147 the epidemic in West Africa in 2014 also showed directional changes in a group of 148 mono- and dinucleotide compositions, but these directional increases/decreases tended 149 to slow approximately 10 months after the start of the epidemic. The increase/decrease 150 trend for SARS-CoV-2 is far from slowing after 10 months, and the next important 151 questions are how long these directional changes in this virus will last and whether there 152 are possible goals to these changes. 153 To conduct this near-future prediction, the following information concerning 154 influenza viruses should be useful. As mentioned before, mono- and oligonucleotide 155 compositions in influenza H1N1/09 changed towards those of seasonal influenza strains 156 such as the H1N1 and H3N2 subtypes [11]. Furthermore, all the human subtypes 157 showed directional changes away from the compositions of all avian influenza A 158 subtypes and closer to those of the human influenza B type, which has been prevalent 159 only in humans [14]. If we assume that changes similar to those in the influenza virus 160 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 7 will occur, the mono- and oligonucleotide compositions of interest for SARS-CoV-2 are 161 expected to change towards those of other coronaviruses that have been prevalent in 162 humans and away from those of coronaviruses prevalent in bats. To test this hypothesis, 163 we analyzed the following coronaviruses: 238 human-CoV strains (alphacoronaviruses 164 229E and NL63: betacoronaviruses HKU1 and OC43) and 166 bat-CoV strains 165 (alphacoronaviruses and betacoronaviruses, including the SARS virus). 166 As shown in Fig. 3A, we compared the mononucleotide compositions of 167 SARS-CoV-2 with those of the human- and bat-CoV strains; the data for bat SARS 168 among bat-CoV strains, which is thought to be the original strain that caused the current 169 COVID-19 pandemic, are marked in pink. Interestingly, concerning the human- and 170 bat-CoV strains, differences in mononucleotide composition were more pronounced 171 between hosts than between the alpha and beta linages, and the levels for all six clades 172 of SARS-CoV-2 were between those for the two hosts. Fig. 3B shows the results of di- 173 and trinucleotides, for which the directional, time-dependent changes were primarily 174 common among the six clades. The increases and decreases in nucleotide composition 175 observed for SARS-CoV-2 in Figs. 1 and 2 are indicated by hollow up and down 176 arrows, respectively. Interestingly, all changes of interest tended to move away from the 177 compositions of bat SARS and approach those of human-CoV, supporting the view that 178 the directional changes of interest have biological significance and are possibly 179 inevitable, as observed for influenza viruses. Assuming that approaching the levels in 180 human-CoV strains is the hypothetical goal of the directional change of SARS-CoV-2, 181 the current compositions are far from this hypothetical goal (Fig. 3); therefore, we 182 predict that directional changes of interest will continue in the near future. 183 Then, assuming that the average value for all human-CoV strains is a 184 hypothetical goal, we investigated how SARS-CoV-2 has approached this possible goal. 185 Specifically, we calculated the square of the difference between the composition of each 186 nucleotide in SARS-CoV-2 and the average value for human-CoV strains and plotted 187 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 8 the values of the difference according to the elapsed month for each nucleotide. 188 Changes in the compositions of both C and U clearly reduced this difference, as the 189 compositions of these nucleotides approached the hypothetical goal (Fig. 4A); their 190 linear reduction supports the prediction that directional changes in the composition of C 191 and U will continue for the foreseeable future. In contrast, A and G did not show 192 directional changes in composition, which is most likely due to the absence of clear 193 differences in the A and G compositions of human- and bat-CoV, i.e., there is no 194 possible target (Fig. 3A). Fig. 4B shows examples of di- and trinucleotides whose 195 compositions have moved towards the hypothetical goal, but Fig. 4C shows a few 196 exceptional nucleotides whose compositions have not changed towards the hypothetical 197 goal but have changed with a common directionality among the six clades. In Fig. 4D, 198 correlation coefficients between the above difference and the elapsed month are 199 presented. Most nucleotides of interest showed a negative coefficient (i.e., a directional 200 change towards human-CoV), but three oligonucleotides, GG, AGC and CAU, showed 201 positive coefficients indicating an increase in the difference (i.e., moving away from the 202 human-CoV level). For these opposing directional changes, certain causes specific to 203 SARS-CoV-2 may be assumed. 204 205 Motifs for RNA-binding proteins 206 Next, we considered the mechanisms that move oligonucleotide compositions away 207 from those of bat coronaviruses and closer to those of human coronaviruses. Certain 208 human cellular factors involved in viral growth may be candidates in such mechanisms. 209 When considering possible protein factors, oligonucleotides longer than trinucleotides 210 should be a focus. As an attempt, we here focused on host RNA-binding proteins 211 because their binding to hepatitis C virus is known to be involved in the growth of this 212 RNA virus [20]. We thus searched for motifs for human RNA-binding proteins in 213 coronavirus genomes (see Methods section) and found multiple loci with binding motifs 214 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 9 for each protein. Table 3 (and Table S10) lists the motifs for which a directional time-215 dependent change was primarily common among six clades. Table 3 and Fig. 5A show 216 that only ELAVL1 showed a positive correlation, but the other nine proteins in Table 3 217 showed a negative correlation for almost all clades; the results for other motifs are 218 presented in Table S12. 219 We next compared the numbers of these motifs in SARS-CoV-2 with the 220 numbers of human- and bat-CoV motifs (Fig. 5B). Of the ten proteins shown in Table 3, 221 the only elevated motif, that for ELAVL1 binding, was found in a significantly higher 222 number of loci in human-CoV than in bat-CoV, but motifs for PCBP2 and SRSF1 223 binding, which tended to decrease (Table 3), were found in significantly fewer loci in 224 human-CoV. These observations appear to be consistent with the features found in the 225 mono-, di- and trinucleotide compositions of interest. However, unlike these changes, 226 there was significant diversity within even a single clade, which appears to be greater 227 than the differences between hosts, with the possible exception of ELAVL1. In regard 228 to long oligonucleotides, they should carry out a variety of functions, and mutations that 229 accumulate in their functional motifs may have complex effects on the presence of 230 functional motif sequences, so an analysis from a new perspective appears to become 231 important. 232 233 Discussion 234 We first discuss possible molecular mechanisms related to time-dependent directional 235 changes in mononucleotide composition. Fig. 1A shows that the frequency of C tended 236 to decrease in SARS-CoV-2, while that of U tended to increase. Since a similar change 237 was previously found for MERS and all A-type influenza subtypes [12,14], these 238 changes may have biological significance for a wide range of RNA viruses that invade 239 from nonhuman hosts. One possible mechanism is the host RNA-editing function; 240 Simmonds (2020) proposed that the C→U hypermutation in SARS-CoV-2 may be due 241 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 10 to the influence of APOBEC family proteins in humans [19]. APOBEC is an antiviral 242 protein in various animal species, including humans, that can convert C to U by the 243 deacetylation of C [21-23]. Such RNA editing is also known to act as a defense 244 mechanism against various viruses, including retroviruses [24]. The APOBEC gene 245 family has generated various paralogs during mammalian evolution, with seven known 246 APOBEC genes in humans and ten in bat families [25-27]. The prevalence C→U 247 change in SARS-CoV-2 upon transfer of its host environment from bats to humans 248 suggests that these changes may be due to human-specific APOBEC genes. 249 We next discuss changes in short oligonucleotides. Directional changes in 250 some oligonucleotides, such as GAG and GGA, cannot be explained by APOBEC-251 induced C→U mutations alone. Although the evidence is weak, these oligonucleotides 252 are part of the binding motifs of several RNA-binding proteins, such as SRSF1 and 253 PCBP2 (Table S9); the number of loci for these motifs has decreased independently of 254 clade. In contrast, the number of motif loci for only ELAVL1 among the ten proteins 255 listed in Table 3 has increased independently of clade. As an RNA-binding protein that 256 binds A- or U-rich elements, ELAVL1 binding to mRNA is known to contribute to 257 RNA stability [28, 29]; SARS-CoV-2 and human-CoV, which are prevalent in humans, 258 may contain increased binding motifs for ELAVL1 for efficient growth in the human 259 cellular environment. However, for further analysis, information on RNA-binding 260 proteins in bat cells is needed. 261 262 Conclusions 263 In the present study, we found that the compositions of a group of mono- and 264 oligonucleotide in SARS-CoV-2 genomes have changed in a host cell-dependent 265 manner. This is totally consistent to our previous finding for influenza A and B viruses 266 [11,12,14], supporting the previous prediction that the host-dependent directional 267 changes of various mono- and oligonucleotides should inevitably occur in zoonotic 268 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 11 RNA viruses that have invaded from nonhuman hosts. Phylogenetic methods based on 269 sequence alignment [7,8] are well refined and undoubtedly essential for studying the 270 phylogenetic relationships between viruses. The present alignment-free method to 271 analyze mono- and oligonucleotide compositions can also serve as a powerful tool for 272 molecular evolutionary studies of viruses, revealing directional changes in viruses and 273 predicting the possible goals of these changes. 274 275 276 Methods 277 SARS-CoV-2 genome sequences 278 Human SARS-CoV-2 genome sequences were downloaded from the GISAID database 279 (https://www.gisaid.org/); sequences that were complete, showed high coverage and had 280 been isolated from humans were downloaded on Sep 17, 2020. Among the acquired 281 sequences, strains with an unknown isolation month were excluded from the analysis, 282 and the polyA tail was removed. A list of all 72,314 strains used is provided in Table 283 S1. 284 285 Genome sequences of coronaviruses prevalent in humans or bats 286 The complete sequences of two types of human coronavirus (human-CoV) strains, 287 alphacoronaviruses (27 229E and 55 NL63 strains) and betacoronaviruses (18 HKU1 288 and 138 OC43 strains), were obtained from the NCBI virus database 289 (https://www.ncbi.nlm.nih.gov/labs/virus/). The complete genome sequences of two 290 types of bat coronavirus (bat-CoV) strains, alphacoronaviruses (87 strains) and 291 betacoronaviruses (79 strains, including 34 SARS-CoV), isolated from three types of 292 bats (Chiroptera, Vespertilionidae and Rhinolophidae) were obtained from the NCBI 293 virus database (https://www.ncbi.nlm.nih.gov/labs/virus/), and the polyA tail of each 294 sequence was removed. The strains are listed in Table S2. 295 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 12 296 Time-series analysis of changes in oligonucleotide compositions 297 In the time-series analysis, the average mono- and oligonucleotide compositions (%) of 298 viruses collected in each month were calculated for each clade. To avoid statistical 299 fluctuations due to the small sample size, months in which fewer than 10 strains had 300 been collected were excluded from the monthly analysis. 301 302 RNA-binding motif analysis 303 RNA-binding motifs were obtained from the ATtRACT database [30]. In this database, 304 multiple binding motifs are registered as corresponding to one RNA-binding protein; 305 we calculated the total number of loci containing the binding motifs for each protein in 306 the viral genomes. 307 308 309 310 311 List of abbreviations 312 SARS-CoV-2: Severe acute respiratory syndrome coronavirus-2 313 human-CoV: human coronavirus 314 bat-CoV: bat coronavirus 315 316 Ethics approval and consent to participate 317 Not applicable 318 319 Consent for publication 320 Not applicable 321 322 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 13 Availability of data and materials 323 The sequence dataset analyzed in this study are stored in GISAID. Other data are 324 available from YI. 325 326 Competing interests 327 The authors declared that there are no conflicts of interests. 328 329 Funding 330 This work was supported by JSPS KAKENHI Grant Number 18K07151, by AMED 331 under Grant Number JP20he0622033 and by COVID-19 Counterplan Research Project 332 (supervised by Prof. Tatsumi Hirata, NIG) from the Research Organization of 333 Information and Systems (ROIS). 334 335 Authors' contributions 336 YI conceived the approach and conducted this analysis. TA developed the algorithm. TI 337 supervised this study. 338 339 Acknowledgements 340 We gratefully acknowledge the authors submitting their sequences from GISAID’s 341 Database and also the valuable comments of Dr. Yashushi Hiromi of National Institute 342 of Genetics (Mishima). We thank Springer Nature Author Services for editing this 343 manuscript for English language. 344 345 Figure legends 346 Fig. 1. Time-dependent directional changes in nucleotide compositions. (A) 347 Average mononucleotide compositions (%) in the SARS-CoV-2 genomes of each clade 348 isolated in each month are plotted against the elapsed month. To compare the four 349 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 14 mononucleotides, the scale widths on the vertical axis are set to the same values. The 350 colored lines distinguishing the clade (G, GH, GR, L, V and S) are shown at the bottom 351 of the figure. The dashed line shows the averaged compositions for all strains isolated in 352 each month. (B) The average di- and trinucleotide compositions that primarily undergo 353 common directional changes among the six clades are plotted against the elapsed 354 month. 355 356 Fig. 2. Fold changes in nucleotide composition between the epidemic start and the 357 last month of analysis. A bar plot shows the fold change in composition of each mono- 358 or oligonucleotide; this value was calculated by dividing the nucleotide composition in 359 the last month of analysis by that at the start of the epidemic. Each bar is colored to 360 indicate the clade, as described in Fig. 1. Since we analyzed strains belonging to 361 different clades separately, data from the first or last month differed among clades; see 362 also the Methods section. 363 364 Fig. 3. Nucleotide compositions of human and bat coronavirus sequences. A 365 boxplot shows the nucleotide compositions in human-CoV (alpha 229E, alpha NL63, 366 beta HKU1 and beta OC43), bat-CoV (bat SARS, alphacoronavirus and 367 betacoronavirus) and SARS-CoV-2 strains. Bat SARS are marked pink. A hollow arrow 368 indicates the direction of change in oligonucleotide composition observed for SARS-369 CoV-2 in Figs. 1 and 2. (A) Mononucleotides. To compare the four mononucleotides, 370 the scale widths on the vertical axis scale are set to the same values. (B) Di- and 371 trinucleotides. 372 373 Fig. 4. Differences in nucleotide composition between SARS-CoV-2 and human-374 CoV. (A) Values for the square of the difference in mononucleotide composition 375 between SARS-CoV-2 isolated in each month and human-CoV are plotted against the 376 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 15 elapsed month. The data are presented as colored or dashed lines, as described in Fig. 1. 377 (B and C) Oligonucleotide compositions that approach and move from those of human-378 CoV are presented, respectively. (D) The correlation coefficients between the elapsed 379 month from the start of the epidemic and the above differences in mono- and 380 oligonucleotides whose directionality of change is common among six clades are 381 presented. The results for A and G mononucleotides, which show nondirectional 382 change, are also presented. 383 384 Fig. 5. Time-dependent changes in the numbers of RNA-binding motif loci. (A) The 385 numbers of loci containing RNA-binding motifs per genome are plotted against the 386 elapsed month. Here, we selected RNA-binding proteins for which the number of motif 387 loci increased or decreased by at least one for all six clades from the epidemic start. The 388 data are presented as colored or dashed lines, as described in Fig. 1A. (B) A boxplot 389 shows the number of loci containing RNA-binding motifs in human-CoV (alpha 229E 390 and NL63: beta HKU1 and OC43), bat-CoV (bat SARS, alphacoronavirus and 391 betacoronavirus) and SARS-CoV-2 strains. Bat SARS are marked pink. A hallow arrow 392 indicates the direction shown in Fig. 5A with which the oligonucleotide compositions of 393 SARS-CoV-2 changed. 394 395 396 Table 1. Correlation coefficients for time-dependent changes in mono- and 397 oligonucleotide compositions in SARS-CoV-2 that have increased. 398 399 Table 2. Correlation coefficients for time-dependent changes in mono- and 400 oligonucleotide compositions in SARS-CoV-2 that have decreased. 401 402 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 16 Table 3. The number motif-containing loci for RNA-binding proteins whose 403 occurrences have increased or decreased between strains of the first and last month of 404 the analysis. 405 406 407 Additional file 1 408 Fig. S1: Average di- and trinucleotide compositions (A and B) of for SARS-CoV-2 409 strains collected in each elapsed month. 410 Fig. S2: Oligonucleotide compositions of human and bat coronavirus sequences. 411 Fig. S3: Differences in oligonucleotide composition between SARS-CoV-2 and human-412 CoV. 413 414 415 Additional file 2 416 Table S1: List of SARS-CoV-2 strains used in the analysis. 417 418 Table S2: List of human-and bat-CoV strains used in the analysis. 419 420 Table S3: Number of SARS-CoV-2 strains in each clade isolated in each elapsed month. 421 422 Table S4: Average oligonucleotide compositions for SARS-CoV-2 strains in each clade 423 isolated in each elapsed month. 424 425 Table S5: Correlation coefficients for time-dependent changes in oligonucleotide 426 compositions of SARS-CoV-2. 427 428 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 17 Table S6: Fold change in compositions between strains of the first and last month of the 429 analysis. 430 Table S7: Distance between the oligonucleotide composition of SARS-CoV-2 isolated 431 in each elapsed month and that of human-CoV. 432 433 Table S8: Correlation coefficients for time-series changes in the distance between 434 oligonucleotide compositions of SARS-CoV-2 and human-CoV. 435 436 Table S9: List of RNA-binding motifs. 437 438 Table S10: Numbers of motif-containing loci for RNA-binding proteins whose 439 abundance increases or decreases between strains of the first and last month of the 440 analysis. 441 442 Table S11: P-value from t-test to analyze the number of RNA-binding motif loci whose 443 abundance increases or decreases between strains of the first and last month of the 444 analysis. 445 446 Table S12: Correlation coefficients for time-dependent changes in the number of loci 447 containing RNA-binding motifs. 448 449 450 Reference 451 1. Singhal T: A review of coronavirus disease-2019 (COVID-19). Indian J Pediatr. 452 2020; 87:281-86. 453 2. García-Sastre A: Inhibition of interferon-mediated antiviral responses by influenza 454 A viruses and other negative-strand RNA viruses. Virology. 2001;279: 375–84. 455 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 18 3. Voinnet O: Induction and suppression of RNA silencing: insights from viral 456 infections. Nat. Rev. Genet. 2005;6:206–20. 457 4. Randall RE, Goodbourn S: Interferons and viruses: an interplay between induction, 458 signalling, antiviral responses and virus countermeasures. J. Gen. Virol. 2008;89:1–459 47. 460 5. Konno Y, Kimura I, Uriu K, et al: SARS-CoV-2 ORF3b is a potent interferon 461 antagonist whose activity is increased by a naturally occurring elongation variant. 462 Cell Rep. 2020;32:108185. 463 6. Zhou et al: A novel bat coronavirus closely related to SARS-CoV-2 contains natural 464 insertions at the S1/S2 cleavage site of the spike protein. Curr Biol. 2020;30:2196-465 203. 466 7. Nei M: Molecular evolutionary genetics. Columbia University Press: New York. 467 1987. 468 8. Kumar S, Nei M, Dudley J, Tamura K: MEGA: a biologist-centric software for 469 evolutionary analysis of DNA and protein sequences, Brief Bioinform. 2008;9:299–470 306. 471 9. Abe T, Kanaya S, Kinouchi M, et al: Informatics for unveiling hidden genome 472 signatures, Genome Res. 2003;13:693–702. 473 10. Abe T, Sugawara H, Kinouchi M, Kanaya S, Ikemura T: Novel phylogenetic 474 studies of genomic sequence fragments derived from uncultured microbe mixtures 475 in environmental and clinical samples, DNA Res. 2005;12:281–90. 476 11. Iwasaki Y, Abe T, Wada K, Itoh M, Ikemura T,: Prediction of directional changes of 477 influenza A virus genome sequences with emphasis on pandemic H1N1/09 as a 478 model case. DNA res 2011;18:125-36 479 12. Iwasaki Y, Abe T, Wada Y, Wada K, Ikemura T: Novel bioinformatics strategies for 480 prediction of directional sequence changes in influenza virus genomes and for 481 surveillance of potentially hazardous strains. BMC Infect Dis. 2013;13:386- 482 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 19 13. Karlin S, Campbell AM, Mrazek J: Comparative DNA analysis across diverse 483 genomes. Annu. Rev. Genet. 1998;32:185–225. 484 14. Wada Y, Wada K, Iwasaki Y, Kanaya S, Ikemura T: Directional and reoccurring 485 sequence change in zoonotic RNA virus genomes visualized by time-series word 486 count. Sci Rep. 2016;6:36197. 487 15. Wada K, Wada Y, Iwasaki Y, Ikemura T: Time-series oligonucleotide count to 488 assign antiviral siRNAs with long utility fit in the big data era. Gene Ther. 489 2017;24:668–73. 490 16. Wada K, Wada Y, Ikemura T: Time-series analyses of directional sequence changes 491 in SARS-CoV-2 genomes and an efficient search method for candidates for 492 advantageous mutations for growth in human cells. Gene. 2020;5:100038. 493 17. Qiu Y, Abe T, Nakao R, Satoh K, Sugimoto C: Viral population analysis of the 494 taiga tick, Ixodes persulcatus, by using Batch Learning Self-Organizing Maps and 495 BLAST search. Journal of Veterinary Medical Science, 2019;81(3):401-10. 496 18. Mercatelli D, Giorgi FM: Geographic and genomic distribution of SARS-CoV-2 497 mutations. Front Microbiol. 2020;22:11:1800. 498 19. Simmonds P: Rampant C→U hypermutation in the genomes of SARS-CoV-2 and 499 other coronaviruses: causes and consequences for their short- and long-term 500 evolutionary trajectories. mSphere. 2020;24:e00408-20. 501 20. Paek KY, Kim CS, Park SM, Kim JH, Jang SK: RNA-binding protein hnRNP D 502 modulates internal ribosome entry site-dependent translation of hepatitis C virus 503 RNA. J Virol. 2008;82:12082-93. 504 21. Harris RS, Bishop KN, Sheehy AM, Craig HM, Petersen-Mahrt SK, Watt IN, 505 Neuberger MS, Malim MH: DNA deamination mediates innate immunity to 506 retroviral infection. Cell. 2003;113:803–809. 507 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 20 22. Mangeat B, Turelli P, Caron G, Friedli M, Perrin L, Trono D: Broad antiretroviral 508 defence by human APOBEC3G through lethal editing of nascent reverse transcripts. 509 Nature. 2003;424:99–103. 510 23. Zhang H, Yang B, Pomerantz RJ, Zhang C, Arunachalam SC, Gao L: The cytidine 511 deaminase CEM15 induces hypermutation in newly synthesized HIV-1 DNA. 512 Nature. 2003. 424:94–98. https://doi.org/10 .1038/nature01707. 513 24. Harris RS, Dudley JP: APOBECs and virus restriction. Virology. 2015;479–514 480:131–45. 515 25. Sawyer SL, Emerman M, Malik HS: Ancient adaptive evolution of the primate 516 antiviral DNA-editing enzyme APOBEC3G. PLoS Biol. 2004;2:E275. 517 26. Münk C, Willemsen A, Bravo IG: An ancient history of gene duplications, fusions 518 and losses in the evolution of APOBEC3 mutators in mammals. BMC Evol Biol. 519 2012;12:71. 520 27. Henry M, Terzian C, Peeters M, Wain-Hobson S, Vartanian JP: Evolution of the 521 primate APOBEC3A cytidine deaminase gene and identification of related coding 522 regions. PLoS One. 2012;7:e30036. 523 28. Wang W, Caldwell MC, Lin S, Furneaux H, Gorospe M: HuR regulates cyclin A 524 and cyclin B1 mRNA stability during cell proliferation. EMBO J. 525 2000;19(10):2340-50. 526 29. Lal A, Mazan-Mamczarz K, Kawai T, Yang X, Martindale JL, Gorospe M: 527 Concurrent versus individual binding of HuR and AUF1 to common labile target 528 mRNAs. EMBO J. 2004;23(15):3092-102. 529 30. Giudice G, Sánchez-Cabo F, Torroja C, Lara-Pezzi E: ATtRACT-a database of 530 RNA-binding proteins and associated motifs. Database (Oxford). 2016;7:baw035. 531 532 533 534 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 21 Table 1 535 Clade G Clade GH Clade GR Clade L Clade V Clade S U 0.97 0.92 0.91 0.72 0.66 0.73 UA 0.89 0.80 0.92 0.62 0.30 0.50 AUU 0.81 0.78 0.92 0.90 0.27 0.05 CAU 0.56 0.59 0.54 0.51 0.55 0.17 UGU 0.68 0.82 0.52 0.59 0.88 0.50 UUA 0.96 0.72 0.93 0.80 0.19 0.34 UUG 0.67 0.78 0.20 0.92 0.69 0.05 UUU 0.94 0.82 0.89 0.80 0.41 0.93 536 Table 2 537 Clade G Clade GH Clade GR Clade L Clade V Clade S C -0.95 -0.83 -0.95 -0.98 -0.97 -0.35 AG -0.77 -0.71 -0.27 -0.57 -0.67 -0.44 CA -0.73 -0.93 -0.85 -0.16 -0.45 -0.82 CC -0.93 -0.87 -0.75 -0.90 -0.90 -0.09 CU -0.81 -0.40 -0.79 -0.52 -0.15 -0.10 GA -0.28 -0.90 -0.80 -0.62 -0.73 -0.29 GG -0.60 -0.79 -0.65 -0.03 -0.10 -0.23 UC -0.33 -0.23 -0.10 -0.58 -0.93 -0.41 AGC -0.57 -0.87 -0.41 -0.80 -0.78 -0.12 CCC -0.69 -0.81 -0.67 -0.94 -0.81 -0.50 GAC -0.73 -0.91 -0.65 -0.35 -0.18 -0.55 GAG -0.11 -0.58 -0.64 -0.50 -0.81 -0.02 GGA -0.73 -0.84 -0.75 -0.65 -0.81 -0.52 538 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 22 Table 3 539 Clade G Clade GH Clade GR Clade L Clade V Clade S PTBP1 -2.81 -3.46 -4.92 -13.50 -11.66 3.95 HNRNPL -3.03 -1.78 -0.52 -3.62 -6.48 0.71 NOVA1 -0.04 -2.77 -0.70 -3.37 -5.04 1.09 SRSF2 -1.17 -3.02 -1.09 -3.10 -3.10 0.78 ZFP36 1.67 -3.50 -1.98 -3.48 -4.78 1.86 HNRNPA1 -0.16 -2.50 -0.50 -2.64 -3.93 0.44 ELAVL1 2.82 0.47 2.87 0.59 0.03 2.39 TIA1 -0.82 -2.05 -1.20 -1.72 -4.41 1.67 PCBP2 -0.37 -2.50 -1.03 -2.28 -2.35 0.11 SRSF1 -0.63 -2.29 -1.26 -2.55 -1.84 0.36 540 541 542 543 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 23 544 545 546 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 24 547 548 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 25 549 550 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 26 551 552 553 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 27 554 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/