key: cord-0876755-bawgldfi authors: Chaw, Shu-Miaw; Tai, Jui-Hung; Chen, Shi-Lun; Hsieh, Chia-Hung; Chang, Sui-Yuan; Yeh, Shiou-Hwei; Yang, Wei-Shiung; Chen, Pei-Jer; Wang, Hurng-Yi title: The origin and underlying driving forces of the SARS-CoV-2 outbreak date: 2020-04-14 journal: bioRxiv DOI: 10.1101/2020.04.12.038554 sha: acb4b163cb42cad5f83f54b4d198d893f266a752 doc_id: 876755 cord_uid: bawgldfi The spread of SARS-CoV-2 since December 2019 has become a pandemic and impacted many aspects of human society. Here, we analyzed genetic variation of SARS-CoV-2 and its related coronavirus and found the evidence of intergenomic recombination. After correction for mutational bias, analysis of 137 SARS-CoV-2 genomes as of 2/23/2020 revealed the excess of low frequency mutations on both synonymous and nonsynonymous sites which is consistent with recent origin of the virus. In contrast to adaptive evolution previously reported for SARS-CoV in its brief epidemic in 2003, our analysis of SARS-CoV-2 genomes shows signs of relaxation of selection. The sequence similarity of the spike receptor binding domain between SARS-CoV-2 and a sequence from pangolin is probably due to an ancient intergenomic introgression. Therefore, SARS-CoV-2 might have cryptically circulated within humans for years before being recently noticed. Data from the early outbreak and hospital archives are needed to trace its evolutionary path and reveal critical steps required for effective spreading. Two mutations, 84S in orf8 protein and 251V in orf3 protein, occurred coincidentally with human intervention. The 84S first appeared on 1/5/2020 and reached a plateau around 1/23/2020, the lockdown of Wuhan. 251V emerged on 1/21/2020 and rapidly increased its frequency. Thus, the roles of these mutations on infectivity need to be elucidated. Genetic diversity of SARS-CoV-2 collected from China was two time higher than those derived from the rest of the world. In addition, in network analysis, haplotypes collected from Wuhan city were at interior and have more mutational connections, both of which are consistent with the observation that the outbreak of cov-19 was originated from China. SUMMARY In contrast to adaptive evolution previously reported for SARS-CoV in its brief epidemic, our analysis of SARS-CoV-2 genomes shows signs of relaxation of selection. The sequence similarity of the spike receptor binding domain between SARS-CoV-2 and a sequence from pangolin is probably due to an ancient intergenomic introgression. Therefore, SARS-CoV-2 might have cryptically circulated within humans for years before being recently noticed. Data from the early outbreak and hospital archives are needed to trace its evolutionary path and reveal critical steps required for effective spreading. Two mutations, 84S in orf8 protein and 251V in orf3 protein, occurred coincidentally with human intervention. The 84S first appeared on 1/5/2020 and reached a plateau around 1/23/2020, the lockdown of Wuhan. 251V emerged on 1/21/2020 and rapidly increased its frequency. Thus, the roles of these mutations on infectivity need to be elucidated. A newly emerging coronavirus was detected in patients during an outbreak of 58 respiratory illnesses starting in mid-December of 2019 in Wuhan, the capital of Hubei 59 Province, China [1 , 2, 3] . Due to the similarity of its symptoms to those induced by the 60 severe acute respiratory syndrome (SARS) and genome organization similarity, the causal 61 virus was named SARS-CoV-2 by the International Committee on Taxonomy of Viruses [4] . 62 As of 3/16/2020, 167,515 cases of SARS-CoV-2 infection have been confirmed in 114 63 countries, causing 6,606 fatalities. As a result, WHO declared the first pandemic caused by a 64 coronavirus on 3/11/2020 (https://www.who.int/emergencies/diseases/novel-coronavirus-65 2019/situation-reports). As the virus continues to spread, numerous strains have been isolated 66 and sequenced. On 3/18/2020, more than 500 complete or nearly complete genomes have 67 been sequenced and made publicly available. Several issues concerning the origin, time of virus introduction to humans, 79 evolutionary patterns, and the underlying driving force of the SARS-CoV-2 outbreak remain 80 to be clarified [12, 13] . Here, we analyzed genetic variation of SARS-CoV-2 and its related 81 coronaviruses. We discuss how mutational bias influences genetic diversity of the virus and 82 attempt to infer forces that shape SARS-CoV-2 evolution. Molecular evolution of SARS-COV-2 and related coronaviruses 85 The resulting phylogeny reveals that RaTG13 is the closest relative of SARS-COV-2, 86 followed by pangolin_2019 and pangolin_2017, then CoVZC45 and CoVZXC21, and other 87 SARS-related sequences as outgroups (S1 Fig). According to general time reversible model, 88 transition occurred more frequent than transversion with C-T and A-G changes account for 89 45% and 28%, respectively, of all six types of nucleotide changes. We next estimated the 90 strength of selection for each coding region using the dN and dS. While purifying selection 91 tends to remove amino acid-altering mutations, thus reducing dN and dN/dS, positive 92 selection has the opposite effect, increasing dN and dN/dS [14] . Between SARS-CoV-2 and 93 RaTG13, orf8 gene exhibits the highest dN (0.032) followed by spike (0.013) and orf7 94 (0.011), all above the genome average of 0.007 (Table 1) . dS varies greatly among CDSs 95 with the highest of 0.313 in spike and the lowest of 0.018 in envelope (genome average 96 0.168). Finally, dN/dS is the highest in orf8 (0.105) followed by orf7 (0.061) and orf3 (0.060), 97 with the genome average of 0.042. Since spike shows both high dS and dN, its protein 98 evolution rate (dN/dS) is only 0.040. Thus, while the coronavirus evolved very rapidly, it has 99 actually been under tremendous selective constraint [13] . Spike protein similarity between SARS-CoV-2 and pangolin_2019 led to the idea that 101 the receptor binding domain (RBD) within the SARS-CoV-2 spike protein originated from 102 pangolin_2019 via recombination [15] [16] [17] [18] . If that were the case, we would expect the 103 divergence at synonymous sites (dS) to also be reduced in the RBD region. However, while 104 dN in the RBD region is 0.023, approximately one third of the estimate for the rest of the 105 spike gene (0.068), dS in the RBD (0.710) is actually slightly higher than in the rest of the 106 spike sequence (0.651). This argues against the recombination scenario. We noticed that the 107 dS of the whole spike and the RBD, are 2-and 3-fold, respectively, higher than the genome 108 average. Since synonymous sites are typically less influenced by selection, the increased 109 divergence in dS may reflect an underlying elevated mutation rate. skewed. While the former shows excess of both high and low frequency mutations, the latter 116 mainly exhibits an excess of low frequency changes (Fig. 1a) . The excess of low frequency 117 mutations is consistent with the recent origin of SARS-CoV-2 [19]. Both population 118 reduction and positive selection can increase high frequency mutations [20, 21] . However, 119 the first scenario is contradicted by the recent origin of the virus. If positive selection has 120 been operating, we would expect an excess of high frequency non-synonymous as well as 121 synonymous changes. Furthermore, the ratio of nonsynonymous to synonymous changes is 122 2.46 (138/56) among singleton variants, but only 1.42 (17/12) among non-singletons. Both of 123 these observations suggest that the majority of amino acid-altering mutations are selected 124 against, with no positive selection in evidence. The skew of synonymous variants toward high frequency deserves further discussion, 126 as it relates to the underlying force driving the SARS-CoV-2 outbreak. The puzzle is 127 probably rooted in how high and low frequency mutations are inferred. The results shown in 128 the Fig. 1a are based on an outgroup comparison. The divergence at synonymous sites 129 between SARS-CoV-2 and RaTG13 is 17%, approximately 3-fold greater than between 130 humans and rhesus macaques [22] . With such high level of divergence, the possibility of 131 multiple substitutions cannot be ignored, especially since substitution in the coronavirus 132 genome is strongly biased toward transitions (see above). Indeed, among all non-singleton 133 mutations listed in Table 2 , 62% of the changes are C-T transitions. To get around the potential problem caused by multiple substitutions, we cross- Table 2 ). All three sequences from Singapore share the T nucleotide also found in 139 the RaTG13 outgroup. Using the outgroup comparison, the C found in the rest of the human 140 SARS-CoV-2 sequences is a derived mutation. However, the T at this position is restricted to 141 genomes collected from Singapore on 2/4 and 2/6/2020 and not found in earlier samples. It is 142 thus more sensible to infer that this T is a back mutation derived from C rather than an novel mutations might be occurred within this patient cannot be 100% ruled out, the 155 alternative explanation that this patient may have been co-infected by two viral strains seems 156 more plausible. After cross-referencing with the haplotype network and the phylogeny, all mutations 158 listed as high frequency in Table 2 and Fig. 1a were re-assigned to the other side of the 159 frequency spectra. We only see an excess of singleton mutations, consistent with a recent In addition to L84S, a G-T transversion at 26114 which caused an amino acid change 180 in orf3 protein (G251V) is also at intermediate frequency (Table 2) . 251V was first seen on 181 1/22/2020 and gradually increased its frequency to 13% by our sampling date (Fig. 3) . We 182 note that the emergence of 84S in orf8 and 251V in orf3 are consistent with the lockdown of 183 Wuhan on 1/23/2020. The former first appeared in early January, gradually increased its 184 frequency, and reached a plateau around 1/23/2020. The latter showed up on 1/22/2020 and 185 rapidly increased its frequency within two weeks. (Table 1) , such inflation would have a large effect on dN/dS estimates. 193 We therefore excluded singletons from dN and dS estimation. The dN/dS of orf8 gene in episode I and II and orf3 gene in episode II show strong 195 signatures of positive selection, consistent with increase of 84S and 251V frequency during 196 these periods, and may suggest a role of adaptation (Table 3 ). The overall dN/dS within each 197 episode was 5-10 times higher than dN/dS between coronavirus genomes derived from 198 different species (Table 1 ). The elevated dN/dS of SARS-CoV-2 is either due to its 199 adaptation to human hosts or relaxation of selection. For a recently emerged virus, it is 200 reasonable to expect operation of positive selection at the early stage. In that case, the dN/dS 201 during episode I should be greater than during episode II [23, 24] . However, dN/dS was smaller in episode I than in episode II across the majority of the 203 genome, suggesting that elevation of dN/dS is probably mostly due to the relaxation of 204 selection. We further divided episode I into Ia and Ib, according to the appearance of 84S in 205 orf8 protein on 1/6/2020. The genome-wide dN/dS values were 0.27 and 0.23 for episode 1a 206 and 1b, respectively (S1 Table) . Therefore, as shown in the frequency spectra, the signature 207 of positive selection is weak at the early stage of the epidemic. influenced by the genome sampling scheme. Since the earliest available genome was sampled 216 on 12/24/2019 almost one month after the outbreak, the real origin of the current outbreak 217 may actually be earlier than our estimation. 218 We estimated genetic variation, including the number of segregating sites, Watterson's estimator of θ, and nucleotide diversity (π) of the SARS-CoV-2. Since both π 220 and θ are estimators of 4Nu (N and u are the effective population size and mutation rate, 221 respectively), they should be close to each other at the mutation-drift equilibrium [25] . The haplotype network also supports this notion (Fig. 2) . Usually, ancestral 231 haplotypes have a greater probability of being in the interior, have more mutational 232 connections, and are geographically more widely distributed. The H1 haplotype is at the 233 center of the network and is found in four countries and many places in China. In addition, a 234 large portion of haplotypes is directly connected to H1. Therefore, it is likely that H1 is the 235 ancestral haplotype. As 45% of H1 are found in Wuhan, this location is the most plausible 236 origin of the ongoing pandemic. hosts, the idea that they share five out of six critical amino acids within RBD through 252 convergent evolution seems far-fetched. 253 We therefore hypothesize that, instead of convergent evolution, the similarity of RBD 254 between SARS-CoV-2 and pangolin_2019 was caused by an ancient inter-genomic Therefore, these two mutations may have some functional consequences and be worth 309 investigating further. By the time we prepared this manuscript, the 215V frequency ceased to 310 increase. However, a parallel mutation has occurred in a different genomic background, 311 further supporting the idea that this mutation may require further study. The authors thank those who contributed to sequence generation and sharing (The 348 detail is listed in S2 Table) . We also thank Chung-I Wu and Wen-Ya Ko for their 349 constructive comments and suggestions. This work was supported by Ministry of Science and Mutation types and numbers are given along the branch. Mutations that are involved in 525 different evolutionary pathways or occurred more than once are enclosed. Also see Table 2 526 for comparison. Six genomes, EPI_ISL_ 408511, 408512, 410480, 408483, 407079, 407079, 527 were excluded from this analysis because they contain too many 'N' in the sequences. 528 Fig. 3 Mutation frequency of 84S is in orf8 and 215V is in orf3. A 614 orf1ab G G G G A 2 H116Q B 1190 orf1ab C C C C T 3 P308S C 5084 orf1ab A A A A G 2 A1606T D 9438 orf1ab C C C C T 3 T3058I E 11083 orf1ab G T G G T 9 L3606F F 18488 orf1ab T T T T C 2 I6074V G 21707 S C C N/A C T 5 H48Y H 22661 S G G G G T 5 V366F I 26144 orf3 G G G G T 18 G251V J 27147 M G G G G C 2 I208T K 28077 orf8 G G G G C 4 V61L L 28144 orf8 C C C T C 99 38 L84S M 28854 N C C C C T 5 S194L N 28878 N G G G G A 6 S202N O 29019 N A A A A T 2 D249H P 29303 N C C C C T 2 K343I Synonymous α 2662 orf1ab C T T C T 3 C2397T β 8782 orf1ab T T T C T 100 37 C8517T γ 10138 orf1ab T T T C T 134 3 C9873T δ 15324 orf1ab C C C C T 2 C15059T ϵ 17373 orf1ab T C T C T 132 5 C17108T ζ 18060 orf1ab T T A C T 131 6 C17795T η 18603 orf1ab T T C T A 2 T18338C θ 23569 S A C A T C 2 T2007C ι 23605 S N/A N/A N/A T G 2 T2043G κ 24034 S T C C C T 131 6 C2472T λ 24325 S A A A A G 2 A2763G μ 26729 M T T T T C 4 T207C ν 29095 N T T T C T The dashed line indicates the date of the Wuhan, China, lockdown. The location of each sequence is given (above the slash) followed by its sampling date 543 (below the slash). For multiple sequences sampled on the same date from the same location, Mutation types and numbers are given along the branch. Mutations that are involved in different evolutionary pathways or occurred more than once are enclosed. Also see Table 2 for comparison. Six genomes, EPI_ISL_ 408511, 408512, 410480, 408483, 407079, 407079 , were excluded from this analysis because they contain too many 'N' in the sequences. Bat SARS-like coronavirus isolate bat-SL-CoVZC45 Bat SARS-like coronavirus isolate bat-SL-CoVZXC21 Bat SARS coronavirus HKU3-1 BtRs-BetaCoV-GX2013 100 100 100 100 100 0.020 S1 Fig Appendix Figure 1 . The neighbor-joining tree of SARS-CoV-2 related coronaviruses constructed by concatenating coding sequences based on the Kimura 2-parameter model implemented in MEGA-X. Identification of a novel coronavirus 356 causing severe pneumonia in human: a descriptive study A pneumonia outbreak associated 359 with a new coronavirus of probable bat origin A new coronavirus associated with 362 human respiratory disease in China Coronaviridae Study Group of the International Committee on Taxonomy of V. The species 365 Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-366 Hosts and Sources of Endemic Human 369 Origin and evolution of pathogenic coronaviruses The 2019-new 374 coronavirus epidemic: Evidence for virus evolution Genome Composition and Divergence of 377 the Novel Coronavirus (2019-nCoV) Originating in China Genomic characterisation and epidemiology of 380 2019 novel coronavirus: implications for virus origins and receptor binding Genomic characterization of the 2019 384 novel human A novel bat coronavirus reveals natural 388 insertions at the S1/S2 cleavage site of the Spike protein and a possible recombinant origin of HCoV-389 19 Moral imperative for the immediate release of 2019-nCoV sequence data. 391 National Science Review On the origin and continuing evolution of 393 SARS-CoV-2. National Science Review Sinauer Associates; 1997. xv, 487 p Evidence of recombination in 396 coronaviruses implicating pangolin origins of nCoV-2019 Isolation and Characterization of 2019-399 nCoV-like Coronavirus from Malayan Pangolins The proximal origin of SARS-CoV-2. 402 Nature Medicine Identification of 2019-nCoV 404 related coronaviruses in Malayan pangolins in southern China Origin time and epidemic dynamics of the 2019 novel coronavirus Hitchhiking under positive Darwinian selection Compound tests for the detection of hitchhiking under positive 411 selection Rate of evolution in 414 brain-expressed genes in humans and other primates Characterization of severe acute 418 respiratory syndrome coronavirus genomes in Taiwan: molecular epidemiology and genome 419 evolution Molecular evolution of the SARS coronavirus during the course of the SARS 422 epidemic in China Molecular population genetics Distinct hepatitis B virus 427 dynamics in the immunotolerant and early immunoclearance phases H7N9 Avian Influenza A Viruses HIV drug resistance Molecular Evolutionary Genetics 486 Analysis across Computing Platforms A new method for estimating synonymous and nonsynonymous rates 489 of nucleotide substitution considering the relative likelihood of nucleotide and codon changes PAML 4: phylogenetic analysis by maximum likelihood DNA Sequence Polymorphism Analysis of Large Data Sets Bayesian phylogenetic 498 and phylodynamic data integration using BEAST 1.10 Relaxed phylogenetics and dating with 501 confidence Appendix Figure 2 . Unrooted neighbor-joining tree of SARS-CoV-2 constructed by concatenating coding sequences based on the Kimura 2-parameter model implemented in MEGA-X. Nonsingleton changes are shown along the branches. The location of each sequence is given (above the slash) followed by its sampling date (below the slash). For multiple sequences sampled on the same date from the same location, the index, a, b, c, d, and etc. is given. Details are listed in Supplemental File 2.Appendix Appendix Figure 3 . Frequency spectra of SARS-CoV-2 carrying 84L (n=98) (A) and 84S (n=39) (B) in orf8. The direction of changes was cross-referenced with the haplotype network shown in Fig. 2