key: cord-0901526-rvygtpvb
authors: Shan, Ke-Jia; Wei, Changshuo; Wang, Yu; Huan, Qing; Qian, Wenfeng
title: Host-specific asymmetric accumulation of mutation types reveals that the origin of SARS-CoV-2 is consistent with a natural process
date: 2021-08-30
journal: Innovation (N Y)
DOI: 10.1016/j.xinn.2021.100159
sha: fa76a8b70738f37d06e43e53cca2ffba56fd4bc1
doc_id: 901526
cord_uid: rvygtpvb

The capacity of RNA viruses to adapt to new hosts and rapidly escape the host immune system is largely attributable to de novo genetic diversity that emerges through mutations in RNA. Although the molecular spectrum of de novo mutations—the relative rates at which various base substitutions occur—are widely recognized as informative towards understanding the evolution of a viral genome, little attention has been paid to the possibility of using molecular spectra to infer the host origins of a virus. Here, we characterize the molecular spectrum of de novo mutations for SARS-CoV-2 from transcriptomic data obtained from virus-infected cell lines, enabled by the use of sporadic junctions formed during discontinuous transcription as molecular barcodes. We find that de novo mutations are generated in a replication-independent manner, typically on the genomic strand, and highly dependent on mutagenic mechanisms specific to the host cellular environment. De novo mutations will then strongly influence the types of base substitutions accumulated during SARS-CoV-2 evolution, in an asymmetric manner favoring specific mutation types. Consequently, similarities between the mutation spectra of SARS-CoV-2 and the bat coronavirus RaTG13 which have accumulated since their divergence strongly suggest that SARS-CoV-2 evolved in a host cellular environment highly similar to that of bats before its zoonotic transfer into humans. Collectively, our findings provide data-driven support for the natural origin of SARS-CoV-2.

Since the first reports of coronavirus disease 2019 (COVID- 19) , controversies have persisted regarding the origin of its causative agent, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). 1 While many studies have proposed that a natural origin of SARS-CoV-2 provides the simplest explanation for its emergence, 2-5 counter arguments have speculated that an accidental laboratory escape of an engineered SARS-like coronavirus could not be excluded. 6, 7 One factor driving this prolonged controversy is a lack of empirical data that clearly supports either possibility. Moreover, the search for related viruses in wild animals that are sufficiently genetically similar to SARS-CoV-2

has not yet shown fruitful results. Since its divergence from RaTG13-the genetically most similar virus identified to date-approximately 50 years ago, [8] [9] [10] [11] SARS-CoV-2 had accumulated ~500 mutations before its jump to human hosts (RaTG13 accumulated ~600 mutations meanwhile).

Per the traditional aphorism, "the absence of evidence is not evidence of absence", especially when considering the vast number of unexplored wild animals and the even greater number of viruses they harbor. While the considerable efforts of many research groups to search in nature for a closely related coronavirus may yet provide many insights into the origins of SARS-CoV-2, we instead turned our attention to the ~500 base substitutions that have accumulated in SARS-CoV-2 since, we hypothesized, they could provide us with unprecedented statistical power to test if they accumulated through an evolutionary process that was consistent with those occurring in known, natural coronaviruses.

Virus evolution begins with a de novo mutation in its genome, thus providing new variations in the genetic material that are retained or lost under different selection pressures. On the one hand, mutations that confer a fitness advantage or disadvantage will increase or decrease in frequency through natural selection. For example, some mutations may affect transmission efficiency or capacity to escape from the host immune system. On the other hand, neutral mutations, which have little fitness effect, remain J o u r n a l P r e -p r o o f unaffected by natural selection, resulting in their genomic accumulation of an equal chance through random genetic drift.

Since de novo mutations are the starting point for genetic variation, their molecular spectrum-i.e., the relative rates at which all 12 possible types of base substitutions arise-has been widely recognized as an essential parameter for understanding genome evolution. Since this spectrum of mutation rates can be used to predict sequence changes under neutral processes, this metric can serve as a null model for optimizing phylogenetic reconstructions based on maximum likelihood or for detecting genomic signals of positive selection. [12] [13] [14] [15] [16] [17] [18] In addition to its applications in evolutionary biology, the molecular spectrum has been used to describe somatic mutations accumulated in the genome of cancer cells and to identify etiological agents involved in tumorigenesis. Various mutational processes, such as exposure to mutagens and enzymatic modification, will each generate unique combinations of mutation types, termed "mutational signatures". Therefore, the molecular spectrum can be used to infer the suite of operative mutational processes through which somatic mutations accumulated in the genome of a cancer cell. [19] [20] [21] For example, excess C>A or G>T transversions, mainly caused by polycyclic aromatic hydrocarbons, has been identified as a mutational signature for tobacco smoking in the development of lung cancers. 21, 22 Following the same logic, we propose that the molecular spectrum of mutations that accumulated during the evolution of a viral genome may be informative for inferring the ancestral hosts of that virus, for viruses share the same sets of mutagens in the cellular environments as their hosts. However, we realize that this strategy heavily relies on the validity of three assumptions. First, the cellular environment is substantially variable among different hosts such that they can create mutational signatures sufficiently distinct in the viral genome for tracing its transmission history. Second, de novo mutations in the viral genome are predominantly introduced through processes specific to the host cellular environment, rather than through inherently viral mechanisms of mutagenesis. Third, the molecular spectrum of mutations accumulated in the evolution of a given virus is largely J o u r n a l P r e -p r o o f determined by de novo mutations rather than by natural selection, which in principle could blur any mutational signatures. We realize that the key to testing these assumptions is to characterize the molecular spectrum of de novo mutations in SARS-CoV-2, before natural selection has a chance to affect their apparent frequency.

In this study, we first tested each of these three assumptions using a computational strategy specifically developed for detecting de novo mutations in SARS-CoV-2 from the transcriptome of virus-infected cell lines. After validating the three assumptions, we constructed a phylogenetic tree for SARS-CoV-2 and related coronaviruses and identified hundreds of mutations that accumulated in its genome prior to jumping to human hosts.

Last, we investigated whether the accumulation of these mutations was compatible with other viruses in the phylogenetic tree that are reported to have a natural origin. Our datadriven investigation provides transparent and empirical support for the natural origin of SARS-CoV-2.

The identification of de novo mutations in SARS-CoV-2 has been technically challenging. For example, the molecular spectrum of de novo mutations cannot be inferred from within-individual polymorphisms in samples of bronchoalveolar lavage fluid 23-25 or from mutations which accumulated among patients (i.e., among-patient polymorphisms), 16, 26, 27 because we are specifically concerned with the extent to which the molecular spectrum of mutations that accumulated during virus evolution reflect the molecular spectrum of de novo mutations (refer to the third assumption aforementioned).

For evolutionary genomics studies, de novo mutations are ideally detected in newly synthesized virus genomes, before natural selection has a chance to act, for example, using RNA sequencing data from SARS-CoV-2-infected cells. However, de novo RNA mutations in SARS-CoV-2 cannot be directly inferred from mismatches between sequencing reads and the reference genome because of the high error rate inherent to high throughput sequencing (10 −3 to 10 −4 errors per nucleotide), which is approximately two J o u r n a l P r e -p r o o f orders of magnitude higher than the average de novo RNA mutation rate (10 −5 to 10 −6 mutations per nucleotide). 28 Furthermore, errors generated during library preparationthat is, in the procedures of reverse transcription and polymerase chain reaction (PCR)based amplification-will also increase the complexity of identifying bona fide RNA mutations. 29 Nevertheless, experimental strategies have been designed to detect rare de novo mutations in RNA viruses, such as for poliovirus 30 Figure S1A, left panel) . This two-round transcription mechanism is also employed by J o u r n a l P r e -p r o o f SARS-CoV-2 to synthesize various positive-sense subgenomes which function as viral mRNAs for the translation of viral proteins ( Figure S1A, right panel) .

The SARS-CoV-2 subgenomes are nested within the genomic RNA and are produced by discontinuous transcription from positive-sense genomic RNA into intermediate negative-sense subgenomic RNA (i.e., via polymerase jumping, Figure   S1B ). In addition to canonical junctions generated by the leader-to-body fusion occurring between the leader and one of the eight body transcription-regulating sequences (TRS), a huge number of noncanonical fusions can be found at random sites in the viral genome. 35 Most resultant noncanonical junctions are present at a low frequency, likely resulting from sporadic errors in discontinuous transcription. 36 Nevertheless, the negative-sense subgenomic RNA bearing such sporadic junctions can serve as a template for transcription into multiple identical positive-sense subgenomic RNAs. 37 We realized that sporadic junctions could serve as the molecular barcode for a negative-sense subgenomic RNA, since it is unlikely that two independently synthesized, positive-sense subgenomic RNAs will share identical genomic coordinates of a pair of upstream and downstream junction sites. Therefore, these "junction barcodes" can be used to group sequencing reads into families; each read family will include all sequencing reads derived from the positive-sense subgenomic RNAs that have been transcribed from the same negative-sense subgenomic RNA. Repeated detection of the same mismatch within a read family implies that an RNA mutation was present in the negative-sense subgenomes ( Figure 1B) . In contrast, errors generated during reversetranscription or sequencing can be excluded, as they will appear randomly (i.e., not at identical sites).

Using the junction-barcoding approach (Figure S2 Figure S4 ). To estimate the rate of each mutation type, we controlled for the nucleotide composition of the viral genome and the potential coverage bias generated during high-throughput sequencing. For this calculation, we estimated the coverage for each site by all read families (similar to RNA mutation calling in Figure S2 , but no mismatch was required) and aggregated this coverage according to the nucleotide (A, C, G, or U) in the reference genome. We divided the number of mutations of each type by the total coverage of all sites with the nucleotide in the reference genome, and used this ratio to infer the molecular spectrum of mutations in SARS-CoV-2 in Vero cells ( Figure 1E ).

Among the 197 RNA mutations we identified in SARS-CoV-2, 122 were transitions (purine to purine or pyrimidine to pyrimidine interchanges) and 75 were transversions (interchanges of a purine to a pyrimidine or vice versa), which significantly deviated from the randomly expected ratio (4:8, P = 3×10 −16 , binomial test, Figure 1D ). On average, transitions occurred three times more frequently (1.1×10 −5 substitutions per site) than transversions (3.6×10 −6 substitutions per site, Figure 1E ), which was possibly attributable to the structural similarity between bases that are substituted in transitions.

In particular, C>U mutations appeared to be the most abundant transition (P = 6×10 −5 , binomial test with probability equal to 1/4, Figure 1D ). Among the eight types of RNA transversions, G>U mutations occurred much more frequently than the other seven mutation types (P = 8×10 −6 , binomial test with probability equal to 1/8, Figure 1D ), reaching frequencies comparable to that of transitions. We then focused on these two major signatures of SARS-CoV-2 mutations, i.e., over-representation of G>U and C>U mutations, in subsequent analyses.

Given that the de novo mutations provide the raw materials for virus evolution (Figure 2A) , before the further investigation of the molecular mechanisms underlying the overrepresentation of G>U and C>U mutations, we first sought to determine the levels at J o u r n a l P r e -p r o o f which SARS-CoV-2 evolution in human patients was affected by the molecular spectrum of de novo mutations. To this end, we retrieved 34,853 high-quality sequences of SARS-CoV-2 variants isolated from patients worldwide (Table S1) from GISAID (Global Initiative on Sharing All Influenza Data), 38 and reconstructed the genomic sequence of their last common ancestor.

We then identified genetic differences between each variant and the ancestor and treated those differences observed in at least two patients as among-patient polymorphisms. This process enabled the removal of singletons that were potentially It is worth noting that despite their apparent similarity, the de novo mutation identified in this study is by definition different from the among-patient polymorphisms, 16, 26 because the latter has been influenced by natural selection related to the processes of infection, propagation, or release from infected cells. 13, 39 Only with the characterization of the molecular spectrum of de novo mutations can we thus examine the relative influence of mutation vs. selection in driving the genomic evolution of SARS-CoV-2 (Figure 2A) . The observation that the molecular spectrum of among-patient polymorphisms largely resembled that of de novo mutations indicated that the proportions of deleterious mutations were largely uniform among the 12 types of base substitutions.

Presumably, the over-representation of G>U and C>U mutations could be consequences of transcriptional errors that produce RNAs carrying different sequences from that of the template (i.e., replication-dependent mutations), or these mismatches could result from exposure to environmental mutagens that can induce mutations in the absence of transcriptional machinery (i.e., replication-independent mutations). 40 We realized that these two mechanisms could be distinguished by comparing the frequency of G>U (or C>U) mutations with that of its complement mutation, C>A (or G>A). For example, a mutation observed in the negative-sense genome (e.g., C>A, which will be transcribed into a G>U mutation in the positive-sense genome) that is generated during positive to negative strand transcription should occur at approximately equal frequency in negative to positive strand transcription (leading to a C>A mutation in the positive-sense genome) because the same polymerase performs both functions. Consequently, when the SARS-CoV-2 replication cycle (i.e., two rounds of transcription) is completed, the molecular spectrum of replication-dependent mutations should be "symmetric", meaning that the frequencies of G>U and its complement mutation, C>A, should be similar ( Figure 3A, left panel) . Alternatively, if a mutation is generated in a replicationindependent manner, for example, induced by mutagens specifically in the positive-sense (or negative-sense) single-strand RNA of SARS-CoV-2, then the frequency of complement mutations will not necessarily be symmetrical ( Figure 3A , middle panel).

We reasoned that among-patient polymorphisms in SARS-CoV-2 could be used to investigate whether the mutagenic mechanisms underlying the overrepresentation of some mutation types were replication-dependent or -independent because polymorphisms were generated by complete replication cycles. The polymorphism data revealed that G>U transversions occurred with greater frequency than C>A (P = 2×10 −116 , Fisher's exact test) and that C>U transitions occurred with greater frequency than G>A (P = 1×10 −199 , Figure 3B ). This asymmetric distribution indicates that the observed overrepresentation of G>U and C>U mutations unlikely results from a replicationdependent process.

Presumably, replication-independent mutations can occur in either double-strand or single-strand RNA. We reasoned that mutations arising in double-strand RNA should J o u r n a l P r e -p r o o f also lead to a symmetric molecular spectrum ( Figure 3A, right panel) . This possibility was excluded by the observation that, among SARS-CoV-2 isolates from human patients, G>U and C>U polymorphisms were distributed in asymmetrically greater numbers ( Figure 3B) , supporting the likelihood that the mechanism responsible for introducing these mutations involved single-strand RNA. Furthermore, since the negative-sense RNA of SARS-CoV-2 is mainly present in the double-strand RNA (i.e., paired with positivesense RNA; Figure 3A ), we therefore proposed that the observed G>U and C>U mutations were most likely introduced to the single-stranded positive-sense RNA.

Thus far, our results indicated that the disproportionate abundance of G>U and C>U mutations in SARS-CoV-2 likely arose in positive-sense single-strand RNAs. There are two possible mechanisms that could account for this outcome. First, the positive-sense RNA is more vulnerable to mutagens-for example due to destruction of the RNA secondary structure by translating ribosomes, which subsequently exposes the single strand RNAs to mutagens. Second, the viral genetic information spends the majority of its life cycle as a positive-sense RNA. We thus reasoned that the molecular spectrum of mutations in negative-sense single-stranded RNA viruses could be used to investigate which of the two mechanisms underlay the emergence of single-strand RNA mutations, since the negative-sense RNA was predominant in these viruses ( Figure 3C ).

We first assessed this mechanism by analyzing the genetic polymorphisms among 1839 confirmed variants (Table S2 ) of the negative-sense single-stranded RNA virus, Influenza A virus, collected during the 2009 pandemic. 38 The asymmetric frequency of the among-patient polymorphisms that we observed in SARS-CoV-2, a positive-sense RNA virus (i.e., over-representation of G>U and C>U on the positive strand), was reversed in Influenza A virus: G>U and C>U polymorphisms were more abundant in the negative-sense genome (P = 2×10 −10 and 0.05, respectively, Fisher's exact tests, Figure   3D ). This result allowed us to exclude the possibility that a positive-sense-specific mechanism was responsible for the over-representation of G>U and C>U mutations. Asymmetric accumulation of G>U and C>U mutations were observed in the negativesense genomic RNA of the Ebola virus (Figure 3E) , which further supported a replication-independent mutation mechanism on single-strand RNAs that acts on the strand carrying the genetic information.

In light of these findings, we next sought to determine whether these replicationindependent mutations were introduced to the genomic-strand RNA in the extracellular virion environment, where the viral RNA is protected by the capsid, or if they occurred in the cellular environment following host invasion ( Figure 4A ). To address this issue, we reasoned that positive-sense single-stranded, persistent yeast RNA viruses (e.g., Saccharomyces 20S RNA narnavirus and Saccharomyces 23S RNA narnavirus) could be used to test if the eukaryotic cellular environment was able to induce G>U or C>U mutations in single-strand RNAs ( Figure 4A) . These viruses represented a strong experimental model for this question because they persist in yeast cells as naked RNA, without a capsid. 41 Therefore, the introduction of G>U and C>U mutations is dependent on mutagenic mechanisms within the yeast cellular environment, and without which the asymmetric accumulation of mutations will not be detectable ( Figure 4A ).

Both CirSeq 42 and ARC-seq 32 experiments were previously conducted in budding yeast, although the aims of the previous studies were to identify mutations in endogenous RNAs. We reasoned that some reads might be derived from the persistent yeast RNA viruses, which can be used to calculate the symmetry of mutations in the virus genomes.

While the yeast strain used in the CirSeq generated minimal reads from either 20S or 23S RNA narnaviruses, the ARC-seq data contained 43,034 sequencing reads from the 20S RNA narnavirus genome (but no reads from the 23S RNA narnavirus). Among the reads that mapped to the 20S RNA narnavirus genome, we identified a significantly higher abundance of G>U mutations than C>A mutations (46 vs. 22, P = 0.003, Fisher's exact J o u r n a l P r e -p r o o f test) and more C>U than G>A mutations (157 vs. 69, P = 1×10 −8 , Fisher's exact test, Figure 4B ). These results indicated that the yeast cellular environment was sufficient to induce the asymmetric emergence of G>U and C>U mutations in a positive-sense singlestranded RNA viral genome.

Postulating that mutagens in the cellular environment cannot discriminate endogenous and viral RNAs, we further predicted that G>U and C>U mutations should also be over-represented in yeast endogenous mRNAs. To test this prediction, we characterized the molecular spectrum of mRNA mutations in the same ARC-seq data set for the budding yeast. 32 The results showed that the molecular spectrum of mutations was highly similar between the 20S RNA narnavirus and that of the yeast-derived mRNAs (Pearson's correlation coefficient r = 0.93, P = 2×10 −5 , Figure 4C -D). Furthermore, G>U and C>U mutations occurred more frequently in the endogenous mRNAs than C>A and G>A mutations, respectively (P < 10 −2311 and 10 −9126 , respectively, Fisher's exact tests, Figure 4C) . A similar molecular spectrum of mRNA mutations was also observed in the CirSeq data for the budding yeast 42 (Pearson's correlation coefficient r = 0.84, P = 1×10 −3 , Figure S5 ). The similarities between these molecular spectra indicated the possibility that mutations were induced by the same mutagens in the cellular environment of yeast.

Although the particular mutagenic mechanisms that caused the observed asymmetric G>U mutations in the cellular environment remain unknown, we hypothesized that reactive oxygen species (ROS) could serve as a strong candidate, 43 particularly considering that oxidative stress is associated with the infection of some respiratory viruses. 44 Specifically, some ROS can oxidize guanine to 8-oxoguanine and thereby induce G>U transversions after an additional round of transcription. 45,46 Alternatively, we also suspected that chemicals with similar property to polycyclic aromatic hydrocarbons, which are also well-known to induce G>T somatic mutations in the lung cancer samples among tobacco smokers, 47 could serve as potential cytosolic mutagens of viral RNA. For J o u r n a l P r e -p r o o f C>U mutation, potential candidates that could induce its accumulation included RNA editing activity by cytidine deaminases. 48, 49 Given the broad array of potential mechanisms, we reasoned that regardless of the exact nature of the mutagens that caused asymmetric accumulation of G>U or C>U RNA mutations in the cellular environment, these mutagens were unlikely to discriminate between RNA and DNA. 50 Consequently, somatic mutations in DNA would also arise, 51-53 particularly in the coding strand, which is exposed to the cellular environment in the single-strand state during transcription 54-56 (illustrated in Figure 5A ). Based on this assumption, we investigated the capacity of the cellular environment to generate G>T and C>T somatic mutations in the coding strand of genomic DNA in order to subsequently infer its capacity to induce G>U and C>U mutations in RNA viruses. To this end, we retrieved the somatic mutations identified for each of the 36 human tissues 55 from publicly available Genotype-Tissue Expression (GTEx) data. 57 We characterized the molecular spectra of somatic mutations that emerged in these 36 human tissues ( Figure S6) and projected them into a two-dimensional space based on the levels of asymmetry in G>T (vs. C>A) and C>T (vs. G>A) base substitutions ( Figure   5B ). Among them, 20 tissues, including the lung, showed asymmetry in both G>T and C>T mutations, while 10 tissues showed asymmetry only in G>T mutations. The cellular environments of the remaining six tissues could not induce detectable asymmetry in either G>T or C>T. These results indicated that human tissues exhibit wide-ranging differences in their capacity to induce various types of de novo mutations.

To further investigate the cellular environment that likely drove SARS-CoV-2 evolution in human patients, we plotted the asymmetries of G>U and C>U for SARS-CoV-2 among-patient polymorphisms (abbreviated as pSCV2 in the figure) into the twodimensional space and found that the cellular environment where SARS-CoV-2 propagated in patients was most similar to that of the lung (Figure 5B ). This finding is in agreement with numerous reports that showed the airborne transmission 58 of SARS-CoV-2 and supports a cellular environment-dependent genomic evolution of SARS-CoV-2. 

Thus far, our results showed that the molecular spectrum of mutations that accumulated during SARS-CoV-2 evolution is reflective of the asymmetric emergence of de novo mutations (Figure 2 ) caused by host cellular environments (Figures 3-4) . Given that different types of base substitutions are disproportionately induced in various cell environments (Figure 5) , we reasoned that the ancestral cellular environment where SARS-CoV-2 propagated prior to its transmission to humans could, in principle, be inferred from the mutations in the SARS-CoV-2 genome which accumulated during that period. These mutations could be identified from a phylogenetic tree including SARS-CoV-2 and related coronaviruses.

We built an evolutionary tree ( Figure 6A ) including the last common ancestor of SARS-CoV-2 isolated from patients and its closely related coronaviruses isolated from Rhinolophus bats (RaTG13, RshSTT200, and ZC45) and pangolins (GD-1 and GX-P5L), using Rc-o319 from bats as an outgroup. 8,59- 63 We then reconstructed the ancestral sequence for each internal node (N1-N5) and determined which mutations accumulated in the evolutionary history represented by each branch in the phylogenetic tree (B1-B9, Figure 6A ). Based on the parsimony principle we labeled seven branches (B1-B4 and B6-B8) that represented the evolutionary history exclusively in the cellular environments of bats and two that represented a mixed evolutionary history in bats and pangolins (B5 and B9, Figure 6A) .

The results showed that the molecular spectra of the nine branches appeared similar ( Figure 6A , inset on top of each branch), and were highly correlated with each other ( Figure 6B) . However, these spectra were only moderately correlated with the spectrum of mutations that accumulated during SARS-CoV-2 evolution in human patients, the spectrum of de novo viral mutations in Vero cells, or the spectrum of somatic mutations in the lung (Figure 6B) . For example, the asymmetric emergence and accumulation of G>U mutations in the viral genome, which was observed in Vero cells ( Figure 1D ) and among human patients (Figure 2C) , respectively, was no longer detectable in the seven J o u r n a l P r e -p r o o f bat-exclusive or two bat-pangolin branches (P > 0.5 in all one-sided Fisher's exact tests).

This finding is in agreement with previous reports of low production of ROS and high concentrations of endogenous antioxidants in bat cells. 64 These observations indicated that bats (probably also pangolins) provided a cellular environment for the genomic evolution of RNA viruses that substantially differed from that of humans.

We determined the 529 base substitutions that apparently accumulated in the SARS-CoV-2 genome since its divergence from RaTG13 (represented by branch B0 in Figure   6A ). The molecular spectrum of this branch (Figure 6A , inset in the top left corner) was highly correlated with the bat-related branches (B1-B9), but showed a much lower correlation with the spectrum of mutations that accumulated in human patients (pSCV2, Figure 6B -C). The patterns held when only fragments of the SARS-CoV-2 genome (e.g., the coding sequence of the spike protein) were used for the comparison (Figure S7) .

These observations suggested that after its divergence from RaTG13, SARS-CoV-2 likely evolved in a host cellular environment similar to that of bats, prior to its zoonotic transfer into humans.

The apparent similarity in the molecular spectra between the mutations accumulated in branch B0 and those in the bat-related branches (B1-B9) could be either attributable to their propagation in a common cellular environment, or a common inherent mutational bias caused by shared genetic variation in the genes that modulate the relative rates of We constructed phylogenetic trees separately for SARS-CoV-related and MERS-CoV-related viruses, and labeled putative host species for each branch according to the parsimony principle ( Figure 7A ). If the molecular spectrum was shaped largely by the host cellular environment, we predicted that all branches representative of evolutionary J o u r n a l P r e -p r o o f history in the same host species would exhibit similar molecular spectra. On the contrary, if the molecular spectrum was an inherent feature encoded in the virus genome that becomes more distinct as genetic distance increases, we predicted that the lineages of SARS-CoV-2, SARS-CoV, and MERS-CoV would each exhibit their own mutational signatures.

To test these two predictions, we estimated the proportions of each base-substitution type for each branch (Figure 7A) . In order to visualize the similarity across molecular spectra, we performed principal component analyses, projecting these branches into a two-dimensional space (Figure 7B-C) . The results showed that 17 branches with a reported evolutionary history exclusive to bats clustered together (Figure 7B) , while the viruses from three distinct lineages of betacoronaviruses did not ( Figure 7C ). This observation indicated that the molecular spectrum of virus genome evolution mainly reflected the cellular environment where viruses propagated, rather than their phylogenetic relationship. We consequently used the 95% confidence ellipse estimated from these 17 bat-exclusive branches ( Figure 7B ) to define the borderline of the bat cellular environment in the two-dimensional space.

Branches that represented the host history in camels (B18 and B19) fell outside of the 95% confidence ellipse (Figure 7B ), indicating that camels had a distinct cellular environment from that of bats. Branches that represented host history entirely in humans (pSCV2), or a host history partly in humans (B10 for a SARS patient and pMERS for among-patient polymorphisms in MERS), also fell outside of the 95% confidence ellipse ( Figure 7B) . Notably, they appeared to cluster together with the spectra of de novo mutations detected in SARS-CoV-2 (mSCV2), Ebola virus (mEbola), or poliovirus (mPV) that was cultivated in primate cell lines. Furthermore, the 95% confidence ellipse of these six human-related molecular spectra was not overlapped with that estimated from the spectra of the 17 bat-exclusive branches (Figure 7B) , highlighting the potential application of our approach for detecting a jumping event from bats to a new host.

The branch leading to SARS-CoV-2 (B0) was located within the 95% confidence ellipse defined by the 17 bat-exclusive branches (Figure 7B) , and in particular, was J o u r n a l P r e -p r o o f within the 95% confidence ellipse defined by the 13 Rhinolophus-exclusive branches.

Considering the consistency of this approach in identifying well-established host jumping events (e.g., from bats to camels for MERS-CoV as shown in branches B18 and B19, and from bats to humans for SARS-CoV as shown in branch B10, Figure 7B ), we concluded that since its divergence from RaTG13, SARS-CoV-2 most likely propagated primarily in a cellular environment highly similar to bats, particularly Rhinolophus bats, prior to its zoonotic spillover into humans.

The central dogma of molecular biology asserts that the genetic information of cellular organisms is stored in DNA, which must be transcribed into mRNA for transmission of the genetic information into functional proteins. Although the presence of mRNA mutations has been confirmed in a few cellular organisms, 29,32,42,68,69 they affect only a limited number of proteins due to the transient nature of mRNA. However, for RNA viruses whose genomic information is stored in RNA, mutations in RNA can have a long-term influence because such mutations have a chance of being inherited. In this study, we exploit this influence to infer the evolutionary history for RNA viruses, in particular SARS-CoV-2.

It is worth noting that four out of five branches that represented a mixed host history in bats and in a non-human organism (pangolins in B5 and B9, civets in B11, camels in B20, and hedgehogs in B25) fell within the 95% confidence ellipse estimated from the 17 bat-exclusive branches (Figure 7B) Although this study focuses on the relative rates among base substitution types (i.e., molecular spectrum), we estimate an average rate of 1.73×10 −5 de novo mutations per nucleotide in SARS-CoV-2 ( Figure 1E) . Despite the proofreading mechanism 70 provided by its nonstructural protein 14, SARS-CoV-2 mutations still occur at a rate three to four orders of magnitude higher than DNA mutations, 29 which enable rapid immunological escape, while leave genomic integrity maintained by natural selection for infectivity.

In addition, our analyses focused on the detection of point mutations. Although not highlighted in the results, we also estimated the molecular spectrum of indels with a similar computational strategy that used junctions as molecular barcodes for the intermediate negative-sense subgenomic RNA (see Figure S2 and Material and methods). This work identified two small insertions and 96 deletions (Figure S3B-C) , the majority of which were shorter than six nucleotides.

Although the significant illness and death caused by the SARS-CoV-2-induced coronavirus disease 2019 pandemic has led to a multitude of studies of this virus, basic understanding is still lacking for several of its key features, such as its origin. 1-7 We show that the mutations accumulated in SARS-CoV-2 prior to its transmission to humans are fully consistent with a natural evolutionary process in a Rhinolophus bat host. In addition to this theoretical purport, our methods will also be useful to identify the natural hosts of other RNA viruses and could be potentially applied towards the prevention of future outbreaks. adrenal gland; 4: artery aorta; 5: artery coronary; 6: artery tibial; 7: brain caudate basal ganglia; 8: brain cortex; 9: brain frontal cortex BA9; 10: brain hippocampus; 11: brain hypothalamus; 12: brain nucleus accumbens basal ganglia; 13: brain putamen basal ganglia (C) The distribution of Pearson's correlation coefficient (r) for the bootstrapped mutation spectra. In all 10,000 paired bootstrapped observations, r(B0, B1) was greater than r(B0, pSCV2), meaning that the P value was smaller than 0.0001. Numbers in the brackets represent the 95% confident intervals (CI) of r. 

On the origins of SARS-CoV-2

The proximal origin of SARS-CoV-2

No credible evidence supporting claims of the laboratory engineering of SARS-CoV-2

Origins of SARS-CoV-2: Focusing on Science

On the origin of SARS-CoV-2-The blind watchmaker argument

Investigate the origins of COVID-19

Science, not speculation, is essential to determine how SARS-CoV-2 reached humans

A pneumonia outbreak associated with a new coronavirus of probable bat origin

Exploring the natural origins of SARS-CoV-2 in the light of recombination

Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic

From animal to human: interspecies analysis provides a novel way of ascertaining and fighting COVID-19. The Innovation 1

PAML 4: phylogenetic analysis by maximum likelihood

Rate, molecular spectrum, and consequences of human mutation

Defining the impact of mutation accumulation on replicative lifespan in yeast using cancer-associated mutator phenotypes

Precise estimates of mutation rate and spectrum in yeast

Mutation rates and selection on synonymous mutations in SARS-CoV-2

Estimating the pattern of nucleotide substitution

Phylogenetic estimation of context-dependent substitution rates by maximum likelihood

COSMIC: the Catalogue Of Somatic Mutations In Cancer

The repertoire of mutational signatures in human cancer

Signatures of mutational processes in human cancer

Isolation of SARS-CoV-2-related coronavirus from Malayan pangolins

Genomic characterization and infectivity of a novel SARS-like coronavirus in Chinese bats

Detection and Characterization of Bat Sarbecovirus Phylogenetically Related to SARS-CoV-2

A novel SARS-CoV-2 related coronavirus in bats from Cambodia

Bats oxidative stress defense

Isolation and characterization of viruses related to the SARS coronavirus from animals in southern China

Middle East respiratory syndrome coronavirus infection in dromedary camels in Saudi Arabia

Origin and evolution of pathogenic coronaviruses

Lost in transcription: transient errors in information transfer

RNA polymerase errors cause splicing defects and can be regulated by differential expression of RNA polymerase subunits

Coronavirus biology and replication: implications for SARS-CoV-2