key: cord-1012868-z11exbyu authors: Wang, Hongru; Pipes, Lenore; Nielsen, Rasmus title: Synonymous mutations and the molecular evolution of SARS-Cov-2 origins date: 2020-10-02 journal: bioRxiv DOI: 10.1101/2020.04.20.052019 sha: f908113c74d56005693a3c9f9bdfb838310c8baf doc_id: 1012868 cord_uid: z11exbyu Human severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is most closely related, by average genetic distance, to two coronaviruses isolated from bats, RaTG13 and RmYN02. However, there is a segment of high amino acid similarity between human SARS-CoV-2 and a pangolin isolated strain, GD410721, in the receptor binding domain (RBD) of the spike protein, a pattern that can be caused by either recombination or by convergent amino acid evolution driven by natural selection. We perform a detailed analysis of the synonymous divergence, which is less likely to be affected by selection than amino acid divergence, between human SARS-CoV-2 and related strains. We show that the synonymous divergence between the bat derived viruses and SARS-CoV-2 is larger than between GD410721 and SARS-CoV-2 in the RBD, providing strong additional support for the recombination hypothesis. However, the synonymous divergence between pangolin strain and SARS-CoV-2 is also relatively high, which is not consistent with a recent recombination between them, instead it suggests a recombination into RaTG13. We also find a 14-fold increase in the dN/dS ratio from the lineage leading to SARS-CoV-2 to the strains of the current pandemic, suggesting that the vast majority of non-synonymous mutations currently segregating within the human strains have a negative impact on viral fitness. Finally, we estimate that the time to the most recent common ancestor of SARS-CoV-2 and RaTG13 or RmYN02 based on synonymous divergence, is 51.71 years (95% C.I., 28.11-75.31) and 37.02 years (95% C.I., 18.19-55.85), respectively. The Covid19 pandemic is perhaps the biggest public health and economic threat that the world 23 has faced for decades ( donors associated with them. To identify possible viral strains that may have contributed, by 10 recombination, to the formation of human SARS-CoV-2, we searched NCBI and EMBL virus 11 entries along with GISAID Epiflu and EpiCov databases for similar sequences using BLAST in 12 100bp windows stepping every 10bp (Fig. 1b) . The majority of the genome (78.1%, 2330/2982 13 of the windows) has one unique best hit, likely reflecting the high genetic diversity of the 14 coronavirus. 21.9% of the genomic regions has multiple best hits, which suggests that these 15 regions might be more conserved. Among the windows with unique best hits, 97.0% 16 (2260/2330) of them were the RaTG13 or RmYN02 bat strains and 1.9% of them, including the 17 ACE2 contact residues region of the S protein, were the pangolin SARS-CoV-2 virus. These 18 observations are consistent with previous results that RaTG13 and RmYN02 are the most 19 closely related viral strains, while the region containing the ACE2 contact residues is more 20 closely related to the pangolin virus strain (Lam, et al. 2020 in some genomic regions may suggest recombination between the ancestral lineage of SARS-4 CoV-2 and distantly related virus lineages, although more formal analyses are needed to 5 determine the recombination history (see also Boni, et al. 2020 for further discussion). 6 Searching databases with BLAST using the most closely related viral strains, RaTG13 and 7 RmYN02, we observe a very similar pattern, as that observed for SARS-CoV-2, in terms of top 8 hits across the genome (Fig. 1b) , suggesting that these possible recombination events with 9 distantly related lineages are not unique to the SARS-CoV-2 lineage, but happened on the 10 ancestral lineage of SARS-CoV-2, RaTG13, and RmYN02. A notable exception is a large region 11 around the S gene, where RmYN02 show little similarity to both SARS-CoV-2 and RaTG13. 12 13 We focus further on studying the synonymous evolution of SARS-CoV-2, and analyzing Wuhan- We performed recombination analyses across the five viral genomes based on the 1 concensus of the seven recombination-detection methods implemented in RDP5 (see Methods). 2 We identified nine recombination regions affecting at least one of the sequences 3 (Supplementary Table 5 ). Phylogenetic analyses of these regions confirm phylogenetic 4 incongruence when compared with genome-wide trees ( Fig. 2 and Supplementary Figure 1-3) . 5 Particularly, a recombination signal is found in a region encompassing the RBD of the S protein, 6 suggesting that the human SARS-CoV-2 (Wuhan-Hu-1) sequence is a recombinant with the 7 Pangolin-CoV (GD410721) as the donor (Supplementary Table 5 consistently support RmYN02 as the nearest outgroup to human SARS-CoV-2, in contrast to 13 previous analyses before the discovery of RmYN02, which instead found RaTG13 to be the 14 nearest outgroup (Lam, et al. 2020; Wu, et al. 2020 ). This observation is also consistent with the 15 genome-wide phylogeny constructed in previous study (Zhou, Chen, et al. 2020) . 16 We plot the overall sequence similarity (% nucleotides identical) between SARS-CoV-2 17 GD410721 into an ancestor of the human strain (Lam, et al. 2020 ; Xiao, et al. 2020; Zhang, et 1 al. 2020), or alternatively, from some other species into RaTG13, as previously hypothesized 2 (Boni, et al. 2020) . We note that RmYN02 is not informative about the nature of this event as it 3 harbors a long and divergent haplotype in this region, possibly associated with another 4 independent recombination event with more distantly related viral strains (Fig. 1e) . The other 5 four sequences are all highly, and approximately equally, divergent from RmYN02 in this large 6 region ( Fig. 1e) , suggesting that the RmYN02 strain obtained a divergent haplotype from the 7 recombination event. When BLAST searching using 100-bp windows along the RmYN02 8 genome, we find no single viral genome as the top hit, instead the top hits are found 9 sporadically in different viral strains of the SARS-CoV lineage (Fig. 1f) , suggesting that the 10 sequence of the most proximal donor is not represented in the database. 11 12 While the overall divergence in the S gene encoding the spike protein could suggest the 14 presence of recombination in the region, previous study (Lam, et al. 2020 ) reported that the tree 15 based on synonymous substitutions supported RaTG13 as the sister taxon to the human SARS-16 CoV-2 also in this region. That would suggest the similarity between GD410721 and human 17 SARS-CoV-2 might be a consequence of convergent evolution, possibly because both strains 18 adapted to the use of the same receptor. An objective of the current study is to examine if there 19 are more narrow regions of the spike protein that might show evidence of recombination. We 20 investigate this issue using estimates of synonymous divergence per synonymous site (d S ) in 21 sliding windows of 300 bp. However, estimation of d S is complicated by the high levels of 22 divergence and extremely skewed nucleotide content in the 3rd position of the sequences 23 (Table 1) which will cause a high degree of homoplasy. We, therefore, entertain methods for 24 estimation that explicitly account for unequal nucleotide content and multiple hits in the same shown that for short sequences, some counting methods, such as the YN00 method, can 1 perform better in terms of Mean Squared Error (MSE) for estimating d N and d S (Yang and 2 Nielsen 2000) . However, it is unclear in the current case how best to estimate d S . For this 3 reason, we performed a small simulations study (see Methods) for evaluating the performance 4 of the maximum likelihood (ML) estimator of d N and d S (as implemented in codeml (Yang 2007) ) 5 under the F3x4 model and the YN00 method implemented in PAML. In general, we find that 6 estimates under the YN00 are more biased with slightly higher MSE than the ML estimate for 7 values in the most relevant regime of d S < 1.5 (Fig. 3) . However, we also notice that both 8 estimators are biased under these conditions. For this reason, we perform a bias correction 9 calibrated using simulations specific to the nucleotide frequencies and d N /d S ratio observed for 10 that there is a trade-off between mean and variance ( Fig. 3) SARS-CoV-2 and RmYN02. However, a substantial amount of this divergence might be caused 23 by recombination with more divergent strains. We, therefore, also estimate d N and d S for the 24 regions with inferred recombination tracts (Supplementary Table 5 ) removed from all sequences (Table 3) . We then find values of d S = 0.1462 (95% C.I., 0.1340-0.1584) and d S = 1 0.1117 (95% C.I., 0.1019-0.1215) between SARS-CoV-2 and RaTG13 and RmYN02, 2 respectively. This confirms that RmYN02 is the virus most closely related to SARS-CoV-2. The 3 relative high synonymous divergence also shows that the apparent high nucleotide similarity 4 between SARS-CoV-2 and the bat strains (96.2% (Zhou, Yang, et al. 2020 ) and 97.2% (Zhou, 5 Chen, et al. 2020)) is caused by conservation at the amino acid level (d N / d S = 0.0410 and 6 0.0555) exacerbated by a high degree of synonymous homoplasy facilitated by a highly skewed 7 nucleotide composition at the third position of codons (with an AT content >72%, Table 1 ). 8 The synonymous divergence to the pangolin sequences GD410721 and GX_P1E in 9 genomic regions with inferred recombination tracts removed is 0.5095 (95% C.I., 0.4794-10 0.5396) and 1.0304 (95% C.I., 0.9669-1.0939), respectively. Values for other comparisons are 11 shown in Tables 2 and 3. In comparisons between SARS-CoV-2 and more distantly related 12 strains, d S will be larger than 1, and with this level of saturation, estimation of divergence is 13 associated with high variance and may be highly dependent on the accuracy of the model 14 assumptions. This makes phylogenetic analyses based on synonymous mutations unreliable 15 when applied to these more divergent sequences. Nonetheless, the synonymous divergence 16 levels seem generally quite compatible with a molecular clock with a d S of 0.9974 (95% C.I., bp sliding windows along the genome. Notice that we truncate the estimate of d S at 3.0. 1 Differences between estimates larger than 2.0 should not be interpreted strongly, as these 2 estimates have high variance and likely will be quite sensitive to the specifics of the model 3 assumptions. 4 We find that d S (SARS-CoV-2, GD410721) approximately equals d S (GD410721, 5 RaTG13) and is larger than d S (SARS-CoV-2, RaTG13) in almost the entire genome showing 6 than in these parts of the genome GD410721 is a proper outgroup to (SARS-CoV-2, RaTG13) 7 assuming a constant molecular clock. One noticeable exception from this is the RBD region of 8 the S gene. In this region the divergence between SARS-CoV-2 and GD410721 is substantially 9 lower than between GD410721 and RaTG13 (Fig. 4a,4c) . The same region also has much 10 smaller divergence between SARS-CoV-2 and GD410721 than between SARS-CoV-2 and 11 RaTG13 (Fig. 4a,4c) . The pattern is quite different than that observed in the rest of the genome, 12 most easily seen by considering the ratio of d S (SARS-CoV-2, GD410721) to d S (SARS-CoV-2, 13 RaTG13) (Fig. 2b, 2d) . In fact, the estimates of d S (SARS-CoV-2, RaTG13) are saturated in this 14 region, even though they are substantially lower than 1 in the rest of the genome. This strongly 15 suggests a recombination event in the region and provides independent evidence of that 16 previously reported based on amino acid divergence (e.g., (Zhang, et al. 2020) ). 17 The combined evidences from synonymous divergence and the topological 18 recombination inference, provide strong support for the recombination hypothesis. However, 19 these analyses alone do not distinguish between recombination into RaTG13 from an unknown 20 source as previously hypothesized (Boni, et al. 2020 ) and recombination between SARS-CoV-2 21 and GD410721 as proposed as one possible explanation by Lam et al. (Lam, et al. 2020) . To 22 distinguish between these hypotheses we searched for sequences that might be more closely 23 related, in the RBD region, to RaTG13 than SARS-CoV-2 and we plotted sliding window 24 similarities across the genome for RaTG13 (Fig. 1c) . We observe relatively low sequence 25 identity between RaTG13 and all three other strains in the ACE2 contact residue region of the spike protein, which is more consistent with the hypothesis of recombination into RaTG13, as 1 proposed in (Boni, et al. 2020 ). Moreover, our BLAST search analyses of RaTG13 in this region 2 show highest local sequence similarity with GX pangolin virus strains which is the genome-wide 3 outgroup for the three other sequences (Lam, et al. 2020 ). This observation is more compatible 4 with the hypothesis of recombination from a virus related to GX pangolin strains, than with 5 recombination between SARS-CoV-2 and GD410721. 6 Unfortunately, because of the high level of synonymous divergence to the nearest 7 outgroup, tree estimation in small windows is extremely labile in this region. In fact, synonymous 8 divergence appears fully saturated in the comparison with GX_P1E, eliminating the possibility to 9 infer meaningful trees based on synonymous divergence. However, we can use the overall 10 maximum likelihood tree using both synonymous and nonsynonymous mutations (Fig. 2d) . The 11 ML tree using sequence from the ACE2 contact residue region supports the clustering of SARS-12 CoV-2 and GD410721, but with unusual long external branches for all strains except SARS-13 CoV-2, possibly reflecting smaller recombination regions within the ACE2 contact residue 14 region. 15 16 The use of synonymous mutations provides an opportunity to calibrate the molecular clock 18 without relying on amino acid changing mutations that are more likely to be affected by 19 selection. The rate of substitution of weakly and slightly deleterious mutations is highly 20 dependent on ecological factors and the effective population size. Weakly deleterious mutations 21 are more likely to be observed over small time scales than over long time scales, as they are 22 unlikely to persist in the population for a long time and go to fixation. This will lead to a 23 decreasing d N /d S ratio for longer evolutionary lineages. Furthermore, changes in effective 24 population size will translate into changes in the rate of substitution of slightly deleterious 25 mutations. Finally, changes in ecology (such as host shifts, host immune changes, changes in cell surface receptor, etc.) can lead to changes in the rate of amino acid substitution. For all of 1 these reasons, the use of synonymous mutations, which are less likely to be the subject of 2 selection than nonsynonymous mutations, are preferred in molecular clock calculations. For 3 many viruses, the use of synonymous mutations to calibrate divergence times is not possible, 4 as synonymous sites are fully saturated even at short divergence times. However, for the 5 comparisons between SARS-CoV-2 and RaTG13, and SARS-CoV-2 and RmYN02, 6 synonymous sites are not saturated and can be used for calibration. We find an estimate of ω = To calibrate the clock we use the estimate provided by (http://virological.org/t/phylodynamic-21 analysis-of-sars-cov-2-update-2020-03-06/420) of =1.04×10 -3 substitutions/site/year (95% CI: 22 0.71x10-3, 1.40x10-3). The synonymous specific mutation rate can be found from this as 23 d S /year = S = /(pS +ωpN), where ω is the d N /d S ratio, and pN and pS are the proportions of 24 nonsynonymous and synonymous sites, respectively. The estimate of the total divergence on 25 the two lineages is then = + . Inserting the numbers from Table 3 for the 26 divergence between SARS-CoV-2 and RaTG13 and RmYN02 ,respectively, we find a total 1 divergence of 96.92 years and 74.05 years respectively. Taking into account that RaTG13 was 2 isolated July 2013, we find an estimated tMRCA between that strain and SARS-CoV-2 of 3 =(96.92 +6.5)/2 = 51.71 years. Similarly, we find an estimate of divergence between SARS-4 CoV-2 and RmYN02 of =74.05/2 = 37.02 years, assuming approximately equal sampling 5 times. The estimate for SARS-CoV-2 and RaTG13 is compatible with the values obtained using 6 different methods for dating (Boni, et al. 2020) . The variance in the estimate in d S is small and 7 the uncertainty is mostly dominated by the uncertainty in the estimate of the mutation rate. We Table 5 ), the estimate is 55.02 years with an approx. 95% C.I. of (29.4, 80.7). 17 As more SARS-CoV-2 sequences are being obtained, providing more precise estimates of the 18 mutation rate, this confidence interval will become narrower. However, we warn that the 19 estimate is based on a molecular clock assumption and that violations of this assumption 20 eventually will become a more likely source of error than the statistical uncertainty quantified in 21 the calculation of the confidence intervals. We also note that, so far, we have assumed no 22 variation in the mutation rate among synonymous sites. However, just from the analysis of the 23 300 bp windows, it is clear that is not true. The variance in the estimate of d S among 300 bp 24 windows from the RaTG13-SARS-CoV-2 comparison is approximately 0.0113. In contrast, in suggesting substantial variation in the synonymous mutation rate along the length of the 1 genome. Alternatively, this might be explained by undetected recombination in the evolutionary 2 history since the divergence of the strains. This strongly suggests that the previously reported amino acid similarity between pangolin 20 viruses and SARS-CoV-2 is not due to convergent evolution, but more likely is due to 21 recombination. In the recombination analysis, we identified recombination from pangolin strains 22 into SARS-CoV-2, which provides further support for the recombination hypothesis. However, 23 we also find that the synonymous divergence between SARS-CoV-2 and pangolin viruses in this 24 region is relatively high, which is not consistent with a recent recombination between the two. It 25 instead suggests that the recombination was into RaTG13 from an unknown strain, rather than between pangolin viruses and SARS-CoV-2, as proposed in (Boni, et al. 2020 ). The alternative 1 explanation of recombination into SARS-CoV-2 from the pangolin virus, would require the 2 additional assumption of a mutational hotspot to account for the high level of divergence in the 3 region between SARS-CoV-2 and the donor pangolin viral genome. To fully distinguish between 4 these hypotheses, additional strains would have to be discovered that either are candidates for 5 introgression into RaTG13 or can break up the lineage in the phylogenetic tree between 6 pangolin viruses and RaTG13. 7 The fact that synonymous divergence to the outgroups, RaTG13 and RmYN02, is not 8 fully saturated, provides an opportunity for a number of different analyses. First, we can date the 9 time of the divergence between the bat viruses and SARS-CoV-2 using synonymous mutations 10 alone. In doing so, we find estimates of 51.71 years (95% C.I., 28.11-75.31) and 37.02 years 11 (95% C.I., 18.19-55.85), respectively. Most of the uncertainty in these estimates comes from 12 uncertainty in the estimate of the mutation rate reported for SARS-CoV-2. As more data is being 13 produced for SARS-CoV-2, the estimate should become more precise and the confidence 14 interval significantly narrowed. We note that the mutation rate we use here are estimated based 15 on the entire genome, which may differ from that in non-recombination regions. To address this 16 problem, we downloaded all the SARS-CoV-2 sequences that are available until 2020-08- 17 17 from GISAID, and obtained an estimate of 1:0.81 for the ratio of mutation rates in the 18 recombination and non-recombination regions, using the "GTRGAMMA" model implemented in 19 the RAxML (Stamatakis 2014). Given the length ratio between the two partitions is 1:4, the 20 difference between the partitions will cause a slight overestimate of the mutation rate by ~5%, 21 which is relatively small compared to the confidence intervals and the potential for other 22 unknown sources of uncertainty. However, we warn that a residual cause of unmodeled 23 statistical uncertainty is deviations from the molecular clock. Variation in the molecular clock 24 could be modeled statistically (see e.g., (Drummond, et al. 2006 ) and (Lartillot, et al. 2016) ), but 25 the fact that synonymous mutations are mostly saturated for more divergent viruses that would be needed to train such models, is a challenge to such efforts. On the positive side, we note that 1 the estimates of d S given in Table 3 in general are highly compatible with a constant molecular 2 clock. Boni et al. (Boni et al. 2020) obtained divergence time estimates similar to ours using a 3 very different approach based on including more divergent sequences and applying a relaxed 4 molecular clock. We see the two approaches as being complimentary. In the traditional relaxed 5 molecular clock approach more divergent sequences are needed that may introduce more 6 uncertainty due to various idiosyncrasies such as alignment errors. Furthermore, the relaxed 7 molecular clock uses both synonymous and non-synonymous mutations and is, therefore, more 8 susceptible to the effects of selection. Our approach allows us to focus on just the relevant in-9 group species and to use only synonymous mutations. The disadvantage is that we cannot 10 accommodate a relaxed molecular clock. However, the fact that both approaches provide 11 similar estimates is reassuring and suggests that neither idiosyncrasies of divergent sequences, 12 natural selection, or deviations from a molecular clock has led to grossly misleading conclusions 13 Another advantage of estimation of synonymous and nonsynonymous rates in the 14 outgroup lineage, is that it can provide estimates of the mutational load of the current pandemic. 15 The d N /d S ratio is almost 14 times larger in the circulating SARS-CoV-2 strains than in the 16 outgroup lineage. While some of this difference could possibly be explained by positive 17 selection acting at a higher rate after zoonotic transfer, it is perhaps more likely that a 18 substantial proportion of segregating nonsynonymous mutations are deleterious, suggesting a 19 very high and increasing mutation load in circulating SARS-CoV-2 strains. 20 21 The pangolin virus sequences, GD410721 and GX_P1E, were downloaded from GISAID with 2 accession numbers EPI_ISL_410721 and EPI_ISL_410539, respectively, and RmYN02 3 sequence was provided by E. C. Holmes. All other sequences analyzed in this study were 4 downloaded from either NCBI Genbank or National Microbiology Data Cente (NMDC). The 5 accession codes for non-human sequences can be found in Supplementary Table 2 and the 6 accession codes for human sequences can be found in Supplementary Table 3 . replicates. The associated distance matrix for (b) and (c) can be found in Table 3 . Basic local alignment search tool Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 An exact nonparametric method for inferring mosaic 15 structure in sequence triplets Hearing silence: non-neutral evolution at 17 synonymous sites in mammals Relaxed phylogenetics and dating with 19 confidence characterization and infectivity of a novel SARS-like coronavirus in Chinese bats From SARS and MERS CoVs to SARS CoV-2: Moving toward more biased codon usage in viral structural and nonstructural genes MAFFT multiple sequence alignment software version 7: 9 improvements in performance and usability Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins A mixed relaxed clock model Emergence of SARS-CoV-2 through recombination and strong purifying 19 selection Phylogeny-aware alignment with PRANK nonsynonymous nucleotide substitution rates, with application to the chloroplast genome IQ-TREE: a fast and effective 4 stochastic algorithm for estimating maximum-likelihood phylogenies Possible emergence of new geminiviruses by 6 frequent recombination Recombination and lineage-specific 8 mutations led to the emergence of SARS-CoV-2 Evaluation of methods for detecting recombination from DNA 10 sequences: computer simulations Identification of breakpoints in 12 intergenotypic recombinants of HIV type 1 by bootscanning Analyzing the mosaic structure of genes Detection of novel coronaviruses in bats in Myanmar Receptor Recognition by the Novel 19 Coronavirus from Wuhan: an Analysis Based on Decade-Long Structural Studies of SARS 20 Codon-substitution models for 1 heterogeneous selection pressure at amino acid sites COVID-19 Outbreak A Genomic Perspective on the Origin and Emergence of SARS Novel Bat Coronavirus Closely Related to SARS-CoV-2 Contains Natural Insertions at the S1/S2 Cleavage Site of the Spike Protein A pneumonia outbreak associated with a new coronavirus of probable bat origin The mean of d S estimates using different methods; ML.corr and yn00.corr are the bias corrected versions of the ML and yn00 methods, respectively. (b) Errors in d S estimates as measured using the ratio of square root of mean squared error (MSE) to true d S . All the estimates are based on 10,000 simulations. ML: maximum-likelihood estimates using the f3x4 model in codeml; ML.corr, maximumlikelihood estimates with bias correction; yn00, count-based estimates in (Yang and Nielsen 2000); yn00.corr, yn00 estimates with bias correction