key: cord-0257763-0akpus5m authors: Tay, J. H.; Porter, A. F.; Wirth, W.; Duchene, S. title: The emergence of SARS-CoV-2 variants of concern is driven by acceleration of the evolutionary rate date: 2021-08-31 journal: nan DOI: 10.1101/2021.08.29.21262799 sha: 0e3edd05ec3f0f55182ca938339b15d94d19cc9b doc_id: 257763 cord_uid: 0akpus5m The ongoing SARS-CoV-2 pandemic has seen an unprecedented amount of rapidly generated genome data. These data have revealed the emergence of lineages with mutations associated to transmissibility and antigenicity, known as variants of concern (VOCs). A striking aspect of VOCs is that many of them involve an unusually large number of defining mutations. Current phylogenetic estimates of the evolutionary rate of SARS-CoV-2 suggest that its genome accrues around 2 mutations per month. However, VOCs can have around 15 defining mutations and it is hypothesised that they emerged over the course of a few months, implying that they must have evolved faster for a period of time. We analysed genome sequence data from the GISAID database to assess whether the emergence of VOCs can be attributed to changes in the evolutionary rate of the virus and whether this pattern can be detected at a phylogenetic level using genome data. We fit a range of molecular clock models and assessed their statistical fit. Our analyses indicate that the emergence of VOCs is driven by an episodic increase in the evolutionary rate of around 4-fold the background phylogenetic rate estimate that may have lasted several weeks or months. These results underscore the importance of monitoring the molecular evolution of the virus as a means of understanding the circumstances under which VOCs may emerge. Interestingly, FLC models where VOC clades were defined as foreground had decisively lower statistical Table 1 : Model selection results for complete genomes. Estimates of log marginal likelihoods using path sampling and stepping-stone (ps logML and ss logML, respectively). log Bayes factors (BF) are shown for the best-fitting model, relative to all others (larger numbers mean lower statistical fit), and thus they are 0.0 for the top model. The FLC shared stems model had a mean background evolutionary rate of 0.58×10 -3 subs/site/year (95% 120 CI: 0.51 -0.65×10 -3 ), while that for the VOC stems was 2.45×10 -3 subs/site/year (95% CI: 1.15 -4.72×10 -3 ). As such, the VOC stems rate was around 4 fold higher than the background (mean 4.25, 95% CI: 2.61 -8.19) 122 (Fig 1) . Although the FLC stems model that assigned each VOC stem branch a different rate had very high 124 uncertainty, it also suggested much higher rates for these branches. Clearly, these estimates were several fold higher than that of the background branches, and in spite of their 129 high uncertainty least 0.90 of the posterior density was above the mean background rate (Fig 1) . The coefficient of rate variation for both relaxed clock models, UCG and UCLN, was indicative of depar-131 ture from clocklike evolution in the data. To investigate whether VOC stem branch rates differed from the 132 rest, we extracted individual branch rates and compared the VOC stem branch rates to the mean of all other 133 5 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) 6 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 31, 2021. ; https://doi.org/10.1101/2021.08.29.21262799 doi: medRxiv preprint branches. We found evidence that VOC stem branch rates were higher than the mean of other branches, 134 with higher means values, but very high uncertainty and 95% credible intervals that overlapped with the 135 mean of other branches (Fig 2) . 136 The mean evolutionary rate of branches other than the VOC stems was 0.65×10 -3 subs/site/year (95% 8 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 31, 2021. ; https://doi.org/10.1101/2021.08.29.21262799 doi: medRxiv preprint substitutions along the complete genome. We focus on the best fitting model (FLC shared stems), with 157 similar results for the second best model (FLC stems). The duration of time along these branches represents 158 the time required before VOCs started to diversify, but it is important to note that they are contingent on 159 sampling bias, and could therefore be shorter than estimated here. Under the FLC shared stems model, the 160 stem branch leading up to VOs were; 14 weeks (95% CI:6 -24) for Alpha, 4 (95% CI: 2 -8) for Beta, 17 161 (95% CI: 8 -28) for Gamma, and 6 (3 -11) for Delta (Supplementary material Fig S2) . The expected number of substitutions along the complete genome were; 21 (95% CI: 14 -32) for Alpha, 163 6 (95% CI:3 -11) for Beta, 26 (95% CI: 18 -35) for Gamma, and 9 (95% CI: 6 -16) for Delta. Although, 164 these numbers are loosely associated with the defining mutations, they are not directly comparable because 165 they involve substitutions along the entire genome and they correspond to the inference from a standard 166 phylogenetic substitution model (the GTR+Γ in this case). (MacLean et al., 2021) . We suggest that model testing may be preferable to using 178 highly parametric models, such as relaxed molecular clock models for this purpose, because they tend to constraining monophyly in VOCs, which we also did for other clock models to ensure that the prior on tree 223 topology was the same. 224 We used the default priors for the substitution model. The coalescent exponential tree prior has two parameter; the standard deviation of the lognormal distribution, and the shape of the Γ distribution. For 233 these parameters we specified an exponential prior with mean 0.33. We ran our analyses for using a Markov 234 chain Monte Carlo of length 5×10 7 , sampling every 5×10 3 and discarding 10% of the chain as burn-in. We 235 repeated the analyses once to verify convergence of independent chains and we ensured that the effective 236 sample size of all parameters was at least 200. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. to assess their variance. Our model testing approach considered the UCLN, SC, and all FLC models in Table 247 1 and Supplementary material. We did not calculate log marginal likelihoods for the RLC because this is a 248 model averaging method, where the number of parameters is less tractable than in other models. As a result 249 it is difficult to conceive proper priors for all parameters, which is a fundamental aspect of Bayesian model Bayes factors Mafft multiple sequence alignment software version 7: improvements in 324 performance and usability Sars-cov-2 evolution during treatment of chronic infection Sars-cov-2 variants of interest and concern naming scheme conducive for global 330 discourse Spatiotemporal invasion dynamics of sars-cov-2 lineage b. 1.1. 7 emergence Computing bayes factors using thermodynamic integration Natural selection in the evolution of sars-cov-2 in bats created a generalist virus and highly 338 capable human pathogen Sars-cov-2 viral variants-tackling a moving target