key: cord-0260577-bg2bo2v3 authors: Ghafari, Mahan; Liu, Qihan; Dhillon, Arushi; Katzourakis, Aris; Weissman, Daniel B title: Investigating the evolutionary origins of the first three SARS-CoV-2 variants of concern date: 2022-05-11 journal: bioRxiv DOI: 10.1101/2022.05.09.491227 sha: 3da0068b6edcafd412252a733883c8fa114b8a1d doc_id: 260577 cord_uid: bg2bo2v3 The emergence of Variants of Concern (VOCs) of SARS-CoV-2 with increased transmissibility, immune evasion properties, and virulence poses a great challenge to public health. Despite unprecedented efforts to increase genomic surveillance, fundamental facts about the evolutionary origins of VOCs remain largely unknown. One major uncertainty is whether the VOCs evolved during transmission chains of many acute infections or during long-term infections within single individuals. We test the consistency of these two possible paths with the observed dynamics, focusing on the clustered emergence of the first three VOCs, Alpha, Beta, and Gamma, in late 2020, following a period of relative evolutionary stasis. We consider a range of possible fitness landscapes, in which the VOC phenotypes could be the result of single mutations, multiple mutations that each contribute additively to increasing viral fitness, or epistatic interactions among multiple mutations that do not individually increase viral fitness—a “fitness plateau”. Our results suggest that the timing and dynamics of the VOC emergence, together with the observed number of mutations in VOC lineages, are in best agreement with the VOC phenotype requiring multiple mutations and VOCs having evolved within single individuals with long-term infections. For the first 8 months of the SARS-CoV-2 pandemic, the virus exhibited a very slow pace of 27 adaptation, with D614G being the only persistent adaptive substitution that appears to have 28 resulted in an increased transmissibility of the virus [1] [2] [3] . However, during the second half of 29 2020, three designated variants of concern (VOCs) of SARS-CoV-2, Alpha, Beta, and Gamma, 30 emerged independently and in quick succession [4] [5] [6] . No other VOC emerged until Delta and 31 Omicron in 2021 which appear to be very different, both genetically and phenotypically, from 32 the three original VOCs [7, 8] . The VOCs are characterised by a large number of mutations relative 33 to the genetic background from which they first emerged, and exhibit altered phenotypes 34 resulting in varying combinations of increased transmissibility, virulence, and immune evasion 35 [6, [9] [10] [11] . 36 Phylogenetic analyses show that a large number of mutations, mostly located in the spike 37 protein, have independently evolved in multiple lineages of SARS-CoV-2 including the Alpha, Beta 38 and Gamma variants and are likely playing a key role in the adaptive evolution of the SARS-CoV-39 2 [7, 12]. Experimental measurements and molecular dynamics simulations also show that some 40 of these mutations have synergistic interactions for important functional traits [13, 14] , indicating 41 that they may have greater combined fitness benefit to the virus. Some of the distinctive 42 mutations in the VOCs, including the E484K and N501Y mutations found in the first three VOCs, 43 have also been observed in chronic infections such as those in certain immunocompromised 44 individuals [15] [16] [17] , suggesting that the VOCs may have arisen from such infections. Some of the 45 other possible explanations for the emergence of VOCs include prolonged circulation of the virus 46 in areas of the world with poor genomic surveillance or reverse-zoonosis from other animals such 47 as rodents followed by sustained transmission and adaptive evolution within the animal 48 population and a spill over back to the humans (see [18] for a recent review on the possible 49 origins of variants of SARS-CoV-2). 50 While finding the evolutionary process(es) that may have led to the emergence of VOCs has 51 profound consequences for understanding the fate of the SARS-CoV-2 pandemic, there have 52 currently been no systematic investigations to assess the likelihood of any particular evolutionary 53 pathway that would lead to the emergence of VOCs. In this work we investigate whether the 54 emergence of VOCs was the result of evolution via sustained transmission chains between 55 acutely infected individuals or prolonged infections, and evaluate plausible fitness landscapes. 56 We also discuss the potential implications of our results for the future of the pandemic and 57 potential measures that might lower the rate at which new VOCs emerge. 58 Emergence of VOCs: an evolutionary puzzle 60 The Alpha, Beta, and Gamma VOCs arose independently and in quick succession, with several 61 shared mutations, in three different countries and began to spread globally (Figure 1 ). This long 62 waiting time followed by clustered emergence of a handful of lineages was not predicted by any 63 simple evolutionary theories. Typically, one would assume that either the beneficial mutation 64 supply is small, in which case one expects a long waiting time for the first VOC but also long gaps 65 before subsequent VOCs, or the mutation supply is large, in which case one expects many VOCs 66 with only a short waiting time [19] . Moreover, each VOC had >6-10 mutations distinguishing it 67 from then-dominant genotypes, which was also unexpected. One of the key evolutionary 68 questions is whether VOCs evolved over the course of many acute infections or within single 69 chronic infected hosts. Both possibilities have serious issues. The many-acute-infections 70 hypothesis needs to explain how the virus acquired so many changes, as the mutant lineages 71 would have had to remain at frequencies below the detection threshold in different countries for 72 several months. The chronic-infection hypothesis needs to explain both why adaptation to the 73 within-host environment led to a transmission advantage between hosts, and why there was no 74 'leakage' of some intermediate mutations at the between-host level before the emergence of 75 the VOCs, i.e., why genotypes with some of the VOC mutations did not escape from the 76 chronically infected patients earlier. 77 Between-host model of VOC emergence 78 We assume the effective virus population size is Ne=N/σ 2 where N is the number of infectious 79 individuals worldwide and σ 2 is the variance in offspring number (secondary cases). We treat 80 each acute infection as one generation, assuming a tight transmission bottleneck of a single virion 81 [20] [21] [22] . Viruses mutate at rate μ per base per generation (see Methods section). For a mutant 82 virus population with selective advantage s relative to the background, the average number of 83 secondary cases increases by a factor 1+s. We also assume that the number of secondary cases 84 approximately follows a negative binomial distribution with mean Rt and dispersion parameter 85 k, so that σ 2 ≈Rt(1+Rt/k). There is substantial uncertainty in the amount of overdispersion in the 86 pandemic, and consequently similar uncertainty in the effective population size. Therefore, we 87 consider a range of values for k to see if any would be consistent with the observed dynamics of 88 the VOC emergence. We also note that while the importance of spatial structure is clearly visible 89 in the spatially restricted initial spread of the VOCs from real-world data, we expect that we can 90 neglect it when analysing their emergence. This is because spatial structure should not have a 91 large impact on viral dynamics until a lineage becomes locally common, and the specific 92 mutations differentiating the VOCs were all locally rare prior to their emergence. infections. Because there is very limited data with which to constrain the within-host 98 evolutionary dynamics of chronic infections with SARS-CoV-2, we simply treat it as a 'black box' 99 and assume with some probability, Pf, that a new infection is chronic and may lead to the 100 production of a VOC ( Table 1 ; Methods section). We also assume that within-host substitutions 101 required for the production of the VOC occur at a constant rate μC per generation (see Table 1 ). 102 (Here a generation is still defined as the typical length of an acute infection.) Given that we know 103 only three VOC lineages emerged by late 2020, we expect TobsN Pf~3 where Tobs~180-317 days is 104 the expected time to the emergence of the first VOC since the beginning of the pandemic based 105 on phylogenetic estimates (see Table 1 ). Therefore, given the typical variation in the population 106 size throughout the pandemic for biologically relevant parameter combinations N~1x10 6 -1x10 7 , 107 we expect that values of Pf~5x10 -9 -1x10 -7 will maximize the likelihood of the within-host model 108 and focus on these. 109 Fitness landscapes 110 One possible explanation for the temporal clustering of VOCs with large numbers of mutations is 111 that the underlying fitness landscape may have some structure that causes the dynamics to 112 deviate from our usual expectations. Unfortunately, the full space of possible fitness landscapes 113 is enormous and impossible to explore exhaustively. To investigate the possible effects of the 114 landscape on the dynamics, we therefore focus on three limiting local fitness landscapes that 115 span a range of biologically plausible scenarios ( Figure 1A) . Importantly, these landscapes 116 describe only between-host fitness, which could be very different from within-host fitness. As 117 mentioned above, we treat within-host dynamics implicitly using an effective substitution rate 118 and so do not need an explicit fitness landscape for it. In all three landscapes, the peak is a VOC 119 phenotype with fitness advantage s over the ancestor. We assume that Alpha, Beta, and Gamma are similar enough that they can be approximately described by the same landscape and the 121 same value of s, which we infer from the early rate of increase of the VOCs (see Methods). 122 Landscape 1 is the simplest possibility: a single mutation on the ancestral background is sufficient 123 to confer the full advantage. In Landscape 2, we test whether simply increasing the number of 124 mutations involved can explain the temporal clustering. In this landscape, the VOC phenotype is 125 produced by a combination of K > 1 mutations, each providing an independent fitness benefit 126 s/K. In Landscape 3, we test whether epistasis may have an effect: the VOC phenotype again 127 requires K mutations, but we now assume that they provide no fitness benefit until the full 128 combination is acquired, i.e., the population must cross a fitness plateau. As mentioned above, 129 there is experimental evidence for this form of epistasis among the VOC mutations [13, 14] . We 130 expect that shallow fitness valleys will produce similar dynamics to Landscape 3, as will shallow 131 upward slopes with a large jump in fitness at the end [24] . Note that mutations in all the three 132 landscapes can be acquired via the between-or within-host evolutionary pathways ( Figure 1B) . 133 For each evolutionary scenario, we test whether there are parameter values consistent with the 134 data on the timing of the emergence of Alpha, Beta, and Gamma variants of SARS-CoV-2 (see 135 Methods; Table 1 ). For these parameter values, we further investigate whether they correspond 136 to biologically reasonable scenarios in terms of the frequencies of the intermediate mutations 137 prior to the emergence of VOCs, total number of mutations required to produce VOCs, total 138 number of successful VOC lineages produced over time, and the timing between the emergence 139 of different VOC lineages. 140 Landscape 1: single mutations 141 We start with the simplest possible fitness landscape, in which a single mutation conferring a 142 fitness advantage s relative to the genetic background of circulating lineages is required for the 143 emergence of VOCs. We first consider the between-host evolutionary pathway. As long as the 144 effective population size of the pandemic was not much smaller than the census size (i.e., Figure 3A) , inconsistent with the observed dynamics. We can therefore rule out this scenario. predicts similarly long waiting times for the emergence of Alpha, Beta, and Gamma, inconsistent 158 with the observed temporal clustering. Therefore, there is no biologically reasonable 159 combination of parameters that result in the clustered emergence of VOCs in late 2020 via the 160 Landscape 1 between-host evolutionary pathway. 161 On the other hand, if VOCs arose from chronic infections, then their emergence was a two-step 162 process: first, chronic infections had to occur, and then the VOC mutation had to arise in them. 163 The waiting time for the first step is determined by NPf; note that the number of chronic 164 infections depends on the census size N rather than Ne, i.e., it is insensitive to the amount of 165 overdispersion. The second step follows an exponential distribution within each chronic host, 166 with rate μC. The third step, the spread of the VOC from the original chronic host to the rest of 167 the population, then takes much less time than the first two. Figure 4 shows that to match 168 observed VOC dynamics we must assume that the level of overdispersion is very high (i.e., very 169 low mutation supply, Neμ), effectively blocking the between-host evolutionary pathway, while 170 simultaneously assuming that chronic infections are very frequently produced in the population 171 (i.e., NPf~1) and that there is a relatively long waiting time before the production of each VOC (Figure 9 ). This creates a phylogenetic relationship between VOC 285 clades that is similar to what we observe for Alpha, Beta, and Gamma variants [4-6]. 287 Another possibility for why the VOCs were not detected until mid to late 2020 is that they may Shifting landscape 330 We have assumed a static fitness landscape prior to the emergence of the first three VOCs; here 331 we consider the plausibility of that assumption. During the first year of the pandemic, a novel 332 virus was spreading in an immunologically naïve population [12] . As more individuals became 333 infected and developed natural immunity, it is possible that the fitness landscape for the virus 334 shifted as selection for immune escape increased [41] . However, by the time the first three VOCs 335 emerged in late 2020, the majority of the world's population were still susceptible to the disease 336 and may not have even been exposed to it. Therefore, it is unlikely that the build-up of natural 337 immunity alone was the reason behind their increased selective advantage. In contrast, the 338 global dominance of Omicron in late 2021 was largely due to its immune escape properties 372 Effective population size 373 We approximate the between-host evolution of SARS-CoV-2 as a haploid population of size N(t) 374 which is equal to the number of daily infectious individuals with SARS-CoV-2 worldwide. Since Table 1) . 436 Each VOC mutation is fixed within the host at rate μC such that the fixation time is an 437 exponentially distributed number with mean 1/μC. Each mutation may then spread to the rest of 438 the population with a probability that is proportional to its fitness as determined by the Dirichlet- Tables Table 1: . (B) Evaluating the temporal clustering of the first three VOC lineages. For each simulation run, represented by a point on the graph, we measure the time that it takes for a single adaptive mutation to establish in the population and the time difference between the establishment of the first and third successful VOC lineage. The red dashed rectangle shows the region of the parameter space corresponding to the emergence of the first three SARS-CoV-2 VOCs with the cross sign ("X") representing the mean value. We can see that by having a combination of relatively high level of overdispersion, high IFR, and low between-host mutation rate, there is a lower chance of intermediate mutations reaching fixation via the between-host path. Instead, multiple VOCs can emerge in quick succession during chronic infections such that a relatively large fraction of the simulation runs yield a temporal clustering that matches the emergence of the first three VOCs in late 2020 (i.e., they fall inside the enclosed area). The inset shows that 20.5% and 13.7% of the runs for K=3 and 6 scenarios produce fewer than three successful VOC lineages by the end of the simulation period, respectively. Each run stops once the frequency of the VOC population reaches 75%. in the production of a VOC, such that IFR=0.2%, μ=2x10 -5 , k=0.2, and s=1.0. The inset shows M with respect to the waiting time for the establishment of the first VOC lineage since the start of the pandemic, T0. For K=1 and 3 scenarios, there are too many and too few VOC lineages are produced by late 2020. Only for the K=2 scenario we can see an intermediate number of VOC lineages being produced in the right time span. (B) Evaluating the temporal clustering of the first three VOC lineages. For each simulation run, represented by a point on the graph, we measure the time that it takes for a single adaptive mutation to establish in the population and the time difference between the establishment of the first and third successful VOC lineage. The red dashed rectangle shows the region of the parameter space corresponding to the emergence of the first three SARS-CoV-2 VOCs with the cross sign ("X") representing the mean value. We see a noticeable overlap between the K=2 scenario and the red rectangle suggesting that a fraction of the simulation runs exhibit temporal clustering dynamics for VOC emergence. The inset shows that 99.2% and 25.7% of the runs for the K=3 and 2 scenarios produce fewer than three successful VOC lineages by the end of the simulation period. Each run stops once the frequency of the VOC population reaches 75%. , and μC=0.1. For K=6 (orange), Pf=4.5x10 -8 , and μC=0.25. In both scenarios, the between-host parameters μ=1x10 -5 , IFR=0.5%, k=0.1, and s=0.7 are the same. The inset shows M with respect to the waiting time for the establishment of the first VOC lineage since the start of the pandemic, T0. The region corresponding to the waiting time for the emergence of the first three SARS-CoV-2 VOC is highlighted in red. Both scenarios produce roughly the same of number of VOC lineages. However, on average, T0 is slightly longer for the K=6 scenario. (B) Evaluating the temporal clustering of the first three VOC lineages. For each simulation run, represented by a point on the graph, we measure the time that it takes for a single adaptive mutation to establish in the population and the time difference between the establishment of the first and third successful VOC lineage. The red dashed rectangle shows the region of the parameter space corresponding to the emergence of the first three SARS-CoV-2 VOCs with the cross sign ("X") representing the mean value. We can see that a noticeable fraction of simulation runs for both scenarios yield a temporal clustering that matches the emergence of the first three VOCs in late 2020 (i.e., they fall inside the enclosed area). The inset shows that 35.5% and 25.9% of the runs for K=3 and 6 scenarios produce fewer than three successful VOC lineages by the end of the simulation period, respectively. Each run stops once the frequency of the VOC population reaches 75%. SARS-CoV-2 spike-protein D614G mutation increases virion spike density and infectivity. 458 Nature Communications SARS-CoV-2 one year on: evidence for ongoing viral adaptation Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK 464 defined by a novel set of spike mutations. Virological Detection of a SARS-CoV-2 variant of concern in South Africa Genomics and epidemiology of the P.1 SARS-CoV-2 lineage in Manaus, Brazil. Science The biological and clinical significance of emerging SARS-CoV-2 variants Considerable escape of SARS-CoV-2 Omicron to antibody neutralization Deep Mutational Scanning of SARS-CoV-2 Receptor Binding Domain Reveals Constraints 476 on Folding and ACE2 Binding Estimated transmissibility and impact of SARS-CoV-2 lineage B.1.1.7 in England. 478 Science Escape of SARS-CoV-2 501Y.V2 from neutralization by convalescent plasma The emergence and ongoing convergent evolution of the SARS-CoV-2 N501Y lineages SARS-CoV-2 variant prediction and antiviral drug design are enabled by RBD in vitro 484 evolution Molecular dynamic simulation reveals E484K mutation enhances spike RBD-ACE2 affinity 486 and the combination of E484K, K417N and N501Y mutations (501Y.V2 variant) induces conformational 487 change greater than N501Y mutant alone, potentially resulting in an escape mutant. bioRxiv Persistence and Evolution of SARS-CoV-2 in an Immunocompromised 490 Host SARS-CoV-2 evolution during treatment of chronic infection Persistent SARS-CoV-2 infection and intra-host evolution in association with advanced HIV 494 infection. medRxiv Evidence that Adaptation in Drosophila Is Not Limited by 498 Mutation at Single Sites Genomic epidemiology of superspreading events in Austria 500 reveals mutational dynamics and transmission properties of SARS-CoV-2 SARS-CoV-2 within-host diversity and transmission Acute SARS-CoV-2 infections harbor limited within-host diversity and transmit via tight 505 transmission bottlenecks Nextstrain: real-time tracking of pathogen evolution The rate at which asexual populations cross fitness valleys. Theoretical Population 509 Biology Estimating the overdispersion in COVID-19 transmission using outbreak sizes outside China A human coronavirus evolves antigenically to escape antibody immunity Emergence in Southern France of a new SARS-CoV-2 variant of probably Cameroonian 517 origin harbouring both substitutions N501Y and E484K in the spike protein. medRxiv Proceedings 520 of the National Academy of Sciences of the United States of America, 2021. 118(47): p. e2114828118. 521 30. B. B. O. Munnink, et al., Transmission of SARS-CoV-2 on mink farms between humans and mink and back 522 to humans Transmission of SARS-CoV-2 delta variant (AY.127) from pet hamsters to humans, leading 524 to onward human-to-human transmission: a case study Evidence for a mouse origin of the SARS-CoV-2 Omicron variant Multiple spillovers from humans and onward transmission of SARS-CoV-2 in white-528 tailed deer Recombination, Reservoirs, and the Modular Spike: Mechanisms of 533 Coronavirus Cross-Species Transmission Exploring the Natural Origins of SARS-CoV-2 in the Light of Recombination A coalescent-based method for detecting and estimating 537 recombination from gene sequences Generation and transmission of interlineage recombinants in the SARS-CoV-2 pandemic Emergence and widespread circulation of a recombinant SARS-CoV-2 lineage in North 541 Rapid epidemic expansion of the SARS-CoV-2 Omicron variant in southern Africa Adaptation from standing genetic variation. Trends in Ecology & 545 Evolution Considerable escape of SARS-CoV-2 Omicron to antibody neutralization Recurrent SARS-CoV-2 Mutations in Immunodeficient Patients. medRxiv Better Tests, Better Care: Improved Diagnostics for Infectious Diseases. Clinical 551 Infectious Diseases Lessons for preparedness and reasons for concern from the early COVID-19 epidemic in 553 Iran Assessing the Burden of COVID-19 in Developing Countries: Systematic Review Age-specific mortality and immunity patterns of SARS-CoV-2 Tracking excess mortality across countries during the COVID-19 pandemic with 560 the World Mortality Dataset. Elife Estimating clinical severity of COVID-19 from the transmission dynamics in Wuhan, China. 562 Nature Medicine Superspreading and the effect of individual variation on disease emergence. 564 Nature Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact 566 tracing Purifying Selection Determines the Short-Term Time Dependency of Evolutionary Rates 568 in SARS-CoV-2 and pH1N1 Influenza Lineage replacement and evolution captured by the United Kingdom Covid Infection 570 Survey. medRxiv Supplementary figure 1: Distribution of waiting times for the establishment of consecutive pairs of VOC lineages via the between-host pathway assuming a fitness landscape with a single adaptive mutation. The distribution of times that it takes between the production of the i th and (i+1) th lineage, Ti:(i+1), for the first 5 established VOC lineages described in Figure 3 . T0:1 is the waiting time for the production for the establishment of the first VOC lineage (equivalent to T0).Supplementary figure 2: Distribution of waiting times for the establishment of consecutive pairs of VOC lineages via the within-host pathway assuming a fitness landscape with a single adaptive mutation. The distribution of times that it takes between the production of the i th and (i+1) th lineage, Ti:(i+1), for the first 5 established VOC lineages described in Figure 4 . T0:1 is the waiting time for the production for the establishment of the first VOC lineage (equivalent to T0).Supplementary figure 3: Distribution of waiting times for the establishment of consecutive pairs of VOC lineages via the between-host pathway assuming an additive fitness landscape. The distribution of times that it takes between the production of the i th and (i+1) th lineage, Ti:(i+1), for the first 5 established VOC lineages described in Figure 5 . T0:1 is the waiting time for the production for the establishment of the first VOC lineage (equivalent to T0). The distribution of times that it takes between the production of the i th and (i+1) th lineage, Ti:(i+1), for the first 5 established VOC lineages described in Figure 6 . T0:1 is the waiting time for the production for the establishment of the first VOC lineage (equivalent to T0). The distribution of times that it takes between the production of the i th and (i+1) th lineage, Ti:(i+1), for the first 5 established VOC lineages described in Figure 7 . T0:1 is the waiting time for the production for the establishment of the first VOC lineage (equivalent to T0).Supplementary figure 6: Distribution of waiting times for the establishment of consecutive pairs of VOC lineages via the within-host pathway assuming a fitness plateau landscape. The distribution of times that it takes between the production of the i th and (i+1) th lineage, Ti:(i+1), for the first 5 established VOC lineages described in Figure 8 . T0:1 is the waiting time for the production for the establishment of the first VOC lineage (equivalent to T0).