key: cord-0791567-0yl9pvvn authors: Kurpas, Monika; Kimmel, Marek title: Mutation patterns in SARS-COV-2 Alpha and Beta variants indicate non-neutral evolution date: 2022-03-01 journal: bioRxiv DOI: 10.1101/2022.02.28.482283 sha: 2ed39290b2eecde1d3e5008dbc19b077a6f0772d doc_id: 791567 cord_uid: 0yl9pvvn Due to the emergence of new variants of the SARS-CoV-2 coronavirus, the question of how the viral genomes evolved, leading to the formation of highly infectious strains, becomes particularly important. Two early emergent strains, Alpha and Beta, characterized by a significant number of missense mutations, provide natural testing samples. In this study we are exploring the history of each of the segregating sites present in Alpha and Beta variants of concern, to address the question whether defining mutations were accumulating gradually leading to the formation of sequence characteristic of these variants. Our analysis exposes data features that suggest other than neutral evolution of SARS-CoV-2 genomes, leading to emergence of variants of concern. We observe only small number of possible combinations of mutations indicating rapid evolution of genomes. In addtion, mutation patterns observed in whole genome samples of Alpha and Beta variants also indicate presence of stronger selection than in remaining genome samples. Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) causing current COVID-19 pandemic as typical RNA virus is expected to mutate at a pace of 10 −4 nucleotide substitutions per site per year [18, 12] . 1 Although most of these mutations are either deleterious or neutral, some of them may impact transmissibility and infectivity of the emerging strain. Accumulation of mutations may lead also to immune escape increasing likelihood of reinfection. These features are observed in several strains, called 'variants of concern' (VOCs) characterized by sets of mutations. 1.1 B.1.1.7 (Alpha) variant B.1.1.7 variant, later recognized as a variant of concern, was first detected in November 2020 in a sample taken on September 20, 2020 in the United Kingdom. With transmissibility increased by 43-90% [4] and about twofold replicative advantage [7] , Alpha variant began to spread, quickly outnumbering the original Wuhan strain. B.1.1.7 variant is characterized by 14 non-synonymous mutations and 3 deletions [8, 15] (Tab. 1). In this study we are exploring the history of each of the segregating sites present in Alpha and Beta VOCs. We are trying to answer the question whether defining mutations were accumulating gradually until they form a sequence characteristic of Alpha or Beta variant, or whether this phenomena can be explained by recombination of two genomes with subsets of mutations. We also check whether mutation patterns observed in whole genome samples of viral variants classified as VOCs indicate the presence of stronger selection than in non-VOC samples. The analysis was carried out using 384,741 nucleotide sequences of SARS-CoV-2 genomes, downloaded from the GISAID (Global Initiative on Sharing Avian Influenza Data) database [17, 2] . The We created statistics for each week since beginning of the pandemic writing down the total number of genomes and also the number of Alpha and Beta-variant genomes in given week. We reviewed all 384,741 subsequences of SARS-CoV-2 genomes. For each position in the subsequence we checked whether given genome has VOC-defining mutation in corresponding place. Then, if this was the case, we saved the accession number and collection date of such genome. Having these data enabled us to quantify the change in the abundance of individual mutations over time, and to study possible combinations of 2, 3, 4 etc. mutations present together in one genome as well as to determine the dates when such combinations arose. We compared observed counts of combinations in tested samples with expected number of combinations, given the count of segregating sites. In order to check whether there is higher selection pressure among genomes belonging to the VOC strain than to the remaining strains we divided our dataset into two groups: Alpha strain genomes and remaining genomes (in the second experiment we did the same for Beta strain genomes). Then we divided both groups by weeks and chose weeks with suitable number of sequenced VOC strain 4 genomes. Results of analysis of number of genomes sequenced in given week are presented in Fig. S1 . The selection criterion for the week was the number of VOC genomes, which shouldn't be larger than 500 (due to computational limitations) and shouldn't exceed number of non-VOC genomes sequenced in a given week. In the case of Alpha these were weeks 45 (115 genomes Inference from evolutionary models of DNA often exploits summary statistics of sequence data, a common one being the so-called Site Frequency Spectrum. In a sequencing experiment with a known number of sequences, we can estimate for each site at which a novel somatic mutation has arisen, the number of genomes that carry that mutation. These numbers are then grouped into sites that have the same number of copies of a mutant. 3 Results Based on the data from processing of subsequences containing segregating sites for Alpha and Beta SARS-CoV-2 variant, we generated timelines for each of defining mutations (Fig. 2 A For both Alpha and Beta variant we calculated how many genomes carry a given number of mutations from the VOC-defining set (Fig. 4 A and B) . We calculated the number of observed unique combinations of VOC-defining mutations for both Alpha (Table 3) and Beta (Table 4) is the dynamics of increase in number of unique combinations over time (Fig. 6) . We observe that genomes carrying combinations of higher number of mutations (even full set) emerge earlier than genomes carrying only some of them (e.g. combinations of 5 or 6 mutations). The results of analysis of the genomes sequenced in week 45, 46 and 49 for Alpha variant and all genomes for Beta variant are presented in the form of log-log cumulative tails (Figs 7-8 and S2-S3) . We observe that the slope of cumulative tails differs between sample with Alpha genomes and the sample with remaining ones. In case of Beta variant we do not observe such significant effect. In the case of exponential population growth, Durrett [6] provided an approximate large sample and large population expression, which leads to the conclusion that, assuming neutral evolution, the SFS cumulative tail in the log-log scale should be approximated by a straight line with coefficient -1 (marked in Figs 7-8 and S2-S3). Analysis of the obtained results shows that cumulative SFS tails calculated based on non-Alpha genomes data can be approximated by straight line with coefficient -1 characteristic for neutral evolution. On the contrary, the slope of cumulative SFS tails obtained for Alpha variant genomes indicates the presence of selective pressure on the evolution of these genomes. In this study we analysed SARS-CoV-2 genomes to see how the individual mutations that define the Alpha and Beta variants were appearing over time. Our analyses showed that these mutations did not arose gradually, but rather co-evolved rapidly leading to the emergence of the full VOC strain. We do not observe transient states which would be expected under neutral evolution. These results seem to indicate that segregating sites in Alpha and Beta variants evolved under strong positive selection. Another possible explanation might be recombination event between viruses carrying subsets of VOC-defining mutations. Research has shown that such phenomenon is common in bat coronaviruses [13] and might be indeed affecting also the evolution of SARS-CoV-2 [11] . Observed mutation patterns may be also due to mutation hotspots, which were detected in the region encoding the Spike protein [14] . In addition to the factors described above, we cannot rule out the possibility that genomes carrying subsets of VOC-defining mutations avoided collection and sequencing. In the data gathered by GISAID we can clearly see temporal differences in the number of sequenced genomes (as shown in Fig. S1 ) but more importantly most of collected genomes come from Europe and United States. The under representation of sequences from other parts of the world could possibly be the cause why genomes containing subsets of mutations have been overlooked. We carried out additional analysis of the early evolution of the B1.1.7 VOC, in the week 45 of the epidemic, when only 115 samples of the variant were present and its abundance was still increasing roughly exponentially. We used the model developed in [5] .The model assumes that at some time labeled t 0 = 0, strain of viruses, such as the VOC B.1.7.7 (clone 0) arises, grows deterministically in size at rate r 0 , these cells acquiring mutations at the rate θ 0 per time unit per genome. At time t 1 > 0, a subclone (clone 1) arises, which differs from the original clone with respect to growth rate (now equal to r 1 > r(0)) and mutation rate (now equal to θ 1 ). We call this the "selective event". The new clone arises on the background of a haplotype already harboring K mutations. Finally, at t 2 > t 1 > 0, a sample of n variant's RNA genomes is sequenced. Without getting into details, as explained in [5] , the emerging substrain leaves a signature ("bulge") on the SFS cumulative tail T (x), the characteristics of which can be estimated from equations in [5] . Figure 9 illustrates the fit. The conclusion is that we observe a substrain of B. GISAID database Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome. NCBI Reference Sequence: NC 045512 Estimated transmissibility and impact of sars-cov-2 lineage b. 1.1. 7 in england Statistical inference for the evolutionary history of cancer genomes Population genetics of neutral mutations in exponentially growing cancer cell populations. The annals of applied probability: an official journal of the Institute of Mathematical Sars-cov-2 variant of concern 202012/01 has about twofold replicative advantage and acquires concerning mutations Sars-cov-2 variants, spike mutations and immune escape Mafft: a novel method for rapid multiple sequence alignment based on fast fourier transform Mafft multiple sequence alignment software version 7: improvements in performance and usability Emergence of sars-cov-2 through recombination and strong purifying selection Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. The lancet Natural selection in the evolution of sars-cov-2 in bats created a generalist virus and highly capable human pathogen Understanding mutation hotspots for the sars-cov-2 spike protein using shannon entropy and k-means clustering Preliminary genomic characterisation of an emergent sars-cov-2 lineage in the uk defined by a novel set of spike mutations Emboss: the european molecular biology open software suite Gisaid: Global initiative on sharing all influenza data-from vision to reality Epidemiology, genetic recombination, and pathogenesis of coronaviruses Detection of a sars-cov-2 variant of concern in south africa Evaluating the effects of sars-cov-2 spike mutation d614g on transmissibility and pathogenicity Monika Kurpas was financially supported by subsidy for the maintenance and development of research potential 02/040/BKM21/1017 granted by Polish Ministry of Science and Higher Education Marek Kimmel was supported by the NSF/DMS Rapid Collaborative grant to MKi suggested the problem, designed and supervised the research. MKu designed algorithms to collect weekly statistics of viral genomes. MKu performed the analyses and visualized the results.MKi and MKu prepared the manuscript. All authors reviewed and approved the final version. The authors declare that they have no competing interests. All relevant data are included within the manuscript and the Supporting Information files.