key: cord-0811529-vrx9cz3m authors: Zeng, Hong-Li; Liu, Yue; Thorell, Kaisa; Nordén, Rickard; Aurell, Erik title: Uneven growth of SARS-CoV-2 clones evidenced by more than 500,000 whole-genome sequences date: 2021-04-06 journal: bioRxiv DOI: 10.1101/2021.04.06.437914 sha: ac2d167969ad0869b3e3e095ca6a35691afce45c doc_id: 811529 cord_uid: vrx9cz3m We have computed the frequencies of the alleles of the “UK variant” (B.1.1.7) and “South Africa variant” (B.1.351) of SARS-CoV-2 from the large GISAID repository. We find that the frequencies of the mutations in UK variant overall rose towards the end of 2020, as widely reported in the literature and in the general press. However, we also find that these frequencies vary in different patterns rather than in concert. For South Africa variant we find a more complex scenario with frequencies of some mutations rising and some remaining close to zero. Our results point to that what is generally reported as one variant is in fact a collection of variants with different genetic characteristics. COVID-19 has so far led to the confirmed deaths of more than 2,700,000 people (1) and has caused the largest disruption in the world economy and human life for several generations. While several efficient vaccines have been developed and some countries have already progressed far towards herd immunity, most of the world is still in the midst of the pandemic. As its elimination in many countries will likely only happen on the time scales of years and not months, a better understanding of the biology of SARS-CoV-2 will remain of high importance. The GISAID repository (2) contains a rapidly increasing collection of SARS-CoV-2 wholegenome sequences, and has been used to identify mutational hotspots and potential drug targets (3) as well as to infer epistatic fitness parameters (4, 5) . In recent months Nature has performed an experiment in the growth of Variants of Concern (VOC) B.1.1.7 and B.1.351. These variants of the virus are commonly referred to as "UK variant" and "South Africa variant", as they were first identified in south-east England (6) , and South Africa (7, 8) The frequencies of the mutations at the different positions hence give information on whether these variants in fact grow as large clones, or if they have mutated or recombined into several clones, or if they were several clones from the beginning. We find that the second scenario holds for UK variant while a combination of the second and third scenario holds for South Africa variant. The GISAID repository (2) holds a large collection of SARS-CoV-2 whole-genome sequences. In the following we have used genomes qualified as "high quality" and annotated with sampling date up to the end of February, 2021. We note that submission date to GISAID is later than sampling date, typically by two weeks or more. The data used hence represents a large part of all the whole-genome sequences available on GISAID up to mid-March 2021. The total number of SARS-CoV-2 genomes used in this study is 562, 477. The data has been stratified by sampling time, as shown in figure captions. Table 2 . Of these 17 mutations, three appeared much before this variant was defined and have an unrelated time course, see Fig. 2 . In the following we have assumed that these three mutations, C1059T (T 265I in N SP 2), C21614T (L18F in Spike) and G25563T (Q57H in ORF 3a) mostly pertain to other clones and/or to another reference sequence. We have not retained data from these loci. The frequencies of the other loci include two that are also present in B.1.1.7 and follow that course, and the rest which remain at an order-of-magnitude lower level in the GISAID data used here. The frequencies of the 22 retained mutations for the UK variant increase in frequency after late summer / early autumn 2020, see Fig. 3 . The lines in this figure connect frequencies of the second most common allele (first minor allele) within the same month of sampling time in the GISAID data. With one exception (16176, discussed above) this second most common allele agrees with the mutation at this locus as given in (6) . We analyzed the consensus sequences deposited in the GISAID database (2) The SARS-CoV-2 dataset downloaded from GISAID website are stored in a desktop computer with 64G RAM named "hlz" at Nanjing University of Posts and Telecommunications (NJUPT). The allele frequencies and the visualizations were both done using MATLAB R2020a on "hlz". The work mainly focused on the allele frequencies analysis for the mutations or deletions listed for B.1.1.7 (6) also known as "UK variant" and B.1.351 (8) also known as "South Africa variant". For a certain time period ∆t, the frequencies of a certain nucleotide x at i locus are computed by eq. 1. pandemic. The outliers pointed by arrows in Fig. 1 and Fig. 2 are identified manually. as originally given in "Technical briefing 1" (6) as given in "Technical briefing 6" (8) Table 4a . This information with annotations is given as Table 2 below. C T a Genomic position as in (6) Table 1 and text above Table 1 . Positions refer to SARS-CoV-2 sequence Wuhan-Hu-1 with the Genbank accession number "MN908947.3". b Frequencies of alleles have been computed from the entire data set (reference) after multiple sequence alignment as described. Frequencies of alleles at one locus have then been sorted as Allele 1 (major allele), Allele 2 (first minor allele), etc. * The question mark "?" indicates different nucleotides in the deletions. c In time-sorted GISAID data alleles at this locus have the opposite behavior than expected if the wild-type at this locus was C. Using the same convention as the other loci we have take the mutation at this locus to be T 16176C. d In time-sorted GISAID data the most common allele at this locus is initially G later overtaken by C. However, the time course is very different from the rest of the UK variant. Possibly this points to the use of another reference sequence for this single mutation in gene M in (6) . Using the same convention as the other loci the mutation at this locus would be G26801C e This locus is one of three annotated as 28280GAT − > CT A in (6) a Genomic position as in (8) analysis is required to reach this conclusion. In this work we have used well over half a million whole-genome SARS-CoV-2 sequences from GISAID, and for the last two months the plots of the monthly frequency data in above are based on the order of 100, 000 sequences. The instability of clones is supported by recent observations points towards the emergence of multiple lineages of SARS-CoV-2 within the same individual (12, 13, 14, 15, 16) . In all cases the patients had prolonged viremia and received convalescent plasma treatment and/or monoclonal antibody therapy. Treatment with convalescent plasma or monoclonal antibodies applies selection pressure on a viral population within the host that may drive the emergence of antibody resistant clones. Also, the large number of viral genomes present simultaneously in a single patient enable opportunities for within host recombination. The phenotypic effects of all described mutations in the spike protein of SARS-CoV-2 are just beginning to be unraveled. For example, the N501Y substitution increases the affinity for ACE2 binding (17) . Also, compensatory mutations have been described as in the case for the E484K substitution in combination with del69-70, where a reduction in antibody sensitivity is compensated with increased infectivity. Coronaviruses, the larger family to which SARS-CoV-2 belongs, in general exhibit a large amount of recombination (18, 19) . There are reports that this is so also for SARS-CoV-2 (20, 21, 22) . Large-scale recombination would be important in the COVID19 pandemic for several reasons. First it increases the resiliance of the viral population against hostile agents. Beneficial (to the virus) changes can spread faster and more reliably throughout the population. Second it leads to form of evolution optimizing fitness and less impacted by traits inherited by chance. While a clone replicating asexually will likely have points of weakness, in a recombining population such errors are shared around and eliminated. Third, substantial amount of recombination is a confounder for phylogentic reconstruction. Crudely put, phylogenetic trees reconstructed from population-wide sequence data may not reflect the actual evolution in such populations, an issue which has been discussed in bacterial phylogenetics since some time (23, 24, 25) . Lastly, a population under strong recombination is expected to be in Kimura's Quasi-Linkage Equilibrium (26, 27, 28) which allows efficient and accurate inference of evolutionary parameters from sequence information (4, 5) . On a positive note this opens up the perspective of systematic search for new drugs and combinatorial drug treatments by leveraging large-scale whole-genome sequencing data. Proceedings of the National Academy of Sciences Genome-wide covariation in SARS-CoV-2, bioRxiv Investigation of novel SARS-CoV-2 Variant of Concern 202112/01 Investigation of SARS-CoV-2 variants of concern in England Github Proc. Natl. Acad. Sci We thank Richard Neher for comments on a first version of the MS, and for pointing out that in the annotation used by nextstrain, C26801G is counted in clade 20E (EU1). The work of HLZ was sponsored by National Natural Science Foundation of China (11705097). The work of EA was supported by the Swedish Research Council grant 2020-04980.