key: cord-0932216-m8ks54x0 authors: Chiara, Matteo; Horner, David S; Gissi, Carmela; Pesole, Graziano title: Comparative genomics reveals early emergence and biased spatio-temporal distribution of SARS-CoV-2 date: 2021-02-19 journal: Mol Biol Evol DOI: 10.1093/molbev/msab049 sha: 07e4fb494c60e5fb4e1b8c424ef96b853882768d doc_id: 932216 cord_uid: m8ks54x0 Effective systems for the analysis of molecular data are fundamental for monitoring the spread of infectious diseases and studying pathogen evolution. The rapid identification of emerging viral strains, and/or genetic variants potentially associated with novel phenotypic features is one of the most important objectives of genomic surveillance of human pathogens and represents one of the first lines of defense for the control of their spread. During the COVID 19 pandemic, several taxonomic frameworks have been proposed for the classification of SARS-Cov-2 isolates. These systems, which are typically based on phylogenetic approaches, represent essential tools for epidemiological studies as well as contributing to the study of the origin of the outbreak. Here we propose an alternative, reproducible and transparent phenetic method to study changes in SARS-CoV-2 genomic diversity over time. We suggest that our approach can complement other systems and facilitate the identification of biologically relevant variants in the viral genome. To demonstrate the validity of our approach, we present comparative genomic analyses of more than 175,000 genomes. Our method delineates 22 distinct SARS-CoV-2 haplogroups, which, based on the distribution of high frequency genetic variants, fall into 4 major macro haplogroups. We highlight biased spatio-temporal distributions of SARS-CoV-2 genetic profiles and show that 7 of the 22 haplogroups (and of all of the 4 haplogroup clusters) showed a broad geographic distribution within China by the time the outbreak was widely recognised—suggesting early emergence and widespread cryptic circulation of the virus well before its isolation in January 2020. General patterns of genomic variability are remarkably similar within all major SARS-CoV-2 haplogroups, with UTRs consistently exhibiting the greatest variability, with s2m, a conserved secondary structure element of unknown function in the 3’ UTR of the viral genome showing evidence of a functional shift. While several polymorphic sites that are specific to one or more haplogroups were predicted to be under positive or negative selection, overall our analyses suggest that the emergence of novel types is unlikely to be driven by convergent evolution and independent fixation of advantageous substitutions, or by selection of recombined strains. In the absence of extensive clinical metadata for most available genome sequences, and in the context of extensive geographic and temporal biases in the sampling, many questions regarding the evolution and clinical characteristics of SARS-CoV-2 isolates remain open. However, our data indicate that the approach outlined here can be usefully employed in the identification of candidate SARS-CoV-2 genetic variants of clinical and epidemiological importance. The ongoing Coronavirus disease 2019 pandemic (Poon and Peiris 2020) poses the greatest global health and socioeconomic threat since world war II. The first case of COVID-19 was reported in Wuhan city, Hubei province, China, in late December 2019 (Zhou et al. 2020) , although retrospective analyses have placed the onset as early as December 1st (Duchene et al, 2020) , and other, sometimes controversial, studies indicate widespread distribution of the causal agent significantly earlier (Apolone et al 2020 , Deslandes et al 2020 , La Rosa et al. 2020 . At the time of writing, COVID-19 has affected more than 200 countries worldwide, with more than 65 Million confirmed individual infections and a death toll in excess of 1.5 million. Varying criteria for reporting COVID-19-related deaths, the fact that very mild or asymptomatic infections can often go undetected, differences in testing strategies, and other demographic factors (Dowd et al 2020 , Lavezzo et al. 2020 , Niedzwiedz et al 2020 , suggest that both these figures are likely to represent substantial underestimates of its worldwide impact. The first complete genome sequences of the viral pathogen were determined in early January 2020 by second generation metatranscriptomic sequencing (Wu et al, 2020) , allowing the rapid development of diagnostic tests (Corman et al. 2020) and the development of molecular monitoring strategies (Qiang et al. 2020) . The viral genome is approximately 30.000 nt in size and shows high similarity (~79%) with SARS-CoV-1 (Lu et al. 2020) , a beta-coronavirus of the subgenus Sarbecovirus, and the causal agent of a large scale epidemic of viral pneumonia (Severe Acute Respiratory Syndrome, SARS) that affected China and other 25 countries in 2003 (Vijayanand et al. 2004 ). The International Committee on Taxonomy of Viruses (ICTV) designated the novel pathogen SARS-CoV-2. Phylogenetic analyses have assigned SARS-CoV-2 to the Severe acute respiratory syndrome-related coronavirus (SARSr-CoV) group, where it forms a relatively distant sister group to SARS-CoV-1, interleaved with various SARSr-CoVs isolated from non-human mammalian species (Coronaviridae Study Group of the International Committee on Taxonomy of Viruses, 2020). SARS-CoV-2 shows the highest levels of genome identity (96%) with a bat SARSr-CoV denoted RaTG13, which was isolated in the Yunnan province in 2013 (Zhou et al. 2020) . Despite this sequence similarity, SARS-CoV-2 differs from RaTG13 in several key features. Arguably, the most important of these is the presence of a polybasic furin cleavage site insertion (residues PRRA) at the junction of the S1 and S2 subunits of the Spike protein (Coutard et al. 2020) . This insertion, which may increase the infectivity of the virus, is not present in related beta-coronaviruses, although similar polybasic insertions are observed in other human coronaviruses, including HCoV-HKU1, as well as in highly pathogenic strains of avian influenza virus (Nao et al. 2017) . Additionally, the RBD (Recognition Binding Domain) of the SARS-CoV-2 spike protein is significantly more similar (97% identity) to that of SARSr-CoVs isolated from specimens of Malayan pangolins (Manis javanica) illegally imported into southern China (Guangdong and Guangxi provinces) than to the RDB of RaTG13 (89% identity) suggesting that recombination in pangolins or other mammalian "intermediate" or "amplifying" hosts may have predated human transmission (Lam et al. 2020; Wong et al. 2020) . As the COVID-19 pandemic has progressed, the number of viral isolates for which genomic sequences are available has increased substantially and in excess of 175.000 viral genomes are currently publicly available (EpiCov, 2020) . As expected, considering its recent emergence and the reportedly low mutation rates of coronaviruses in general (Sanjuán et al. 2010 ), SARS-CoV-2 genomes show low levels of genetic diversity (average pairwise identity of 99.99%). These considerations notwithstanding, phylogenetically distinct groups of SARS-CoV-2 isolates have been identified (Benvenuto el al 2020; Ceraolo and Giorgi 2020; Forster et al. 2020; Gudbjartsson et al. 2020; Phan, 2020; Walker et al. 2020) and several of these clusters show highly biased geographic distributions. Many independent studies have tentatively linked particular genomic signatures to increased/decreased virulence or possible adaptation to human hosts (Grubaugh et al. 2020; Korber et al. 2020; Pachetti et al. 2020) . While, in the absence of careful validation, it is impossible to determine whether emerging SARS-CoV-2 genetic variants are of physiological relevance and indeed reflect adaptive evolution rather than being neutral characters derived from genetic drift and founder effects, the importance of establishing simple and reproducible systems for the delineation of genetic diversity in human pathogens is widely acknowledged (Deng et al. 2016; Armstrong et al. 2019 ). Currently, the GISAID (Shu and McCauley, 2017) and Netxstrain portals (Hadfield et al. 2018) , represent the most comprehensive repositories of SARS-CoV-2 genome sequences, and thus constitute reference points for molecular epidemiological, population genetics and comparative genomic studies of the novel pathogen. Both systems provide tools for comparative and phylogenetic analyses, as well as instruments dedicated to genome and variant annotation. Additionally, Nexstrain provides resources for temporo-spatial graphical data representation. Other dedicated computational infrastructures, including the EBI COVID-19 Data Portal (https://www.covid19dataportal.org/) and the SARS-CoV-2 resource portal (https://www.ncbi.nlm.nih.gov/sars-cov-2/), also facilitate access and retrieval of COVID-19-related data including raw sequences reads. While all these platforms constitute invaluable resources for the SARS-CoV-2 research community, the exponential expansion of the viral population during the current pandemic, as well as notable temporo-geographic biases in sampling, low sequence variability, recombination (de Wit et al. 2016; Boni et al. 2020; Phan, 2020) and limited metadata associated with many sequences pose significant challenges for the rapid and effective identification of relevant genomic variants and viral haplotypes and might potentially hinder pure phylogenetics approaches. Additionally, while molecular phylogenetic methods are clearly the most suitable approach for viral transmission tracing from genomic sequences, they can be computationally burdensome and the optimization of substitution models as well as the interpretation of complex topologies imply a requirement for a certain level of technical expertise. Moreover, widespread variation in allele frequencies is normally observed during an outbreak and requires ad hoc analyses to identify minimum levels of divergence required to delineate different clusters of genomic sequences while avoiding excessive fragmentation. Currently, the approach proposed by Rambaut et al is regarded as the most evolutionarily accurate and consistent available method for nomenclature and classification of emerging SARS-CoV-2 viral strains. This framework applies rigorous phylogenetic analyses, combined with simple rules -based on prevalence and phylogeographic distribution -to identify novel groups or viral lineages and their ancestry and can efficiently accommodate novel sequences as they are generated. However, it is not specifically designed to identify/pinpoint highly prevalent genetic variants or rapidly emerging viral isolates, as reflected by the fact that the system currently identifies more than 230 distinct lineages which show a median size of 22.5 genomes. In the light of the above considerations we and others (Chiara et al, 2020 have proposed phenetic clustering for the delineation of current and emerging viral genetic diversity considering only genetic variants that are highly prevalent (AF > 1%) in the viral population. These methods are conceptually more simple than those based on true phylogenetic analyses, and have the advantage of providing a more general picture of genetic variants and haplotypes. However, allele frequencies are typically estimated from all sequences available at the time of study, implying the potential loss of information regarding variation in frequencies of even clinically or statistically relevant variants over time. This consideration is further complicated by extensive geographic biases in sequence sampling. In an attempt to partially address these considerations, here we extend a simple framework inspired by MultiLocus Strain Typing (MLST), a classification approach often used for microbes (Maiden, 2006) to the geo-spatial analysis of SARS-CoV-2 genomic sequences. Thus, we estimate changes in "fixed" viral allele and haplotype frequencies in geographic contexts across the time-course of the pandemic. To demonstrate the validity of our approach, we have studied more than 175,000 complete SARS-CoV-2 genomes and derive intriguing observations regarding the origin (and by inference the timing) of the emergence of pandemic strains as well as the evolutionary mechanisms associated with the emergence of novel variants. We retrieved 178,191 SARS-CoV-2 genomic sequences labeled as high coverage and putatively complete, derived from 121 countries in 5 continents from the GISAID EpiCoV portal (as available on November 10th 2020). We observed that a considerable number (49,260) of these reportedly complete genomic assemblies presented incomplete 3' or 5' UTRs (median size 29743 nts; reference genome size of 29,903 nts, including a polyA tail of 33 nts). Additionally, 25,865 genomes contained a large number of gaps and/or uncalled bases -ranging from 151 to 3,652. Stringent criteria were used to retain only sequences which were more likely to represent a nearly complete assembly of the SARS-CoV-2 genome (more than 29,850 nts in size) and containing only a limited number of uncertain bases (less than 150 Ns). A total of 102,951 genomes were therefore selected as the "high quality set" (Supplementary Table S1 ). Analysis of raw genetic distances between these SARS-CoV-2 genomes (see Materials and Methods) revealed a mean of 0.535 variant sites between pairs of most similar sequences, slightly lower than the equivalent figure for late-phase isolates of SARS-CoV-1 from the SARS 2003-2004 epidemics (0.78 polymorphic sites) (Song et al. 2005) . Furthermore, 63,869 (62.53%) of the high quality SARS-CoV-2 genomes analysed, have a perfect sequence identity (0 polymorphic sites) with respect to at least one other genome. Mutation rates were estimated according to the formula described in Zhao et al. 2004 and, consistent with the observed pattern of genetic distances, were marginally lower for SARS-CoV-2 (1.89 ± 0.53 x 10 -3 substitutions per site per year) than for SARS-CoV-1 (2.38 ± 0.47 x 10 -3 substitutions per site per year). While these values are in line with those reported by other studies (Boni et al, 2020) , we note that estimates of viral evolutionary rates can vary considerably with the timescale of measurement, owing to exponential population growth and/or varying selective pressures. Indeed, when more distantly related species were used to calibrate the estimation of evolutionary rates, substantially lower estimates were obtained (Boni et al, 2020) A total of 28,222 distinct variable sites were observed among 102,951 high quality genomes included in these analyses (Supplementary Table S2 ). Considering the date of sampling as reported for each isolate, only 818 (2,89%) of these reached an allele frequency above 1%, the threshold above which a variant is considered fixed in a natural population (Wong et al. 2003 ) for at least one day. Strikingly, the majority of these variants remain highly frequent only for a limited period of time (Supplementary Figure 1A , median 15 days, upper quartile 35 days) possibly indicating a strong effect of drift and/or sampling biases. Notably, and consistent with this hypothesis a significant positive correlation is observed between the total time of circulation (that is the total number of days in which a variant is observed at over 1% frequency among sampled genomes) and maximum allele frequency reached (Supplementary Figure 1B) . Interestingly, according to our analyses only 182 high frequency variants (0.644%) (Supplementary Table 3 ) achieve an allele frequency greater than 1% for more than 50 days in total. Importantly, when the entire collection of 178,191 SARS-CoV-2 genomes is considered, the total number of variant sites is significantly increased (38581 sites,Supplementary Table S2 ), but the number and type of high (≥ 1%) frequency variant sites remains relatively constant (811 in total), suggesting a robust estimate of comprehensive allele frequencies for these sites. Of the 7 sites that show a reduced prevalence, 5 are located within 200nt of the 3' terminal and 2 within the first 52 nt of the 5' end of the genome, suggesting that their apparent reduction in frequency might be associated with the incompleteness of some genome sequences. The MEME and FEL methods (Kosakovsky- Pond et al. 2020) were applied to the concatenated alignments of protein coding genes of the 102,951 high quality complete genomes to identify signals of adaptive evolution. A total of 849 sites were associated (pvalue ≤ 0.05) with signatures of selection according to both methods (Supplementary Table S4 ). Of these, 398 were associated with positive selection, while 451 sites were deemed to be under negative selection. A highly significant over-representation of both sites under negative and positive selection is observed among the 818 high frequency polymorphic sites of protein coding genes. In particular, 36 of these sites (OR=2.97, Fisher p-value 6.26e-08) are highlighted as evolving under positive selection, while 61 sites (OR=4.44 , Fisher p-value 2.20e-16) were predicted to be under negative selection. (Strikingly, this pattern is even more pronounced, when only the 182 sites that display an allele frequency of 1% or greater for 50 or more days are considered, with 17 (Fisher p-value 5.45e-09, OR=6.62) and 30 (Fisher p-value 1.20e-21, OR=10.31) sites respectively, predicted to be under positive or negative selection. We and others (Chiara et al 2020 have proposed conceptually simple systems to monitor SARS-CoV-2 genetic diversity and complement phylogenetic methods. Here we develop this philosophy, employing an approach inspired by MultiLocus Strain typing (Maiden, 2006) , where only variants that reach a relatively high prevalence (typically 1%, a frequency that is often considered to represent fixation (Wong et al. 2003 ) are considered and clustering of allele presence/absence profiles identifies recurrent viral haplotypes. In the context of the extreme variation between allele frequencies observed for SARS-CoV-2, exclusion of low frequency variants, which, as previously demonstrated, typically show short temporal persistence, is potentially helpful in the search to capture significant trends in variation of genomic diversity. To accommodate these considerations, we propose a set of arbitrary, but empirically reasonable conditions for the operational classification of SARS-CoV-2 haplotypes: 1. Given that closely related genomes show an average of 0.535 polymorphic sites, we suggest that distinct haplogroups (HGs) should differ by least 2 high frequency polymorphic sites. 2. To avoid an excessive fragmentation, each haplogroup should incorporate at least 100 distinct genomes. 3. Since the variability of SARS-CoV-2 is limited, haplogroups that share one or more high frequency polymorphic sites (have one or more nucleotide substitution in common) should form "macro haplogroups" (MHGs). Genetic markers defining macro haplogroups should be "completely" fixed (i.e. AF >0.9) in the associated haplogroups. 4. To minimize the effects of sampling biases, analyses of allele frequencies should be performed for distinct time windows over the course of the COVID-19 pandemic. 5. To mitigate the impact of short term effects, only variants that show high frequency (above 1%) for a relatively long span of time, (arbitrarily set to a minimum of 50 -not necessarily consecutive -days in this study) should be included in the final analyses. To facilitate the classification of newly sequenced SARS-CoV-2 genomes, an automated software package that implements the criteria devised in the present study is made publicly available as a standalone tool at https://github.com/matteo14c/assign_CL_SARS-CoV-2 and, through a dedicated galaxy web-server at http://corgat.cloud.ba.infn.it/galaxy . Table 2, Supplementary Table S1 ) were recovered. A total of 82 (of 182) distinct high frequency genetic variants, were "completely" fixed (relative Allele Frequency > 0.9) in one or more haplogroups (Table 1 ). Figure 1B (and Supplementary Figure S2B ) shows that each haplogroup is defined by a characteristic molecular signature consisting between 2 and 15 high frequency alleles (Supplementary Table S5 ). HG1 is the only exception in this respect, as it is composed of genomes that are highly similar to the reference. Consistent with other reports (Benvenuto et al. 2020; Ceraolo and Giorgi, 2020; Forster et al. 2020; Gudbjartsson et al. 2020; Phan, 2020; Walker et al. 2020) , we observe ( Figure 2 Figure S4 ). Finally, in the UK HG5, the most prevalent haplogroup in the country for a long period of time, is now being gradually replaced by HG21 (Supplementary Figure S4) . Notably, and consistent with our previous observations, a rapid decrease in the prevalence of genomes associated with MHG1, MHG2, and MHG4 is observed after time T0+70, whereas relatively high prevalence of these genomes were observed in many countries during the early phase of the pandemic. Importantly, fromSupplementary Figure S5 , we notice that there is a modest prevalence of MHG3 in China, the country that is currently considered the origin of the outbreak, where it accounts for only 5.1 % of all the genomes available. Of the 82 high frequency polymorphic sites that reach complete, or nearly complete fixation in at least one haplogroup (Table 1, Figure 1B ) only two variants (11083G->T; 14805C->T;) show an allele frequency ≥ 0.01 in more than one macro haplogroup, while the remaining 80 have AF≥0.01 in only one macro haplogroup, and can therefore be considered MHG "specific". These observations strongly support our contention that high frequency variable sites, as defined here, are effective for the discrimination/classification of distinct genomic signatures in SARS-CoV-2. Strikingly, 24 of the 77 sites associated with protein coding genes that are fixed in and specific to at least one haplogroup are predicted to be under positive (9) or negative (14) selection according to FEL or MEME (Table 1) . Although this observation might be suggestive of distinct phenotypic features/properties for the different SARS-CoV-2 types, as previously suggested by other authors (Grubaugh et al. 2020; Korber et al. 2020; Pachetti et al. 2020) , in the absence of experimental validation, such inferences should be treated carefully. Strikingly, we notice that, among the 82 high frequency variants, that define the major haplogroups of SARS-CoV-2, 28 are also present in one or more genomes of SARSr-CoV-2 isolated from bat and/or pangolin specimens (Supplementary Figure S6) . These alleles appear to be highly admixed among SARSr-CoV-2 coronaviruses isolated from pangolins and bats, suggesting possible parallelism/convergence, but potentially suggestive of extensive recombination between immediate ancestors of SARS-CoV-2. To investigate possible scenarios of emergence of novel genome types, the allele frequency distribution of the 82 genetic variants that define the major haplogroups of SARS-CoV-2 (AF ≥ 0.01) were compared at intervals of 10 days since 26th Dec 2019 (the collection date of the reference genome). Within haplogroups, distributions of allele frequency are highly stable and do not change over time (Supplementary Figures S7, S8, S9 ). Since by definition major SARS-CoV-2 viral haplotypes identified in this study are formed by at least 2 or more characteristic genomic variants, this suggests that the majority of the genomic signatures that define distinct haplogroups concomitantly reached a high allelic frequency. This is even more evident when allele frequency distributions are compared within macro haplogroups, as several clusters of alleles show a rapid emergence and almost immediate fixation ( Figure 4 Supplementary Table S6, Supplementary Table S7) . Notably, we observe that genetic variants associated with HG18 and HG21, are rapidly becoming more prevalent. Importantly we notice that both groups incorporate emerging variants that alter the sequence of the spike protein (Supplementary Figures S10A and S10B): L18F (HG21), and A222V (both HG18 and HG21). Both variants are predicted to be under positive selection according to FEL and MEME (Table 1 ). The rapid emergence of the A222V variant, which probably originated in Spain, has already been described in (Hodcroft et al 2020) . Although it might be tempting to speculate that, similar to the D614G variant in the spike protein -the hallmark of MHG3 -these rapidly emerging variants might be associated with increased viral fitness, other nonsynonymous spike protein variants, such as S477N in HG15, also show signatures of positive selection and apparent increase in allele frequency, but subsequently exhibit rapid decrease in prevalence (Supplementary Figure S10C ). In addition to rapid selection of standing variation as an adaptive process (Nowak and Schuster,1989) , other evolutionary processes, including genetic drift and founder effects, can explain rapid changes in allele frequency (Holland et al. 1991; Lynch et al. 2016 ). Indeed, the proposed importance of "superspreader" events and individuals and the inferred overdispersion of R 0 associated with SARS-CoV-2 transmission patterns (eg Gomez-Carballa et al 2020; Endo et al 2020) might be consistent with an important role for founder effects, particularly in the context of containment strategies imposed in many countries after the initial outbreak and after the initiation of "second waves". Comparison of isolation dates ( Figure 5 , and Supplementary Fig S11) , suggest that the majority of currently observed haplogroups of SARS-CoV-2 were not present during the initial phases of the pandemic and seem to emerge at a later time. Using arbitrary thresholds, based on days of first isolation, haplogroups can be roughly divided into "early" (HG1-HG3, HG5, HG9, HG11 appearing within 30 days of the isolation of the reference genome), "middle" (HG4, HG6-HG8, HG10, HG12-HG15, and HG19: appearing between 30 to 100 days), and "late" (HG17,HG18,HG20-HG22: appearing after 100 days) ( Table 2) . Consistent with our previous findings, we observe that "early" haplogroups have a probable origin in China, as shown in the comparison of the phenetic patterns and the localities of the first 50 isolates (in terms of date, see Supplementary Figure S12 ). Conversely, the "middle" and "late" haplogroups are likely to have become distinct outside China (Supplementary Figure S13 and S14). To discriminate between possible evolutionary scenarios associated with the rapid emergence of novel haplogroups in SARS-CoV-2, we reasoned that while founder effects or selection should be associated with an overall reduction in genomic diversity of viral subpopulations, convergent evolution should not alter the underlying allele frequency distribution (except at the sites under selection). Accordingly, we compared average allele diversity between genomes associated with "early", "middle" and "late" emergence within each major haplogroup ( Figure 6A) . The results show a statistically significant reduction of genetic diversity for middle and late HGs (i.e. emerging after ≥30 days) compared to early haplogroups (Wilcoxon Sum and Rank test p-values 1.145e-15, 2.367e-14 and 4.45e-13 for early vs middle in HG2, HG3 and HG4 respectively; P-value ≤ 2e-16 for early compared to late). Intriguingly, a moderate but still statistically significant reduction is observed also when middle and late haplogroups in MHG3 are compared (p-value 0.000103). Importantly, ( Figure 6B ), we observe that evolutionary rates are highly homogeneous and do not show detectable changes between haplogroups, suggesting that reduced diversity of late clusters is not associated with a reduction of evolutionary rates. According to our starting hypothesis, and in the light of the biased geographic sampling and prevalence of different HGS, these results suggest that the emergence of novel SARS-CoV-2 genome types is unlikely to be driven by widespread convergent evolution and independent fixation of advantageous substitutions. Remarkably, our analyses do not support an increased genomic diversity for haplogroups included in MHG3 compared to other MHGs), although the 14408 C>T substitution causing the P323L variant in the nsp12 gene (RdRp),was previously described as associated with an increased genomic variability (Pachetti et al. 2020) . We speculate that biased/incomplete sampling of MHG3 during the early phase of the pandemic, and the fact that Pachetti et al compared raw non-normalized genetic distances (instead of normalized evolutionary rates) are the most likely explanation for this discrepancy. Profiles of genomic variability for each of the haplogroups and macro haplogroups defined in this study were established using windows of 100 bp in size and sliding by 50 bp. As shown in Figure 7 and Supplementary Figure S15 , the observed patterns are remarkably similar suggesting common patterns of variation. Density of polymorphic sites is significantly enriched (Adjusted Fisher test p-value ≤1e-15 and ≤1e-12 respectively) in both the 5' and 3' UTR regions, while protein coding loci (CDS) show less variability. Strikingly, a single genomic region in the 3' UTR accumulates ~10x more mutations than any CDS, and ~ 2x more than any other UTR region, and is the single most variable region in the genome of SARS-CoV-2 ( Figure 8A ). This highly variable genomic region is associated with a conserved secondary structure ( Figure 8A ), known as s2m (Tengs and Jonassen, 2016) . S2m is a 43-nucleotide element that has been described in several families of singlestranded RNA viruses, including Astroviridae, Caliciviridae, Picornaviridae and Coronaviridae (Tengs and Jonassen, 2016) . The molecular function of this potentially mobile structural element is not well understood. Current hypotheses include a role in the hijacking of host protein synthesis through interactions with ribosomal proteins (Robertson et al. 2005) , and RNA interference (RNAi) via processing of the s2m elements into a mature microRNA (Tengs et al. 2013) . In coronaviruses, the highly conserved nature s2m has also allowed the development of a PCR-based virus discovery strategy (Jonassen, 2008) . As outlined inFigure 8B, compared to the consensus secondary structure of s2m described in the Rfam database, the reference genome of SARS-CoV-2 harbors a nucleotide substitution at a highly conserved and structurally important position, with possible impacts on structural stability (the T at the SARS-CoV-2 genomic position 29,758, indicated by an arrow in Figure 8A ). Secondary structure predictions suggest that of all possible 129 single nucleotide substitutions in the presumably ancestral sequence of s2m, as observed in the genome of RaTG13 SARSr-CoV-2, this would be the second most destabilizing in terms of Table S8 ). Based on this observation and on the high levels of variation of the entire s2m region, it is tempting to speculate that s2m could either be subject to diversifying selection in SARS-CoV-2, or have lost significant purifying constraints. Strikingly, we observe that the G->T substitution at 29,742, which is a hallmark of haplogroup 11, would result in a substantially increased stability of s2m (Supplementary Table S7 ), with an MFE that becomes substantially lower than that of the s2m structure of the reference genome. Interestingly we observe that this variant also achieves a relatively high frequency (max 6.4%) also in haplogroup 8, a possible hint of convergent evolution. Conversely, 5 other recurrent substitutions in s2m, that achieve an allele frequency of 1% during the time course of the SARS-CoV-2 pandemic (29,742 G>A and 29,734 G>T, 29736 G>T, 29751 G>T and 29747 G>T) are not associated with a specific haplogroup and are predicted to result only in a marginal decrease of the MFE of the s2m secondary structure (Supplementary Table S8 ). Interestingly, we notice ( Figure 8C ) that the same consideration applies to the majority of the nucleotide substitutions that are observed in the SARS-CoV-2 s2m element. Indeed, with respect to the background of all possible nucleotide substitutions that could occur in s2m of the SARS-CoV-2 reference genome, the set of variants that is actually observed in extant SARS-CoV-2 genomes are associated with only a modest increases in the thermodynamic stability of the structure. A co-folding analysis of all distinct variants of the s2m elements found in the 102,951 complete and high quality genomes -according to the criteria defined in this study -suggest a very degenerate secondary structure of s2m in SARS-CoV-2 ( Figure 8D ). Notably while a substitution that restores the presumably ancestral state of s2m (i.e., the secondary structure of RaTG13 SARSr-CoV-2) is observed (29758 T>G), this substitution is associated only with a very limited number of genomes (103, AF=0,00100841). Effective approaches for the interpretation of genome sequences are fundamental for monitoring and understanding the dynamics of the spread and evolution of pathogens, and the SARS-CoV-2 paradigm, given both its global significance and the availability of modern sequencing technologies is particularly instructive in this sense. In the current study we propose a rational and reproducible approach for the delineation of genomic diversity in SARS-CoV-2 which also takes into account the temporal distribution of allele frequency by building on an informative set of variable sites, which show high prevalence in the viral population for a relevant frame of time. Applying our system to the entire collection of (more than 175,000) genomic sequences, as available on 10th November 2020, we derive interesting observations concerning evolutionary patterns of SARS-CoV-2. We observe a low level of variability and infer a relatively low mutation rate (1.84 sites per 10 -3 nt per year) in SARS-CoV-2 which is consistent with previous estimates (Zhao et al. 2004; Sanjuán et al. 2010) . The presence of representatives of different viral haplogroups during the early phases of the pandemic (within 25 days of the report of the first case of COVID19 in Wuhan) in several distinct geographic regions of China, is suggestive of an early circulation of SARS-CoV-2 in humans, probably well before the major outbreak of COVID19 in Wuhan. These observations, coupled with evidence provided by several other studies, would indicate an early spillover of SARS-CoV-2 to humans (Apolone et al 2020; Deslandes et al 2020; La Rosa et al. 2020; Zehender et al. 2020; ) . Careful monitoring of the evolutionary rates of SARS-CoV-2 over a longer period of time, and ideally also on an unbiased/matched number of genomes isolated from different geographic areas, are required to confirm these inferences. In this respect, the fact that a relevant number of SARS-CoV-2 high-frequency, and macro haplogroup specific polymorphic sites are found also in viral strains isolated from pangolins and bats specimens highlights an unexplored diversity shared by SARS-CoV-2 and SARSr-CoV-2 viral genomes. Moreover, the observed pattern of admixed SARS-CoV-2 haplogroupspecific alleles in the genomes of closely related SARSr-CoV-2 coronaviruses isolated from bat and pangolin specimens (Supplementary Figure S6 ), is highly consistent with possible recombination events, as suggested also by previous studies (Boni et al. 2020; Lam et al. 2020; Wong et al. 2020) . In the light of these observations, it is evident that additional sampling of a substantially larger number of viral specimens, isolated from non-human hosts will be required to reconstruct a more complete phylogeny and to possibly trace back the "original" spillover event. Indeed, notwithstanding the high levels of similarity to SARS-CoV-2 (in the order of 97%), RaTG13, the most closely related viral genome isolated from a bat specimen, is estimated to have diverged from SARS-CoV-2 more than 25 to 40 years ago (Boni et al. 2020 ). Our classification system, based on 82 high frequency, stable polymorphic sites, identifies a total of 22 distinct prevalent haplogroups and 4 macro haplogroups of SARS-CoV-2 genome types, all having a highly biased phylogeographic distribution (Figure 2-3) . We note that several polymorphic sites that are specifically associated (completely fixed) with haplogroups and macro haplogroups are predicted to be either under positive or negative selection according to state of the art methods for the study of evolutionary constraints in protein coding genes (Table 1) . Interestingly, several of these sites have previously been highlighted by other studies and tentatively associated with increased virulence and/or increased mutation rates of SARS-CoV-2 (Grubaugh et al. 2020; Korber et al. 2020; Pachetti et al. 2020) . While fixation of advantageous variants has previously been proposed as an effective and widespread mechanism for the rapid increase of the fitness of a viral population (Moya et al. 2004) , the functional relevance of these genomic variants remains, for now, unclear. We emphasize the importance functional and clinical validation as reduced levels of variability, high levels of recombination, transmission dynamics, and, particularly, biased sampling of genomic sequences, might impair the accuracy of methods based on phylogenetic reconstruction of ancestral states for the identification of selective signatures (Ives et al. 2007; Som et al. 2015) . In this respect, our observation of reduced genetic variability of middle and "late" viral haplogroups belonging to ( Figure 6A) , coupled with the highly biased phylogeographic distribution of SARS-CoV-2 genome types (Figure 2-3) , might be more consistent with genetic drift and founder effects rather than ongoing adaptive evolution. We The observation of highly divergent/geographically biased patterns of allele frequency distributions in the SARS-CoV-2, coupled with large differences in the number of genomic sequences sampled from different geographic areas or countries might represent a relevant limitation for this work. Indeed, these considerations might compromise estimation of allele frequencies, and thus limit the sensitivity of our approach in the identification of relevant/important genetic variants for which only a limited number of representative genomic sequences are available. For example, the majority of the largest HGs identified by our approach are associated with countries from which the largest number of genomes are available (UK, USA and Australia). This suggests that currently available sampling offers only a partial overview of SARS-CoV-2 genomic variability. Importantly, approaches/nomenclature systems based on true phylogenetic analyses do not suffer from this limitation, as the delineation of distinct lineages is not determined by their overall prevalence. However, and for the same reason, results of phylogenetic analyses might be more difficult to interpret and do not facilitate the rapid identification of highly prevalent/emerging genetic variants. As such, we believe that systems for the monitoring of the evolution of SARS-CoV-2 should integrate both types of approaches and routinely incorporate geographic and temporal dimensions. Notably, we observe a highly consistent pattern of nucleotide substitution in SARS-CoV-2 genomes between all haplogroups and macro haplogroups, characterized by an increased variability at UTRs, in spite of the fact that a significant proportion of genomic assemblies annotated as "full-length" in GISAID are incomplete at the terminal ends. Although this incomplete representation of genomic sequences does not affect the classification system proposed in this study, it might result in an inaccurate/incomplete representation of ongoing evolution of SARS-CoV-2. This is exemplified by the s2m element, a highly conserved secondary structure element located in the 3' UTR which carries a substitution in the reference genome of SARS-CoV-2 that destabilizes the secondary structure and is possibly functionally significant. The s2m element shows a patchy phylogenetic distribution among distinct groups of positive sense RNA viruses (picornaviruses, astroviruses and coronaviruses) and likely represents a "mobile" element (Tengs and Jonassen, 2016) . When present it is always found in the 3' region of such genomes (Jonassen, 2008) and shows extremely high levels of conservation at the structural and primary sequence levels. However, the patchy distribution within groups of viruses with extremely similar gene complements may argue against an essential role for this element -implying that it is advantageous in specific conditions. The substantial increase of genomic variability observed in the s2m locus, compared with the rest of the genome (as well as the observation that among 73 available SARS-CoV-1 genome sequences from the 2003-2004 epidemic and other SARSr-CoV-2 genomes, no variation is observed in s2m), suggest changes in selective pressures acting on this element. It remains unclear whether these changes reflect ongoing widespread diversifying selection in SARS-CoV-2, or whether the original disruptive substitution inactivated s2m function, leading to a general loss of significant purifying constraints. Patterns of single nucleotide substitutions in s2m provide contrasting evidence concerning the evolutionary patterns of this secondary structure element in SARS-CoV-2, as the most prevalent substitutions (29,742 G->T) seems to be associated with a considerable increase in secondary structure stability, but the majority of the substitutions observed in extant SARS-CoV-2 genomes are not optimal in terms of the recovery of a highly stable secondary structure, and, in particular, do not recapitulate the consensus sequence/structure. The functional role of s2m remains unclear: it has been proposed, on the basis of structural similarity to rRNA structural elements, that it might be involved in the regulation of translational efficiency of viral transcripts (Robertson et al. 2005) . We also note that viral RNA secondary structures have been implicated in the suppression of antigen presentation (EBNA-1 (Apcher et al. 2010; Tellam et al. 2012) ) and in the suppression of host innate immune responses (Vandevenne et al. 2010; McFadden et al. 2013; Witteveldt et al. 2014; Murat and Tellam, 2015) . We believe that detailed experimental evaluation of both s2m function and the possible phenotypic consequences of changes observed among SARS-CoV-2 isolates should represent a priority topic in coronavirus research. While, many questions concerning the mechanisms of evolution and the phenotypic characteristics of SARS-CoV-2 (increased/decreased virulence) remain open, approaches such as that presented here facilitate rapid grouping of frequent SARS-CoV-2 haplotypes and can be useful both for monitoring the changing prevalence of different types of SARS-CoV-2 and for the study of the molecular processes that underlie the emergence of novel viral types. A collection of 178,191 putatively complete, high coverage SARS-CoV-2 genomes and associated metadata was retrieved from the GISAID EpicoV (Shu and McCauley, 2017) platform on 10th November 2020. A total of 13 SARSr-CoV genome sequences isolated from non-human hosts, including bats and pangolins (Lam et al. 2020; Wong et al. 2020) , were also retrieved from the GISAID EpiCoV portal at the same date. SARS-CoV-2 sequence comparisons were performed using the reference Refseq were considered. For sites predicted to be under negative selection, only FEL was used, since MEME can not identify purifying selection (Murrell et al, 2012) . A total of 68 viral genomes from the SARS 2003 outbreak were retrieved from the NCBI virus database (Goodacre et al. 2018) . Calculation of evolutionary rates of SARS-CoV-2 and estimation of times of divergence were performed according to the formula described in Zhao et al. 2004 , based on genetic distances as determined in this study. Analyses of prevalence of allele frequency over time were executed based on the collection dates of individual genomes as reported in the GISAID metadata table. The collection date of the reference genomic sequence of SARS-CoV-2 in GISAID (26th December 2020), was set as time T0. Consecutive, non-overlapped intervals of 10 days were considered. For the analysis of allele frequency within haplogroups and major haplogroups the cumulative (from T0) sequence distribution was computed at every interval. Global analyses of allele prevalence within the complete collection of SARS-CoV-2 genomes were executed on the equivalent time intervals of 10 days, in this case however to capture local effects in allele frequency variation, overlapped (by 5 days) intervals were considered and the local distribution of allele frequency was computed by taking into account only the genomes isolated within each specific interval of time. A total of 3608 genomic sequences, for which collection dates were not reported in GISAID, were excluded from these analyses. Comparison of levels of variability of "early" and "late" clusters of SARS-CoV-2 genomes were established by 100 random samples of 150 genomes (batch), matched by date of collection, from each defined cluster (see below). The total number of distinct variant sites was calculated for each random batch of genomes, in order to derive a distribution of genomic variability. Significance between distributions of genomic diversity were established by means of the Wilcoxon, Sum and Rank test as implemented in the standard R libraries (R Core Team, 2019). Variability with respect to the reference NC_045512.2 SARS-CoV-2 genome was computed on sliding windows of 100 bp, overlapped by 50 bp, by counting the proportion of variable sites contained in each window (number of variable sites in the window, divided by the total number of variable sites in the entire genome) with a custom Perl script. A Fisher-exact test, contrasting the local variability in a window with the average variability in the genome, was used to identify hypervariable regions. P-values were corrected using the Benjamini-Hochberg procedure for the control of False Discovery Rate. Predictions of the secondary structure of the "Coronavirus stem-loop II-like motif" (s2m) and its Minimum Folding Energy (MFE) calculation were performed with the RNAfold program (Mathews et al. 2004 ) of the Vienna package (Gruber et al. 2008) , by artificially implanting each of the possible 129 substitutions in the 43 nt-long s2m sequence identified in the reference SARS-CoV-2 genome and in the presumably ancestral sequence of s2m, i.e. that observed in the genome of the RatG13 SARSr-CoV-2. Prediction of the consensus co-folding structure of s2m in SARS-CoV-2 was obtained by applying the R-scape (Rivas et al. 2017) program to the alignment of all s2m sequences found in the collection of the 11,633 high quality complete genome analysed in this study. Consensus secondary structure of the s2m element of Coronaviruses was as in the model RF00164 (https://rfam.org/family/RF00164) of the RFAM database. Graphical representation of the data, basic statistical analyses and clustering of viral genomes were performed by means of the standard libraries of the R programming language. A software package for the assignment of SARS-CoV-2 haplogroups as proposed in this Figure 1A . The dendrogram on the left indicates haplogroups with similar allele frequency profiles. The size of each "bubble" is proportional to the frequency of that allele in a given cluster. Barplot on the right panel indicates the number of genomes assigned to every haplogroup, scaled by logarithm base 10. Heatmap of worldwide prevalence of SARS-CoV-2 haplogroups. Only countries for which at least 100 distinct genomes of SARS-CoV-2 are available in a public repository are shown. Color codes on the top indicate individual haplogroups, according to Figure 1 . Pie-chart of prevalence of types of SARS-CoV-2 macro-haplogroups in different continents. Color code as in Figure 1 Bubbleplots of allele frequency in the 4 macro haplogroups of SARS-CoV-2 genomes, at different intervals in time. Each bubbleplot displays the allele frequency of the 82 high frequency polymorphic sites calculated at different, non-overlapping, intervals of 10 days. ("T_" with time 0= 26th December 2019). The size of each bubble is proportional to the allele frequency. Color codes according to haplogroups as in Figure 1 . The barplots above indicate the number of genomes of each macro-haplogroup observed at each time interval considered, scaled by logarithm base 10. Violin plots of isolation dates of SARS-CoV-2 strains assigned to each haplogroup of SARS-CoV-2 genomes. Color codes according to Figure 1 . Haplogroups are indicated on the X axis. Isolation dates are reported on the Y axis. Analysis of variability and structural stability of the s2m secondary structure element. A) Barplot of variability of different categories of genomic elements in the genome of SARS-CoV-2. Variability is reported as the proportion of polymorphic sites. B) Consensus secondary structure of the s2m element of coronaviruses according to the RFAM model RF00164 (https://rfam.org/family/RF00164). The arrow indicates the nucleotide substitution observed in the s2m of the reference genome of SARS-CoV-2 (position 29,758). C) Boxplot of MFE (minimum free energy) of predicted s2m secondary structures. Initial: MFE of the s2m element in the SARS-CoV-2 reference genome. Not observed: MFE of secondary structures associated with single nucleotide substitutions that are not observed in s2m of extant SAR-CoV-2 genomes. Observed: MFE of secondary structures associated with nucleotide substitutions found in the s2m element of extant SAR-CoV-2 genomes D) Prediction of secondary structure co-folding of s2m of SARS-CoV-2 according to the Rscape program. Color codes are used to indicate the level of conservation of single nucleotide residues according to the convention used in RFAM (white <=50%, gray >50% and <= 75%, black >75% and < 90%, red>=90% Unexpected detection of SARS-CoV-2 antibodies in the prepandemic period in Italy. Tumori Epstein Barr virusencoded EBNA1 interference with MHC class I antigen presentation reveals a close correlation between mRNA translation initiation and antigen presentation Pathogen Genomics in Public Health The 2019-new coronavirus epidemic: Evidence for virus evolution Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. Unpublished data Genomic variance of the 2019-nCoV coronavirus Comparative genomics suggests limited variability and similar evolutionary patterns between major clades of SARS-Cov-2. bioRxiv Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR Coronaviridae Study Group of the International Committee on Taxonomy of Viruses The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2 The spike glycoprotein of the new coronavirus 2019-nCoV contains a furin-like cleavage site absent in CoV of the same clade Demographic science aids in understanding the spread and fatality rates of COVID-19 Genomic Epidemiology: Whole-Genome-Sequencing-Powered Surveillance and Outbreak Investigation of Foodborne Bacterial Pathogens Clinical characteristics of fatal and recovered cases of coronavirus disease 2019 in Wuhan, China: a retrospective study SARS-CoV-2 was already spreading in France in late SARS and MERS: recent insights into emerging coronaviruses Temporal signal and the phylodynamic threshold of SARS-CoV-2. Virus Evol MUSCLE: multiple sequence alignment with high accuracy and high throughput Endo A; Centre for the Mathematical Modelling of Infectious Diseases Estimating the overdispersion in COVID-19 transmission using outbreak sizes outside China EpiCoV Data Curation Team. 2020. 55,000 viral genomic sequences of hCoV-19 shared with unprecedented speed via GISAID Phylogenetic network analysis of SARS-CoV-2 genomes COVID-19 vulnerability: the potential impact of genetic susceptibility and airborne transmission Mapping genome variation of SARS-CoV-2 worldwide highlights the impact of COVID-19 superspreaders A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection Making sense of mutation: what D614G means for the COVID-19 pandemic remains unclear. Cell. Epub ahead of print The Vienna RNA websuite Spread of SARS-CoV-2 in the Icelandic Population Emergence and spread of a SARS-CoV-2 variant through Europe in the summer of 2020 Quantitation of relative fitness and great adaptability of clonal populations of RNA viruses Within-species variation and measurement error in phylogenetic comparative methods Detection and sequence characterization of the 3'-end of coronavirus genomes harboring the highly conserved RNA motif s2m Spike: evidence that D614G increases infectivity of the COVID-19 virus. Cell. Epub ahead of print HyPhy 2.5-A Customizable Platform for Evolutionary Hypothesis Testing Using Phylogenies Identifying SARS-CoV-2-related coronaviruses in Malayan pangolins Suppression of a SARS-CoV-2 outbreak in the Italian municipality of Vo First detection of SARS-CoV-2 in untreated wastewaters in Italy SMS: Smart Model Selection in PhyML Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding Genetic drift, selection and the evolution of the mutation rate cluster: Cluster Analysis Basics and Extensions Multilocus sequence typing of bacteria MUMmer4: A fast and versatile genome alignment system Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure Influence of genome-scale RNA structure disruption on the replication of murine norovirus--similar replication kinetics in cell culture but attenuation of viral fitness in vivo FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix The population genetics and evolutionary epidemiology of RNA viruses Effects of messenger RNA structure and other translational control mechanisms on major histocompatibility complex-I mediated antigen presentation Detecting individual sites subject to episodic diversifying selection Genetic Predisposition To Acquire a Polybasic Cleavage Site for Highly Pathogenic Avian Influenza Virus Hemagglutinin Ethnic and socioeconomic differences in SARS-CoV-2 infection: prospective cohort study using UK Biobank Error thresholds of replication in finite populations mutation frequencies and the onset of Muller's ratchet Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant Genetic diversity and evolution of SARS-CoV-2 Emergence of a novel human coronavirus threatening human health Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology A statistical test for conserved RNA structure shows lack of evidence for structure in lncRNAs The structure of a rigorously conserved RNA element within the SARS virus genome R: A language and environment for statistical computing. R Foundation for Statistical Computing Viral mutation rates GISAID: Global initiative on sharing all influenza datafrom vision to reality Cross-host evolution of severe acute respiratory syndrome coronavirus in palm civet and human Causes, consequences and solutions of phylogenetic incongruence Messenger RNA sequence rather than protein sequence determines the level of self-synthesis and antigen presentation of the EBV-encoded antigen, EBNA1 A mobile genetic element with unknown function found in distantly related viruses Distribution and Evolutionary History of the Mobile Genetic Element s2m in Coronaviruses Innate immune response and viral interference strategies developed by human herpesviruses Severe acute respiratory syndrome (SARS): a review Genetic structure of SARS-CoV-2 reflects clonal superspreading and multiple independent introduction events A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach The influence of viral RNA secondary structure on interactions with innate host cell defences A population threshold for functional polymorphisms Evidence of recombination in coronaviruses implicating pangolin origins of nCoV-2019 A new coronavirus associated with human respiratory disease in China Analysis of genomic distributions of SARS-CoV-2 reveals a dominant strain type with strong allelic associations Genomic characterization and phylogenetic analysis of SARS-COV-2 in Italy Moderate mutation rate in the SARS coronavirus genome and its implications A pneumonia outbreak associated with a new coronavirus of probable bat origin We thank ELIXIR Italy for providing the computing and bioinformatics facilities. We gratefully acknowledge the authors, originating and submitting laboratories of the sequences from GISAID's EpiFlu™ Database on which this research is based. The list is detailed in Supplementary Table S1