key: cord-0285698-yoe84ta7
authors: Chiara, Matteo; Horner, David S.; Gissi, Carmela; Pesole, Graziano
title: Comparative genomics provides an operational classification system and reveals early emergence and biased spatio-temporal distribution of SARS-CoV-2
date: 2020-06-30
journal: bioRxiv
DOI: 10.1101/2020.06.26.172924
sha: 56caae445019d6baf1bc315ffe80603325bb4040
doc_id: 285698
cord_uid: yoe84ta7

Effective systems for the analysis of molecular data are of fundamental importance for real-time monitoring of the spread of infectious diseases and the study of pathogen evolution. While the Nextstrain and GISAID portals offer widely used systems for the classification of SARS-CoV-2 genomes, both present relevant limitations. Here we propose a highly reproducible method for the systematic classification of SARS-CoV-2 viral types. To demonstrate the validity of our approach, we conduct an extensive comparative genomic analysis of more than 20,000 SARS-CoV-2 genomes. Our classification system delineates 12 clusters and 4 super-clusters in SARS-CoV-2, with a highly biased spatio-temporal distribution worldwide, and provides important observations concerning the evolutionary processes associated with the emergence of novel viral types. Based on the estimates of SARS-CoV-2 evolutionary rate and genetic distances of genomes of the early pandemic phase, we infer that SARS-CoV-2 could have been circulating in humans since August-November 2019. The observed pattern of genomic variability is remarkably similar between all clusters and super-clusters, being UTRs and the s2m element, a highly conserved secondary structure element, the most variable genomic regions. While several polymorphic sites that are specific to one or more clusters were predicted to be under positive or negative selection, overall, our analyses also suggest that the emergence of novel genome types is unlikely to be driven by widespread convergent evolution and independent fixation of advantageous substitutions. While, in the absence of rigorous experimental validation, several questions concerning the evolutionary processes and the phenotypic characteristics (increased/decreased virulence) remain open, we believe that the approach outlined in this study can be of relevance for the tracking and functional characterization of different types of SARS-CoV-2 genomes.

The ongoing Coronavirus disease 2019 (COVID-19) pandemic [1] poses the greatest global health and socioeconomic threat since world war II. The first case of COVID-19 was reported in Wuhan city, Hubei province, China, in late December 2019 [2] , although retrospective analyses have placed the onset as early as December 1st.

The most common symptoms include fever, dry-cough and a general sense of fatigue, while some patients also experience muscle pain, nasal congestion, runny nose, sore throat, or diarrhoea [3] [4] . In a minority of the patients the infection may cause pneumonia, severe acute respiratory syndrome, kidney failure and even death [5] [6] . Symptoms are usually mild in children and young adults , while elderly, immunosuppressed, and individuals affected by a cardiac disease or diabetes are at higher risk of developing severe symptoms [7] .

COVID-19 is primarily transmitted between people through respiratory droplets and contact routes [8] , although some recent studies suggest that airborne transmission might also be possible [9] [10] . The incubation period typically ranges between 2 and 14 days, although longer incubation times have been also reported [11] . Notwithstanding the development of several promising therapeutic approaches [12] , at present no universally approved therapeutic strategy is available for the treatment of COVID-19.

At the time of writing, COVID-19 has affected more than 200 countries worldwide, with more than 9 Million confirmed individual infections and a death toll in excess of 480 thousand. The limited availability of diagnostic kits in several countries, varying criteria for reporting COVID-19-related deaths, and the fact that very mild or asymptomatic infections can often go undetected [13] , suggest that both these figures are likely to represent substantial underestimates of the worldwide impact of SARS-CoV-2. The first complete genomic sequences of the viral pathogen were determined in early January 2020 by Next Generation Sequencing metatranscriptomics [14] , allowing the rapid development of diagnostic tests (Corman et al., 2020) and the development of molecular monitoring strategies [15] [16] .

The genome is approximately 30.000 nt in size and shows high similarity (~79%) with SARS-CoV [17] , a beta-coronavirus of the subgenus Sarbecovirus, and the causal agent of a large scale epidemic of viral pneumonia (Severe Acute Respiratory Syndrome, SARS) that hit China and other 25 countries in 2003 and 2004 [18] . The International Committee on Taxonomy of Viruses (ICTV) has designated the novel pathogen as SARS-CoV-2.

Phylogenetic analyses of conserved protein domains of more than 2500 coronaviruses have been used to assign SARS-CoV-2 to the group of the Severe acute respiratory 2 syndrome-related coronavirus species (SARSr-CoV), where it forms a relatively distant sister group to SARS-CoV, interleaved with various SARSr-CoV isolated from non-human mammalian species [19] .

SARS-CoV-2 shows the highest levels of genome identity (96%) with a bat SARS-related (SARS-r) CoV denoted RaTG13, isolated in the Yunnan province [20] . Despite this sequence similarity, SARS-CoV-2 differs from RaTG13 in several key features. Arguably, the most important is the presence of a polybasic furin cleavage site insertion (residues PRRA) at the junction of the S1 and S2 subunits of the Spike protein [21] . This insertion, which may increase the infectivity of the virus, is not present in related beta-coronaviruses, although similar polybasic insertions are observed in other human coronaviruses, including HCoV-HKU1, as well as in highly pathogenic strains of avian influenza virus [22] .

Additionally, the RBD (Recognition Binding Domain) of the SARS-CoV-2 spike protein is significantly more similar (97% identity) to that of SARSr-CoVs isolated from specimens of Malayan pangolins ( Manis javanica ) illegally imported into southern China (Guangdong and Guangxi provinces) than to the RDB of RaTG13 ( 89% identity ). This data suggest the possibility that pangolins or other mammalian species could have acted as "intermediate" or "amplifying" hosts for human transmission of SARS-CoV-2 [23] [24] .

As the COVID-19 pandemic has progressed, the number of viral isolates for which a genomic sequence is available has increased substantially. Currently, in the excess of 50 .000 viral genomes are publicly available in dedicated repositories [25] . As expected, considering the recent ancestry, and reportedly low mutation rates of coronaviruses [26] , SARS-CoV-2 genomes are highly similar (average identity 99.99% ) and show reduced genetic diversity. This low diversity notwithstanding, the availability of a considerable number of genomic sequences, with worldwide sampling, allows the identification of phylogenetically distinct clusters of SARS-CoV-2 sequences [27] [28] [29] [30] [31] [32] . Interestingly, several of these clusters show highly biased geographic distributions, and in many independent studies, the genomic signatures of different clusters have been tentatively linked to increased/decreased virulence or possible adaptation to human hosts [33] [34] [35] .

While it should be stressed that, in the absence of careful experimental validation, it is extremely difficult to determine if the identified SARS-CoV-2 genetic variants have increased/decreased virulence and reflect adaptive evolution or genetic drift and founder effects, the importance of establishing a simple and reproducible system for the delineation of genomic diversity of human pathogens is universally acknowledged [36] [37] . Indeed such a classification system could be extremely useful for the rapid identification of new emerging types and, more importantly, for a fine grained monitoring of the diffusion of pathogens in 3 different geographic contexts over time.

Currently, the most widely used models for the classification of SARS-CoV-2 viral types are those provided by the curators of the GISAID [38] and Netxstrain portals [39] , which incorporate the most complete and accurate resources for comparative genomics of SARS-CoV-2, Providing extensive databases of more than 40000 viral sequences with associated metadata, including date and place of isolation, as well as information concerning health and background data of the patients. GISAID also provides tools for comparative genomic and phylogenetic analyses to facilitate the annotation of the genomes, their classification in types/groups and the functional interpretation of the data. Similarly, Nexstrain collects a selection of publicly available genomes and associated metadata, to provide informative analyses and graphical representation of the spread of SARS-CoV-2 over time as well as detailed and curated phylogenetic analyses for the classification of genome types. Although both platforms constitute invaluable resources for the SARS-CoV-2 research community, the associated classification systems suffer from some inherent limitations. The most important is that the criteria used to establish the clusters/types of SARS-CoV-2 are not set out clearly, and that the delineation of clusters is based on arbitrary manual annotation of phylogenetic trees, rather than on inherent genomic features of SARS-CoV-2, limiting the reproducibility of results. Furthermore, Nextstrain provides classification only for a restricted selection of the available viral genomes (currently less than 4000). Finally, limited variability and the recurrent events of recombination observed in Sarbecovirus [40] [41] may hinder the validity of classification systems based purely on phylogenetic analyses and manual annotation.

In the light of these considerations, in the present work we propose a simple set of rules, which could serve as an operational classification system of SARS-CoV-2 genomic sequences. The proposed classification system is inspired by MultiLocus Strain Typing (MLST), a classification approach normally used for microbes [ 42] . Thus, for a given dataset of available SARS-CoV-2 genomes, it combines the empirical study of genomic variability with the analyses of the high frequency variant sites that are fixed in the extant viral population. In this way our approach derives a simple but effective set of rules for a systematic and highly reproducible delineation of SARS-CoV-2 genomes, that are therefore grouped in clusters and superclusters.

To demonstrate the validity of our approach, we conducted an extensive comparative genomic analysis of more than 20,000 SARS-CoV-2 complete genomes. By applying the proposed classification system we derived important observations concerning different types of SARS-CoV-2, their evolutionary pattern and geographic distribution, and the mechanisms associated with the emergence of possible novel types.

A collection of 20,521 putatively complete high coverage SARS-CoV-2 genomes and associated metadata was retrieved from the GISAID EpicoV [38] platform on 28th May 2020. A total of 13 SARSr-CoV genomic sequences isolated from non-human hosts, including bats and pangolins [23] [24] , were also retrieved from the GISAID EpiCoV portal at the same date. SARS-CoV-2 sequence comparisons were performed using the reference Refseq [43] assembly NC_045512.2, collected on 26th December 2019 and identical to the sequence of the oldest SARS-CoV-2 isolates, dating back to 24th December 2019 (EPI_ISL_402123).

SARS-CoV-2 genomes were aligned to the 29,903 nt-long reference assembly of SARS-CoV-2 by means of the nucmer program [44] . Custom Perl scripts were used to infer the size of each genomic assembly and the number of uncalled bases/gaps (denoted by N in the genomic sequence). Then, we analysed only the 11,633 high quality complete genomes, defined as those longer than 29,850 nt and including less than 150 ambiguous sites.

Variant sites, including substitutions and small insertion and deletions, were identified by using the show-snps utility of the nucmer package. Output files were processed by the means of a custom Perl script, and converted into a phenetic matrix, with variable positions on the rows and viral isolates in the columns. Values of 1 and 0 were respectively used to indicate presence or absence of a variant.

Genetic distances between genomic sequences were established from this phenetic matrix using the dist function of the R stat package with default parameters (Euclidean distances) [45] [46] . Clusters were established by means of hierarchical clustering algorithms, with complete linkage as implemented in the hclust R standard libraries function. The cutree function was used to separate distinct clusters at the desired level of divergence (2 distinct variant sites).

Functional effects of genetic variants, as identified from genome alignments, were predicted by means of a custom Perl script, based on the annotation of the NC_045512.2 SARS-CoV-2 reference assembly.

Identification of sites possibly under selection was performed by applying the MEME and FEL methods, as implemented in the Hyphy package [47] , on the concatenated alignment of protein coding sequences of all the 11,633 previously identified high quality complete SARS-CoV-2 genomes. A p-value of 0.05 was considered for the significance threshold.

A total of 68 viral genomes of the SARS 2003 outbreak were retrieved from the NCBI virus database [48] . Classification/association of strains to the 3 (early/middle/late) phases of the SARS 2003 epidemic are according to Song et al 2005 [49] .

Calculation of evolutionary rates of SARS-CoV-2 and estimation of times of divergence were performed according to the formula described in Zhao et al [50] , based on genetic distances as determined in this study.

Analyses of prevalence of allele frequency over time were executed based on the collection dates of individual genomes as reported in the GISAID metadata table. The collection date of the reference genomic sequence of SARS-CoV-2 in GISAID (26th December 2020), was set as time 0. Consecutive, non-overlapped intervals of 10 by 10 days were considered. A total of 1381 genomic sequences, for which collection dates were not reported in GISAID, were excluded from these analyses.

Comparison of levels of variability of "early" and "late" clusters of SARS-CoV-2 genomes were established by 100 random resampling of 150 genomes (batch), matched by date of collection, from each defined cluster (see below). The total number of distinct variant sites was calculated for each random batch of genomes, in order to derive a distribution of genomic variability. Significance between distributions of genomic diversity were established by means of the Wilcoxon, Sum and Rank test as implemented in the standard R libraries [45] .

Variability with respect to the reference NC_045512.2 SARS-CoV-2 genome was computed on sliding windows of 100 bp, overlapped by 50 bp, by counting the proportion of variable sites contained in each window (number of variable sites in the window, divided by the total number of variable sites in the entire genome) with a custom Perl script. A Fisher-exact test, contrasting the local variability in a window with the average variability in the genome, was used to identify hypervariable regions. P-values were corrected using the Benjamini-Hochberg procedure for the control of False Discovery Rate.

Predictions of the secondary structure of the "Coronavirus stem-loop II-like motif" (s2m) and its Minimum Folding Energy (MFE) calculation were estimated with the RNAfold program [51] of the Vienna package [52] , by artificially implanting each of the possible 129 substitutions in the 43 nt-long s2m sequence identified in the reference SARS-CoV-2 genome and in the presumably ancestral sequence of s2m, as observed in the genome of the RatG13 SARSr-CoV-2.

Prediction of the consensus co-folding structure of s2m in SARS-CoV-2 was obtained by applying the R-scape [53] program to the alignment of all s2m sequences found in the collection of the 11,633 high quality complete genome analysed in this study.

Consensus secondary structure of the s2m element of Coronaviruses was as in the model RF00164 ( https://rfam.org/family/RF00164 ) of the RFAM database (reference).

Graphical representation of the data, basic statistical analyses and clustering of viral genomes were performed by means of the standard libraries of the R programming language.

The software for the operational classification of SARS-CoV-2 lineages proposed here, is publicly available through this github repository https://github.com/matteo14c/assign_CL_SARS-CoV-2 .

We retrieved 20,521 SARS-CoV-2 genomic sequences labeled as high coverage and putatively complete, covering 86 countries in 5 continents from the GISAID EpiCoV portal.

Surprisingly, we observed that a considerable number (6462) of the reportedly complete genomic assemblies of SARS-CoV-2 presented incomplete 3' or 5' UTRs (median size 29782 nts; reference genome size of 29,903 nts, including a polyA tail of 33 nts).

Additionally, 4310 genomes contained a large number of gaps and/or uncalled basesranging from 151 to 3500. Stringent criteria were used to retain only sequences which were more likely to represent a nearly complete assembly of the SARS-CoV-2 genome (more than 29,850 nts in size) and contained only a limited number of missing positions (less than 150

Ns). A total of 11,633 genomes were therefore selected and analysed (Supplementary Table   S1 ).

Comparative analysis of genetic distances between these SARS-CoV-2 genomes (see year). Interestingly, we notice that 6384 (54.87%) of the high quality SARS-CoV-2 genomes here analysed have a perfect sequence identity (0 polymorphic sites) with respect to their closest neighbor. All in all these observations are consistent with the reportedly low mutation rate of coronaviruses with respect to other single strand RNA viruses [26, 50] .

A total of 6319 distinct variant sites were observed between the 11,633 high quality genomes included in these analyses (Supplementary Table S2 ). 99.22% of these sites have an allele frequency below 1%, the threshold above which a variant is often considered fixed in a natural population [54] . This value is not surprising, consideringly the relatively recent ancestry and the low mutation rate of SARS-CoV-2. Thus, only 50 variant positions (0.88%)

show an allele frequency of 1% or greater (Table 1) .

Importantly, when the entire collection of 20,521 SARS-CoV-2 genomes considered in this study is taken into account, the total number of variant sites is significantly increased (9398 sites, Supplementary Table S2) (Table 1 ). This suggests that the apparent reduction in allele frequency at these 4 sites is most probably due to an incomplete representation of the terminal ends of the genomes. Table 1) .

The MEME and FEL methods [47] were applied to the concatenated alignments of protein coding genes of the 11,633 high quality complete genomes to identify signals of adaptive 8 evolution. A total of 194 sites were associated (p-value ≤ 0.05) with signatures of selection according to at least one method (Supplementary Table S3 ). Of these, 118 sites were associated with positive selection, while 76 sites were deemed to be under negative selection.

Strikingly, a highly significant over-representation of both sites under negative and positive selection is observed among the 42 high frequency polymorphic sites of protein coding genes ( tree. This is exemplified by the fact that at present, more than 1500 genomes are not assigned to a type according to GISAID.

As outlined in the previous section currently available genomic sequences of SARS-CoV-2

show a limited level of variability and a low number of high frequency polymorphic sites.

Moreover, a relatively high proportion of these sites may evolve under adaptive selection, possibly consistent with functional consequences of variants. Based on these observations, we propose a simple set of empirical rules for the operational classification of SARS-CoV-2 genomes:

1. Similar to approaches used for Multi Locus Strain typing [42] , classification of SARS-CoV-2 genomes should be based on variants that are shared by a relevant

proportion of viral population. Since a prevalence of 1% is normally considered to identify genetic variants that have reached fixation in a population [54] , we propose to use this cut-off as the natural threshold for identifying high frequency variants .

2. In the light of the fact that closely related genomes show an average of 0.64 polymorphic sites, we suggest that distinct clusters/types should differ at least for 2 high frequency polymorphic sites 3. To avoid an excessive fragmentation, each cluster/type should incorporate at least 100 distinct genomes.

4. Since the variability of SARS-CoV-2 is limited, clusters/types that share one or more high frequency polymorphic sites should form groups of higher order (super-cluster).

To evaluate the validity of the proposed clustering approach, we applied the criteria defined above for clustering both the 11,633 high quality genomes (high quality set), as defined by our stringent criteria, and the entire collection of 20,521 genomes (extended set).

Considerations regarding the apparent reduction in allele frequency of polymorphic sites at the ends of the genome (see above), prompted us to exclude these sites from the analyses of the extended set.

Irrespective of the dataset considered ( Figure 1A , Supplementary Figure S2A ), our criteria consistently identified 12 clusters (C1-C12) and 4 super-clusters (SC1-SC4) ( Table   2,Supplementary Table S1 ). As outlined in Figure 1B (and Supplementary Figure S2B) , each cluster and supercluster is defined by a characteristic molecular signature consisting of states at 2 to 9 high frequency variable sites. Cluster 1 is the only exception in this respect, since it is formed by genomes that are highly similar to the reference.

A comparison of our proposed clustering systems with the current classifications of Nexstrain and GISAID ( Table 2 ), shows that our method recapitulates all of the major groups of genomes defined by GISAID and Nextstrain, as well as identifies additional clusters undescribed by only one (clusters 9 and 10) or both (clusters 11 and 12) other classification methods. Importantly, although the novel clusters defined in this study are formed by a relatively limited number of genomes (243 and 315 respectively for cluster 11 and 12), they

show a similar genetic distance from the reference genome as all the other previously defined larger clusters ( Figure 1B) . Notably, a heatmap of worldwide prevalence of SARS-CoV-2 types ( Figure 2 ) shows that cluster 12 represents more than 40% of the genomes isolated in India, but shows only a very modest prevalence in all other countries worldwide. To facilitate the classification of newly sequenced SARS-CoV-2 genomes an automated software package that implements the criteria devised in the present study has is made publicly available at https://github.com/matteo14c/assign_CL_SARS-CoV-2 .

Of the 50 high frequency polymorphic sites included in our analyses 35 reach complete, or nearly complete fixation (relative AF ≥ 0.9) in at least one cluster (Table 1, Figure 1B super-cluster, and can be therefore considered super-cluster "specific". These observations strongly support our hypothesis that high frequency variable sites, as defined here, are highly effective for the discrimination/classification of distinct types of SARS-CoV-2.

Strikingly, 15 of the 35 sites that are fixed in and specific to at least one cluster are predicted to be under positive (8) or negative (7) selection according to FEL or MEME (Table   1 ). Although this observation might be suggestive of distinct phenotypic features/properties for the different SARS-CoV-2 types, as previously suggested by other authors [33] [34] [35] , we remark that in the absence of experimental validations, these data should be interpreted very carefully.

As shown in Figure 3 and consistently with previous reports, we observe a highly biased geographic distribution of SARS-CoV-2 genome types worldwide [27] [28] [29] [30] [31] [32] . Figure S9) . Overall, the low levels of variability of SARS-CoV-2, the emergence of novel genome types at different times during the pandemic, and the highly biased phylogeographic distribution (Figure 2 ), suggest that novel genome types of the "late" pandemic phases likely derived from the fixation of novel alleles in "early" pre-existent types.

We speculate that several possible alternative evolutionary processes can explain this pattern of rapid allele fixation, including genetic drift and founder effects [55] [56] , convergent evolution, and/or rapid selection of standing variation for the adaptation to a novel environment [57] . To discriminate between these scenarios, we reasoned that while founder effects and selection should be associated with an overall reduction in genomic diversity of relevant viral sub-populations population, convergent evolution should not alter the underlying allele frequency distribution of the population (except at the sites under selection).

We then compared the overall allele diversity between batches of genomes from all the clusters forming the super-clusters SC2 and SC3, batches equivalent both in number and timescale (see Materials and Methods) ( Figure 5A ). SC1 was not considered due to being composed of a single cluster, while considerations regarding the limited availability of genomes matched by collection date prompted us to exclude SC4 from these analyses. The results show a statistically significant (p-value Wilcoxon 1.3e-07 for C3, 1.5e-07 for C4 and 9.21e-08, 1.21e-07 and 1.13e-07 for C6, C7 and C8 respectively) reduction of genetic diversity for late clusters (i.e. clusters emerging after ≥60 days) compared to early clusters.

Importantly ( Figure 5B , Table3), we observe that evolutionary rates are highly homogeneous and do not show detectable changes between clusters, suggesting that reduced diversity of late clusters is not associated with a reduction of evolutionary rates. According to our starting hypothesis, and in the light of the strong phylogeographic bias, these results suggest that the emergence of novel SARS-CoV-2 genome types is unlikely to be driven by widespread convergent evolution and independent fixation of advantageous substitutions.

Remarkably, our analyses do not support an increased genomic diversity for clusters included in SC3 ( Figure 5A ), although their common molecular signature, i.e., the prevalent polymorphic site at position 14408 in the nsp12 gene (RdRp), was previously described as associated with an increased genomic variability ( Pachetti et al. [33] ). We speculate that biased/incomplete sampling of SC3 during the early phase of the pandemic, and the fact that Pachetti et al compared raw non-normalized genetic distances (instead of normalized evolutionary rates) are the most likely explanation for this discrepancy.

Patterns of genomic variability were established on sliding windows of 100 bp in size and overlapping by 50 bp, for all the clusters and super-clusters defined in this study. As shown in Figure 6 and Supplementary Figure S10 , the observed patterns are remarkably similar between all the clusters and super-clusters, suggesting common patterns of variation.

and ≤1e-12 respectively) in both the 5' and 3' UTR regions, while protein coding loci (CDS)

show less variability. Strikingly, a single genomic region in the 3' UTR accumulates~10x more mutations than any CDS,~2x more than any other UTR region, and is the single most variable region in the genome of SARS-CoV-2 ( Figure 7A ). This highly variable genomic region is associated with a conserved secondary structure ( Figure 7A ), already known in literature and referred to as s2m.

S2m is a 41-nucleotide genetic element with a highly conserved secondary -as well as primary and tertiary -structure, and has been described in several families of single-stranded RNA viruses, including Astroviridae, Caliciviridae, Picornaviridae and

Coronaviridae [58] . The molecular function of this potentially mobile structural element is not well understood. Current hypotheses include hijacking of host protein synthesis through interactions with ribosomal proteins [59] , and RNA interference (RNAi) via processing of the s2m elements into a mature microRNA [60] . In coronaviruses, the highly conserved nature s2m has also allowed the development of a PCR-based virus discovery strategy [61] .

As outlined in Figure 7B , compared to the consensus secondary structure of s2m described Table S7 ). Based on this observation and on the high levels of variation of the entire s2m region, it is tempting to speculate that s2m could be under diversifying selection in SARS-CoV-2. Strikingly, we observe that the G->T substitution at 29,742, which is a hallmark of cluster C11, would result in a substantially increased stability of s2m (Supplementary Table S7 Table S7 ). Interestingly, we notice ( Figure 7C ) that the same consideration applies to the majority of the nucleotide substitutions that are observed in the SARS-CoV-2 s2m element. Indeed, with respect to the background of all possible nucleotide substitutions that could occur in s2m of the SARS-CoV-2 reference genome, the set of variants that is actually found in extant SARS-CoV-2 genomes seems to produce a less stable secondary structure.

A co-folding analysis of all distinct variants of the s2m elements found in the 11,633

complete and high quality genomes -according to the criteria defined in this study -suggest a very degenerate secondary structure of s2m in SARS-CoV-2 ( Figure 7D ).

Notably while a substitution that restores the presumably ancestral state of s2m (i.e., the secondary structure of RaTG13 SARSr-CoV-2) is observed (29753 T->G), this substitution is associated only with a very limited number of genomes (AF=0.0012). In the current study we propose a rational and highly reproducible approach for the classification of SARS-CoV-2 genomes, inspired by Multi Locus Strain Typing (MLST), a molecular approach commonly used for the classification of pathogens [42] . Building on a highly informative set of variable sites, which show high prevalence in the viral population, our system is conceptually simple and represents a general and robust alternative to methods based on phylogenetic analyses. Applying our system to the entire collection of genomic sequences, as available on 28th May 2020, we derive interesting observations concerning evolutionary patterns of SARS-CoV-2.

By comparison of more than 20,000 complete or nearly complete genomic sequences we The evolutionary rates estimated in this study, coupled with the number of observed polymorphic sites would retrospectively date the spillover of SARS-CoV-2 in humans to August-November 2019, an estimate that is consistent with other recent studies [62] .

Careful monitoring of the evolutionary rates of SARS-CoV-2 over a longer period of time, and ideally also on an unbiased/matched number of genomes isolated from different geographic areas, are required to confirm these findings.

In this respect, the fact that a relevant number of SARS-CoV-2 high frequency polymorphic sites are observed even in viral strains isolated from pangolins and bats specimens highlights an unexplored diversity shared by SARS-CoV-2 and SARSr-CoV-2 viral genomes. This suggests that additional sampling of a larger number of non-human specimens is required to reconstruct a more complete phylogeny and to possibly trace back the "original" spillover event. Indeed, notwithstanding the high levels of similarity to SARS-CoV-2 (in the order of 97%), RaTG13, the most closely related viral genome isolated from a bat specimen, is supposed to have diverged from SARS-CoV-2 more than 25 to 40 years ago [63] .

Our classification system, based on 50 high frequency polymorphic sites, identifies a total of 12 distinct clusters and 4 super-clusters of SARS-CoV-2 genome types, all having a highly biased phylogeographic distribution (Figure 2-3) . We note that several polymorphic sites that are specifically associated (completely fixed) with clusters and super-clusters are predicted to be either under positive or negative selection according to state of the art methods for the study of evolutionary constraints in protein coding genes (Table 1) . Interestingly, several of these sites have been previously highlighted by other studies and tentatively associated with increased virulence and/or increased mutation rates of SARS-CoV-2 [33] [34] [35] .

While fixation of advantageous variants has previously been proposed as an effective and Notably, we observe a highly consistent pattern of nucleotide substitution in SARS-CoV-2 genomes between all clusters and superclusters, characterized by an increased variability at UTRs, in spite of the fact that a significant proportion of genomic assemblies annotated as "full-length" in GISAID are incomplete at the terminal ends. Although this incomplete representation of genomic sequences does not affect the classification system proposed in this study, it might result in an inaccurate/incomplete representation of ongoing evolution of SARS-CoV-2. This is exemplified by the s2m element, a highly conserved secondary structure element located in the 3' UTR which carries a substitution in the reference genome of SARS-CoV-2 that destabilizes the secondary structure and is possibly deleterious. In the absence of clear data on a function to be experimentally tested, it is very difficult to speculate whether the s2m of SARS-CoV-2 maintains the same functional activity. However, the substantial increase of genomic variability observed in the s2m locus, compared with the rest of the genome, suggests that this element might be subject to ongoing widespread diversifying selection in SARS-CoV-2, or at least have lost significant purifying constraints.

Patterns of single nucleotide substitutions in s2m provide contrasting evidence concerning the evolutionary patterns of this secondary structure element in SARS-CoV-2, as the most prevalent substitutions (29,742 G->T) seems to be associated with a considerable increase in secondary structure stability, but the majority of the substitutions observed in extant SARS-CoV-2 genomes are not optimal in terms of the recovery of a highly stable secondary structure.

In 

Emergence of a novel human coronavirus threatening human health

A pneumonia outbreak associated with a new coronavirus of probable bat origin

A Review of Coronavirus Disease-2019 (COVID-19)

COVID-19 pandemic-a focused review for clinicians

The trinity of COVID-19: immunity, inflammation and intervention

Characteristics of and Important Lessons From the

COVID-19) Outbreak in China: Summary of a

Cases From the Chinese Center for Disease Control and Prevention

Pattern of early human-to-human transmission of Wuhan

Aerosol and surface stability of SARS-CoV-2 as compared with SARS-CoV-1

COVID-19 vulnerability: the potential impact of genetic susceptibility and airborne transmission

The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application

Pharmacologic Treatments for

COVID-19): A Review

Suppression of COVID-19 outbreak in the municipality of Vo

A new coronavirus associated with human respiratory disease in China

Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR

Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus

Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding

Severe acute respiratory syndrome (SARS): a review

Coronaviridae Study Group of the International Committee on Taxonomy of Viruses

The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2

A pneumonia outbreak associated with a new coronavirus of probable bat origin

The spike glycoprotein of the new 20 coronavirus 2019-nCoV contains a furin-like cleavage site absent in CoV of the same clade

Genetic Predisposition To Acquire a

Polybasic Cleavage Site for Highly Pathogenic Avian Influenza Virus Hemagglutinin

Identifying SARS-CoV-2-related coronaviruses in Malayan pangolins

Evidence of recombination in coronaviruses implicating pangolin origins of nCoV-2019

EpiCoV Data Curation Team. 55,000 viral genomic sequences of hCoV-19 shared with unprecedented speed via GISAID

Viral mutation rates

Genetic diversity and evolution of SARS-CoV-2

Phylogenetic network analysis of SARS-CoV-2

Proceedings of the National Academy of Sciences

The 2019-new coronavirus epidemic: Evidence for virus evolution

Genomic variance of the 2019-nCoV coronavirus

Spread of SARS-CoV-2 in the

Icelandic Population

Genetic structure of SARS

reflects clonal superspreading and multiple independent introduction events

Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant

Distinct Viral Clades of SARS-CoV-2: Implications for Modeling of Viral Spread

Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2

Genomic Epidemiology: Whole-Genome-Sequencing-Powered Surveillance and Outbreak Investigation of Foodborne Bacterial Pathogens

Global initiative on sharing all influenza data -from vision to reality

Nextstrain: real-time tracking of pathogen evolution

Genetic diversity and evolution of SARS-CoV-2

SARS and MERS: recent insights into emerging coronaviruses

Multilocus sequence typing of bacteria

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

MUMmer4: A fast and versatile genome alignment system

R: A language and environment for statistical computing. R Foundation for Statistical Computing

cluster: Cluster Analysis Basics and Extensions

HyPhy 2.5-A Customizable Platform for Evolutionary Hypothesis Testing Using Phylogenies

A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection. mSphere

Cross-host evolution of severe acute respiratory syndrome coronavirus in palm civet and human

Moderate mutation rate in the SARS coronavirus genome and its implications

Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure

The Vienna RNA

Issue suppl_2

A statistical test for conserved RNA structure shows lack of evidence for structure in lncRNAs

A population threshold for functional polymorphisms

Genetic drift, selection and the evolution of the mutation rate

Quantitation of relative fitness and great adaptability of clonal populations of RNA viruses

Error thresholds of replication in finite populations mutation frequencies and the onset of Muller's ratchet

Distribution and Evolutionary History of the Mobile Genetic Element s2m in Coronaviruses

The structure of a rigorously conserved RNA element within the SARS virus genome

A mobile genetic element with unknown function found in distantly related viruses

Detection and sequence characterization of the 3′-end of coronavirus genomes harboring the highly conserved RNA motif s2m

Genomic characterization and phylogenetic analysis of SARS-COV-2 in Italy

T

Supplementary Figure S8: Phenetic patterns and geographic area of the first 50 genomes for each "early" cluster. Each heatmap displays presence/absence of the 50 high frequency polymorphic sites (AF >0.01, identified in the 11,633 "high quality" complete SARS-CoV-2 genomes) in the first 50 genomes

Pink indicates a reference allele, red an alternative allele for that site

Country and date of collection are reported on the rows

Supplementary Figure S9: Phenetic patterns and geographic area of the first 50 genomes for each "late" cluster. Each heatmap displays presence/absence of the 50 high frequency polymorphic sites (AF >0.01, identified in 11,633 "high quality" complete SARS-CoV-2 genomes) in the first 50 genomes

Pink indicates a reference allele, red an alternative allele for that site. Panels on the left indicate continents: yellow=Asia, blue=Europe, pink=Africa, green=North America, orange=South America, purple=Oceania

We thank ELIXIR Italy for providing the computing and bioinformatics facilities. We gratefully acknowledge the authors, originating and submitting laboratories of the sequences from GISAID's EpiFlu™ Database on which this research is based. The list is detailed in Supplementary Table S1