key: cord-0779989-uubndgio
authors: Chiara, Matteo; Horner, David S.; Gissi, Carmela; Pesole, Graziano
title: Comparative genomics suggests limited variability and similar evolutionary patterns between major clades of SARS-CoV-2
date: 2020-04-04
journal: bioRxiv
DOI: 10.1101/2020.03.30.016790
sha: c03fd1659e0b27091c1ff54f539072cfd167ed66
doc_id: 779989
cord_uid: uubndgio

Phylogenomic analysis of SARS-CoV-2 as available from publicly available repositories suggests the presence of 3 prevalent groups of viral episomes (super-clades), which are mostly associated with outbreaks in distinct geographic locations (China, USA and Europe). While levels of genomic variability between SARS-CoV-2 isolates are limited, to our knowledge, it is not clear whether the observed patterns of variability in viral super-clades reflect ongoing adaptation of SARS-CoV-2, or merely genetic drift and founder effects. Here, we analyze more than 1100 complete, high quality SARS-CoV-2 genome sequences, and provide evidence for the absence of distinct evolutionary patterns/signatures in the genomes of the currently known major clades of SARS-CoV-2. Our analyses suggest that the presence of distinct viral episomes at different geographic locations are consistent with founder effects, coupled with the rapid spread of this novel virus. We observe that while cross species adaptation of the virus is associated with hypervariability of specific protein coding regions (including the RDB domain of the spike protein), the more variable genomic regions between extant SARS-CoV-2 episomes correspond with the 3’ and 5’ UTRs, suggesting that at present viral protein coding genes should not be subjected to different adaptive evolutionary pressures in different viral strains. Although this study can not be conclusive, we believe that the evidence presented here is strongly consistent with the notion that the biased geographic distribution of SARS-CoV-2 isolates should not be associated with adaptive evolution of this novel pathogen.

The SARS-CoV-2 pandemic (Poon and Peiris, 2020) poses the greatest global health and socioeconomic threat since the second world war. Complete genomic sequences of viral isolates from diverse geographic sites, have rapidly been made available through dedicated resources (Shu and McCauley, 2017, Goodacre et al,2018) facilitating comparative genomics studies, identification of putative therapeutic targets (Zhou et al, 2020 , Chen et al 2020 , Robson 2020 ) and the development of effective prevention and monitoring strategies (Qiang et al 2020) . Analyses of available genomic sequences, according to GISAID EpiCoV, suggest major viral clades, S, V and G which, collectively, circumscribe more than 69% of the characterised isolates. Strikingly, these clades show a markedly biased prevalence in different areas, with the S clade accounting for more than 72% of the viral isolates characterized in the USA, and the G clade comprising more than 74% of those that have been sequenced in Europe.

Importantly, while the G-clade was initially considered to be composed of viral strains which were not observed in China, availability of additional genomic sequences suggests that this clade as well should be nested within isolates from Shanghai. Although comparative analyses suggest that genomic variability between different isolates of SARS-CoV-2 is generally low (Lu et al, 2020 , Zhang e t al, 2020 , Tang et al, 2020 , the fact that distinct viral episomes show a highly biased geographic distribution is potentially alarming, as, at present, it is not completely clear whether frequent variants reflect the adaptive processes, which result in the emergence of novel, and potentially more virulent strains. Moreover at present it is unclear whether the genomic variability of major clades of SARS-CoV-2 and their biased geographic distribution, could explain -at least in part-apparent differing rates of lethality observed worldwide (Baud et al 2020) .

In the present study, exploiting curated viral genomic sequences, we present analyses of more than 1100 complete SARS-CoV-2 genomes, identified from 5 continents and more than 45 countries. By contrasting evolutionary patterns associated with the most prevalent viral clades, with those observed between closely related viral strains isolated from various species, we provide insights into the evolutionary dynamics and adaptation of SARS-CoV-2 like viruses to different hosts and the evolutionary patterns of the major clades of SARS-CoV-2.

We show that while the majority of the genomic variants that discriminate between major viral clades cause non synonymous substitutions in protein coding genes, including genes implicated in the modulation of the virulence of SARS-CoV-2 such as the spike protein and the RNA dependent polymerase (Weiss and Navas-Martin, 2005) , the major clades of SARS-CoV-2 show nearly identical patterns of genomic variability as well as the absence of signatures that are normally associated with adaptive evolution at protein coding loci. Indeed, the major clades of SARS-CoV-2 are identified only by a limited number of clade-specific genetic variants and show very modest variability. Notably, variable sites are enriched in the 5' and 3' non coding regions of the genome, unlike genomic sites which are hyper-variable between closely related strains with distinct host specificites.

While our observations cannot be considered conclusive, we believe that the available data are broadly consistent with the notion that the biased geographic distribution of SARS-CoV-2 isolates is not associated with adaptive evolution of this novel pathogen, but rather with extensive founder effects coupled with the rapid spread of this pathogen in diverse geographic zones.

A total of 1113 complete, high quality SARS-CoV-2 genomic sequences, as well as of 2 SARS-CoV-2 like viruses isolated from non-human hosts (bats and pangolins (Zhou et al 2020 , Matthew et al 2020 , were retrieved from the GISAID EpiCoV portal on March 24th 2020. Associated metadata (Supplementary Table S1) show that these isolates included in cover more than 45 countries in 5 continents. As expected, the geographic distribution of the data closely reflects geographic prevalence of the pandemic, although notable exceptions include limited public data from Italy, one of the early hotspots of the pandemic.

Genomic sequences obtained from GISAID EpiCoV were aligned to the reference SARS-CoV-2 assembly (Refseq (O'Leary et al, 2016) accession NC_045512.2) , i.e., the presumably ancestral Wuhan isolate (Wu et al, 2020) , and derive a phenetic matrix of presence/absence of the variants.

Substitution patterns of nucleotide residues, as shown in Supplementary Table S2 This notwithstanding we notice the presence of additional genomic variants, which could subtend the presence of smaller sub-clades.

Consistent with this observation, cluster stability metrics, based on the Dunn index, were strongly consistent with the presence of 9 clusters in the phenetic matrix of viral isolates. Clustering of viral isolates, based on the k-means algorithm with 9 clusters (Figure 1 and Supplementary Table S3) demonstrates a neat separation of viral episomes, with cluster number 7 corresponding to the two genome sequences of the SARS-CoV-2 like strains isolated from non human hosts. Cluster number 1 is formed by episomes that have limited variability with respect to the reference genome; cluster 2, 5 and 9 incorporate all the strains included in the European clade G; while clusters 3 and 4 correspond to the GISAID EpicoV clade S, with cluster 3 containing the majority of viral isolates prevalent in the U.S. Cluster 6 and cluster 8, which are composed of 35 and 40 viral strains, respectively, corresponds to a subset of the V clade as described in GISAID EpiCoV. While the majority of isolates included in cluster 8 have been isolated in Europe, cluster 6 does not seem to be associated with a specific geographic location. Notably, (Table 1 and Supplementary Table S4 ) we observe that the majority of polymorphic sites that discriminate between the 9 clusters identified by our analyses, are associated with nonsynonymous amino-acid substitutions, and several of these substitutions occur in viral genes that have been implicated in the modulation of the virulence of SARS and SARS-like associated coronavirus, including the spike protein and the RNA dependent RNA polymerase (nsp12) (Weiss and Navas-Martin, 2005) . In order to test the possibility that these variations could reflect early hints of adaptive evolution of SARS-CoV-2 strains, evolutionary dynamics of viral super-clades were contrasted with patterns of evolution of the two closely related viral strains with different host specificity.

As shown in Figure 2 , comparison of intra-cluster variability, performed using only the 713 polymorphic sites that are associated with a single viral isolate, clearly demonstrates similar levels of variation between all the SARS-CoV-2 clusters identified in this study, with a slight (but not statistically significant, Wilcoxon test p-value 0.129) increase in variability for strains included in cluster 6.

In order to identify regions of the genome which could be subject to distinct evolutionary pressures, plots of local genomic variability along the complete genomic sequence of SARS-CoV-2, were prepared by computing the proportion of polymorphic sites identified in each of the viral Super-clades on sliding genomic windows of 100 bp in size and overlapping by 50 bp. As shown in Figure Table S5 ) indicate that, as expected, the RDB domain of the spike protein seems to be subject to strong diversifying selection between viral isolates with different host specificity (dN/dS > 1). On the contrary, in SARS-CoV-2 isolates we do not observe a dN/dS value > 1of the RDB domain, but we notice that, in all our comparisons, the gene coding for the spike protein shows the highest level of dN/dS among all protein coding genes, suggesting thatprobably to facilitate adaptation to different hosts -this gene could be subject to relaxed selective pressure in coronaviruses.

coronaviruses are consistent with the notion that regions in the genome corresponding with increased variability, should be associated with relaxed purifying selection and/or increased diversifying selection. In this respect it is interesting to note that the 5' and 3' UTRs, which are the most variable regions of the genome between major SARS-CoV-2 strains, are among the least variable genomic regions, when SARS-CoV-2 and SARS-CoV-2 like strains with a different host specificity are compared.

Notwithstanding the limited variability of the genome, which prevents more detailed evolutionary analyses, our analyses provide no evidence for distinct evolutionary constraints acting on the different superclades and clusters of SARS-CoV-2 genomes. Although these results cannot be conclusive, the observation of similar dynamics of genomic variability, and that variable sites are prevalent at the 3' and 5' UTR genomic regions of SARS-CoV-2 indicate that it is unlikely that these differences should be the result of adaptive selection.

However, we notice that in SARS-CoV-2 the spike protein evolves under less constrained evolutionary dynamics compared to other genes. Indeed, in spite of the limited number of variable sites (135/3822) a dN/dS ratio of ~ 0.7 is observed for the spike gene, a value that is well above the value recovered for genes of similar size in the same genome and that would indicate that the spike gene is subject to weaker evolutionary constraints than the other protein coding genes. This possibly reflects a mechanism for the rapid adaptation to a more widespread range of hosts, as for example suggested by Menachery et al (Menachery et al, 2016) .

Notwithstanding some limitations, our comparative analyses are consistent with the hypothesis that the biased geographic distribution, and the allelic differences observed between major viral SARS-CoV-2 clades are not the result of an adaptive evolutionary process, but are more consistent with founder effects on viral populations, coupled with the rapid spread of this novel virus.

Although our analyses do not suggest distinct evolutionary patterns, it remains unclear whether the genetic variants that discriminate between major viral clades could be related with differences in the virulence/pathogenicity of these clades. To address this issue it will be crucial to collect patient metadata, to sequence more genomes, and to enable the execution of retrospective statistical analyses.

The complete collection of high quality, complete SARS-CoV-2 genomes and associated metadata was accessed from the GISAID EpicoV (Shu and McCauley, 2017) Functional effects of genetic variants as identified from genome alignments, were predicted by means of a custom Perl script, based on the annotation of the NC_045512.2 SARS-CoV-2 reference assembly.

Graphical representation of the data, basic statistical analyses and clustering of viral isolates were performed by means of the standard libraries of the R programming language. Similar to previous studies for the classification of SARS-CoV genomes (Chinese SARS Molecular Epidemiology Consortium, Song et al, 2005) , clustering of viral genomes was performed by considering only sites that were polymorphic in at least 2 different genomes. Singleton sites, that is polymorphic sites observed only in one genome, were excluded from this analysis due to considerations regarding the limited information content, but also to the fact that, according to previous reports, these sites could be slightly enriched in sequencing errors (Song et al, 2005) . Analogous considerations prompted us to execute the analysis of substitutions patterns both on the complete collection of polymorphic sites, and only on sites that were polymorphic in at least 2 genomes.

Determination of the optimal clustering solution was performed based on the Dunn Index metrics, as computed by the clValid R package (Brock et al, 2008) . Table 1 . List of polymorphic sites characteristic of each cluster of SARS-CoV-2 episomes. Position: indicates the genomic coordinate of the polymorphic site, Reference: the sequence of the reference genome and Alternative: the alternative allele, respectively. For variants associated with protein coding genes, the "AAresidue" and the "AAchange" columns are used to indicate the affected aminoacidic residue in protein coordinates and the predicted change in amino acid sequence, "-" indicates a silent substitution. The Gene column reports the corresponding geneor functionally annotated genomic element, while "Number of isolates" indicates the number of viral isolates in each cluster that have the polymorphism the comparison of bat RaTG13 SARS-CoV-2 like against the reference SARS-CoV-2 genome; E) the comparison of Pangolin SARS-CoV-2 like against the reference SARS-CoV-2;F) the comparison of bat RaTG13 SARS-CoV-2 like against the pangolin SARS-CoV-2 like. Genomic coordinates are represented on the X axis, number of variable sites per window on the Y axis analysis of all the variable sites ("_AllG") or of only sites variable in at least 2 genomes ("_2G"). Figure S2 . Histogram of the number of variable sites identified in: A) any SARS-CoV-2 genomes included in this study, with respect to the reference SARS-CoV-2 genome; B) SARS -CoV genomes from the late phase of the 2003 epidemic with respect to the SARS-CoV reference genome. Figure S3 . Prevalence of polymorphic sites in viral isolates. Number of viral genomes is reported on the X axis. Log scaled counts of the number of polymorphic sites supported by that number of genomic sequences is indicated on the Y axis Table 1 Figure 3. Plot of genomic variability calculated as the proportion of variable sites identified in overlapping genomic windows of 100 bp in:A) Super-clade I; B) Super-clade II; C) Super-clade III;D) the comparison of bat RaTG13 SARS-CoV-2 like against the reference SARS-CoV-2 genome; E) the comparison of Pangolin SARS-CoV-2 like against the reference SARS-CoV-2;F) the comparison of bat RaTG13 SARS-CoV-2 like against the pangolin SARS-CoV-2 like. Genomic coordinates are represented on the X axis, number of variable sites per window on the Y axis

A) B) C) D) E) F) Figure 3

Real estimates of mortality following COVID-19 infection

clValid:An R Package for Cluster Validation

Prediction of the SARS-CoV-2 (2019-nCoV) 3C-like protease (3CL pro) structure: virtual screening reveals velpatasvir, ledipasvir, and other drug repurposing candidates

Molecular evolution of the SARS coronavirus during the course of the SARS epidemic in China

A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection. mSphere

Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding

MUMmer4: A fast and versatile genome alignment system

Petrosino Evidence of recombination in coronaviruses implicating pangolin origins of

A SARS-like cluster of circulating bat coronaviruses shows potential for human emergence

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus

Preliminary bioinformatics studies on the design of a synthetic vaccine and a preventative peptidomimetic antagonist against the SARS-CoV-2 (2019-nCoV, COVID-19) coronavirus

Cross-host evolution of severe acute respiratory syndrome coronavirus in palm civet and human

Global initiative on sharing all influenza data -from vision to reality

On the origin and continuing evolution of SARS-CoV-2

Coronavirus pathogenesis and the emerging pathogen severe acute respiratory syndrome coronavirus

A new coronavirus associated with human respiratory disease in China

Genomic variations of SARS-CoV-2 suggest multiple outbreak sources of transmission

KaKs_Calculator: calculating Ka and Ks through model selection and model averaging

Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2

We thank ELIXIR Italy for providing the computing and bioinformatics facilities and Edward C.Holmes for his expert advice. We gratefully acknowledge the authors, originating and submitting laboratories of the sequences from GISAID's EpiFlu™ Database on which this research is based. The list is detailed in Supplementary Table S1 

 Supplementary Table S1 . List of viral isolates included in the study. The table is in the same format as the submission acknowledgment table available from the GISAID EpiCoV website.

Rates of nucleotide substitution as identified from whole genome alignment. For every possible single nucleotide substitution (rows), the table reports the frequency of that substitution for: SARS-CoV-2_All vs SARS-CoV-2: alignments between SARS-CoV-2 genomes; SARS-Cov-2_2G vs SARS-CoV-2: as before , but computed only on polymorphic sites common to at least 2 genomes; SARS 2003L vs SARS 2003 : alignment between all the SARS-CoV isolates from the late phase of the SARS 2003 epidemic (SARS 2003L) with the reference SARS-CoV genome; bat RaTG13 CoV vs SARS-CoV-2: alignment between the SARS-Cov-2 reference genome and the genome of a SARS-Cov-2 like of bat(RaTG13 genome); pangolin CoV vs SARS-CoV-2: alignment between the SARS-Cov-2 reference genome and the genome of a SARS-Cov-2 like pangolin Supplementary Table S3 . Cluster assignment of viral isolates. Accession number of the isolate in GISAID and the relative N° of polymorphic sites with respect to the reference genome, are also reported Supplementary Table S4 . Functional annotation of polymorphic sites. The table lists all the 983 polymorphic sites identified from the comparison of 1113 SARS-Cov-2 genomes with the Refseq assembly. "Pos" reports the genomic coordinates of the polymorphic site, followed by the sequence on the reference assembly (Ref) and the alternative sequence (Alt), with "." indicating insertion/deletions. For variants associated with protein coding genes, "AA pos" and "AA change" are used to indicate the affected amino-acidic residue in protein coordinates, and the predicted change in amino acid sequence, respectively. Type indicates the predicted functional effect: "S" indicates a silent substitution, "NS" a nonsynonymous substitution, "FS" a frameshift, "Non-coding", that the variant is associated with a non protein coding region of the genome, "Na", stands for not applicable. "Functional element" reports the corresponding gene or functionally annotated genomic element, and "N° of genomes" indicates the number of viral isolates in each cluster that have the polymorphic site. The equivalent information for every cluster (number of isolates that have the variant) is reported in the columns "Cluster 1" to "Cluster 9", for clusters from 1 to 9 respectively. Table S5 . dN/dS ratio. The table reports the dN/dS ratio, computed by the means of the KaKs_Calculator for all the protein coding genes longer than 100 aa residues and with more than 50 variable sites associated with their CDS. Proteins are indicated by their gene symbol on the rows. "spike_rbd" indicates the spike recognition binding domain, "concat_no_spike", indicates a supergene formed by all the protein coding genes in the genome with the exclusion of the spike protein. "Na" stands for not applicable. bat RaTG13 CoV vs SARS-CoV-2: dN/dS between SARS-CoV-2 and the bat RaTG13 CoV; pangolin CoV vs SARS-CoV-2: dN/dS between SARS-CoV-2 and the SARS-CoV-2 like CoV isolated from pangolin specimens; SARS-CoV-2: dN/dS for SARS-CoV-2. Figure S1 . Heatmap of nucleotide substitution frequencies. The heatmap displays nucleotide substitution frequencies, as derived from whole genome alignment as reported in Table S1 . Frequencies are reported in each cell of the heatmap. A gray (low) to blue (high) gradient of color is used. The suffix "_AllG" and "_2G" are used to indicate substitution profiles derived from the Figure 2 . Boxplot of genomic variability for the 8 viral clusters of human-specific SARS-CoV-2 episomes, based on singleton polymorphic sites. Cluster color codes are as in Figure 1 .