key: cord-0967084-qw05apnf
authors: Gómez-Carballa, Alberto; Bello, Xabier; Pardo-Seco, Jacobo; Martinón-Torres, Federico; Salas, Antonio
title: The impact of super-spreaders in COVID-19: mapping genome variation worldwide
date: 2020-05-21
journal: bioRxiv
DOI: 10.1101/2020.05.19.097410
sha: 2097cb84513c604bd6214fcbb6c73654e0b37b7e
doc_id: 967084
cord_uid: qw05apnf

The human pathogen severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is responsible for the major pandemic of the 21st century. We analyzed >4,700 SARS-CoV-2 genomes and associated meta-data retrieved from public repositories. SARS-CoV-2 sequences have a high sequence identity (>99.9%), which drops to >96% when compared to bat coronavirus. We built a mutation-annotated reference SARS-CoV-2 phylogeny with two main macro-haplogroups, A and B, both of Asian origin, and >160 sub-branches representing virus strains of variable geographical origins worldwide, revealing a uniform mutation occurrence along branches that could complicate the design of future vaccines. The root of SARS-CoV-2 genomes locates at the Chinese haplogroup B1, with a TMRCA dating to 12 November 2019 - thus matching epidemiological records. Sub-haplogroup A2a originates in China and represents the major non-Asian outbreak. Multiple bottleneck episodes, most likely associated with super-spreader hosts, explain COVID-19 pandemic to a large extent.

The human severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was first detected in late 2019 in patients from the city of Wuhan (Hubei province, China) suffering from respiratory illnesses and leading to a disease that has been popularized as coronavirus disease or COVID-19. The disease was declared an International Public Health Emergency on 30 January 2020, and a few weeks later, on 11 March 2020, it was declared a pandemic by the World Health Organization ((WHO) 2020). Even though it has not possible to trace an index case, there is a large amount of epidemiological information that has been very useful for tracking the pandemic spread of SARS-CoV-2. The first report of February 2020), USA (20 February 2020) , Europe (Italy and Spain, 31 January 2020), etc. Wu et al. (2020) reported the first genome sequence of SARS-CoV-2 (29,903 bp length), from a worker at the Wuhan market admitted to the Central Hospital of Wuhan on 26 December 2019; this patient experienced severe respiratory syndrome. The authors identified a new RNA virus strain belonging to the family Coronaviridae that was subsequently designated as 'WH-Hum 1 coronavirus' (and also '2019-nCoV'). According to Coutard et al. (2020) the nearest bat precursor would be RaTG13 with a genome identity to SARS-CoV-2 of 98%. Phylogenetic studies supported the theory of a natural origin for SARS-CoV-2 (Andersen et al. 2020 ).

Since the beginning of the COVID-19 pandemic, there has been a growing interest in exploring genetic variation in the genome of severe acute respiratory syndrome coronavirus (SARS-CoV-2). Identifying patterns of genomic variation can help understand the origin and spread of the pandemic and facilitate the development of future vaccines. The amount of genome data deposited in public repositories in a such a reduced timeframe offers a unique opportunity for a detailed phylogenetical characterization of SARS-CoV-2, as well as the geographic mapping of the different clades spreading worldwide, and of the impact of the outbreaks on the genome variability of the virus. Initial analyses so far used a limited number of SARS-CoV-2 genomes only, and focused mostly on various evolutionary aspects of the SARS-CoV-2 genomes (Andersen et al. 2020; Forster et al. 2020; Li et al. 2020; Shen et al. 2020) . The Global Initiative on Sharing Avian Influenza Data GISAID (Shu and McCauley 2017) provides public access to the most complete collection of genetic sequences of several viruses, with special emphasis on influenza viruses. In 2020, the Global Initiative on Sharing Avian Influenza Data GISAID (Shu and McCauley 2017) started to compile sequence data from the virus causing COVID-19 disease, and now it makes thousands of genomic sequences of SARS-CoV-2 available. The China National Center for Bioinformation keeps an updated resource on COVID -19 (https://bigd.big.ac.cn/ncov/tool/annotation; ) and provides different analytical tools to study SARS-CoV-2 variation. The open-source project Nextstrain (https://nextstrain.org; (Hadfield et al. 2018) provides an interactive web portal that allows navigating SARS-CoV-2 genome variation and helps track the spread of disease outbreaks.

In the present study, we built a solid phylogenetic skeleton of SARS-CoV-2 genomes that allows to investigate sequence variation in a large number of genomes (>4,700; Supplementary Material) deposited in GISAID, explore sitespecific mutational instability, investigate phylogeographic patterns of variation worldwide, and clarify the role of super-spreader hosts in the pandemic.

Human SARS-CoV-2 genomes have a within sequence identity of 99.98% (Table   1) ; and are much more identical to bat coronavirus than to pangolin coronavirus, although the values vary substantially depending on the specimen, 93.44%-5 96.17% (Table 1) . When compared to pangolin coronavirus, the range of genome identities drops to 85.24%-92.35%.

Between 1,699 and 3,727 substitution variants separate the pangolin coronavirus genomes from the SARS-CoV-2 reference sequence, and this range drops to 1,105 to 1,369 (Table 1 ) when compared to bat coronavirus. The bat #412976 coronavirus genome is conflictive because it has an unusual amount of mutational differences with respect to the SARS-CoV-2 reference and has an abnormally low sequence identity with human coronavirus (76.87%), comparable to pangolin coronavirus. This genome is problematic in the sequence alignment and should be avoided in future comparative analyses.

An inter-specific Maximum Likelihood (ML) tree was built using pangolin, SARS, and bat coronavirus genomes as outgroups to investigate their phylogenetic relationships with SARS-CoV-2 (Supplementary material). The tree depicts the SARS coronavirus genome occupying the most external branch. Next, all the pan genomes cluster separately from bat and human coronavirus, which also group separately. In line with its very low identity with SARS-CoV-2, bat-412976 behaves as an outlier in the tree. Overall, the clustering pattern in the tree is in very good agreement with sequence identity values ( Table 1) .

We next focused our attention on the root for all existing SARS-CoV-2 genomes, assuming the bat coronavirus as its closest coronavirus relative. We built a new ML tree including all SARS-CoV-2 genomes sequenced up to 29 February 2020 (n = 621); almost all of them are of Asian origin and this group should contain the Most Recent Common Ancestor (MRCA), as it is evident from phylogenetics and epidemiology that the origin of the pandemic is in China and more particularly within haplogroup B (see below and Supplementary Material).

The ML tree unequivocally reveals that the root of SARS-CoV-2 is located in the basal B1 haplogroup (B1 genomes that do not belong to derived B1 sub-clades; SARS-CoV-2 mutation rate, as inferred from the ML tree, is 5.42´10 -4 (Bootstrap 2.5% -75% confidence interval: 4.29´10 -4 -8.02´10 -4 ) according to an uncorrelated relaxed-clock method; a slightly higher mutation rate of 6.05´10 -4 (Bootstrap 2.5% -75% confidence interval: 4.46´10 -4 -8.22´10 -4 ) was obtained assuming and strick-clock model.

According to a relaxed-clock model mutation rate the TMRCA for all SARS-CoV-2 genomes dates to 12 th November 2019 (Bootstrap 2.5% -75% confidence interval: 7 th August 2019 to 8 December 2019), fully matching epidemiological dates; estimates using an strick clock mutation rate varied very little: 7 th November 2019 (Bootstrap 2.5% -75% confidence interval: 18 August 2019 to 2 December 2019).

The most parsimonious tree (fully developed in Supplementary Material Figure S4 ; see also skeleton in Figure a) shows that the two very stable transitions C8782T and T28144C (3 and 1 total occurrences in the phylogeny, respectively) separate SARS-CoV-2 variation into two main clades, A and B, both originating in China. Sub-haplogroups emerging from these main clades are mainly supported by single mutations, most of them being very solid along the phylogeny (Table S1), and therefore granting the robustness of the different clades. It is notable that the structure of the branches in the parsimonious tree fully agrees with the skeleton shown in the ML tree. is mainly found in Asia, and especially in the Middle East (81.25%); its sub-clade A3a is also highly prevalent in the same region (31.94%) but shows even higher frequency in Oceania (44.44%). Other minor clades are found in more restricted areas; for instance, A4a (n = 39; 1.44% of A) is only found in Wales (Europe), while A5, A7 and A9b appear only in Asia.

Phylogeographic information allows reconstructing dynamics of (sub)haplogroups worldwide (Figure 2) . The main clades emerged in Asia (mainly in China), while some minor ones appeared outside Asia (next section;

The number of sequences belonging to clade A and its main sub-clades increased exponentially during the outbreak occurring outside Asia at the end of 

It is remarkable that a few haplotypes are disproportionally represented in continental regions or in particular countries (Supplementary Material; Figure   S10 ), appearing abruptly in a few days' period. This pattern is compatible with super-spreaders arriving to certain geographic locations and giving rise to severe founder effects ( Figure 3A ). Haplotypes #H1, #H2, #H3, and #H4 (ID's as in Table S8 There are additional examples of SARS-CoV-2 super-spreaders (Table   S8 ) appearing in restricted geographical areas. For instance, H8 (n = 33; A3) appears at high frequency in Japan (28/33; 84.85%). In Iceland, founder haplotypes represent a large proportion of all existing haplotypes on the island e.g. H7 exists only in Iceland (n = 37), and together with H2 (n = 34 in Iceland) and other four haplotypes, makes up 39.18% of all the haplotypes in this country.

In USA, H3 occurs 126 times, and H2 45 times; together with other five haplotypes, they make up 31.75% of all genomes in this country. In the UK, eight haplotypes make-up 28.95% of the total haplotypes. H9 (n = 26; B3a) and H14 Figure 3B) , followed by exponential growth from 20 th January 2020, coinciding with the Asian outbreak. The peak is reached on 30 th January 2020, matching the Asian lockdown. Consequently, Ne drops remarkably for the next couple of weeks, but starts to grow progressively again from 12 th February 2020, coinciding with the beginning of the non-Asian outbreak.

By overlapping COVID-19 incidence (officially reported cases per day worldwide; https://ourworldindata.org) with the EBPS plot, we observed comparable shape distributions, but with a remarkable 14-15 days' delay in reported cases per day worldwide with respect to the EBPS distribution ( Figure   3B ).

We have undertaken a large-scale study on SARS-CoV-2 genomes considering a sample that is more than an order of magnitude higher than those of previous analyses. By focusing on high-quality (HQ) genomes, we devoted great effort to elucidate the most parsimonious phylogeny of SARS-CoV-2. This effort has allowed us to present novel phylogeographic inferences on the origin and dynamics of CoV-2 strains. In particular, we discovered a few dozen genomes (representing > 1/3 of the total database) that played a fundamental role as superspreaders of COVID-19 disease. These SARS-CoV-2 strains (belonging to different haplogroups), occur with remarkable frequency in the dataset and became founders in restricted regions or countries in short time periods (of a few days).

SARS-CoV-2 genomes show very high identity among themselves (>99%) and lower to bat coronaviruses (>96%; BatCoV RaTG13); these values are very identical to earlier estimates based on a limited number of SARS-CoV-2 genomes (Ceraolo and Giorgi 2020) . The pangolin coronavirus genome, initially proposed as the original host of SARS-CoV-2, shows significantly lower identity. The high identity observed between SARS-CoV-2 genomes and other betacoronaviruses adds support to its zoonotic origin from a bat relative (Ceraolo and Giorgi 2020) .

The differences found between SARS-CoV-2 and their most related coronaviruses in horseshoe bat indicate that large number of mutational jumps were needed to generate these differences from a common ancestor which could have existed in a time frame between 1948 -1982 (Boni et al. 2020 . Divergent genomes could have been incubated in animal reservoirs before the zoonotic jump to humans in the shape of a B1 genome, in a process similar to that observed for palm civet as intermediary in other SARS coronavirus (Hu et al. 2017 ). These new coronaviruses would be able to use human ACE2 receptor to infect patients. Patterns of variation observed in SARS-CoV-2 could be explained assuming a unique index case, which would already contain the very specific and well-conserved PFCS insertion. This original B1 genome would then start to diverge very soon in Wuham in two directions of the phylogeny, giving rise to its most frequent sub-lineage B1a1 and, almost simultaneously, to other B lineages and the large haplogroup A (timeline in figure 3C ).

According to our inferences, the TMRCA for all SARS-CoV-2 genomes would be 12 th November 2019. Assuming a maximum incubation time in humans of up to 24 days (Guan et al. 2020) , the virus could have been infecting the first citizens from Hubei in a silent mode of transmission until the end of November 2019; and started to be noticed by Chinese health authorities in early to mid-December. The EBSP distribution suggests that the Ne of SARS-CoV-2 could have started to grow significantly from 30 th December 2019, i.e. only two-three weeks after the initial cases reported and probably favored by super-spreaders (e.g. genomes like the reference sequence played a special role in the beginning of the Asiatic epidemic). Subsequently, it followed an exponential growth that marked the beginning of the Asian outbreak on the 20 th January 2020 and lasted until the end of this month. Next, Ne experienced a notable drop coinciding with human intervention and quarantine implemented in Asia on 30 th January 2020.

Finally, the beginning of a second wave of expansion outside Asia starting around 12 th -27 th February 2020 is also well-recorded on the SARS-CoV-2 genomes ( Figure 3C) .

The two-week delay between the dates suggested by the EBSP distribution and the official documented incidence of COVID-19 in Asia could be due to the mean incubation time of the disease, but also to the number of cases officially declared being well below the real incidence.

With the data available in GISAID, we were not able to detect association between main haplogroups and age and sex of carriers. Further research is needed to investigate the possible differential effect of strains (haplogroups) with the disease outcome.

Evidence of natural selection acting on SARS-CoV-2 genomes needs further investigation (Supplementary Material), although the data suggest purifying selection acting on most of the SARS-CoV-2 genes when explored at an inter-specific level, and weaker intra-specific purifying selection. In agreement with this latter observation is the recent report indicating a 81 deletion at gene ORF7a that would convert the coronavirus in a less virulent pathogen with reduced short-term selective advantage (Holland et al. 2020) . None of the HQ genomes investigated in our report carry this deletion.

In contrast to the weak (or null) action of positive selection on COVID-19 spread, there is strong evidence pointing to the role of genetic drift occurring in many continental regions and restricted locations, especially outside China.

Phylogeographic analysis allowed us to investigate pandemic dynamics worldwide. The high incidence of a few lineages outside Asia was more probably due to drift and not selective advantages. The main non-Asian sub-clade A2a was probably among the first ones to leave Asia before this region established a severe population lockdown. In good agreement with epidemiological data, we observed multiple worldwide introductions of SARS-CoV-2 coming from Asia.

Super-spreaders were probably the main responsible for genetic drift episodes.

We detected >48 haplotypes in our dataset that most likely represent genomes transmitted by super-spreaders. These haplotypes have three differential features: they reached high to moderate frequencies in the population, they are characteristic of specific continental regions or even individual countries, and they appeared in a very short time period of only a few days. The data suggest that these genomes have played a fundamental role in COVID-19 spreading; they alone represent 34.61% of the total genomes in the database. The role of superspreaders is well reported in previous pandemics, including SARS, MERS and Ebola (Stein 2011; Wong et al. 2015) .

We found the 12bp polybasic furin cleavage site (PFCS) in all SARS-CoV-2 genomes with only two substitutions in two different genomes (belonging to different haplogroups; Supplementary Material). This segment is therefore highly mutationally stable. A BLAST search (https://blast.ncbi.nlm.nih.gov/) of the PFCS indicates that this sequence segment is absolutely specific of SARS-CoV-2. The fact that the PFCS has been found universally in all SARS-CoV-2 suggests that this insertion was acquired before the zoonotic event and not after (Andersen et al. 2020 ). The virulence conferred by this deletion to the coronavirus constitutes the focus of several studies (Lau et al. 2020 ).

The origin of SARS-CoV-2 has become a very popular question. The results of the present study (TMRCA dating of SARS-CoV-2, EBSP plot, and phylogeny) are compatible with an index case living in Wuhan-China, belonging to basal haplogroup B1, and most likely existing not before the beginning of November 2019. Subsequently, the coronavirus was transmitted from a living animal to a human host and then it started to spread from human to human. By analyzing stored biological samples from cases occurring at the beginning of the epidemy in Wuhan, it would be possible to narrow the search for patient zero among those belonging to the root of B1. The phylogeny built in the present study would be compatible with a single patient zero initiating the epidemic. Identifying the index case would help better understand how and when the spread of the pandemic begun, a lesson that would be useful in future pandemics. In agreement with previous studies (Andersen et al. 2020) , the theory of SARS-CoV-2 originating artificially in a lab finds no support in the results of the present study, in the sense that variation (within and between other species), and the step-wise mutational evolution observed at SARS-CoV-2 genomes is as expected for a RNA virus in nature.

This study warrants further expansion to clarify the role of super-spreaders in COVID-19 by investigating epidemiological data locally. Detecting and analyzing the genome of super-spreaders might shed light on the specific host genetic background contributing to their increased propensity to transmit the pathogen, as well as to understand the mechanisms of infection and transmission of the pathogen. Moreover, the phylogenic precision to which we classified SARS-CoV-2 genomes will also serve disease studies aimed at understanding the potential role of different pathogen strains in disease outcomes, and how these correlate to, and interact with, host genomic susceptibility.

We downloaded 4,721 complete genomes from the GISAID database (https://www.epicov.org/epi3/frontend) on 6 April 2020; 3,481 out of these 4,721 sequences were noted as high quality (HQ; > 29Kb, high cover only) based on the information provided by GISAID on 8 April 2020. Unless otherwise indicated, our analyses were carried out using the HQ SARS-CoV-2 genomes (see results).

Although SARS-CoV-2 is an RNA virus, the data deposited in GISAID are in DNA format. Metadata for these genomes was downloaded from the Nextrain repository (https://github.com/nextstrain/ncov/tree/master/data) on 7 April 2020; this contains information on geographic location of the sample (city, country, and continental region) date of submission, age, gender, etc. This data allowed the marking of each genome by date when the virus was isolated ("Location date") and geographic origin, among others. We also downloaded coronavirus genomes from nine pangolins and bat genomes analyzed in the Guangdong province The SARS-CoV-2 genomes were aligned against the reference sequence used by e.g. Nextstrain and many authors, with GenBank acc. nº MN908947.3 (submitted on 5 January 2020; GISAID ID #402125). This was the first SARS-CoV-2 genome released on GenBank.

Alignment of SARS-CoV-2 genomes against the reference sequence was carried out using MUSCLE v3.8.31 program (Edgar 2004) .

Apart from the discarded low-quality (LQ) sequences, and in order to ensure comparability between genomes, we trimmed the 5' and 3' untranslated regions; retaining a consensus sequence of 29,607 bp that runs from position 169 to position 29,776.

We built a maximum likelihood (ML) tree in order to investigate interspecific phylogenetic relations between SARS-CoV-2 genomes and nine pangolins, three bats, and the reference SARS genome (GenBank acc. nº: NC_004718.3).

Interspecific alignment was carried out using MAFFT program (Katoh et al. 2002) with default parameters. Genetic distances (F84) were computed using dnadist and default parameters, and the tree was built using dnapars; both programs are included in the software Phylip-3.697. With the SARS-CoV-2 genes aligned for all genomes with nucleotides in frame, and the ML tree, we used PAML 4 (Yang 2007 ) to compute the statistics w = Ka/Ks (also known as dN/dS), where Ka is the number of non-synonymous substitutions per non-synonymous site, and Ks refers to synonymous substitutions per synonymous site. This ratio allows to measure the strength and mode of natural selection acting on the protein genes.

There have been several attempts at reconstructing the phylogeny of SARS-CoV-2 genomes. At present there is no consensus phylogeny identifying the mutational changes characterizing main clades and sub-clades. Many of these earlier attempts used only a reduced number of genomes (Forster et al. 2020 ).

The interactive web-phylogeny presented by Nextstrain is probably the most elaborate attempt carried out to date; this resource defines two main branches, A and B, and 9 sub-branches that cluster a variable number of sequences ranging from only 3 (haplogroup A7) or 52 sequences (haplogroup A6) to 2,279 (A2a) (28 April 2020). GISAID identifies three large clades according to changes located in the ORF8 gene and other sequence variants: a) Clade S: change L84S, with S referring to the SARS-CoV-2 spike S-glycoprotein located on the surface of the viral envelope, and sequence variant T28144C; b) Clade G: change D614G and sequence variant A23404G; and c) Clade V: NSP3-G251V and sequence variant G26144T. In addition, GISAID refers to clade 'Other', which is in reality a paraphyletic clade. Note that GISAID uses the reference sequence of a SARSassociated coronavirus sequenced in 2003 (Marra et al. 2003) .

Here we used different strategies to build the phylogeny of SARS-CoV-2.

First, a phylogeny based on ML was carried out to find the phylogenetic root of all SARS-CoV-2 using the most similar pangolin and bat coronavirus to the SARS-CoV-2 genome as ancestral relatives (GISAID IDs [omitting prefix "EPI-ISL-"]): #410721, #402131 respectively). In the particular case of SARS-CoV-2 and taking into account epidemiological evidence, we know for sure that its root is among the first genomes sequenced in China (most likely in its Hubei province).

Therefore, we used only genomes available until the end of February to build the ML tree in order to reduce the noise generated by an unnecessarily large number of genomes that were sequenced later and spread outside China.

Second, the most parsimonious strategy allows to determine main and secondary sub-clades of the SARS-CoV-2 tree, identifying the characteristic mutations of clades. This phylogenetic procedure also allows to count the occurrences of mutations along branches, which serves as a good proxy for mutation-specific stability (Weissensteiner et al. 2016) . We followed the HQ standards used to generate the most robust molecular phylogeny based on maximum parsimony, the human mtDNA (van Oven and Kayser 2009), although the novelty of SARS-CoV-2 genomes and the use of a variety of NGS techniques (Bandelt and Salas 2012) , prevents the filtering out of sequencing errors as efficiently as in other well-known haplotypic-based phylogenies (Salas et al. 2005 ). Thus, we took the following steps with regards to data filtering:

1. We used only genomes labeled as 'high-coverage only' and 'complete' in GISAID.

2. We collapsed the sequences to the common sequence segment. NGS procedures generate artifacts at the 5' and 3' ends of the genome sequences, which generally consist of deletions. Before eliminating the extremes of the sequences, we checked in the complete genomes available if there were any variant that could be phylogenetically informative.

3. A solid phylogeny should be based on stable mutational variants; thus, only branches supported by at least five genomes were considered.

4. Since many sub-haplogroups are supported by single mutations, we only considered those having mutational stability.

We observed an excess of reversions at the tips of a few phylogenetic branches (e.g. C14805T reversion in a few A1a sequences). This phylogenetic noise could be due to the high mutational rate of the mutations involved, recombination (which is not unusual between coronaviruses (Rehman et al. 2020) ), or sequencing errors. For this reason, we decided to not resolve these branches further, while we await new evidence based on higher sequence quality genomes.

By simple counting of the mutational hits along the branches and at the tips of the phylogeny, it is possible to infer the relative mutation stability of diagnostic sites. The mutations at the tips of the phylogeny were counted only once within each terminal branch. Table S1 reports the number of occurrences in both the tree branches and in the tips of the phylogeny.

Finally, in order to guarantee the robustness of the phylogeny, we checked that inferences on the root and phylogeny estimated from the ML tree were in good agreement with the pattern of mutations observed in the most parsimonious tree.

The average number of nucleotide differences per site between DNA sequences or nucleotide diversity (p) (Nei and Li 1979) , sequence/haplotype diversity (HD) and Tajima's D statistics (Tajima 1989) were computed for the main continental regions, haplogroups and gene partitions. Tajima's D is a test for neutrality in the context of infinite-sites model of sequence evolution and it is negligibly affected by S, sample size, and recombination (Ramirez-Soriano et al. 2008) .

Multidimensional Scaling (MDS) was undertaken to identify clusters of genetic variation by examining (a) all the variation observed in the SARS-CoV-2 genomes and (b) the phylogenetic diagnostic variants of the SARS-CoV-2 tree inferred from parsimony. For this, we used the function cmdscale (library stats) from the statistical software R Project for Statistical Computing v. 3.3.1 (https://www.r-project.org/; (Team 2012) ).

The geographic representation of haplogroup frequencies in world maps was carried out using SAGA v. 7.6.2 (http://www.saga-gis.org/) (Conrad et al. 2015) , and the ordinary Kriging method. We used only sampling points with 10 samples or more; to avoid unnecessary loss of sampling points, a few of them were collapsed into nearest points to represent local areas, whenever possible, in order to reach the minimum sampling required.

From the sequence alignments and annotated files, we summarized information on mutational patterns in the SARS-CoV-2 genomes (Table S2 and   S3 ). We used a Fisher's exact test to check if there are differences in the transition-to-transversion ratio (ts/tv), and the synonymous/non-synonymous changes.

We carried out the ML phylogeny and Extended Bayesian Skyline plot analysis using the sequences collected before 29 February 2020 (n = 621). We constructed the ML tree using RAxML-HPC v.8 (Stamatakis 2014) and molecular rates, fitting a molecular clock to the phylogeny through a fast relaxed-clock method based on a Gamma-Poisson mixture model of substitution rates, and using the R package treedater (Volz and Frost 2017) . We selected the best molecular clock model by testing if the relaxed-clock offers better fit to data, and also identified and removed tip outlier lineages that fit poorly the substitution model in order to obtain a tree that could fit better the molecular clock. We estimated confidence intervals for rates and dates using a parametric bootstrap approach.

The demography of SARS-CoV-2 sequences was inferred using the Extended Bayesian Skyline Plot method (EBSP) (Heled and Drummond 2008) implemented in BEAST v2.6.2 (Drummond and Rambaut 2007) . EBSPs allow the inference of effective population size (Ne) through time and also estimate the number of demographic changes from the data. We used strict-clock and a rate of evolution of 0.80 x 10 -3 [0.14 x 10 -3 - In-house R and Perl (http://www.perl.org) scripts were used to display results obtained from the different software packages used. 

WHO Director-General's opening remarks at the media briefing on COVID-19 -11

The proximal origin of SARS-CoV-2

Current next generation sequencing technology may not meet forensic standards

Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic

Genomic variance of the 2019-nCoV coronavirus

System for automated geoscientific analyses (SAGA) v. 2.1. 4. Geoscientific Model Development

The spike glycoprotein of the new coronavirus 2019-nCoV contains a furinlike cleavage site absent in CoV of the same clade

BEAST: Bayesian evolutionary analysis by sampling trees

MUSCLE: multiple sequence alignment with high accuracy and high throughput

Phylogenetic network analysis of SARS-CoV-2 genomes

Clinical Characteristics of Coronavirus Disease 2019 in China

Nextstrain: real-time tracking of pathogen evolution

Bayesian inference of population size history from multiple loci

An 81 nucleotide deletion in SARS-CoV-2 ORF7a identified from sentinel surveillance in Arizona

Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

Attenuated SARS-CoV-2 variants with deletions at the S1/S2 junction

Evolutionary history, potential intermediate animal host, and cross-species analyses of SARS-CoV-2

The Genome sequence of the SARS-associated coronavirus

Mathematical model for studying genetic variation in terms of restriction endonucleases

Statistical power analysis of neutrality tests under demographic expansions, contractions and bottlenecks with recombination

Evolutionary Trajectory for the Emergence of Novel Coronavirus SARS-CoV-2

A practical guide to mitochondrial DNA error prevention in clinical, forensic, and population genetics

Genomic diversity of SARS-CoV-2 in Coronavirus Disease 2019 patients

GISAID: Global initiative on sharing all influenza data -from vision to reality

RAxML version 8: a tool for phylogenetic analysis and postanalysis of large phylogenies

Super-spreaders in infectious diseases

Statistical method for testing the neutral mutation hypothesis by DNA polymorphism

R: A Language and Environment for Statistical Computing. (ed. RCT The)

Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation

Scalable relaxed clock phylogenetic dating

HaploGrep 2: mitochondrial haplogroup classification in the era of high-throughput sequencing

The Role of Super-Spreaders in Infectious Disease

A new coronavirus associated with human respiratory disease in China

PAML 4: phylogenetic analysis by maximum likelihood

The 2019 novel coronavirus resource

This study received support from the Instituto de Salud Carlos III: project GePEM We gratefully acknowledge GISAID and contributing laboratories for giving us access to the SARS-CoV-2 genomes used in the present study.