key: cord-1049033-ew2kjq5s authors: Castells, Matías; Lopez‐Tort, Fernando; Colina, Rodney; Cristina, Juan title: Evidence of Increasing Diversification of Emerging SARS‐CoV‐2 Strains date: 2020-05-15 journal: J Med Virol DOI: 10.1002/jmv.26018 sha: c7a0e968b16fbd14fd8bf135f1404ffbe054831b doc_id: 1049033 cord_uid: ew2kjq5s BACKGROUND: On January 30th, 2020, an outbreak of atypical pneumonia caused by a novel Betacoronavirus (βCoV), named SARS‐CoV‐2, was declared a public health emergency of international concern by the World Health Organization. For this reason, a detailed evolutionary analysis of SARS‐CoV‐2 strains currently circulating in different geographic regions of the world was performed. METHODS: A compositional analysis as well as a Bayesian coalescent analysis of complete genome sequences of SARS‐CoV‐2 strains recently isolated in Europe, North America, South America and Asia was performed. RESULTS: The results of these studies revealed a diversification of SARS‐CoV‐2 strains in three different genetic clades. Co‐circulation of different clades in different countries, as well as different genetic lineages within different clades were observed. The time of the most recent common ancestor (tMRCA) was established to be around November 1, 2019. A mean rate of evolution of 6.57 x 10(‐4) substitutions per site per year was found. A significant migration rate per genetic lineage per year from Europe to South America was also observed. CONCLUSION: The results of these studies revealed an increasing diversification of SARS‐CoV‐2 strains. High evolutionary rates and fast population growth characterizes the population dynamics of SARS‐CoV‐2 strains. This article is protected by copyright. All rights reserved. zoonotic transmission at a market in Wuhan where animals and meat were sold. 5 The World Health Organization declare this outbreak as a public health emergency of international concern on January 30th, 2020 6 and the disease caused by this specific virus species have recently been designated as COVID-19 (Coronavirus Disease 2019). 7 The Coronavirus Study Group of the International Committee on Taxonomy of Viruses (ICTV), formally recognized this virus as a relative to severe acute respiratory syndrome SARS-CoVs and designated it as severe acute respiratory syndrome coronavirus 2: SARS-CoV-2. 8 As April 12th, 2020, there have been more than 1.5 million confirmed cases and the global deaths of SARS-CoV-2 disease surpasses 100,000. 6 In order to gain insight into the emergence, spread, and evolution of SARS-CoV-2 populations, a Bayesian coalescent Markov Chain Monte Carlo analysis of complete genome sequences of SARS-CoV-2 strains recently isolated in different regions of the world (Europe, North America, South America and South East Asia) was performed. This article is protected by copyright. All rights reserved. Available complete genome sequences of 64 SARS-CoV-2 strains recently isolated from December 30 th , 2019 to March 9 th Base composition of the 64 SARS-CoV-2 genomes were calculated using the MEGA-X program. 9 The relationship between compositional variables and samples was obtained using multivariate statistical analyses. Principal component analysis (PCA) is a type of multivariate analysis that allows a dimensionality reduction. Singular Value Decomposition (SVD) method was used to calculate PCA. Unit variance was used as scaling method. This means that all variables are scaled so that they will be equally important (variance = 1) when finding the components. As a result, a difference of 1 means that the values are one standard deviation away from each other. PCA analysis was done using the ClustVis program. 10 To investigate the patterns of evolution of SARS-CoV-2 strains recently isolated in Europe, North America, South America and South East Asia a Bayesian Markov This article is protected by copyright. All rights reserved. Chain Monte Carlo (MCMC) approach was used as implemented in the BEAST package v2.5.2. 11 First, sequences were aligned using MAFFT version 7 program. 12 Then, the evolutionary model that best fit the sequence dataset was determined using MEGA-X program. 13 Bayesian Information Criterion (BIC), Akaike Information Criterion (AIC), and the log of the likelihood (LnL), indicated that the HKY model was the most suitable model. Recent studies have demonstrated that the choose of the tree prior can upwardly bias the inferred clock rate and Bayesian phylogenetic analysis. 14 These studies also revealed that tree priors allowing for population structure lead to better estimates of emerging virus populations evolution. 14 For these reasons, we considered a population structured model using the multi-type birth-death model in these studies. Statistical uncertainty in the data was reflected by the 95% highest posterior density (HPD) values. Results were assessed using the TRACER program v1.6. 15 One hundred million generations were used after a burn-in of 10 million steps, which were enough to acquire a suitable sample for the posterior, assessed by effective sample sizes (ESS) with values over 200. The results were visualized using the DensiTree program. 16 DensiTree draws all the trees in the dataset simultaneously, but instead of using opaque lines, transparency is used when drawing the trees. For this reason, in areas where a lot of the trees agree on the topology and branch length there will be many lines drawn and the screen will show a densely colored area. 17 In order to gain insight into the composition and genetic heterogeneity among the 64 complete genomes of SARS-CoV-2 strains isolated all over the world, the nucleotide frequencies were determined for all of them. Mean values of 32.10 %, 18 .37 %, 29.86 % and 19.65 % were found for U, C, A and G, respectively. Then, This article is protected by copyright. All rights reserved. PCA was performed on nucleotide compositions frequencies for all strains enrolled in this analysis. The results of this study are shown in Figure 1 . Positions of the strains in the plane conformed by PC1 and PC2 revealed that SARS-CoV-2 strains cluster separately in different positions in the plane. These results suggest a different genome composition among strains enrolled in this analysis (see Fig. 1 ). In fact, PC1 tended to separate the red and blue clades (see Fig. 1 ). This result also revealed a degree of heterogeneity among genomic composition of SARS-CoV-2 strains. To address the degree of genetic variability and mode of evolution of the SARS-CoV-2 strains recently isolated in four different geographic regions of the world, a Bayesian MCMC approach was employed. 11 The results shown in Table 1 Table 1 ). When a mean incubation period of 5 days and a recovery period of 14 days was considered [7] , 95 % HPD credible internals of R0 Accepted Article of 0.88 to 1.83, 0.89 to 1.45, 0.42 to 1.84 and 0.99 to 1.33 were obtained for Europe, North America, South America and South East Asia, respectively (Table 1 ). Comparison between the sampled population size marginal posterior distributions for the populations studied revealed no significant differences in R0 among the four regions (see Fig. 2 ). Upper 95 % HPD values range from 1.33 to 1.83, revealing a mean R0 of 1.58. The rate of recovery for a patient with SARS-CoV-2 was established in a mean of 23.48 days for any of the regions studied ( Table 1 ). The phylogenetic relationship among SARS-CoV-2 strains recently isolated in the four geographic regions of the world studied were explored and summarized in Figure 3 . When the complete genome sequences of SARS-CoV-2 genomes were analyzed, three distinct genetic clades were found (see Fig. 3 ). This result revealed a significant degree of genetic diversification of SARS-CoV-2 strains. Moreover, co-circulation of strains from different genetic clades was observed in different countries (Fig. 3) . To study the circulation of virus lineages among the different geographic regions studied, the migration rate per genetic lineage per year was calculated for all regions ( Table 1 ). As it can be seen, a significant rate of migration from Europe to South America was observed (Table 1) . To gain insight into the degree of genetic variation among the SARS-CoV-2 genetic clades observed, a detailed analysis of substitutions found throughout SARS-CoV-2 complete genome was performed. The results of these studies are shown in Table 2 . This article is protected by copyright. All rights reserved. The Clade 1 strains share the same substitutions in 5'non coding region, 1a, 1b and S genes; while clade 2 strains share the same substitutions in 1a and 8 genes and clade 3 share the same substitutions in 1a and 3a genes (see Fig. 3 and Table 2 ). While some substitutions are synonymous, others revealed amino acid changes (Table 2) . Several other substitutions were observed in strains circulating in a particular country and co-circulation of different variants in the same country was observed. Some of these particular substitutions were present in European and South American strains, suggesting a close genetic relation among themselves (see Table 2 ). On January 30 th 2020, the World Health Organization declared the current SARS-CoV-2 outbreak a public health emergency of international concern. 7 The rapid availability of research data on internet platforms such as the GISAID permitted to perform detailed phylogenetic reconstruction of the origin, spread and evolution of SARS-CoV-2. The results of this work revealed that SARS-CoV-2 viruses evolved from ancestors circulating around November 1, 2019, several weeks before the first cases were This article is protected by copyright. All rights reserved. diagnosed (Table 1) . This is in agreement with recent results establishing that the pandemic originated between October and November of 2019. 17 As many early cases of COVID-19 were linked to the Huanan market in Wuhan 18 , it is possible that an animal source was present at this location. This is also in agreement with very recent estimations establishing the MCRA on November 9, 2019 19 and is consistent with the earliest retrospectively confirmed cases. 20 Taking all together, these studies revealed a period of unrecognized transmission in humans from the initial zoonotic event. 21 More studies will be needed in order to determine the extent of prior human exposure to SARS-CoV-2. 21 The evolutionary rate of SARS-CoV-2 strains enrolled in these studies was estimated to be 6.57 x 10 -4 substitutions/site/year (s/s/y) ( Table 1 ). This is in agreement with recent estimations at the beginning of the pandemic of 7.8 x 10 -4 s/s/y. 17 Previous estimations by the WHO at the initial stage of the pandemic revealed a reproduction number (R0) of 1.4 to 2.5. 22 Li and colleagues have estimated slightly higher values ranging from 1.4 to 3.9. 23 Very recent studies, assuming that SARS-CoV-2 would cause more mild-to-moderate cases than the ones produced by SARS values found for all regions studied cover the 95 % HPD values of previous estimations. Moreover, no significant differences in R0 among the four regions studied was found (Fig. 2) . Higher 95 % HPD values revealed a mean of 1.58. Recent studies revealed that the majority of scenarios with an R0 of 1.5 were controllable with less than 50% of contacts successfully traced. 24 Recent studies have provided evidence of the genetic diversity and rapid evolution of SARS-CoV-2 strains 25 and others have permitted to observe some clades sharing particular amino acid substitutions, like clade S (Orf 8, L84S); clade G (Orf S, D624G) and clade V (Orf3a, G251V). 26 On the other hand, many other strains were not assigned to specific clades. 26 In these studies, three clades were observed and co-circulation of different clades in different countries was observed (see Fig. 3 ). Moreover, co-circulation of different clades was observed in different countries (see Fig.3 and Table 2 ). Particularly, several substitutions were shared by strains isolated in Europe and South America, revealing a close genetic relationship among them, and this is also in relation with the rate of migration of genetic lineages from Europe to South America (see Table 1 ). Besides, several substitutions, although they are synonymous substitutions, can be useful for monitor the spread of SARS-CoV-2 genetic lineages in different regions of the world (see Table 2 ). Although the three clades observed in these studies are in agreement with recent studies permitting to assign several strains to clades S, G and V 26 , several other substitutions have been observed (Table 2) We hope the substitutions observed in SARS-CoV-2 strains will serve as a useful reference for development of treatment against SARS-CoV-2 disease and for public health agencies. The results of these studies revealed the diversification of SARS-CoV-2 population Emerging coronaviruses: genome structure, replication, and pathogenesis Epidemiology, genetic recombination Severe acute respiratory syndrome Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster Statement on the second meeting of the International Health Regulations (2005) Emergency Committee regarding the outbreak of novel coronavirus (2019-nCoV) World Health Organization Severe acute respiratory syndrome-related coronavirus: The species and its viruses -a statement of the Coronavirus Study Group MEGA-X: Molecular evolutionary genetics analysis across computing platforms Clustvis: a web tool for visualizing clustering of multivariate data using Principal Component Analysis and heatmap BEAST 2.5: An advanced software platform for This article is protected by copyright. All rights reserved. Accepted Article Bayesian evolutionary analysis MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization MEGA-X: Molecular evolutionary genetics analysis across computing platforms Impact of the tree prior on estimating clock rates during epidemic outbreaks Posterior summarization in Bayesian phylogenetics using Tracer 1.7 DensiTree: making sense of phylogenetic trees Early phylogenetic estimate of the effective reproduction number of SARS-CoV-2 Genomic characterization of the 2019 novel human-pathogenic coronavirus isolated from a patient with atypical pneumonia after visiting Wuhan Potential of large "first generation" human-tohuman transmission of 2019-nCoV Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China The proximal origin of SARS-CoV-2 An updated estimation of the risk of transmission of the novel coronavirus (2019-nCov) Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia Feasibility of controlling COVID-19 outbreaks by isolation of cases and contacts Genetic diversity and evolution of SARS-CoV-2 No. refers to the number of strains carrying that substitution in the alignment. b Clade assignment is indicated when substitution is present in more than four or more strains in the alignment. S, G an V clade names assignment by GISAID, accordingly to amino acid substitutions found in Orf 8, S and 3a, respectively. d A synonymous substitution is shown by a dotted line (---).