key: cord-1040340-se35sjyv authors: Bukin, Yu.S.; Bondaryuk, A.N.; Kulakova, N.V.; Balakhonov, S.V.; Dzhioev, Y.P.; Zlobin, V.I. title: Phylogenetic reconstruction of the initial stages of the spread of the SARS-CoV-2 virus in the Eurasian and American continents by analysing genomic data date: 2021-08-26 journal: Virus Res DOI: 10.1016/j.virusres.2021.198551 sha: 972b8c7c121a5f2d769713e60a439c06c90a5fad doc_id: 1040340 cord_uid: se35sjyv Samples from complete genomes of SARS-CoV-2 isolated during the first wave (December 2019 - July 2020) of the global COVID-19 pandemic from 21 countries (Asia, Europe, Middle East and America) around the world, were analysed using the phylogenetic method with molecular clock dating. Results showed that the first cases of COVID-19 in the human population appeared in the period between July and November 2019 in China. The spread of the virus into other countries of the world began in the autumn of 2019. In mid-February 2020, the virus appeared in all the countries we analysed. During this time, the global population of SARS-CoV-2 was characterized by low levels of the genetic polymorphism, making it difficult to accurately assess the pathways of infection. The rate of evolution of the coding region of the SARS-CoV-2 genome equal to 7.3 × 10(−4) (5.95 × 10(−4) - 8.68 × 10(−4)) nucleotide substitutions per site per year is comparable to those of other human RNA viruses (Measles morbillivirus, Rubella virus, Enterovirus C). SARS-CoV-2 was separated from its known close relative, the bat coronavirus RaTG13 of the genus Betacoronavirus, approximately 15-43 years ago (the end of the 20th century). March 2020 it was declared as a worldwide pandemic. The problems of epidemiological studies of COVID-19 lie in the mass spread of asymptomatic and mild forms of the disease (Zhao et al 2020; Gao et al 2020) . Therefore, the early stages of the infection spread in the world, or in any country, could be hidden from epidemiologists. Epidemiological data from the beginning of 2021 showed that globally, since March 2020, there have been two waves (two peaks in incidence) of COVID-19: the spring peak of 2020 and the autumn-winter peak of 2020-2021 (Cacciapaglia et al, 2020) . Little epidemiological data was available from December 2019 to February 2020. Detailed analysis of the initial period of COVID-19 spread in the world would allow us to develop a strategy for preventing global pandemics of new viruses in the future. Genomic phylogeny methods using Bayesian (Larget and Simon, 1999) and other approaches allow researchers to reconstruct the evolutionary history of virus pandemics using the concept of a molecular clock, dating the tree based on the date of strain isolation. It is generally assumed that the molecular clock method can give very approximate results, ranging in millions of years, but viruses mutate at a high rate. Mutations in the genomes of RNA viruses occur millions of times more often than in vertebrate genomes (Pybus et al., 2009 ). Therefore, a high accuracy molecular dating of evolutionary events up to several months and even days could be achieved. In fact, the use of Bayesian phylogenetic methods make it possible to identify patterns of virus evolution and distribution that cannot be obtained from epidemiological data Our study aimed to reconstruct the SARS-CoV-2 phylogenetic tree at the initial period of the COVID-19 pandemic by the Bayesian phylogenetic method with a molecular clock to identify the date of the first cases in the human population, to determine the origin (country) of the infection, and the time of the appearance and spread of infection in the world during the first wave of COVID-19; to detect the data of separation SARS-CoV-2 with close relatives of the genus Betacoronavirus, and estimate the mutation rate and genetic polymorphism of virus genomes. 80189 genome records on SARS-CoV-2 from the GISAID database accessed on 8 December 2020 (the first wave of the COVID-19 pandemic) were used in the analysis. From the total data set, sequences were selected according to the following: length greater than 29,000 nucleotides, approximately corresponding to the total length of the virus genome; less than 0.25% of unidentified nucleotides; strains isolated from diseased human patients; known information on strain isolation; and no stop codons within the open reading frame of viral proteins. From the selected genomes the protein coding regions were used for further analysis. The non-coding part of the genomes was not analyzed because of short length (about 592 nucleotides) compared to the length of the coding part of the genome and a large proportion of unknown nucleotides in the data. For each sample, a separate analysis was performed to determine the sample size effect on the results of phylogenetic reconstructions. For each of the three differently sized samples, the uncalibrated phylogenetic tree was reconstructed by the maximum likelihood (ML) method in the IQTREE program (Nguyen 2015) . The 1+2 and 3 codon positions were treated differently in the analysis with the HKY+I+G nucleotide evolutionary model recommended for this phylogenetic reconstruction (recommended for RNA viruses) (Shapiro et al 2006) . A IQTREE preliminary analysis using the ModelFinder algorithm (Kalyaanamoorthy et al., 2017) showed that according to the value of the Bayesian information criterion (BIC), the reconstruction of the tree with differently treated codon positions has a significant advantage over the reconstruction without it. Statistical estimates of the reliability of the tree topology were performed in the IQTREE program using ultrafast bootstrap (1000 replicas) (Minh et al, 2013) and SH-aLRT (Guindon et al, 2010) analysis. The maximum likelihood tree was rooted using data on the strain isolation time by the "residual-mean-squared" method in the TempEst v1.5.3 program (Rambaut et al, 2016) . Phylogenetic analysis of the smallest sample of 252 genomes with molecular clock and calibrating the tree to the time of SARS-CoV-2 strain isolation was performed by the Bayesian phylogenetic method in the BEAST v. 2.6.2 software package (Bouckaert et al, 2014) . The 1+2 and 3rd codon positions and the HKY+I+G DNA evolution model were tested with the following models: 1) constant population size, strict clock and dated tree; 2) constant population size, strict clock and not dated tree; 3) constant population size, relaxed clock and tip-dating tree -dating of the tree by the time of isolating strains; 4) constant population size, relaxed clock and not dated tree; 5) exponential growth population size, strict clock, tip-dating tree; 6) exponential growth population size, strict clock, not dated tree; 7) Exponential growth population size, relaxed clock, tip-dating tree; 8) exponential growth population size, relaxed clock and not dated tree. The uncorrelated lognormal relaxed clock model was used in the tree reconstruction with relaxed clock. Testing of the best tree reconstruction model by comparison of the marginal likelihood estimators was provided in BEAST-2 with the "Path sampling" analysis using the "Modelselection" package. This analysis was aimed to answer the following questions: 1) does reconstruction of a dated tree based on the virus isolation time have an advantage over For the datasets consisting of 513 and 777 genomes, the molecular clock analysis was performed in BEAST v. 2.6.2 using the best-fit reconstruction model of evolution chosen for 252 genomes dataset by the "Path sampling" analysis. This was due to the fact that the "Path sampling" analysis of 252 genomes required a lot of computational resources and time. Therefore, analyzing the datasets consisting of 513 and 777 genomes would be even more laborintensive. We suppose that if the "Path sampling" analysis finds the applicability of dating of the phylogenetic tree by the time of virus strain isolation for a smaller sample size (252 genomes) then such dating is suitable for a dataset with a large sample size. For all three datasets, samples from 1,000 ultrafast bootstrap IQ-TREE ML trees were combined with topology of Bayesian consensus tree (BEAST). This allowed us to estimate the occurrence of nodes of the consensus of the Bayesian tree among the ultrafast bootstrap ML trees. Thus, the ultrafast bootstrap support was calculated for the Bayesian consensus tree. Comparison of trees and calculation of bootstrap support were carried out with the APE package All supplementary materials such as the genomes with GISAID and GenBank numbers used in the study, the xml files for the BEAST v. 2.6.2 program, and original reconstructed phylogenetic trees are available at https://doi.org/10.6084/m9.figshare.14830899. The rooted maximum likelihood tree for 252 SARS-CoV-2 genomes dataset is shown in Figure 1 . The tree is divided into two large clades A_ML and B_ML in its basal part. For both clades ( Fig. 1) , the ultrafast bootstrap support values were < 95%, and the SH-aLRT support values were < 60%. These values (ultrafast bootstrap < 95%, SH-aLRT support < 80%) ( Genomic data can be used to track a path of the virus from one country to another based on the topology of a phylogenetic tree and time estimates inferred from the date of virus isolation. To track the virus pathway, high support of the tree topology is required. For the dataset of 252 SARS-CoV-2 genomes, Figure 4 shows the correlation between the bootstrap/probability value on the Bayesian phylogenetic tree and the time estimator at the node ( Fig. 4 a -ultrafast bootstrap support, Fig. 4 b -Bayesian posterior probability) . Analysis of this correlation shows that the percentage of ultrafast bootstrap support and Bayesian posterior probability values ≥ 95% increases from the root of the tree to the end of branches. Only slightly more than 10% of the nodes in the tree exceed the 95% confidence threshold (Fig. 4 a, b) . Table 3 For all data sets, average values of mutation rates and confidence interval measurements differ less than 11%. The maximal variability of values (the largest confidence interval) was observed in the analysis of the dataset consisting of 777 SARS-CoV-2 genomes (Table 3 ). An enlargement of the sample size more than threefold (from 252 to 777 genomes) does Reconstructions using two evolutionary models (constant population size and exponential growth population size) show intersection of confidence intervals of dating of nodes on the tree. Notably, in the analysis that included closely related strains (Fig. 5) , the left border of the confidence interval for dating the divergence of all SARS-CoV-2 strains from their common ancestor (the date of existence of the closest common ancestor of the SARS-CoV-2 strains circulating in the first wave of the pandemic) moved back by two-three months (to the first half of 2019). As in the analysis without closely related strains, an increase in the sample size of SARS-CoV-2 genomes by more than three times did not lead to an increase in the accuracy of the analysis (narrowing of the confidence intervals for dating of nodes). Our results are consistent with the initial conclusions of some experts ( Our estimates for the first wave of the COVID-19 pandemic showed that there was a low level of genetic polymorphism in complete SARS-CoV-2 genomes. If the number of parsimonyinformative sites is less than the number of sequences applied in a phylogenetic analysis, then we will always face low topology support and ambiguous clustering results. In our dataset, we face the problem of unreliable estimates of the topology of the reconstructed phylogenetic tree. This Our analysis of SARS-CoV-2 genomic data allows us to draw the following conclusions: The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The work was supported by the Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty Serologic testing of US blood donations to identify severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)-reactive antibodies The global spread of 2019-nCoV: a molecular evolutionary analysis. Pathogens and global health Unequal evolutionary rates in the human immunodeficiency virus type 1 (HIV-1) pandemic: the evolutionary rate of HIV-1 slows down when the epidemic rate increases BEAST 2: a software platform for Bayesian evolutionary analysis & Nelson M., 2020. Transmission of hepatitis C virus in HIV-positive and PrEP-using MSM in Journal of viral hepatitis Second wave COVID-19 pandemics in Europe: a temporal playbook Evidence of early circulation of SARS-CoV-2 in France: findings from the population-based The 2019 novel coronavirus disease (COVID-19) pandemic: A review of the current evidence Bayesian Evaluation of Temporal Signal in Measurably Evolving Populations Phylogenomics and phylodynamics of SARS-CoV-2 genomes retrieved from India Origin of measles virus: divergence from rinderpest virus between the 11 th and 12 th centuries A systematic review of asymptomatic infections with COVID-19 Potentially adaptive SARS-CoV-2 mutations discovered with novel spatiotemporal and explainable AI models The first two cases of 2019-nCoV in Italy: Where they come from? New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Systematic biology Analysis of multimerization of the SARS coronavirus nucleocapsid protein. Biochemical and biophysical research communications A novel SARS-CoV-2 related coronavirus in bats from Cambodia ModelFinder: fast model selection for accurate phylogenetic estimates Variant analysis of SARS-CoV-2 genomes Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees Rapid genomic characterization of SARS-CoV-2 viruses from clinical specimens using nanopore sequencing Population Genetics of SARS-CoV-2: Disentangling Effects of Sampling Bias and Infection Clusters Ultrafast approximation for phylogenetic bootstrap A new method for inferring timetrees from temporally sampled molecular sequences Rapid SARS-CoV-2 whole-genome sequencing and analysis for informed public health decision-making in the Netherlands IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies Re-visiting the evolution, dispersal and epidemiology of Zika virus in Evolutionary analysis of the dynamics of viral infectious disease A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen). Virus evolution Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences GISAID: Global initiative on sharing all influenza data-from vision to reality HIV-2 genetic evolution in patients with advanced disease is faster than that in matched HIV-1 patients Recombination in the evolution of enterovirus C species sub-group that contains types CVA-21, CVA-24, EV-C95, EV-C96 and EV-C99 International public health responses to COVID-19 outbreak: a rapid review On the origin and continuing evolution of SARS-CoV-2 Genomic characterization of a newly discovered coronavirus associated with acute respiratory distress syndrome in humans The COVID-19 epidemic. Tropical medicine & international health Analysis of SARS-CoV-2 mutations in the United States suggests presence of four substrains and novel variants SARS-CoV-2 501Y.V2 escapes neutralization by South African COVID-19 donor plasma Patient 0'HIV-1 genomes illuminate early HIV/AIDS history in North America A new coronavirus associated with human respiratory disease in China Novel henipalike virus, Mojiang paramyxovirus, in rats The evolutionary rates of HCV estimated with subtype 1a and 1b sequences over the ORF length and in different genomic regions Spike mutation T403R allows bat coronavirus RaTG13 to use human ACE2 COVID-19: asymptomatic carrier transmission is an underestimated problem Emergence and continuous evolution of genotype 1E rubella viruses in China