key: cord-0996504-zknmfgsh authors: Li, Xingguang; Zai, Junjie; Zhao, Qiang; Nie, Qing; Li, Yi; Foley, Brian T.; Chaillon, Antoine title: Evolutionary history, potential intermediate animal host, and cross‐species analyses of SARS‐CoV‐2 date: 2020-03-11 journal: J Med Virol DOI: 10.1002/jmv.25731 sha: dd87b73244c421e71b387676c12b3ac5b0108591 doc_id: 996504 cord_uid: zknmfgsh To investigate the evolutionary history of the recent outbreak of severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2) in China, a total of 70 genomes of virus strains from China and elsewhere with sampling dates between 24 December 2019 and 3 February 2020 were analyzed. To explore the potential intermediate animal host of the SARS‐CoV‐2 virus, we reanalyzed virome data sets from pangolins and representative SARS‐related coronaviruses isolates from bats, with particular attention paid to the spike glycoprotein gene. We performed phylogenetic, split network, transmission network, likelihood‐mapping, and comparative analyses of the genomes. Based on Bayesian time‐scaled phylogenetic analysis using the tip‐dating method, we estimated the time to the most recent common ancestor and evolutionary rate of SARS‐CoV‐2, which ranged from 22 to 24 November 2019 and 1.19 to 1.31 × 10(−3) substitutions per site per year, respectively. Our results also revealed that the BetaCoV/bat/Yunnan/RaTG13/2013 virus was more similar to the SARS‐CoV‐2 virus than the coronavirus obtained from the two pangolin samples (SRR10168377 and SRR10168378). We also identified a unique peptide (PRRA) insertion in the human SARS‐CoV‐2 virus, which may be involved in the proteolytic cleavage of the spike protein by cellular proteases, and thus could impact host range and transmissibility. Interestingly, the coronavirus carried by pangolins did not have the RRAR motif. Therefore, we concluded that the human SARS‐CoV‐2 virus, which is responsible for the recent outbreak of COVID‐19, did not come directly from pangolins. December 2019 in the Chinese city of Wuhan, causes a respiratory illness called COVID-19, which can spread from person to person. 1 cases and 858 deaths in 27 countries during the 2012 MERS outbreak. 7, 8 Coronaviruses are known to circulate in mammals and birds. Previous studies revealed that both SARS-CoV and MERS-CoV are zoonotic in origin, originally coming from bats, [9] [10] [11] [12] with SARS-CoV spreading from bats to palm civets to humans, [13] [14] [15] and MERS-CoV spreading from bats to camels to humans. 16, 17 Recent research has also reported that the SARS-CoV-2 virus likely originated in bats, a proposal based on the similarity of its genetic sequence to that of other known coronaviruses. 18 However, like SARS-CoV, MERS-CoV, and many other coronaviruses, the SARS-CoV-2 virus may have been transmitted to humans by an intermediate animal host. 19 Therefore, the identity of the animal source of SARS-CoV-2 remains a key and urgent question. Furthermore, to stem future outbreaks of this type and preventing the transmission of zoonotic diseases to humans should be a top research priority. The existence of an intermediate animal host of SARS-CoV-2 between a probable bat reservoir and humans is still under investigation. The discovery of a virus closely related to the newly emerged SARS-CoV-2 in a data set from pangolins sampled more than a year ago illustrates that the sampling of other mammals handled or consumed by humans could uncover even more closely related viruses. 20 During a press conference on 7 February 2020, two researchers (Shen Yongyi and Xiao Lihua) from the South China Agricultural University in Guangzhou identified the pangolin as a potential source of the SARS-CoV-2 virus based on genetic comparison of coronaviruses taken from pangolins and from humans infected during the recent outbreak. By analyzing more than 1 000 metagenomic samples and using molecular biology testing, they found that the positive rate of β coronavirus in pangolins was 70% and that the genome sequence of an isolated virus strain was 99% similar to that of the SARS-CoV-2 virus. Thus, whether pangolins acted as a direct intermediate animal host of the SARS-CoV-2 virus is worth further investigation. In the present study, we performed analyses of the transmission dynamics and evolutionary history of the virus based on 70 genomes of SARS-CoV-2 strains sampled from Australia (n = 4), Belgium (n = 1), China (Hubei Province, n = 19; Guangdong Province, n = 16; Zhejiang Province, n = 4; Taiwan, n = 1), Finland (n = 1), France (n = 4), Germany (n = 1), Japan (n = 1), Korea (n = 1), Singapore (n = 3), Thailand (n = 2), United Kingdom (n = 2), and United States (n = 10) with sampling dates between 24 December 2019 and 3 February 2020. We reanalyzed two of the 21 pangolin metagenome samples from previously published data 20 (Table S1 ). To investigate the potential intermediate hosts of SARS-CoV-2 (between originating animal and human hosts), two samples (SRR10168377 and SRR10168378) obtained from previously reported Malayan pangolin (Manis javanica) viral metagenomic sequencing data (Bio Project PRJNA573298) were downloaded from the NCBI SRA public database. 20 After assembly, the SRR10168377 and SRR10168378 genomes were 16 999 and 6 392 bp in length, respectively. We defined another data set ("dataset_6") composed of six genome sequences of coronavirus strains. BetaCoV/Wuhan-Hu-1/2019 (EPI_ISL_402125) was grouped as "Clade A," one (BetaCoV/bat/Yunnan/RaTG13/2013; EPI_ISL_402131) and two (bat-SL-CoVZC45; MG772933 and bat-SL-CoVZXC21; MG772934) SARS-related coronaviruses were grouped as "Clade B" and "Clade D," respectively. The two assembled genomes from SRR10168377 and SRR10168377 were grouped into "Clade C." The two data sets ("dataset_70" and "dataset_6") were aligned using MAFFT v7.222 22 and then manually curated using BioEdit v7.2.5. 23 To assess the recombination of "dataset_70," we employed the pairwise homoplasy index (PHI) to measure the similarity between closely linked sites using SplitsTree v4.15.1. 24 The best-fit nucleotide substitution models for the two data sets were identified according to the Bayesian information criterion (BIC) method with 3 (24 candidate models) or 11 (88 candidate models) substitution schemes in jModelTest v2.1.10. 25 To evaluate the phylogenetic signals of "dataset_70" and "dataset_6," we performed likelihood-mapping analysis 26 using TREE-PUZZLE v5.3, 27 with 25 000 to 175 000 randomly chosen quartets for the two data sets. For "dataset_70," split network analysis was performed using Kishino-Yano-85 distance transformation with the NeighborNet method, which can be loosely thought of as a "hybrid" between the neighbor-joining (NJ) and split decomposition methods, implemented in TREE-PUZZLE v5.3. For "dataset_70," NJ 28 phylogenetic trees were constructed using the Kimura 2-parameter method 29 implemented in MEGA v7.0.26. 30 For "dataset_6," NJ 28 phylogenetic trees were constructed using the Maximum Composite likelihood (MCL) method, 31 TempEst v1.5. 35 We also estimated the evolutionary rate and time to the most recent common ancestor (TMRCA) for "dataset_70" using ML dating in the TreeTime package. 36 The HIV TRAnsmission Cluster Engine (HIV-TRACE; www.hivtrace.org) 44 was employed to infer transmission network clusters for SARS-CoV-2 "dataset_70." All pairwise distances were calculated and the putative linkages between each pair of genomes were considered whenever their divergence was ≤0.0001 (0.01%) or ≤0.00001 (0.001%) substitutions/site (TN93 substitution model). Multiple linkages were then combined into putative transmission clusters. Clusters that comprised of only two linked nodes were identified as dyads. This approach detects transmission clusters in which the clustering strains are genetically similar, implying a direct or indirect epidemiological connection. To investigate the putative parents of SARS-CoV-2, we performed similarity plot analysis based on the Kimura two-parameter method 29 with a window size of 200 bp and step size of 20 bp using SimPlot v.3.5.14. 45 We divided "dataset_6" into four clades (ie, Clade A, Clade B, Clade C, and Clade D), with Clade A designated as the query group. For "dataset_70" and "dataset_6," HKY and GTR + G were the models of best-fit, respectively, across the two different substitution schemes (ie, 24 and 88 candidate models) according to the BIC method, and were thus used in subsequent likelihood-mapping and phylogenetic analyses for the two data sets. The PHI tests of "dataset_70" did not find statistically significant evidence of recombination (P = 1.0). Likelihood-mapping analysis of "dataset_70" revealed that 69.7% of LI ET AL. | 3 the quartets were distributed in the center of the triangle, indicating a strong star-like topology signal reflecting a novel virus, which may be due to exponential epidemic spread ( Figure 1A) . Likewise, 25.9% of the quartets from "dataset_6" were distributed in the center of the triangle, indicating a strong phylogenetic signal ( Figure 1B) . The split network generated for "dataset_70" using the NeighborNet method was highly unresolved, and the phylogenetic relationship of "data-set_70" was probably best represented by a network rather than a tree ( Figure 1C ). The existence of polytomies indicated-in contrast to that expected in a strictly bifurcating tree-an explosive, star-like evolution of SARS-CoV-2. Both the NJ and ML phylogenetic analyses of SARS-CoV-2 "dataset_70" also showed star-like topologies, in accordance with the likelihood-mapping results (Figures 2 and S1 ). The ML phylogenetic tree showed greater star-like topology than the NJ phylogenetic tree, indicating that the ML method was more reasonable for "dataset_70." Root-to-tip regression analyses between genetic divergence and sampling date using the best-fitting root showed that (Table 1) . Furthermore, based on Bayesian time-scaled phylogenetic analysis using the tip-dating method, the estimated TMRCA dates and evolutionary rates from "dataset_70" ranged from (Table 1) . Thus, the estimated TMRCA dates and evolutionary rates for SARS-CoV-2 from "dataset_70" were consistent among the different clock models (strict and relaxed) but were distinct among the different dating methods (constrained-dating and tip-dating). The estimated TMRCA dates and evolutionary rates for SARS-CoV-2 from "data-set_70" using the tip-dating method exhibited much narrower 95% BCIs than the constrained-dating method. In addition, the estimated TMRCA dates and evolutionary rates for SARS-CoV-2 from "data-set_70" were consistent between the different coalescent tree models (ie, constant and exponential) when using the tip-dating method but were distinct when using the constrained-dating method. For each F I G U R E 1 Likelihood-mapping and split network analyses of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Likelihoods of three tree topologies for each possible quartet (or for a random sample of quartets) are denoted by data points in an equilateral triangle. Distribution of points in seven areas of the triangle reflects tree-likeness of data. Specifically, three corners represent fully resolved tree topologies; center represents an unresolved (star) phylogeny; and sides represent support for conflicting tree topologies. Results of likelihoodmapping analyses of two data sets ("dataset_70," A; and "dataset_6," B) and split network analyses of "dataset_70" (C) are shown We considered individuals as genetically linked when the genetic distance between SARS-CoV-2 strains was <0.01% substitutions/site. Based on this, we identified one large transmission cluster that included 66 of 70 (94.29%) genomes, thus suggesting low genetic divergence for "dataset_70" (Figure S3 ). We also considered individuals as genetically linked when the genetic distance between SARS-CoV-2 strains was <0.001% substitutions/site. Based on this, we identified 6 transmission clusters that included 37 of 70 (52.86%) genomes for "dataset_70" (Figure 5 ). Clusters ranged in size from 2 to 23 genomes. The NJ and ML phylogenetic topologies of "dataset_6" were consistent with each other (Figure S4 ), indicating that the use of a small number of sequences could show similar topological results. Homology plot analysis of "dataset_6" also revealed that BetaCoV/bat/Yunnan/ RaTG13/2013 was more similar to the SARS-CoV-2 virus than the coronavirus obtained from the two pangolin samples (SRR10168377 and SRR10168378), consistent with phylogenetic analysis ( Figure S5 ). Of note, "Clade D" (bat-SL-CoVZC45 and bat-SL-CoVZXC21) had higher similarity to the SARS-CoV-2 virus in the first 12 000 bp region of the full alignment than to the pangolin coronavirus ( Figure S5 ). We also found that a unique peptide (PRRA) insertion region in the spike protein at the junction of S1 and S2 junction in the human SARS-CoV-2 virus ("Clade A") induced a furin cleavage motif (RRAR), which could be a predicted polybasic cleavage site, and thus a unique feature of SARS-CoV-2, in comparison with the other three clades ("Clade B," "Clade C," and "Clade D") ( Figure S6 ). On the basis of "dataset_70," our likelihood-mapping analysis confirmed additional tree-like signals over time compared with our previous results. 46, 47 This result implies increasing genetic divergence of SARS-CoV-2 in human hosts ( Figure 1A) , consistent with the findings of our earlier studies. 46, 47 Split network analysis for SARS-CoV-2 "dataset_70" human-to-human transmission ( Figure 1C ). These results are consistent with the ML phylogenetic analyses, which showed polytomy topology from "dataset_70" (Figure 2 ). However, NJ phylogenetic analyses showed a more bifurcating tree topology compared with the ML phylogenetic analyses ( Figure S1 ). This is a good example showing the differences between NJ and ML phylogenetic construction methods. "Dataset_70" had a minor strong positive temporal signal based on root-to-tip regression and ML dating analyses (Figures 3 and S2) , with the estimated TMRCA dates and evolutionary rates for SARS-CoV-2 found to be nearly identical using both analyses ( Table 1 ). The estimated TMRCA dates and evolutionary rates for SARS-CoV-2 were very similar across different clock models and coalescent tree priors using the tip-dating method. The estimated TMRCA dates and evolutionary rates for SARS-CoV-2 were also very similar across different clock models using the constraineddating method, but highly distinct across the different coalescent tree priors ( Table 1 ). The TMRCA estimated by the tip-dating method was relatively narrower than that determined by the constrained-dating method, consistent with our previous studies. 46, 47 Bayesian analyses with the tip-dating method using a strict clock as well as constant size coalescent tree prior indicated that SARS-CoV-2 is evolving at a rate of 1.24 × 10 −3 substitutions per site per year (Table 1 ), in accordance with our prior research 46, 47 and similar to that found for other human F I G U R E 4 Estimated maximum-clade-credibility tree of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) using tip-dating method. Colors indicate different sampling locations. Nodes are labeled with posterior probability values. Estimated maximum-clade credibility (MCC) tree of "dataset_70" are shown F I G U R E 5 Transmission clusters of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Structure of inferred SARS-CoV-2 transmission clusters from "dataset_70" using genetic distances of <0.001% substitutions/site is shown. Nodes (circles) represent connected individuals in overall network, and putative transmission linkages are represented by edges (lines). Nodes are color-coded by sampling locations coronaviruses. 41 Our results also suggest that the virus originated on 24 November 2019, which is in further agreement with our earlier studies. 46, 47 We identified eight phylogenetic clusters (number of sequences 2 to 7) with posterior probabilities between 0.99 and 1.0 using Bayesian inference ( Figure 4) . We also identified six transmission clusters (number of sequences 2 to 23) when the genetic distance between the SARS-CoV-2 strains was <0.001% substitutions/site ( Figure 5) . However, our conclusions should be considered preliminary and explained with caution due to the limited number of SARS-CoV-2 genome sequences presented in this study. As more genome sequences become available, there may be stronger among-lineage rate variation over time as to warrant using a relaxed clock model, but we anticipate that the evolutionary rates and TMRCA dates will be broadly similar to those estimated here. As the number of substitutions is still small, it is tempting to speculate that sequencing errors could have a considerable impact on the evolutionary rate and TMRCA date estimates. We removed three SARS-CoV-2 genome sequences (ie, BetaCoV/Wuhan/IPBCAMS-WH-02/2019, EPI_ISL_403931; BetaCoV/Shenzhen/SZTH-001/2020, EPI_ISL_406592; BetaCoV/Shenzhen/SZTH-004/2020, EPI_ISL_406595) with potential sequencing errors, but these may have less impact on the above estimates when more substitutions of SARS-CoV-2 are accumulated over time. We also expect that as more SARS-CoV-2 genome sequences become available, the estimated 95% BCIs of the evolutionary rates and TMRCA dates will be narrower. We found that the Pangolin-CoV virus from the two pangolin samples was clustered with the SARS-CoV-2 virus with 100% bootstrap support; however, BetaCoV/bat/Yunnan/RaTG13/2013 was more similar to the SARS-CoV-2 virus than to the pangolin coronavirus and the human SARS-CoV-2 virus ("Clade A") showed a unique peptide (PRRA) insertion not found in the other three clades ("Clade B," "Clade C," "Clade D"). This insertion constitutes an RRAR motif in the spike protein at the junction of S1 and S2 junction in the human SARS-CoV-2 virus, after considering the next amino acid (R) of the unique peptide (PRRA) ( Figure S6 ). Of note, the highly favored motifs for furin cleavage are Arg-X-(Arg/Lys)-Arg (RXRR or RXKR), and the minimal motifs for furin cleavage can be RXXR. 48 We also note that some of the other coronaviruses have a furin motif in almost the same location in their spike proteins. 49, 50 Lentiviruses have an RKXR (R, arginine; K, lysine; X, any amino acid) site between gp120 and gp41, cleaved by furin to convert gp160 into subunits. [51] [52] [53] [54] Therefore, it is tempting to speculate that cleavage or lack of cleavage of the spike protein at this site could significantly impact host range and transmissibility. Taken together, the pangolin coronavirus samples (SRR10168377 and SRR10168378) were less similar to the SARS-CoV-2 virus than to the BetaCoV/bat/Yunnan/ RaTG13/2013 virus and did not have the RRAR motif. Therefore, we concluded that the human SARS-CoV-2 virus, which is responsible for the current outbreak of COVID-19, did not come directly from pangolins. However, due to the limited viral metagenomic data obtained from pangolins, we cannot exclude that other pangolins from China may contain coronaviruses that exhibit greater similarity to the SARS-CoV-2 virus. In conclusion, our results emphasize the importance of further epidemiological investigations, genomic data surveillance, and experimental studies of the role of the unique furin cleavage motif (RRAR) of SARS-CoV-2 in the spike protein at the junctions of S1 and S2. Such work could positively impact public health in terms of guiding prevention efforts to reduce SARS-CoV-2 transmission in real time, and to stem future outbreaks of zoonotic diseases. This study was supported by a grant from the National Natural Science Foundation of China (No. 31470268) to Prof. Yi Li. This study was sponsored by the K.C. Wong Magna Fund in Ningbo University. We gratefully acknowledge the authors and Originating and Submitting Laboratories for their sequences and meta-data shared through GISAID, 21 on which this study is based. A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-toperson transmission: a study of a family cluster Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia Epidemiology, genetic recombination, and pathogenesis of coronaviruses Identification of a novel coronavirus in patients with severe acute respiratory syndrome A novel coronavirus associated with severe acute respiratory syndrome Epidemiology and cause of severe acute respiratory syndrome (SARS) in Guangdong, People's Republic of China Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia Middle East respiratory syndrome coronavirus (MERS-CoV): announcement of the Coronavirus Study Group Ecoepidemiology and complete genome comparison of different strains of severe acute respiratory syndrome-related Rhinolophus bat coronavirus in China reveal bats as a reservoir for acute, self-limiting infection that allows recombination events Isolation and characterization of viruses related to the SARS coronavirus from animals in southern China Severe acute respiratory syndrome coronavirus-like virus in Chinese horseshoe bats Bats are natural reservoirs of SARS-like coronaviruses Cross-host evolution of severe acute respiratory syndrome coronavirus in palm civet and human Molecular evolution of the SARS coronavirus during the course of the SARS epidemic in China SARS-CoV infection in a restaurant from palm civet MERS coronavirus neutralizing antibodies in camels MERS coronaviruses in dromedary camels Discovery of a novel coronavirus associated with the recent pneumonia outbreak in humans and its potential bat origin Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding Viral metagenomics revealed sendai virus and coronavirus infection of Malayan pangolins (Manis javanica) disease and diplomacy: GISAID's innovative contribution to global health MAFFT multiple sequence alignment software version 7: improvements in performance and usability BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT Application of phylogenetic networks in evolutionary studies jModelTest 2: more models, new heuristics and parallel computing Maximum-likelihood analysis using TREE-PUZZLE TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing The neighbor-joining method: a new method for reconstructing phylogenetic trees A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets Prospects for inferring very large phylogenies by using the neighbor-joining method New algorithms and methods to estimate maximumlikelihood phylogenies: assessing the performance of PhyML 3.0 A new method for calculating evolutionary substitution rates Confidence limits on phylogenies: an approach using the bootstrap Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen) TreeTime: maximum-likelihood phylodynamic analysis Bayesian phylogenetics with BEAUti and the BEAST 1.7 Many-core algorithms for statistical phylogenetics Moderate mutation rate in the SARS coronavirus genome and its implications Transmission and evolution of the Middle East respiratory syndrome coronavirus in Saudi Arabia: a descriptive genomic study Spread, circulation, and evolution of the Middle East respiratory syndrome coronavirus Relaxed phylogenetics and dating with confidence Posterior summarization in Bayesian phylogenetics using tracer 1.7 HIV-TRACE (TRAnsmission Cluster Engine): a tool for large scale molecular epidemiology of HIV-1 and other rapidly evolving pathogens Full-length human immunodeficiency virus type 1 genomes from subtype C-infected seroconverters in India, with evidence of intersubtype recombination Transmission dynamics and evolutionary history of 2019-nCoV Potential of large 'first generation' human-tohuman transmission of 2019-nCoV A single point mutation creating a furin cleavage site in the spike protein renders porcine epidemic diarrhea coronavirus trypsin independent for cell entry and fusion A fluorogenic peptide cleavage assay to screen for proteolytic activity: applications for coronavirus spike protein activation Functional analysis of potential cleavage sites in the MERS-coronavirus spike protein Structural investigation of the HIV-1 envelope glycoprotein gp160 cleavage site 3: role of sitespecific mutations Maturation of HIV envelope glycoprotein precursors by cellular endoproteases Processing and routage of HIV glycoproteins by furin to the cell surface The convertases furin and PC1 can both cleave the human immunodeficiency virus (HIV)-1 envelope glycoprotein gp160 into gp120 (HIV-1 SU) and gp41 (HIV-I TM) Evolutionary history, potential intermediate animal host, and cross-species analyses of SARS-CoV-2 The authors declare that there are no conflict of interests. XL conceived and designed the study and drafted the manuscript. XL, BF, and AC analyzed the data. XL, QN, JZ, QZ, YL, BF, and AC interpreted the data and provided critical comments. All authors reviewed and approved the final manuscript. http://orcid.org/0000-0002-3470-2196