key: cord-0852122-9aeacbyn authors: Shen, Libing; Zhang, Zhao; He, Funan title: The phylogenetic relationship within SARS-CoV-2s: an expanding basal clade date: 2020-11-24 journal: Mol Phylogenet Evol DOI: 10.1016/j.ympev.2020.107017 sha: f700209a7e3994e083edf2dcf6c0227f4c5fe79a doc_id: 852122 cord_uid: 9aeacbyn The COVID-19 pandemic is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) whose origin is still shed in mystery. In this study, we developed a method to search the basal SARS-CoV-2 clade among collected SARS-CoV-2 genome sequences. We first identified the mutation sites in the SARS-CoV-2 whole genome sequence alignment. Then by the pairwise comparison of the numbers of mutation sites among all SARS-CoV-2s, the least mutated clade was identified, which is the basal clade under parsimony principle. In our first analysis, we used 168 SARS-CoV-2 sequences (GISAID dataset till 2020/03/04) to identify the basal clade which contains 33 identical viral sequences from seven countries. To our surprise, in our second analysis with 367 SARS-CoV-2 sequences (GISAID dataset till 2020/03/17), the basal clade has 51 viral sequences, 18 more sequences added. The much larger NCBI dataset shows that this clade has expanded with 85 unique sequences by 2020/04/04. The expanding basal clade tells a chilling fact that the least mutated SARS-CoV-2 sequence was replicating and spreading for at least four months. It is known that coronaviruses have the RNA proofreading capability to ensure their genome replication fidelity. Interestingly, we found that the SARS-CoV-2 without its nonstructural proteins 13 to 16 (Nsp13-Nsp16) exhibits an unusually high mutation rate. Our result suggests that SARS-CoV-2 has an unprecedented RNA proofreading capability which can intactly preserve its genome even after a long period of transmission. Our selection analyses also indicate that the positive selection event enabling SARS-CoV-2 to cross species and adapt to human hosts might have been achieved before its outbreak. The coronavirus disease 2019 has been reported in over 200 countries (WHO, 2020) . It is caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which is clustered with the SARS-CoVs of bat in a clade and regarded as a SARS-like virus . Although it has a lower mortality rate than SARS-CoV, SARS-CoV-2 exhibits a high contagious transmission pattern with low detectability, making this virus more threatening than any coronavirus before. Now COVID-19 becomes a global hazard instead of a regional crisis. The COVID-19 epidemic first broke out in early December, 2019, in Wuhan City, China. The news reported that SARS-CoV-2 was associated with a sea food market in Wuhan, but its origin is still shed in mystery. Even the evolutionary relationship within SARS-CoV-2s is very obscure. The bat RaTG13 coronavirus is usually used as the outgroup to root the SARS-CoV-2's phylogenetic tree in a lot of practices (Forster et al., 2020) . However, the bat RaTG13 coronavirus was discovered in 2013 and the great divergence between the bat virus and SARS-CoV-2s created a problem for phylogenetic inference, called long-branch attraction (Gribaldo and Philippe, 2002) . That is the fast evolving SARS-CoV-2 branches which would be wrongfully placed close to the bat RaTG13 coronavirus as if they were the ancestors to the other SARS-CoV-2s. In this work, we used a parsimony method to determine which SARS-CoV-2 clade is the most basal clade to the others. Our method does not need any outgroup and thus circumvents the long-branch attraction trap. To ours surprise, we identified an expanding basal clade in this work, which suggests that SARS-CoV-2 has an extraordinary RNA proofreading capability for protecting its genome from mutation. Additionally, we investigated the selection process for all SARS-CoV-2 coding sequences in this work and did not observe any significant positive selection event, which proposes that the positive selection enabling SARS-CoV-2 to cross species and adapt to human hosts has already achieved before its outbreak in Wuhan in early December 2019. By comparing SARS-CoV-2's nonstructural proteins with their SARS-CoV's counterparts, we found the fast evolving SARS-CoV virus (406592_Shenzhen_2020) harbors an premature termination codon at nonstructural protein 12 (Nsp12). Truncated nonstructural proteins of SARS-CoV-2 might affect its replication fidelity and susceptibility to antiviral drugs, such as Remdesivir (Agostini et al., 2018) . Finally, the much larger NCBI dataset shows that the least mutated SARS-CoV-2 sequence was replicating and spreading for at least four months. 430 SARS-CoV-2 whole genome sequences were downloaded from GISAID. Due to the continuous incoming SARS-CoV-2 genome data on GISAID, we downloaded the dataset in two batches. The first batch of 184 SARS-CoV-2 genome data was obtained till 2020/03/04 and the second batch of 246 SARS-CoV-2 genome data was obtained till 2020/03/17. After removing the low-coverage (shorter than 29000 base pairs) and low-quality sequences (with more than 100 ambiguous nucleotides), 367 SARS-CoV-2 sequences were kept for the next step analysis. For a much large dataset, we retrieved the SARS-CoV-2 genome data from Coronavirus genomes of NCBI Datasets (https://www.ncbi.nlm.nih.gov/datasets/coronavirus/genomes/). We only downloaded the sequence of human host. The downloaded dataset has 7007 complete SARS-CoV-2 genome sequences. After removing the low-coverage (shorter than 29000 base pairs) and low-quality sequences (with more than 100 ambiguous nucleotides), 4678 SARS-CoV-2 genome sequences were kept for the further analysis. The collection dates for GISAID data were retrieved from GISAID cov2020 acknowledgement table (from 2019/12/30 to 2020/03/13) and the collection dates for NCBI data were from GBFF file (2019/12/30 to 2020/07/07). We used MUSCLE 3.8.31 to perform the alignment for all SARS-CoV-2 sequences with default parameters (Edgar, 2004) . The alignment results were trimmed with an inhouse Perl script to remove any gap or ambiguous site. In our first analysis (GISAID dataset till 2020/03/04), the final alignment of SARS-CoV-2s has 168 sequences and 29347 nucleotide sites. In our second analysis (GISAID dataset till 2020/03/17), the final alignment of SARS-CoV-2s has 367 sequences and 29188 nucleotide sites. For the NCBI SARS-CoV-2 dataset, the final alignment of 4678 SARS-CoV-2 genome sequences has 28683 nucleotide sites. In our first analysis, there are 211 mutation sites found in the final SARS-CoV-2 alignment. In our second analysis, there are 390 mutation sites in the final SARS-CoV-2 alignment. Each SARS-CoV-2 sequence was compared to the other SARS-CoV-2 sequences from the final alignment in pair-wise fashion. By doing so, the number of point mutations was calculated for each SARS-CoV-2 sequence. Under parsimony principle, the SARS-CoV-2 sequence that has the least number of point mutations are regarded as the ancestors (Figure 1 ). Substitution matrices for SARS-CoV-2s were computed with MEGA X (Kumar et al., 2018) . MEGA X (Hyphy model) and Datamonkey (branch-site model) were used to detect the positive selection sites among SARS-CoV-2s (Weaver et al., 2018) . The SARS-CoV-2's phylogenetic tree was constructed with FastTree 2.1.7 using the GTR model and gamma distribution (Price et al., 2010) . RAxML v8.2.12 and MrBayes 3.2.6 were also used to construct the phylogenetic trees of SARS-CoV-2s (Ronquist et al., 2012; Stamatakis, 2014) . In our first analysis (GISAID dataset till 2020/03/04), using the genomic DNA alignments of 168 SARS-CoV-2s, we calculated the transition-transversion biases for SARS-CoV-2s (Table 1 ). Our first analysis shows that the transition-transversion bias is 3.058 in SARS-CoV-2s. More C to T and T to C substitutions are observed in SARS-CoV-2s. In our second analysis (GISAID dataset till 2020/03/17), the genomic DNA alignment of 367 SARS-CoV-2s was used to calculated the transition-transversion bias. There are more A to G or G to A transition in the larger dataset and the transitiontransversion bias increases to 3.955 (Table 2 ). Using the number of point mutations in the least mutated clade, we calculate the mutation rate for SARS-CoV-2 as follows. We assume that there is no mutation in the least mutated clade and 365 total point mutations are shared by the rest 97 SARS-CoV-2 clades (Supplemental data 2). There are average 3.76 point mutations in each SARS-CoV-2. Our final alignment has 29188 nucleotide sites. So the mutation rate is 1.29×10 -4 per nucleotide. The SARS-CoV-2 epidemic in Wuhan started on the early December of 2019 and the final collection date of our second dataset is 2020/03/13 (almost four months) (Wu et al., 2020) . Thus, the mutation rate 3.87 ×10 -4 per nucleotide per year for the alignment length of 29188 nucleotide sites. For a complete SARS-CoV-2 genome with 29903 nucleotides (Wu et al., 2020) , the mutation rate is 3.95 ×10 -4 per nucleotide per year, which is almost 8 times lower than the mutation rate of SARS-CoV and 3 times lower than that of MERS-CoV (Table 3 ). The mutation rates and the transition-transversion bias for SARS-CoV and MERS-CoV are collected from literatures (2004; Cotten et al., 2014; Pavlovic-Lazetic et al., 2005; Zhang et al., 2016) . The transition-transversion bias also shows that SARS-CoV-2 has a much stable genome than SARS-CoV and MERS-CoV (Table 3) . In our first analysis (GISAID dataset till 2020/03/04), the final alignment of 168 SARS-CoV-2 sequences has 29347 nucleotides and only contains 211 non-conserved sites (Supplemental data 1). There are four recurrent point mutations which appear in more than 10 sequences. Using the virus sequence from a seafood market's worker as the reference (GISAID ID: EPI_ISL_402125), we annotated these four recurrent point mutations. They are C to T mutation at reference position 8782 (alignment position 8466, AGC to AGT, synonymous substitution, serine, ORFlab protein), G to T mutation at reference position 26144 (alignment position 25818, GGT to GTT, non-synonymous substitution, glycine to valine, ORF3a protein), T to C mutation at reference position 28144 (alignment position 27817, TTA to TCA, non-synonymous substitution, leucine to serine, ORF8 protein), and C to T mutation at reference position 29095 (alignment position 28768, TTC to TTT, synonymous substitution, phenylalanine, N protein). The similarity among SARS-CoV-2 sequences is quite high. 168 SARS-CoV-2 sequences can be further divided into 98 clades according to their sequence identities SARS-CoV-2 sequences are most likely to be neutral. Notably, 406592_Shenzhen_2020 has the largest number of mutations in both our analyses. In our first analysis (GISAID dataset till 2020/03/04), we constructed a maximum The T to C mutation at reference position 28144 that causes a leucine to serine change in ORF8 protein can be found in all SARS-CoV-2s in Clade 2, 3 and two special sequences marked with squares in Figure 2 (413017_Sourth_Korea_2020 and 406592_Shenzhen_2020). Since the basal clade has T at reference position 28144, leucine is the ancestral state, not serine. 413017_South_Korea_2020 also has the G to T mutation at reference position 26144. 406592_Shenzhen_2020 also has the C to T mutation at reference position 29095. In our second analysis (GISAID dataset till 2020/03/17), we constructed a maximum likelihood tree with 367 SARS-CoV-2 sequences and rooted it with the expanded least mutated clade. Due to no positive selection detected among SARS-CoV-2s, we did not annotate or classify any mutation in our second maximum likelihood tree. The added 199 sequences in our second analysis are mostly incorporated in our second tree's clades, instead of forming a single clade of their own. Because SARS-CoV-2s have a very slow evolutionary rate, only a few informative sites so far can be used for their phylogenetic analysis, which render the trees only suitable for the probable estimation of SARS-CoV-2s' evolutionary relationship and classification. Thus, both RAxML and MrBayes trees show a very poor performance in resolving SARS-CoV-2's phylogeny (Supplemental Figure 3 and 4) . These two methods require a substantial amount of informative sites to reconstruct a reasonably accurate phylogenetic tree. Such difficulty has also been stated by RAxML's authors (Benoit Morel, 2020). We further constructed a maximum likelihood tree for the much larger NCBI SARS-CoV-2 dataset. There are total 4678 SARS-CoV-2 sequences in the phylogenetic tree of the NCBI dataset. For such a large dataset, we identified the basal clade using the seafood market worker's SARS-CoV-2 (EPI_ISL_402125) whose NCBI accession ID is NC_045512. It is the first SARS-CoV-2 genome sequence obtained in the Wuhan COVID-19 epidemic (Wu et al., 2020) . We found that the basal clade had expanded with 85 sequences (Supplemental data 5). Two more countries, France and Pakistan, were added to the basal clade. The SARS-CoV-2 sequence of this clade was lastly found in Milwaukee county of Wisconsin, USA by 2020/04/04 (NCBI accession ID: MT706448). Using the seafood market worker's SARS-CoV-2 sequence (EPI_ISL_402125 and NC_045512) as the reference, we found nine major recurrent mutations in the NCBI dataset (Table 4 ). These mutations can classify most SARS-CoV-2s in NCBI dataset into six groups ( Figure 5 ). There are 470 SARS-CoV-2s in group one, 213 SARS-CoV-2s in group two, 783 SARS-CoV-2s in group three, 952 SARS-CoV-2s in group four, 660 SARS-CoV-2s in group five, and 1504 SARS-CoV-2s in group six, respectively. Frustratingly, multiple geographic locations could be found in each group due to SARS-CoV-2's low mutability. It looks like that the SARS-CoV-2s have already been everywhere. One SARS-CoV-2 sequence (406592_Shenzhen_2020) is especially notable in our point mutation comparison. It has the fastest evolutionary rate among all SARS-CoV-2s and a transition/transversion rate of 0.93. We compared it with the seafood market worker's SARS-CoV-2 (EPI_ISL_402125) and found that it at least accumulated 27 point mutations. 20 out of 27 mutations are non-synonymous ones which result in amino acid changes (Table 4 ). 13 out of 20 non-synonymous mutations occurred in the orf1ab replicase gene which is responsible for RNA proofreading during virus replication in coronaviruses (Denison et al., 2011) . After being processed by viral proteinases, orf1ab replicase gene produces 16 mature nonstructural proteins (Nsp1 to Nsp16) (Denison et al., 2011) . Each nsp has its own unique function. Using SARS-CoV sequence with NCBI accession number NC_004718 as a template, we annotated the non-synonymous mutations in orf1ab replicase gene according to their positions in nsp genes ( Table 5 ). The nsp2 gene is not required for viral replication (Graham et al., 2006) . The nsp8, nsp12, and nsp15 genes encode for a hexadecamer with putative processivity activities, RNA-dependent RNA polymerase (RdRp), and endoribonuclease (EndoU), respectively (Denison et al., 2011) . Interestingly, due to a premature stop codon mutation in its nsp12 gene, this fast evolving SARS-CoV-2 lacks nsp13 (RNA helicase-ATPase), nsp14 (exoribonuclease and methyltransferase), nsp15 (endoribonuclease), and nsp16 (RNA 2'-O-methyltransferase). Especially, nsp14 is essential for replication fidelity in coronavirus (Eckerle et al., 2010; Eckerle et al., 2007) . This result partly explains why 406592_Shenzhen_2020 accumulates a large number of mutations in its genome. This work is designed to resolve the evolutionary relationship within SARS-CoV-2s after its outbreak. In our first analysis (GISAID dataset till 2020/03/04), we found a basal clade with 33 sequences which are from seven countries. In our second analysis (GISAID dataset till 2020/03/17), the basal clade expanded into ten countries with 51 sequences. The much larger NCBI dataset shows that this clade has expanded with 85 sequences. For the basal SARS-CoV-2s, the earliest collection date is 2019/12/26 and the latest collection date is 2020/04/04, so they were immune to mutations for at least three months. Due to their average four-day incubation time, SARS-CoV-2s could spread before we could detect it (Linton et al., 2020) . Their extraordinary RNA proofreading capability further blurred the evolutionary relationship among SARS-CoV-2s. Thus, the origin of SARS-CoV-2 remains unknown. The best way to answer this question is to find the SARS-CoV-2's natural host. about 4 days (Linton et al., 2020) . In an extreme case, an incubation period of 27 days has been reported. The asymptomatic SARS-CoV-2 carriers could transmit this virus to others (Bai et al., 2020) . That's why it is really difficult to detect SARS-CoV-2 in the initial stage of its transmission. Our first analysis (GISAID dataset till 2020/03/04) shows that the basal clade has 33 identical SARS-CoV-2 sequences, about one fifth of total sequences used in our first analysis, gathered from all over the Pan-Pacific region. Our second analysis (GISAID dataset till 2020/03/17) shows that the basal clade expands to a total of 51 identical SARS-CoV-2s and covers ten countries. The very SARS-CoV-2 found in Wuhan's COVID-19 outbreak appeared in Europe afterwards. The basal SARS-CoV-2 must have infected tens of thousands of people, if not hundreds of thousands. For the number of people it infected and the geographic region it covered, the basal SARS-CoV-2 is a super virus for its high contagiousness and low detectability. We used three different methods to construct SARS-CoV-2's phylogenetic trees based on three batches of data. None of them yields a satisfactory result. Two taxa with faraway geographic locations are often clustered together. Thus, the phylogenetic analysis for SARS-CoV-2 is more suitable for its classification rather than its transmission route. So far, the origin of SARS-CoV-2 is still unknown. The phylogenetic tree of the NCBI dataset shows that at least its classification is reliable and the SARS-CoV-2 with the same major mutations are usually clustered together. The problem is that the SARS-CoV-2s from the same group always cover several continents (Supplemental data 3). It sends the alarming message from SARS-CoV-2: when you found one of them, they have been already everywhere. The transition-transversion bias for SARS-CoV-2 is 3.058 in our first analysis and 3.955 in our second analysis. The increasing transition-transversion bias suggests that SARS-CoV-2's genome becomes more stable during its global transmission. We did not detect any significant positive selection signal in its genome, which further proposes that the handful mutations we observed in SARS-CoV-2s are mostly neutral. For the SARS-CoV-2's evolution, we have to ask such a question: why does this virus evolve so slowly? The answer might lie in its orf1ab protein which encodes replicase polyproteins responsible for SARS-CoV-2's RNA proofreading capability and replication fidelity. The SARS-CoV-2 lost its nsp13 to nsp16 and has a very fast evolutionary rate. It has been proved that nsp14 is required for replication fidelity and mediates the antiviral effect of Remdesivir in coronaviruses (Agostini et al., 2018; Denison et al., 2011) . So far, we still do not understand the biological properties of SARS-CoV-2's nsp genes. They may hold the key to combat this virus. In conclusion, SARS-CoV-2 is a highly stable and contagious virus with low detectability. If COVID-19's patients does not get a proper medical care, the virus will unleash its decimating power, a mortality rate much higher than influenza. Its debut was an ultra-spreading event and COVID-19 has already become a global crisis now. Every harsh measure should be taken in order to contain it. Yellow hollow circle ○ indicates group 1 with no major recurrent mutation. Magenta hollow circle ○ indicates group 2 with C to T mutation at reference position 8782. Dark-olive-green hollow circle ○ indicates group 3 with C to T mutation at reference position 8782, C to T mutation at reference position 17747, A to G mutation at reference position 17858, and C to T mutation at reference position 18060. Dark-blue hollow circle ○ indicates group 4 with C to T mutation at reference position 3037, C to T mutation at reference position 14408, and A to G mutation at reference position 23403. Light-blue hollow circle ○ indicates group 5 with C to T mutation at reference position 3037, C to T mutation at reference position 14408, A to G mutation at reference position 23403, and G to T mutation at reference position 25563. Green hollow circle ○ indicates group 6 with C to T mutation at reference position 1059, C to T mutation at reference position 3037, C to T mutation at reference position 14408, A to G mutation at reference position 23403, and G to T mutation at reference position 25563. Supplemental Figure 1 . The maximum likelihood tree of 168 SARS-CoV-2s with statistical support values (SH-like local supports). Red filled circle • indicates clade 1 (basal clade). Red hollow circle ○ indicates the intermediate type between the basal clade and the other four clades. Light-blue triangle ▲ indicates clade 2 (C to T at 8782). Magenta triangle ▲ indicates clade 3 (C to T at 8782 and C to T at 29095). Dark-blue triangle ▲ indicates the basal clade 4 (G to T at 26144). Green filled circle • indicates clade 4 (no dominant mutation). Colored squares ■ and ■ indicate two specific sequences. Khaki branch indicates T to C substitution at 27817 (leucine to serine change). The nucleotide frequencies are 29.92% (A), 32.18% (T/U), 18.34% (C), and 19.56% (G). The transition/transversion rate ratios are k 1 = 3.297 (purines) and k 2 = 9.681 (pyrimidines). The nucleotide frequencies are 29.96% (A), 32.19% (T/U), 18.30% (C), and 19.55% (G). The transition/transversion rate ratios are k 1 = 4.95 (purines) and k 2 = 11.863 (pyrimidines). Molecular evolution of the SARS coronavirus during the course of the SARS epidemic in China Coronavirus Susceptibility to the Antiviral Remdesivir (GS-5734) Is Mediated by the Viral Polymerase and the Proofreading Exoribonuclease Presumed Asymptomatic Carrier Transmission of COVID-19 Phylogenetic analysis of SARS-CoV-2 data is difficult Coronaviruses: an RNA proofreading machine regulates replication fidelity and diversity Infidelity of SARS-CoV Nsp14-exonuclease mutant virus replication is revealed by complete genome sequencing High fidelity of murine hepatitis virus replication is decreased in nsp14 exoribonuclease mutants MUSCLE: multiple sequence alignment with high accuracy and high throughput Phylogenetic network analysis of SARS-CoV-2 genomes The nsp2 proteins of mouse hepatitis virus and SARS coronavirus are dispensable for viral replication Ancient phylogenetic relationships MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms Incubation Period and Other Epidemiological Characteristics of 2019 Novel Coronavirus Infections with Right Truncation: A Statistical Analysis of Publicly Available Case Data SARS-CoV genome polymorphism: a bioinformatics study FastTree 2--approximately maximum-likelihood trees for large alignments MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies Datamonkey 2.0: A Modern Web Application for Characterizing Selective and Other Evolutionary Processes WHO, 2020. WHO Coronavirus Disease (COVID-19) Dashboard A new coronavirus associated with human respiratory disease in China Evolution of the novel coronavirus from the ongoing Wuhan outbreak and modeling of its spike protein for risk of human transmission Evolutionary Dynamics of MERS-CoV: Potential Recombination, Positive Selection and Transmission SARS-CoV-2 with superior RNA proofreading capability has an expanding basal clade All authors declare no potential competing interest. This work used the online data and ethical approval is not applicable. LBS, ZZ and FNH collected and primarily processed data. LBS, ZZ and FNH analyzed the data and produced the figures. LBS and FNH wrote the first draft of this manuscript.LBS, ZZ and FNH revised the manuscript. We thank scientific communities all over the world for their selfless effort in this pandemic. Special gratitude to GISAID and NCBI for the SARS-CoV-2 data they provide.The first identified SARS-CoV-2 genome sequence (EPI_ISL_402125 and NC_045512) is used as the reference. Authors' contributions LBS, ZZ and FNH collected and primarily processed data. LBS, ZZ and FNH analyzed the data and produced the figures. LBS and FNH wrote the first draft of this manuscript.LBS, ZZ and FNH revised the manuscript. All authors consented the right for publication. All authors declare no potential competing interest.