key: cord-015503-j99cgsjt authors: Tang, Xiaolu; Wu, Changcheng; Li, Xiang; Song, Yuhe; Yao, Xinmin; Wu, Xinkai; Duan, Yuange; Zhang, Hong; Wang, Yirong; Qian, Zhaohui; Cui, Jie; Lu, Jian title: On the origin and continuing evolution of SARS-CoV-2 date: 2020-03-03 journal: Natl Sci Rev DOI: 10.1093/nsr/nwaa036 sha: doc_id: 15503 cord_uid: j99cgsjt The SARS-CoV-2 epidemic started in late December 2019 in Wuhan, China, and has since impacted a large portion of China and raised major global concern. Herein, we investigated the extent of molecular divergence between SARS-CoV-2 and other related coronaviruses. Although we found only 4% variability in genomic nucleotides between SARS-CoV-2 and a bat SARS-related coronavirus (SARSr-CoV; RaTG13), the difference at neutral sites was 17%, suggesting the divergence between the two viruses is much larger than previously estimated. Our results suggest that the development of new variations in functional sites in the receptor-binding domain (RBD) of the spike seen in SARS-CoV-2 and viruses from pangolin SARSr-CoVs are likely caused by mutations and natural selection besides recombination. Population genetic analyses of 103 SARS-CoV-2 genomes indicated that these viruses evolved into two major types (designated L and S), that are well defined by two different SNPs that show nearly complete linkage across the viral strains sequenced to date. Although the L type (∼70%) is more prevalent than the S type (∼30%), the S type was found to be the ancestral version. Whereas the L type was more prevalent in the early stages of the outbreak in Wuhan, the frequency of the L type decreased after early January 2020. Human intervention may have placed more severe selective pressure on the L type, which might be more aggressive and spread more quickly. On the other hand, the S type, which is evolutionarily older and less aggressive, might have increased in relative frequency due to relatively weaker selective pressure. These findings strongly support an urgent need for further immediate, comprehensive studies that combine genomic data, epidemiological data, and chart records of the clinical symptoms of patients with coronavirus disease 2019 (COVID-19). The coronavirus disease 2019 (COVID-19) epidemic started in late December 2019 in Wuhan, the capital of Central China's Hubei Province. Since then, it has rapidly spread across China and in other countries, raising major global concerns. The etiological agent is a novel coronavirus, SARS-CoV-2, named for the similarity of its symptoms to those induced by the severe acute respiratory syndrome. As of February 28, 2020, 78,959 cases of SARS-CoV-2 infection have been confirmed in China, with 2,791 deaths. Worryingly, there have also been more than 3,664 confirmed cases outside of China in 46 countries and areas (https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports/), raising significant doubts about the likelihood of successful containment. Further, the genomic sequences of SARS-CoV-2 viruses isolated from a number of patients share sequence identity higher than 99.9%, suggesting a very recent host shift into humans [1] [2] [3] . Coronaviruses are naturally hosted and evolutionarily shaped by bats [4, 5] . Indeed, it has been postulated that most of the coronaviruses in humans are derived from the bat reservoir [6, 7] . Unsurprisingly, several teams have recently confirmed the genetic similarity between SARS-CoV-2 and a bat betacoronavirus of the sub-genus Sarbecovirus [8] [9] [10] [11] [12] [13] . The whole-genome sequence identity of the novel virus has 96.2% similarity to a bat SARS-related coronavirus (SARSr-CoV; RaTG13) collected in Yunnan province, China [2, 14] , but is not very similar to the genomes of SARS-CoV (about 79%) or MERS-CoV (about 50%) [1, 15] . It has also been confirmed that the SARS-CoV-2 uses the same receptor, the angiotensin converting enzyme II (ACE2), as the SARS-CoV [11] . Although the specific route of transmission from natural reservoirs to humans remains unclear [5, 13] , several studies have shown that pangolins may have provided a partial spike gene to SARS-CoV-2; the critical functional sites in the spike protein of SAR-CoV-2 are nearly identical to one identified in a virus isolated from a pangolin [16] [17] [18] . Despite these recent discoveries, several fundamental issues related to the evolutionary patterns and driving forces behind this outbreak of SARS-CoV-2 remain unexplored [19] . Herein, we investigated the extent of molecular divergence between SARS-CoV-2 and other related coronaviruses and carried out population genetic analyses of 103 sequenced genomes of SARS-CoV-2. This work provides new insights into the factors driving the evolution of SARS-CoV-2 and its pattern of spread through the human population. For each annotated ORF in the reference genome of SARS-CoV-2 (NC_045512), we extracted the orthologous sequences in human SARS-CoV, four bat SARS-related coronaviruses (SARSr-CoV: RaTG13, ZXC21, ZC45, and BM48-31), one Pangolin SARSr-CoV from Guangdong (GD) [17] , and six Pangolin SARSr-CoV genomes from Guangxi (GX) [18] (Table S1) . We aligned the coding sequences (CDSs) based on the protein alignments (see Materials and Methods). Most ORFs annotated from SARS-CoV-2 were found to be conserved in other viruses, except for ORF8 and ORF10 (Table 1 ). The protein sequence of SARS-CoV-2 ORF8 shared very low similarity with sequences in SARS-CoV and BM48-31, and ORF10 had a premature stop codon in both SARS-CoV and BM48-31 (Fig. S1) . A one-base deletion caused a frame-shift mutation in ORF10 of ZXC21 ( Fig. S1 ). To investigate the phylogenetic relationships between these viruses at the genomic scale, we concatenated coding regions (CDSs) of the nine conserved ORFs (orf1ab, E, M, N, S, ORF3a, ORF6, ORF7a, and ORF7b) and reconstructed the phylogenetic tree using the synonymous sites ( Fig. 1A) . We also used CODEML in the PAML [20] 1A ). In parallel, we also calculated the pairwise dN, dS, and ω values between SARS-CoV-2 and another virus ( Table 1) . The genome-wide phylogenetic tree indicated that SARS-CoV-2 was closest to RaTG13, followed by GD Pangolin SARSr-CoV, then by GX Pangolin SARSr-CoVs, then by ZC45 and ZXC21, then by human SARS-CoV, and finally by BM48-31 (Fig. 1A) . Notably, we found that the nucleotide divergence at synonymous sites between SARS-CoV-2 and other viruses was much higher than previously anticipated. For example, although the overall genomic nucleotides overall differ ~4% between SARS-CoV-2 and RaTG13, the genomic average dS was 0.17, which means the divergence at the neutral sites is 17% between these two viruses (Table 1) . This is because the nonsynonymous sites are usually under stronger negative selection than synonymous sites, and calculating sequence differences without separating these two classes of sites may underestimate the extent of molecular divergence by several folds. Notably, the dS value varied considerably across genes in SARS-CoV-2 and the other viruses analyzed. In particular, the spike gene (S) consistently exhibited larger dS values than other genes (Table 1 ). This pattern became clear when we calculated the dS value for each branch in Fig. 1A for the spike gene versus the concatenated sequences of the remaining genes ( Fig. S2 ). In each branch, the dS of spike was 2.22 ± 1.35 (mean ± SD) times as large as that of the other genes. This extremely elevated dS value of spike could be caused either by a high mutation rate or by natural selection that favors synonymous substitutions. Synonymous substitutions may serve as another layer of genetic regulation, guiding the efficiency of mRNA translation by changing codon usage [21] . If positive selection is the driving force for the higher synonymous substation rate seen in spike, we expect the frequency of optimal codons (FOP) of spike to be different from that of other genes. However, our codon usage bias analysis (Table S2 ) suggests the FOP of spike was only slightly higher than that of the genomic average (0.717 versus 0.698, see Materials and Methods). Thus, we believe that the elevated synonymous substitution rate measured in spike is more likely caused by higher mutational rates; however, the underlying molecular mechanism remains unclear. Both SARS-CoV and SARS-CoV-2 bind to ACE2 through the RBD of spike protein in order to initiate membrane fusion and enter human cells [1, 2, [22] [23] [24] [25] [26] . Five out of the six critical amino acid (AA) residues in RBD were different between SARS-CoV-2 and SARS-CoV (Fig. 1B) , and a 3D structural analysis indicated that the spike of SARS-CoV-2 has a higher binding affinity to ACE2 than SARS-CoV [23] . Intriguingly, these same six critical AAs are identical between GD Pangolin-CoV and SARS-CoV-2 [16] . In contrast, although the genomes of SARS-CoV-2 and RaTG13 are more similar overall, only one out of the six functional sites are identical between the two viruses ( Fig. 1B) . It has been proposed that the SARS-CoV-2 RBD region of the spike protein might have resulted from recent recombination events in pangolins [16] [17] [18] . Although several ancient recombination events have been described in spike [27, 28] , it also seems likely that the identical functional sites in SARS-CoV-2 and GD Pangolin-CoV may actually the result of coincidental convergent evolution [18] . If the functional AA residues in the SARS-CoV-2 RBD region were acquired from GD Pangolin-CoV in a very recent recombination event, we would expect the nucleotide sequences of this region to be nearly identical between the two viruses. However, for the CDS sequences that span five critical AA sites in the SARS-CoV-2 spike (ranging from codon 484 to 507, covering five adjacent functional sites: F486, Q493, S494, N501, and Y505; Fig. S3 originated from the GD Pangolin-CoV due to a very recent recombination event. Alternatively, it seems more likely that a high mutation rate in spike, coupled with strong natural selection, has shaped the identical functional AA residues between these two viruses, as proposed previously [18] . Although these sites are maintained in SARS-CoV-2 and GD Pangolin-CoV, mutations may have changed the residues in the RaTG13 lineage after it diverged from SARS-CoV-2 (the blue arrow in Fig. 1A ). In summary, it seems that the shared identity of critical AA sites between SARS-CoV-2 and GD Pangolin-CoV might be due to random mutations coupled with natural selection, and not necessarily recombination. The genome-wide ω value between SARS-CoV-2 and other viruses ranged from 0.044 to 0.124 (Table 1) We downloaded 103 publicly available SARS-CoV-2 genomes, aligned the sequences, and identified the genetic variants. For ease of visualization, we marked each virus strain based on the location and date the virus was isolated with the format of "Location_Date" throughout this study (see Table S1 for details; Each ID did not contain information of the patient's race or ethnicity). Although SARS-CoV-2 is an RNA virus, for simplicity, we presented our results based on DNA sequencing results throughout this study (i.e., the nucleotide T (70/83) of nonsynonymous mutations), indicating either a recent origin [30] or population growth [31] . In general, the derived alleles of synonymous mutations were significantly skewed towards higher frequencies than those of nonsynonymous ones (P < 0.01, Wilcoxon rank-sum test; Fig. 2 ), suggesting the nonsynonymous mutations tended to be selected against. However, 16.3% (7 out of 43) synonymous mutations, and one nonsynonymous (ORF8 (L84S, 28,144)) mutation had a derived frequency of ≥ 70% across the SARS-CoV2 strains. The nonsynonymous mutations that had derived alleles in at least two SARS-CoV-2 strains affected six proteins: orf1ab (A117T, I1607V, L3606F, I6075T), S (H49Y, V367F), ORF3a (G251V), ORF7a (P34S), ORF8 (V62L, S84L), and N (S194L, S202N, P344S). To detect the possible recombination among SARS-CoV2 viruses, we used Haploview [32] to analyze and visualize the patterns of linkage disequilibrium (LD) between variants with minor alleles in at least two SARS-CoV-2 strains (Fig. 3A ). Since most mutations were at very low frequencies, it is not surprising that many pairs had a very low r 2 or LOD value ( Fig. 3B -C). Consistent with another recent report [31] , we did not find evidence of recombination between the SARS-CoV2 strains. However, we found that SNPs at location 8,782 (orf1ab: T8517C, synonymous) and 28,144 (ORF8: C251T, S84L) showed significant linkage, with an r 2 value of 0.954 (Fig. 3B, red) and a LOD value of 50.13 (Fig. 3C, red) . Among the 103 SARS-CoV-2 virus strains, 101 of them exhibited complete linkage between the two SNPs: 72 strains exhibited a "CT" haplotype (defined as "L" type because T28,144 is in the codon of Leucine) and 29 strains exhibited a "TC" haplotype (defined as "S" type because C28,144 is in the codon of Serine) at these two sites. Thus, we categorized the SARS-CoV-2 viruses into two major types, with L being the major type (~70%) and S being the minor type (~30%). Although we defined the L and S types based on two tightly linked SNPs, strikingly, the separation between the L (blue) and S (red) types was maintained when we reconstructed the haplotype networks using all the SNPs in the SARS-CoV-2 genomes ( Fig. 4A ; the number of mutations between two neighboring haplotypes was inferred parsimoniously). This analysis further supports the idea that the two linked SNPs at sites 8,782 and 28,144 adequately define the L and S types of SARS-CoV-2. To determine whether L or S type is ancestral, we examined the genomic alignments of SARS-CoV-2 and other highly related viruses. Strikingly, nucleotides of the S type at sites 8,782 and 28,144 were identical to the orthologous sites in the most closely related viruses ( Fig. 4B) . Remarkably, both sites were highly conserved in other viruses as well. Hence, although the L type (~70%) was more prevalent than the S type (~30%) in the SARS-CoV-2 viruses we examined, the S type is actually the ancestral version of SARS-CoV-2. To further examine the relationship among the strains in the L and S types, we reconstructed a phylogenetic tree of all the 103 SARS-CoV-2 viruses based on their whole-genome sequences. Our phylogenetic tree also clearly shows the separation of the two types (Fig. 5) . Viruses of the L type (blue) first clustered together, and likewise, viruses of the S type (red) were also more closely related to each other. Therefore, our whole-genome comparisons further confirm the separation of the L and S types. Thus far, we found that, although the L type is derived from the S type, L (~70%) is more prevalent than S (~30%) among the sequenced SARS-CoV-2 genomes we examined. This pattern suggests that L has a higher transmission rate than the S type. Furthermore, our mutational load analysis indicated that the L type had accumulated a significantly higher number of derived mutations than S type (P < 0.0001, Wilcoxon rank-sum test; Fig. S5 ). We propose that, although the L type newly evolved from the ancient S type, it transmits faster or replicates faster in human populations, causing it to accumulate more mutations than the S type. Thus, our results suggest the L might be more aggressive than the S type due to the potentially higher transmission and/or replication rates. To test whether the two types of SARS-CoV-2 had differences in temporal and spatial distributions, we stratified the viruses based on the locations and dates they were isolated ( Fig. 6 and Table S3 ). If the L type is more aggressive than the S type, why did the relative frequency of the L type decrease compared to the S type in other places after the initial breakout in Wuhan? One possible explanation is that, since January 2020, the Chinese central and local governments have taken rapid and comprehensive prevention and control measures. These human intervention efforts might have caused severe selective pressure against the L type, which might be more aggressive and spread more quickly. The S type, on the other hand, might have experienced weaker selective pressure by human intervention, leading to an increase in its relative abundance among the SARS-CoV-2 viruses. Thus, we hypothesized that the two types of SARS-CoV-2 viruses might have experienced different selective pressures due to different epidemiological features. Of note, the above analyses were based on very patchy SARS-CoV-2 genomes that were collected from different locations and time points. More comprehensive genomic data is required for further testing of our hypothesis. It is currently unclear how the L type specifically evolved from the S type during the development of SARS-CoV-2. However, we found that the sequence of viruses isolated from To further investigate the heteroplasmy of SARS-CoV-2 viruses in patients, we searched 12 deep-sequencing libraries of SARS-CoV-2 genomes that were deposited in the Sequence Read Archive (SRA) ( Table S4 , Materials and Methods). We found 17 genomic sites that showed evidence of heteroplasmy of SARS-CoV-2 virus in five patients, but we did not find any other instances of the co-existence of L and S types in any patient (Table 2) . These findings evince the developing complexity of the evolution of SARS-CoV-2 infections. Further studies investigating how the different alleles of SARS-CoV-2 viruses compete with each other will be of significant value. In this study, we investigated the patterns of molecular divergence between SARS-CoV-2 and other related coronaviruses. Although the genomic analyses suggested that SARS-CoV-2 was closest to RaTG13, their difference at neutral sites was much higher than previously realized. Our results provide novel insights into tracing the intermediate natural host of SARS-CoV-2. With population genetic analyses of 103 genomes of SARS-CoV-2, we found that SARS-CoV-2 viruses evolved into two major types (L and S types), and the two types were well defined by just two SNPs that show nearly complete linkage across SARS-CoV-2 strains. Although the L type (~70%) was more prevalent than the S type (~30%) in the SARS-CoV-2 viruses we examined, our evolutionary analyses suggested the S type was most likely the more ancient version of SARS-CoV-2. Our results also support the idea that the L type is more aggressive than the S type. Since nonsynonymous sites are usually under stronger negative selection than synonymous sites, calculating sequence differences without separating these two classes of sites could lead to a potentially significant underestimate of the degree of molecular divergence. For example, although the overall nucleotides only differed by ~4% between SARS-CoV-2 and RaTG13, the genomic average dS value, which is usually a neutral proxy, was 0.17 between these two viruses ( Table 1) . Of note, the genome-wide dS value is 0.012 between humans and chimpanzees [33] , and 0.08 between humans and rhesus macaques [34] . Thus, the neutral molecular divergence between SARS-CoV-2 and RaTG13 is 14 times larger than that between humans and chimpanzees, and twice as large as that between humans and macaques. The genomic average dS value between SARS-CoV-2 and GD Pangolin-CoV is 0.475, which is comparable to that between humans and mice (0.5) [35] , and the dS value between Our analyses of molecular evolution and population genetics suggested that some amino acid changes might be favored by natural selection during the evolution of SARS-CoV-2 and other related viruses. However, negative selection appears to be the predominant force acting on these viruses. Interestingly, the virus isolated from one patient in Shenzhen on January 13, 2020 (SZ_2020/01/13.a, GISAID ID: EPI_ISL_406592) had C at both positions 8,782 and 28,144 in the genome, belonging to neither L nor S type ( Fig. 4A and 5) . Notably, this strain had one stop-gain mutation in orf1ab and had accumulated 20 silent and 5 nonsynonymous mutations after diverging from the ancestor haplotype (Fig. 4A ). Thus, it is possible that functional constraints on the genomic sequence were weakened after the disruption of orf1ab in this strain. Notably, on viruses isolated from a patient living in South Korean (Skorea_2020/01.a, GISAID: EPI_ISL_411929), acquired six nonsynonymous mutations that were different from the most recent common ancestor of SARS-CoV-2: orf1ab (M902I and T6891M), S (S221W), ORF3a (W128L and G251V), and E (L37H). If these changes are not due to sequencing errors, it would be interesting to test whether and how these mutations affect the transmission and pathogenesis of SARS-CoV-2. In this work, we propose that SARS-CoV-2 can be divided into two major types (L and S types): the S type is ancestral, and the L type evolved from S type. Intriguingly, the S and L types can be clearly defined by just two tightly linked SNPs at positions 8,782 (orf1ab: T8517C, synonymous) and 28,144 (ORF8: C251T, S84L). However, it is currently unclear whether L type evolved from the S type in humans or in the intermediate hosts. It is also unclear whether the L type is more virulent than the S type. orf1ab, which encodes replicase/transcriptase, is required for viral genome replication and might also be important for viral pathogenesis [36] . Although the T8517C mutation in orf1ab does not change the protein sequence (it changes the codon AGT (Ser) to AGC (Ser)), we hypothesized this mutation might affect orf1ab translation since AGT is preferred while AGC is unpreferred (Table S2 ). ORF8 promotes the expression of ATF6, the ER unfolded protein response factor, in human cells [37] . Thus, it will be interesting to investigate the function of the S84L AA change in ORF8, as well as the combinatory effect of these two mutations in SARS-CoV-2 pathogenesis. In summary, our analyses of 103 sequenced SARS-CoV-2 genomes suggest that the L type is more aggressive than the S type and that human interference may have shifted the relative abundance of L and S type soon after the SARS-CoV-2 outbreak. As previously noted [19] , the data examined in this study are still very limited, and follow-up analyses of a larger set of data are needed to have a better understanding of the evolution and epidemiology of SARS-CoV-2. There is a strong need for further immediate, comprehensive studies that combine genomic data, epidemiological data, and chart records of the clinical symptoms of patients with SARS-CoV-2. The set of 103 complete genome sequences were downloaded from GISAID (Global Initiative on Sharing All Influenza Data; https://www.gisaid.org/) with acknowledgment, GenBank (https://www.ncbi.nlm.nih.gov/genbank), and NMDC (http://nmdc.cn/#/nCoV). Sequences and annotations of the reference genome of SARS-CoV-2 (NC_045512) and other related viruses were downloaded from GenBank or GISAID (Table S1 ). The genomic sequences of SARS-CoV-2 were aligned using MUSCLE v3.8.31 [38] . The annotated CDSs of other viruses were downloaded from GenBank. To avoid missing annotations in other viruses, we also annotated the ORFs using CDSs annotated in SARS-CoV-2 using Exonerate (--model protein2genome:bestfit --score 5 -g y) [39] . The protein sequences of SARS-CoV-2 and other related viruses were aligned with MUSCLE v3.8.31 [38] , and the codon alignments were made based on the protein alignment with RevTrans [40] . The codon alignments of the conserved ORFs were further concatenated for down-stream evolutionary analysis. The phylogenetic tree was constructed by the neighbor-joining method in MEGA-X [41] using the parameters of Kimura 2-parameter model, and only the third positions of codons were considered. YN00 from PAML v4.9a [20] was used to calculate the pairwise divergence between SARS-CoV-2 and other viruses for each individual gene or for the concatenated sequences. The free-ratio model in CODEML in the PAML [20] package was used to calculate the dN, dS, and ω values for each branch. Positive selection was detected using EasyCodeML [42] , a recently published wrapper of CODEML [20] . The M7 and M8 models were compared. In the M7 model, ω follows a beta distribution such that 0⩽ω⩽1, and in the M8 model, a proportion p 0 of sites have ω drawn from the beta distribution, and the remaining sites with proportion p 1 are positively selected and have ω 1 >1. The LRTs between M7 and M8 models were conducted by comparing twice the difference in log-likelihood values (2 ln Δl) against a χ 2 -distribution (df=2). The positively selected sites were identified with the Bayes Empirical Bayes (BEB) score larger than 0.95. DnaSP v6.12.03 [43] was used to generate multi-sequence aligned haplotype data, and PopART v1.7 [44] was used to draw haplotype networks based on the haplotypes generated by DnaSP. RAxML v8.2.12 [45] was used to build the maximum likelihood phylogenetic tree of 103 aligned SARS-CoV-2 genomes with theparameters "-p 1234 -m GTRCAT". We downloaded 12 SARS-CoV-2 metagenomic sequencing libraries (Table S2) , and mapped the NGS reads to the reference genome of SARS-CoV-2 (NC_045512) using BWA (0.7.17-r1188) [46] with the default parameters. SNP calling was done using bcftools mpileup (bcftools 1.9) [47] . We calculated the RSCU (Relative Synonymous Codon Usage) value of each codon in the SARS-CoV-2 reference genome (NC_045512). The RSCU value for each codon was the observed frequency of this codon divided by its expected frequency under equal usage among the amino acid [48] . The codons with RSCU > 1 were defined as preferred codons, and those with RSCU < 1 were defined as unpreferred codons. The FOP (frequency of optimal codons) value of each gene was calculated as the number of preferred codons divided by the total number of preferred and unpreferred codons. The authors declare that they have no conflicts of interest. For each gene, the dN and dS values between SARS-CoV-2 and another virus are given, and the dN/dS (ω) ratio is given in the parenthesis. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding A pneumonia outbreak associated with a new coronavirus of probable bat origin Identification of a novel coronavirus causing severe pneumonia in human: a descriptive study Origin and evolution of pathogenic coronaviruses Bat origin of a new human coronavirus: there and back again. Science China Life Sciences Bats are natural reservoirs of SARS-like coronaviruses Detection of group 1 coronaviruses in bats in North America Genome Composition and Divergence of the Novel Coronavirus (2019-nCoV) Originating in China Evolution of the novel coronavirus from the ongoing Wuhan outbreak and modeling of its spike protein for risk of human transmission. Sci China Life Sci The 2019-new coronavirus epidemic: Evidence for virus evolution Discovery of a novel coronavirus associated with the recent pneumonia outbreak in humans and its potential bat origin Genomic characterization of the 2019 novel human-pathogenic coronavirus isolated from a patient with atypical pneumonia after visiting Wuhan Evolutionary Perspectives on Novel Coronaviruses Identified in Pneumonia Cases in China Full-genome evolutionary analysis of the novel corona virus (2019-nCoV) rejects the hypothesis of emergence as a result of a recent recombination event Return of the Coronavirus: 2019-nCoV Evidence of recombination in coronaviruses implicating pangolin origins of nCoV-2019 Isolation and Characterization of 2019-nCoV-like Coronavirus from Malayan Pangolins Identification of 2019-nCoV related coronaviruses in Malayan pangolins in southern China Moral imperative for the immediate release of 2019-nCoV sequence data. National Science Review PAML 4: phylogenetic analysis by maximum likelihood Codon optimality, bias and usage in translation and mRNA decay Receptor recognition by novel coronavirus from Wuhan: An analysis based on decade-long structural studies of SARS Cryo-EM Structure of the 2019-nCoV Spike in the Prefusion Conformation Characterization of spike glycoprotein of 2019-nCoV on virus entry and its immune cross-reactivity with spike glycoprotein of SARS-CoV Identification of Two Critical Amino Acid Residues of the Severe Acute Respiratory Syndrome Coronavirus Spike Protein for Its Variation in Zoonotic Tropism Transition via a Double Substitution Strategy Difference in Receptor Usage between Severe Acute Respiratory Syndrome (SARS) Coronavirus and SARS-Like Coronavirus of Bat Origin A new coronavirus associated with human respiratory disease in China Homologous recombination within the spike glycoprotein of the newly identified coronavirus may boost cross-species transmission from snake to human Moderate mutation rate in the SARS coronavirus genome and its implications Origin time and epidemic dynamics of the 2019 novel coronavirus Decoding evolution and transmissions of novel pneumonia coronavirus using the whole genomic data Haploview: analysis and visualization of LD and haplotype maps The Chimpanzee S, Analysis C. Initial sequence of the chimpanzee genome and comparison with the human genome Evolutionary and Biomedical Insights from the Rhesus Macaque Genome Initial sequencing and comparative analysis of the mouse genome SARS coronavirus replicase proteins in pathogenesis Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus MUSCLE: multiple sequence alignment with high accuracy and high throughput Automated generation of heuristics for biological sequence comparison RevTrans: Multiple alignment of coding DNA from aligned amino acid sequences Molecular Evolutionary Genetics Analysis across Computing Platforms EasyCodeML: A visual tool for analysis of selection using CodeML DnaSP 6: DNA Sequence Polymorphism Analysis of Large Data Sets popart: full-feature software for haplotype network construction RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies Fast and accurate short read alignment with Burrows-Wheeler transform The Sequence Alignment/Map format and SAMtools Codon usage in regulatory genes in Escherichia coli does not reflect selection for 'rare' codons The authors thank the researchers who generated and shared the sequencing data from Note the derived alleles of synonymous mutations are skewed towards higher frequencies than those of nonsynonymous mutations. A. LD plot of any two SNP pairs among the 29 sites that have minor alleles in at least two strains. The number near slashes at the top of the image shows the coordinate of sites in the genome. Color in the square is given by standard (D'/LOD), and the number in square is r 2 value. B. The r 2 of each pair of SNPs (y-axis) against the genomic distance between that pair (x-axis). C. The LOD of each pair of SNPs (y-axis) against the genomic distance between that pair (x-axis). Note that in both B and C, the red point represents the LD between SNPs at 8,782 and 28,144. A. The haplotype networks of SARS-CoV-2 viruses. Blue represents the L type, and red is the S type. The orange arrow indicates that the L type evolved from the S type. Note that in this study, we marked each sample with a unique ID that starting with the geological location, followed by the date the virus was isolated (see Table S1 for details). Each ID did not contain information of the patient's race or ethnicity. B. Evolution of the L and S types of SARS-CoV-2 viruses. Genome sequence alignments with the seven most closely related viruses indicated that the S type was most likely the ancient version of SARS-CoV-2. ".", The nucleotide sequence is identical; "-", gap. In our recent publication (https://doi.org/10.1093/nsr/nwaa036), we showed that among circulating SARS-CoV-2 (with 103 genomes analyzed) two different viral genomes co-exist. We identified them as lineages L and S. The concerned amino acid we used to define the L and S lineages is located in ORF8 (open reading frame 8), which plays a yet undefined role in the viral life cycle. Based on the finding that "L" lineage has a higher frequency than lineage S, we described the L lineage as aggressive. We now recognize that within the context of our study the term "aggressive" is misleading and should be replaced by a more precise term "a higher frequency". In short, while we have shown that the two lineages naturally co-exist, we provided no evidence supporting any epidemiological conclusion regarding the virulence or pathogenicity of SARS-CoV-2. By saying so, corrections will be made in the print version of this paper to avoid being misleading.