key: cord-0962327-70g2ve88 authors: Liu, Qi; Zhao, Shilei; Shi, Cheng-Min; Song, Shuhui; Zhu, Sihui; Su, Yankai; Zhao, Wenming; Li, Mingkun; Bao, Yiming; Xue, Yongbiao; Chen, Hua title: Population Genetics of SARS-CoV-2: Disentangling Effects of Sampling Bias and Infection Clusters date: 2020-07-12 journal: Genomics Proteomics Bioinformatics DOI: 10.1016/j.gpb.2020.06.001 sha: 2bd2c9d7edfe4e6e532b5a565dc1d64d869d2b33 doc_id: 962327 cord_uid: 70g2ve88 A novel RNA virus, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is responsible for the ongoing outbreak of coronavirus disease 2019 (COVID-19). Population genetic analysis could be useful for investigating the origin and evolutionary dynamics of COVID-19. However, due to extensive sampling bias and existence of infection clusters during the epidemic spread, direct applications of existing approaches can lead to biased parameter estimations and data misinterpretation. In this study, we first present robust estimator for the time to the most recent common ancestor (TMRCA) and the mutation rate, and then apply the approach to analyze 12,909 genomic sequences of SARS-CoV-2. The mutation rate is inferred to be 8.69 × 10(-4) per site per year with a 95% confidence interval (CI) of [8.61 × 10(-4), 8.77 × 10(-4)], and the TMRCA of the samples inferred to be Nov 28, 2019 with a 95% CI of [Oct 20, 2019, Dec 9, 2019]. The results indicate that COVID-19 might originate earlier than and outside of Wuhan Seafood Market. We further demonstrate that genetic polymorphism patterns, including the enrichment of specific haplotypes and the temporal allele frequency trajectories generated from infection clusters, are similar to those caused by evolutionary forces such as natural selection. Our results show that population genetic methods need to be developed to efficiently detangle the effects of sampling bias and infection clusters to gain insights into the evolutionary mechanism of SARS-CoV-2. Software for implementing VirusMuT can be downloaded at https://bigd.big.ac.cn/biocode/tools/BT007081. The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a novel RNA virus of the Coronaviridae family, caused an outbreak of coronavirus disease 2019 in China in late December 2019, and has been rapidly spreading to more than 214 countries and areas since then [1, 2] . COVID-19 is the third pandemic caused by coronavirus in the last 20 years; and it has resulted in more than 4, 993, 470 infections and claimed nearly 327,738 lives as of May 22, 2020 Among the extensive studies conducted on COVID-19, one essential question is to trace the origin and transmission between humans, shedding light on the molecular mechanism underlying epidemiological and pathological characteristics of the virus. Population genetic methods are often used to reconstruct evolutionary history of viral infectious diseases, which supplements our knowledge of epidemic or pandemic dynamics [3−6] . High evolutionary rates, which are typical of RNA viruses (10 -4 −10 -3 nucleotide substitutions per year) [7] , and large genome size of betacoronaviruses (~30 kb) leave sufficient amount of genomic polymorphisms within the time frame of epidemic outbreaks [8] . By interrogating the genomic data sampled at different time points of outbreaks, it is possible to estimate fundamental parameters of the evolutionary process, including evolutionary rate, population expansion rate, and the time when all sampled virus strains shared the most recent common ancestor (MRCA), and to test the different hypotheses of evolutionary mechanism. There are limitations on directly applying existing population genetic approaches to estimate the viral evolutionary history. First, virus samples are often collected by different agencies during the process of infectious outbreak, which may incur spatial and temporal sampling biases. Second, transmission of infectious diseases is commonly seen to happen in infection clusters or outbreak clusters, i.e., a sudden burst of infected cases in the same place around the same time. One example of an infection cluster is the COVID-19 outbreak in the Diamond Princess cruise [9] . Both sampling bias and presence of infection clusters cause genomic polymorphism patterns similar to those generated by evolutionary effects, such as natural selection [10, 11] . Most population genetic methods are under the assumption that samples are collected uniformly and randomly from one or multiple populations. A direct application of existing population genetic approaches without taking into account of sampling bias and presence of infection clusters could lead to biased parameter estimations of s and data misinterpretations. In this study, we illustrate extensive sampling biases in the SARS-CoV-2 genomic sequence data. We further investigate two other prominent polymorphism patterns found in the genomic data: a highly homozygous haplotype group or an over-sized node in the haplotype network graph, and a substantial allele frequency difference between two spatial and temporal samples (e.g., the Wuhan and the non-Wuhan samples). Such data patterns are widely considered in population genetic studies as evidence of natural selection [10, 11] . Nevertheless, we propose that such patterns in the SARS-CoV-2 genomic data should result from sampling bias and presence of infection clusters during epidemic and pandemic spreads. To reduce the estimation bias, we present a robust estimator of the mutation rate and the time to MRCA (TMRCA), which disentangles the effect of sampling bias and presence of infection clusters. The performance of our proposed estimator is compared with the results from a Bayesian evolutionary analysis package, BEAST [12] , via simulation studies. We subsequently apply the method to analyze 12,909 genomic sequences collected before May 4, 2020. The dataset used for this study includes 12 A phylogeny of the 756 SARS-CoV-2 sequences is constructed using neighbor-joining approach (Figure 1) . Two lineage clades, named S and L lineages, were identified in previous studies [13] . colored in red in Figures 1 and 2 ). Except the branch splitting the S and L lineages discussed above, many parts of the SARS-CoV-2 genealogy are highly uncertain with low support due to lack of mutations. The genomic sequences of SARS-CoV-2 were collected by multiple institutes or medical agencies at different time points of the pandemic (Figure 1 and Figure 3 ). Such sampling is not compliant with the population genetic assumptions on sequence samples simultaneously and randomly collected from the contemporary populations. There are at least three sources of non-ignorable sampling biases in the samples collected before March 1, 2020, which are detailed below. Third, before Dec 24, 2019, no sequence was collected in the databases during this early stage. The first identified patient as far as we know was reported on Dec 01, 2019 by Huang and colleagues [15] , without presenting the virus genomic sequences. The genealogical relationship of the 756 genomic sequences is illustrated by a haplotype network ( Figure 2 ). Notably, we observe multiple large nodes in the We evaluate the performance of VirusMuT with simulated data, by checking its robustness to the sampling bias and presence of infection clusters, and compare the performance of VirusMuT with that of the commonly used method, BEAST [12] . In Furthermore, in simulation 3, one filtering step prior to analysis was adopted by removing identical sequences from the infection cluster. As we can see from Figure S1 , both BEAST and VirusMuT showed deviation from the true values in the presence of infection clusters ( Figure S2C and 2D) . However, the filtering step that down-weights sequences from the infection clusters can correct the bias in some degree ( Figure S2E and 2F). VirusMuT overall performs better, and provides nearly unbiased inference of TMRCA and mutation rate for all three simulations, indicating its robustness to sampling bias and presence of infection clusters. We apply VirusMuT to 12,909 genomic sequences collected before May 4, 2020 ( [15] ). The estimated TMRCA and the associated 95% CIs are both earlier than the outbreak time of COVID-19 in WSM. It is likely that the L and S lineages have been existing for quite a while before the outbreak. Another important question on virus evolutionary dynamics is to identify whether the virus genomes are under rapid adaptive evolution with accumulation of abundant mutantations during its transmission. In population genetic analysis, a haplotype group at a high frequency and relatively low heterozygosity will exhibit an over-sized node in the network graph of a uniformly collected sample, which is often considered as a strong evidence of positive selection on the haplotype. The prominent over-sized H1 node in the haplotype network graph was identified as a signal of higher infectivity of the H1 haplotype by multiple studies [17, 18] . However, as we discussed above, over-sized nodes in the haplotype network are patterns commonly observed during virus spread due to the presence of infection clusters or severe sampling bias. Infection clusters occur mostly in places like transport hubs, nursing facilities, and schools, and usually unrelated to the infectivity or pathogenicity differences among virus lineages [19] . The other informative statistic often used to detect selection is the allele frequency difference between samples from different places or stages. We observe remarkable changes in the prevalence of L lineage during the COVID-19 pandemic ( Figure S3A ). For example, 35 out of 38 specimen collected before Jan 10, 2020 were from Wuhan, Sequence alignment, quality control, and population genetics summary including the number of haplotypes, gene diversity, nucleotide diversity, Tajima's D [22] , and Fu's Fs [23] , were calculated using Arlequin v3.5 [24] . Neighbor-joining phylogenetic tree of the 756 genome sequences was constructed using MEGA 10.1.8 with default arguments [25] . Phylogenetic relationships and mutations occurring among unique genomes were further inspected from 756 genomes through median-joining networks [26] using the Network 10 We used forward simulations to test the performance of BEAST v. 1.10.4 [12] and VirusMuT on estimating TMRCA and mutation rate. The genomic length was chosen to be 30,000 bp. The generation time is five days. The mutation rate is 0.001 per year per nucleotide, that is, a mean number of 0.4110 mutations occur on the genome per generation. The reproductive number (R) is 1.7. We used the Wight-Fisher model to simulate forward in time and assumed no recombination among virus strains. In each generation, the number of decedents of every virus strain was generated from a Poisson distribution with the mean R value of 1.7, and the number of mutation was also generated from a Poisson distribution with the mean value of 0.4110. Three simulation data sets were generated for testing the methods. In simulation 1, we simulated the transmissions of the virus strains for 20 generations, and randomly collected time-series samples with the total size of 300 from generations 13 to 20. The sub-sample sizes of generations 13 to 20 were set to be 10, 10, 20, 20, 40, 40, 80 , and 80, respectively. The sub-sample sizes increasing with generations in the simulations is to mimic the real data sets, of which more sequences were collected over time. In simulation 2, in addition to the same procedures performed in simulation 1, we randomly chose one strain during generation 10 as the founder genome, and simulated an additional "infection cluster" population from generations 10 to 20 using the same parameter settings. We then collect additional 200 sequences from the cluster (50 samples from generations 18 and 19, and 100 samples from generation 20). The final data set includes 300 sequences from simulation 1 and 200 sequences from the "infection cluster" population. The procedures of simulation 3 is identical to simulation 2, except that we include an additional filtering step to remove multiple sequences of the same sampling date in the samples from "infection cluster" population. All the three simulations were repeated for 100 times. BEAST [12] and VirusMuT were applied to the simulated sequences. The inferred TMRCA and mutation rates were presented as boxplots in Figure S2 . Table 1 Inferred TMRCA and The dash line indicates the sample size on Mar 1, 2020. Table S1 Summary of SARS-CoV-2 sequences collected before Mar 1, 2020 A new coronavirus associated with human respiratory disease in China A pneumonia outbreak associated with a new coronavirus of probable bat origin Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic Reconstructing the initial global spread of a human influenza pandemic: a Bayesian spatial-temporal model for the global spread of H1N1pdm Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak Transmission and evolution of the Middle East respiratory syndrome coronavirus in Saudi Arabia: a descriptive genomic study Rates of evolutionary change in viruses: patterns and determinants Evolutionary history and phylogeography of human viruses Clinical characteristics of COVID-19 in 104 people with SARS-CoV-2 infection on the Diamond Princess cruise ship: a retrospective analysis Detecting natural selection in genomic data Recent human adaptation: genomic approaches, interpretation and insights Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10 On the origin and continuing evolution of SARS-CoV-2 Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China Inference of viral evolutionary rates from molecular sequences Decoding the evolution and transmissions of the novel pneumonia coronavirus (SARS-CoV-2 / HCoV-19) using whole genomic data Viral and host factors related to the clinical outcome of COVID-19 A dictionary of epidemiology Population genetic studies in the genomic sequencing era MUSCLE: multiple sequence alignment with high accuracy and high throughput The effect of change in population size on DNA polymorphism Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows MEGA X: molecular evolutionary genetics analysis across computing platforms Median-joining networks for inferring intraspecific phylogenies HC and YX designed the study, QL and Shilei Zhao developed the method, QL, Shilei Zhao, CMS, SS, Sihui Zhu, and YS performed the analysis. HC and YX wrote the manuscript. WZ, ML, and YB provided genome sequences and sample information.All authors were involved in discussion, read and approved the final manuscript. The authors have declared no competing interests.