key: cord-0818301-fnj57b0l authors: Fang, Bin; Liu, Linlin; Yu, Xiao; Li, Xiang; Ye, Guojun; Xu, Juan; Zhang, Ling; Zhan, Faxian; Liu, Guiming; Pan, Tao; Shu, Yilin; Jiang, Yongzhong title: Genome-wide data inferring the evolution and population demography of the novel pneumonia coronavirus (SARS-CoV-2) date: 2020-03-07 journal: bioRxiv DOI: 10.1101/2020.03.04.976662 sha: dfb2f07149617b0c2911899ac61563e808e2cee1 doc_id: 818301 cord_uid: fnj57b0l Since December 2019, coronavirus disease 2019 (COVID-19) emerged in Wuhan, Central China and rapidly spread throughout China. Up to March 3, 2020, SARS-CoV-2 has infected more than 89,000 people in China and other 66 countries across six continents. In this study, we used 10 new sequenced genomes of SARS-CoV-2 and combined 136 genomes from GISAID database to investigate the genetic variation and population demography through different analysis approaches (e.g. Network, EBSP, Mismatch, and neutrality tests). The results showed that 80 haplotypes had 183 substitution sites, including 27 parsimony-informative and 156 singletons. Sliding window analyses of genetic diversity suggested a certain mutations abundance in the genomes of SARS-CoV-2, which may be explaining the existing widespread and high adaptation of the deadly virus. Phylogenetic analysis showed that the view, pangolin acted as an intermediate host, may be controversial. The network indicated that, in the original haplotype (H14), one patient sample lived near the Huanan seafood market (approximate 2 km), indicating high possibility of the patient having a history of unconscious contact with this market. However, based on this clue, we cannot accurately concluded that whether this market was the origin center of SARS-CoV-2. Additionally, 16 genomes, collected from this market, assigned to 10 haplotypes, indicated a circulating infection within the market in a short term and then leading to the outbreak of SARS-CoV-2 in Wuhan and other areas. The EBSP results showed that the first estimated expansion date of SARS-CoV-2 began from 7 December 2019, which may indicated that the transmission could have begun from person to person in mid to late November. On this analysis technique, to precisely decode the evolutionary history of 1 5 9 SARS-CoV-2, the genome EPI ISL 402131 (bat-RaTG13-CoV) from GISAID was 1 6 0 also included as the outgroup following the previous study [25] , because it is the 1 6 1 closest sister betacoronavirus to SARS-CoV-2.3. The 136 complete genome 1 6 2 sequences of SARS-CoV-2 and an outgroup (bat-RaTG13-CoV) were aligned using 1 6 3 MAFFT then the alignment was manually checked using Geneious. For probing the 1 6 4 haplotypes relationships among localities, the Network v.4.6.1 (Fluxus Technology, 1 6 5 Suffolk, UK) was used to construct the minimum-spanning network based on the full 1 6 6 median-joining algorithm [26] . During the alignment, 10 genomes containing 1 6 7 ambiguous sites or "N" bases or more degenerated bases were excluded in this study 1 6 8 (Table S1 ). In addition, 4 genome sequences (EPI ISL 409067, EPI ISL 412981, EPI 1 6 9 ISL 407071 and EPI ISL 408489) with one degenerated base were included, which 1 7 0 were divided into two sequences without degenerated base. For ensuring the accuracy 1 7 1 of this analysis, the 5'UTR and 3'UTR contain missing and ambiguous sites of both 1 7 2 regions were excluded in the following alignment analyses. DnaSP v.5.10 [27] was 1 7 3 used to convert the relevant format. Population genetic indices in was estimated in 1 7 4 DnaSP, including nucleotide diversity (π) and haplotype diversity (Hd). F ST between 1 7 5 haplotypes was calculated in Mega 6.0 based on haplotype frequency differences with 1 7 6 10,000 permutations [28] . Additionally, a sliding window of 500 bp and steps of 50 1 7 7 bp were used to show nucleotide diversity (π) for the entire alignment this data. Nucleotide diversity for the entire alignment was plotted against midpoint positions of 1 7 9 author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.04.976662 doi: bioRxiv preprint each window. To indicate the relative position of the mutations in the genome, we 1 8 0 selected the EPI ISL 402124 sequence as the reference genome. The alignment was then imported into DnaSP for haplotype analyses. Population BEAST using a Bayesian Markov chain Monte Carlo (MCMC) method with a strict 1 9 5 clock. In this study, the substitution rate was set as 0.92×10 -3 (95% CI, 1 9 6 0.33×10 -3 -1.46×10 -3 ) substitution/site/year based on the most recent estimation for 1 9 7 SARS-CoV-2 [25]. The other parameters were set as follows: extended Bayesian 1 9 8 skyline process, 10 million MCMC generations, sampling every 1,000 th iteration, the 1 9 9 initial 25% burn-in. Tracer was used to check the convergence of the MCMC analyses 2 0 0 (effective sample size [ESS] values >200). Convergence of the two independent 2 0 1 MCMC runs was assessed in Tracer, as was convergence of model parameter values 2 0 2 (ESS) to ensure ESS values >200. 2 0 3 author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.04.976662 doi: bioRxiv preprint 2 4 2 ( Fig. 3 and S1 ). Based on this finding, , the study suggested that pangolin may not be 2 4 3 the intermediate host, as had been reiterated by other previous study [33] . As the 2 4 4 Huanan seafood market was closed on 1 January 2020, this created many challenges Therefore, more sampling and analysis of the sample from this area may suggest 2 4 7 clearer results for more concrete conclusions in future studies. The evolutionary network of 80 haplotypes of SARS-CoV-2, with bat-RaTG13-CoV 2 5 0 as the outgroup, is shown in Figure 4 . The network analysis showed typical star 2 5 1 network, with several core haplotype node (H3, H14, H15) and edge haplotype nodes. In the network, fifty-five satellite haplotypes and H14 connected to the H3 haplotype; 2 5 3 author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.04.976662 doi: bioRxiv preprint eighteen satellite haplotypes and H15+H3 connected to the H14 haplotype; and five 2 5 4 satellite haplotypes and H14 connected to H15. Most of these haplogroups were 2 5 5 separated by one to six mutations, except two haplotypes (H19 and H80, Fig. 4) . The 2 5 6 evolutionary network showed that bat-RaTG13-CoV to be connected through a 2 5 7 hypothesized haplotype (mv6) to the H15 and H31 haplotypes by single mutations 2 5 8 (Fig. 4) . However, the mutations between mv6 and bat-RaTG13-CoV was more than 2 5 9 1,000, which indicated that SARS-CoV-2 still has a relative distant kinship with the 2 6 0 outgroup (bat-RaTG13-CoV). As the most abundant haplotype, H3 included 28 samples, while 55 satellite 2 6 2 haplotypes are directly derived from the H3 haplotype ( Fig. 4 and Table S1 ). Moreover, many haplotypes from other countries should also be derived from the H3 2 6 4 haplotype. The current samplings showed that the H3 haplotype has been found to be 2 6 5 in 28 samples, but 12 of 28 samples were collected from Wuhan in Hubei Province. Fifteen of the 55 satellite haplotypes were also collected from Hubei Province (Fig. 4 ). 2 6 7 One possible explanation was that a common haplotype from the Huanan seafood 2 6 8 market (Fig. 4) was rapidly circulated at an early stage of human-to-human 2 6 9 transmissions. It is worth noting that in the H14, there are two samples (EPI ISL 2 7 0 406801 and EPI ISL 412979) from Wuhan. Although these two hosts didn't directly 2 7 1 link with the Huanan market, one of the host (EPI ISL 412979) having lived in a 2 7 2 residential area about 2 kilometers from the Huanan seafood market. This also 2 7 3 provides the possibility of them having indirect possible contact with the Huanan 2 7 4 seafood market. Additionally, the networks of 24 th December, 2019 -30 th December, 2019 and 2 7 6 24 th December, 2019 -6 th January, 2020 also suggested that the most frequent 2 7 7 haplotype (H3) occurred. All the haplotypes were collected from Wuhan, which 2 7 8 author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.04.976662 doi: bioRxiv preprint indicated that Wuhan may have acted as the important origin center (Fig. 4) . On the 2 7 9 other hand, in this study, 16 sequences of Huanan seafood market (4 new sequenced 2 8 0 and 12 GISAID data) were collected. Fifteen of out of the sixteen samples were 2 8 1 collected before 1 st Januray, 2020 and one samples was collected on 8 th Januray, 2020. 2 8 2 These samples were represented ten haplotypes (H2, H3, H4, H5, H6, H7, H8, H10, 2 8 3 H11, H16), which indicated the high proportion of haplotype diversity in Huanan 2 8 4 seafood market. Among those haplotypes, the H3 haplotype, the most abundant, was 2 8 5 present in 6 samples from Huanan seafood market, while the other haplotypes were 2 8 6 directly derived from the H3 haplotype. All the samples from the Huanan seafood 2 8 7 market had the H3 haplotype and other 9 derived haplotypes (Fig. 4) No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the (Table S3) , which also supported 3 3 6 the EBSP results. In addition, with the change of time, multiple evidence (i.e. 3 3 7 haplotype number, Tajima's D, Fu's FS) indicated a relatively highest population 3 3 8 growth at around 21 st -27 th of January (Table S3 ). The demographic expansion of 3 3 9 SARS-CoV-2 during this period was consistent with the patients first diagnosed on 8 th 3 4 0 December, 2019 [4, 19] and the previously suggested most recent common ancestor 3 4 1 (TMRCA) dates for SARS-CoV-2 at 6 th December, 2019) [17] . Additionally, 3 4 2 considering that the virus has longest reported incubation period of up to 24 days [6], 3 4 3 the virus may had first infected humans in mid to late November, which is basically 3 4 4 consistent with the results of previous studies [25] . Therefore, in this study, EBSP 3 4 5 revealed that SARS-CoV-2 experienced an effective population size expansion since 3 4 6 7 th December 2019, which was also supported by a star-like network, EBSP, the 3 4 7 neutral tests (Fu's and Tajima's D test) and mismatch analysis. Yilin Shu, Tao Pan, and Yongzhong Jiang conceived the research, analyzed the data, We declare no competing interests. 3 5 7 Acknowledgements 3 5 8We thank scientists and researchers for depositing the whole genomic sequences of 3 5 9Novel Pneumonia Coronavirus (SARS-CoV-2) in the Global Initiative on Sharing All 3 6 0 Influenza Data (GISAID) EpiFlu™; and the GISAID database for allowing us access 3 6 1 to sequence for non-commercial scientific research. We also are grateful to Dr. and identification of traditional Chinese medicine as potential origin of 3 8 9 zoonotic coronaviruses. Lett Appl Microbiol. 2020; DOI: 10.1111/lam.13285. 3 9 0 author/funder. All rights reserved. No reuse allowed without permission.The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.04.976662 doi: bioRxiv preprint The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.04.976662 doi: bioRxiv preprint the novel pneumonia coronavirus (SARS-CoV-2) using whole genomic data. BEAUti and the BEAST 1.7. Mol Biol Evol. 2012; 29(8):1969 -1973 The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.04.976662 doi: bioRxiv preprint values. 5 0 0 5 0 1 author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.04.976662 doi: bioRxiv preprint