key: cord-0827695-80ydp0bu authors: Zhou, Zhong-Yin; Liu, Hang; Zhang, Yue-Dong; Wu, Yin-Qiao; Peng, Min-Sheng; Li, Aimin; Irwin, David M.; Li, Haipeng; Lu, Jian; Bao, Yiming; Lu, Xuemei; Liu, Di; Zhang, Ya-Ping title: Worldwide tracing of mutations and the evolutionary dynamics of SARS-CoV-2 date: 2020-08-10 journal: bioRxiv DOI: 10.1101/2020.08.07.242263 sha: c2b3c31392e800a22960d10f3d74421b498d8f8f doc_id: 827695 cord_uid: 80ydp0bu Understanding the mutational and evolutionary dynamics of SARS-CoV-2 is essential for treating COVID-19 and the development of a vaccine. Here, we analyzed publicly available 15,818 assembled SARS-CoV-2 genome sequences, along with 2,350 raw sequence datasets sampled worldwide. We investigated the distribution of inter-host single nucleotide polymorphisms (inter-host SNPs) and intra-host single nucleotide variations (iSNVs). Mutations have been observed at 35.6% (10,649/29,903) of the bases in the genome. The substitution rate in some protein coding regions is higher than the average in SARS-CoV-2 viruses, and the high substitution rate in some regions might be driven to escape immune recognition by diversifying selection. Both recurrent mutations and human-to-human transmission are mechanisms that generate fitness advantageous mutations. Furthermore, the frequency of three mutations (S protein, F400L; ORF3a protein, T164I; and ORF1a protein, Q6383H) has gradual increased over time on lineages, which provides new clues for the early detection of fitness advantageous mutations. Our study provides theoretical support for vaccine development and the optimization of treatment for COVID-19. We call researchers to submit raw sequence data to public databases. 48 . This disease has spread worldwide, and as of June 26, 2020, has 49 infected more than nine million humans 1 . The whole-genome sequence of SARS-50 CoV-2 is 96.2% similar to that of a bat SARS-related coronavirus (RaTG13) and is 79% 51 similar to human SARS-CoV 2 . Recent studies have indicated that SARS-CoV-2 is 52 more easily transmitted from person to person than SARS-CoV 3,4 . SARS-CoV-2 has 53 also gradually accumulated new mutations that may make it more suitable to the 54 human host 5,6 . With the development of next-generation sequencing, it has become possible to 56 conduct large-scale studies to detect inter-host single nucleotide polymorphisms 57 (inter-host SNPs) and intra-host single nucleotide variations (iSNVs) for infectious 58 diseases, such as Ebola virus that can uncover essential information concerning their 59 transmission and evolution 7,8 . Thus, tracing inter-host SNPs and iSNVs, and revealing 60 the evolutionary dynamics of SARS-CoV-2 worldwide is a priority. 61 To trace mutations in SARS-CoV-2, we first identified similarities and 62 differences in SNPs from assembled genome sequences and publicly available raw 63 reads. We collected 15,818 high-quality SARS-CoV-2 genome sequences from the 64 publicly accessible GISAID, CNGBdb, GenBank, GWH 9 and NMDC databases on 65 May 27, 2020. All of these sequences were aligned against the reference genome 66 (Wuhan-Hu-1, GenBank NC_045512.2) 10 using MUSCLE 11 , revealing 7,700 inter-67 host SNP sites (4,672 nonsynonymous, 2,570 synonymous, 98 stop gain, 7 stop loss, 68 and 433 noncoding) (Extended Data Table 1 ). Publicly available raw reads were analyzed, the number of iSNV is much underestimated. We then investigated the distribution of inter-host SNPs and iSNVs in each open 81 reading frame (ORF) and found similar trends in the SNP numbers and SNP numbers 82 normalized by the length between inter-host SNPs and iSNVs ( Fig. 1, a-d) . We then 83 focused on the S-protein receptor binding domain (RBD) as it is being used for 84 monoclonal antibody isolation and vaccine design 13,14 . Several mutations were found 85 in this region, including 68 inter-host nonsynonymous and 65 intra-host 86 nonsynonymous mutations (Fig. 1, a and c ). An excess of C-to-T, A-to-G and T-to-C 87 mutations were observed in both the inter-host SNPs and iSNVs, which may be the Table 4 ). When the sequences were tested for evidence of recombination using the The frequency of iSNVs can also be used to reveal selection signals 8, 19 . We Table 4 ; 146 genome position: 24,300-27,800). Diversifying selection on these two regions might 147 also be driven by an intra-host escape from antibody recognition. To study the mechanisms for the generation and transmission of fitness 149 advantage mutations, we first investigated the number of iSNVs over time and noticed 150 that the numbers of iSNVs was generally steady with few fluctuations (Fig. 3a) . Thus, 151 we concluded that there was no increase in diversity within intra-host genomes over Population genetics theory points out that an allele with a fitness advantage will 176 increase in frequency faster than alleles without a fitness advantage. Thus, if a 177 mutation has a selective advantage, then the frequency of this mutation will gradually 178 increase over time in a transmission chain (Fig. 4a) . To detect possible fitness with isolate sampling times for the above iSNVs for each lineage. We found two 186 shared iSNVs (mutation type) in the red cluster and two shared iSNVs in the black 187 cluster that had gradual frequency increases over time ( Fig. 4c and Extended Data are not yet highly prevalent in COVID-19 patients. We urged researchers to focus on 199 these types of mutations and upload raw data of SARS-CoV-2 genomes to public 200 databases for the monitoring of iSNV frequencies. In summary, we found that to May 27, at least 35.6% of the bases of the SARS-202 CoV-2 genome harbor one or more mutation and that evolution of mutations in the 203 spike protein gene might have been driven by diversifying selection to increase 204 diversity to escape immune system recognition. Furthermore, we found that recurring 205 mutations and human-to-human transmission are the reasons for the prevalence of the 206 fitness advantage mutations in SARS-CoV-2. Through an analysis of shared iSNVs in 207 SARS-CoV-2 genomes from patients from Shanghai, we found three alleles that 208 might be fitness advantage mutations for SARS-CoV-2. This study provides 209 theoretical support for vaccine development and optimizing treatment. We also 210 strongly recommend that researchers submit raw sequence data and epidemiological 211 10 information of SARS-CoV-2 isolates to public databases, which will improve the 212 detection of fitness advantage mutations of SARS-CoV-2. To prevent a mapping bias caused by polymorphisms, the iSNVs in the SARS-377 CoV-2 reference genome were replaced with 'N' for each sample. We used BWA 378 mem 0.7.17 with default parameters to re-map the reads 23 . Following the haplotype network and phylogenetic tree in the Shanghai samples, WHO Coronavirus disease(COVID-19) Situation Report A pneumonia outbreak associated with a new coronavirus of 387 probable bat origin A familial cluster of pneumonia associated with the 2019 novel 390 coronavirus indicating person-to-person transmission: a study of a family cluster Cryo-EM structure of the 2019-nCoV spike in the prefusion 394 conformation Tracking changes in SARS-CoV-2 Spike: evidence that D614G 397 increases infectivity of the COVID-19 virus On the origin and continuing evolution of SARS-CoV-2. National 400 science review Ebola Virus Epidemiology, Transmission, and Evolution during 402 Intra-host dynamics of Ebola virus during 2014 The 2019 novel coronavirus resource A new coronavirus associated with human respiratory disease in 409 MUSCLE: multiple sequence alignment with high accuracy and 411 high throughput Viral and host factors related to the clinical outcome of COVID-414 19 A noncompeting pair of human neutralizing antibodies block 416 COVID-19 virus binding to its receptor ACE2. Science A Universal Design of Betacoronavirus Vaccines against COVID-419 19 Bayesian phylogenetic and phylodynamic data integration 421 using BEAST 1.10. Virus evolution 4, vey016 No evidence for increased transmissibility from recurrent 424 mutations in SARS-CoV-2. bioRxiv RDP4: 427 Detection and analysis of recombination patterns in virus genomes SNPGenie: estimating 430 evolutionary parameters to detect natural selection using pooled next-generation 431 sequencing data Resequencing of 200 human exomes identifies an excess of low-434 frequency non-synonymous coding variants IQ-TREE 2: New Models and Efficient Methods for 437 Phylogenetic Inference in the Genomic Era ANNOVAR: functional annotation of 440 genetic variants from high-throughput sequencing data Trimmomatic: a flexible trimmer for 443 Illumina sequence data Fast and accurate short read alignment with Burrows-446 Wheeler transform The Genome Analysis Toolkit: a MapReduce framework for 449 analyzing next-generation DNA sequencing data A framework for variation discovery and genotyping using 452 next-generation DNA sequencing data A statistical framework for SNP calling, mutation discovery, association 455 mapping and population genetical parameter estimation from sequencing data Genetic diversity and evolutionary dynamics of Ebola virus in 459 This work was supported by the Chinese Academy of Sciences Science Foundation of China (91331101), and the National Key Research and 465 Development Project (No. 2020YFC0847000) W drew the 470 figures. Y.B. provided the SARS-CoV-2 sequence alignment 474 The authors declare that they have no competing interests.