key: cord-0906981-tnfgbl05 authors: YANG, Xuemei; Dong, Ning; CHAN, Wai-Chi; CHEN, Sheng title: Identification of super-transmitters of SARS-CoV-2 date: 2020-04-22 journal: nan DOI: 10.1101/2020.04.19.20071399 sha: 1e62a454b436adc90ea12c5e186a4252f9787a83 doc_id: 906981 cord_uid: tnfgbl05 A newly emerged coronavirus, SARS-CoV-2, caused severe outbreaks of pneumonia in China in December 2019 and has since spread to various countries around the world. To probe the origin and transmission dynamics of this virus, we performed phylodynamic analysis of 247 high quality genomic sequences of viruses available in the GISAID platform as of March 05, 2020. A substantial number of earliest sequences reported in Wuhan in December 2019, including those of viruses recovered from the Huanan Seafood Market (HNSM), the site of the initial outbreak, were genetically diverse, suggesting that viruses of multiple sources were involved in the original outbreak. The viruses were subsequently disseminated to different parts of China and other countries, with diverse mutational profiles being recorded in strains recovered subsequently. Interestingly, four genetic clusters defined as Super-transmitters (STs) were found to become dominant and were responsible for the major outbreaks in various countries. Among the four clusters, ST1 is widely disseminated in Asia and the US and mainly responsible for outbreaks in the states of Washington and California in the US as well as those in South Korea at the end of February and early March, whereas ST4 contributed to the pandemic in Europe. Each ST cluster carried a signature mutation profile which allowed us to trace the origin and transmission patterns of specific viruses in different parts of the world. Using the signature mutations as markers of STs, we further analysed 1539 genome sequences reported after February 29, 2020. We found that around 90% of these genomes belonged to STs with ST4 being the dominant one and their contribution to pandemic in different continents were also depicted. The identification of these super-transmitters provides insight into the control of further transmission of SARS-CoV-2. The genomic characteristics of SARS-CoV-2 have been elucidated using phylogenetic, 57 structural and mutational analyses by scientists across the globe [4] . High-throughput 58 sequencing revealed that SARS-CoV-2 was a novel betacoronavirus which resembled CoV at around 79.5% sequence identity [5, 6] . A recent study indicated that SARS-CoV-2 was 60 96% identical to a bat coronavirus RaTG13 (accession: MN996532) at the genomic level, 61 suggesting that bat might be a natural host of SARS-CoV-2 [7] . GISAID is a platform for 62 sharing genetic data of influenza. Currently, a rapidly increasing number of SARS-CoV-2 63 genomic sequences are being deposited into this database from laboratories around the world 64 [8] . On the other hand, some recent studies also suspected that Malayan pangolins (Manis Sequence analysis, alignment and mutation identification 90 A total of 343 full-length SARS-CoV-2 genomes available in the GISAID platform 91 (https://platform.gisaid.org/) as of March 5, 2020 were downloaded [8] . A total of 247 92 sequences with high sequence quality as noted in the GISAID database were included for 93 further analysis after removing sequences containing little temporal signal and thus are not 94 unsuitable for inference using phylogenetic molecular clock models. Information regarding the 95 date and country of isolation were also retrieved from the GISAID platform. The annotated 96 reference genome sequence of the SARS-CoV-2 isolate Wuhan-Hu-1 (accession: 97 NC_045512.2) was downloaded from the NCBI GenBank database. All genomes were 98 annotated by GATU Genome Annotator [11] using the SARS-CoV2 isolate Wuhan-Hu-1 99 (NC_045512.2) as reference [12] . Nucleotide and amino acid mutations of all genome and 100 separate proteins were analyzed by blast (https://blast.ncbi.nlm.nih.gov/) using the sequence of 101 strain Wuhan-Hu-1 as reference. Global genomic surveillance of SARS-CoV-2 was implemented by means of an automated 105 phylogenetic analysis pipeline using Nextstrain, which generates an interactive visualization 106 integrating a phylogeny with sample metadata such as geographic location or host age [13] . The pipeline involved the sequence alignment module with MAFFT [14] , phylogenetic 108 analysis with IQ-TREE [15], maximum-likelihood phylodynamic analysis with Treetime [16], Phylodynamics analysis of genome sequences of SARS-CoV-2 strains collected worldwide 126 To trace the evolutional process and identify the common ancestor of 247 strains of SARS-127 CoV-2 collected worldwide, root-to-tip regression scatter plots was conducted among all 128 SARS-CoV-2 genomes, with R 2 being found to be 0.23, suggesting that these 247 viral 129 sequences shared a common recent ancestor (Fig 1a) . The date of the most recent common 130 ancestor (tMRCA) of all reported SARS-CoV-2 viruses was 2019-Nov-12, suggesting that this 131 virus emerged recently (Fig 1a) . A total of 379 nucleotide mutations were identified among 132 these 247 sequences based on the sequence alignment, among which G 11083 T (n=5), T3G (n=3), 133 G 29864 A (n=3), C 29870 A (n=3), A1T (n=2), A4T (n=2), T 4402 C (n=2), G 5062 T (n=2), T 18603 C (n=2) 134 and G 22661 T (n=2) were the most homoplasic mutations (Fig 2, (n=3, P 34 S, Q62* and H 73 Q), ORF10 (n=2, P 10 S and I 13 M) and E protein (n=1, S 6 L) (Fig 1b) . 143 Identification of single amino acid substitutions in SARS-CoV-2 isolates consistently showed 144 that these isolates shared a recent common ancestor but entered diverse evolution paths. The 145 estimated substitution rate of SARS-CoV-2 was 8.90e-04 subs/site/year, which was similar to 146 that of other RNA viruses including SARS-CoV, Ebola virus, Zika virus, and others, which 147 was found to be at ~ 1e-3 subs/site/year (http://virological.org/t/phylodynamic-analysis-93-148 genomes-15-feb-2020/356). Based on this mutation rate, a genome of 29kb of SARS-CoV-2 will 149 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. isolates obtained from HNSM. Compared to these six viral genomes, others displayed various 160 mutation profiles which comprised 1 to 6 mutations in the genomes. We therefore set these six (Table S1) . could be identified in all offsprings (Fig 2) . The first cluster contained two mutations, C 8782 T 194 and T 28144 C; the second cluster contained the mutation G 26144 T; the third cluster contained the 195 mutation G 11083 T; the fourth cluster contained three mutations, C 241 T, C 3037 T and A 23403 G. Tracing the changes in mutation profiles of these viral genomes over time allowed us to 197 visualize the transmission and evolution dynamics of SARS-CoV-2. Since viruses of all of 198 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 22, 2020. . https://doi.org/10. 1101 these four clusters exhibited very high potential to undergo global transmission, we define 199 viruses in these four clusters as super-transmitter cluster 1 (ST1), 2 (ST2), 3(ST3) and 4(ST4). Table 2 ). The viruses in ST1 were also found to be able to rapidly mutate along the 212 transmission paths. Three genome sequences that were reported in Australia, Vietnam and USA 213 on Feb. 28, 24 and Mar. 03, 2020 respectively, were found to harbour a total of 11 mutations. An additional nine mutations were acquired by the parental virus within 50 days (from Jan 05 215 to Feb. 24, 2020), with a mutation rate of 2.3e-3 subs/site/year (29kb genome size), which was 216 much higher than the predicted mutation rate of SARS-CoV-2 (4.057 e-4 subs/site/year) and 217 other coronaviruses such as SARS-CoV-1 and MERS virus. Among viral genomes in this 218 cluster, 43 out of 85 genomes exhibited five or more mutations ( Table 2) . Detailed analysis of mutation profiles of the genome sequences in ST1 enables us to trace the 221 evolution routes of these viruses in specific region. In Washington State, USA, a genome 222 sequence with three mutations, C 18060 T, C 8782 T and T 28144 C, was reported on Jan 25, 2020. A 223 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. on Feb. 29, 2020, in which 9, 10 and 11 mutations were identified respectively ( Table 3) . Our 246 data showed that as many as eight additional mutations were acquired by the parental virus 247 within 30 days (from Jan 28 to Feb. 29, 2020), representing a mutation rate of 3.3e-3 248 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 22, 2020 . . https://doi.org/10.1101 subs/site/year (29kb genome size), which was much higher than the predicted mutation rate of from Italy, and contributed to the current explosive increase in incidence of COVID-19 in 274 Europe (Table 5) To better understand the temporal and spatial distribution of these super-transmitters, we plot 291 variation in the types of genome sequences recovered from different continents against time. The original viruses were transmitted in the first week before the emergence of these super-293 transmitters. ST1 was the first batch of viruses that emerged and dissemination continued 294 throughout the study period. Other STs emerged at different time points and transmission also 295 peaked at different dates. Transmission of ST2 and ST3 mainly occurred between mid January 296 to mid February. Transmission of ST4 viruses mainly began at the end of February. Viruses of 297 the four clusters exhibited much higher mutation rate than those which exhibited diverse 298 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 22, 2020. . https://doi.org/10. 1101 genetic profiles and could not be allocated into specific genetic cluster when compared to the 299 original genome (Fig 3a) . ST1 viruses were those which were disseminated extensively in 300 China, in particular in the later stage of the outbreak (Fig 3b) . ST1, ST2 and ST3 were prevalent 301 in other Asian countries (Fig 3c) . All the four clusters were involved in the outbreaks in Europe 302 in the early stage, but ST4 was the cluster that eventually transformed the outbreaks in Europe 303 to the pandemic level (Fig 3d) . In Oceania, ST1 was involved mainly in the early stage of 304 outbreak, yet ST2 became dominant at the later stage (Fig 3e) . ST1 and other types of viruses 305 were the major transmitters in the US. ST1 was shown to be transmitted mainly in the states of 306 Washington and California, whereas the other types were mainly transmitted in other states, 307 (Fig 3f, Table S1 ). Upon finishing our manuscript, we went to check the available genome sequences in the 312 database and found a rapid increase of numbers of sequences. A total of 1539 genome 313 sequences reported after February 29, 2020 were included for a quick analysis to identify the 314 type of these most recent genomes. As shown in Table 6 , most of the genomes were reported 315 from USA (968 / 63%) and Europe (441 / 29%), where the pandemics were the most server. It is good to see some genomes from Africa (20 / 1%) and South America (23 / 2%), which 317 were minimally reported before March 01, 2020. Among these genomes, 89% of the genomes 318 belonged to ST1-4 with ST4 being the most dominant (56%), while the original derivatives 319 accounted for only 11%, which were mainly reported in UK and Netherland. In Africa, ST4 320 (18/20, 90%) was the major type with some of the cases showing travel history to Europe; in 321 Asia, the major types became ST3 (17/33, 52%) and ST4 (16/33, 48%); in Europe, all types 322 were presence with ST4 being the dominant one (668/968, 69%); all the types were reported 323 in the US with ST1 (282/441, 62%) and ST4 (137/441, 14%) being the dominant; in Canada, 324 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 22, 2020 . . https://doi.org/10.1101 all types except for ST1 were present; in Oceania, all four STs were present with ST3 being 325 the dominant; in South America, ST1 and ST4 were the most dominant types ( Table 6) . 326 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 22, 2020 . . https://doi.org/10.1101 Discussion 327 We conducted detailed and comprehensive analyses of sequences of SARS-CoV-2 reported 328 from December to March 05, 2020 and deposited in the GISAID database. The detailed 329 analysis of 247 high quality genome sequences of SARS-CoV-2 provides insight into the 330 evolution and transmission of this novel virus (Fig 5) . The ancestor of SARS-CoV-2 could (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 22, 2020 . . https://doi.org/10.1101 dominant, yet by the end of February and early March, these four super-transmitters became Lastly, our data also provided insight into the major transmitting viruses in current pandemic 373 areas in the world. For example, in Italy, ST2, ST3 were reported in the end of January, while 374 ST4 was reported in February and early March. Similar trends were seen in other countries 375 with exception that a higher proportion of the original viral genomes were reported in 376 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 22, 2020 . . https://doi.org/10.1101 Netherland. In the US, the original viruses were reported in other states, while ST1 was the 377 major virus that caused outbreak in Washington and California States. Other ST genomes were 378 also sporadically reported in the US. Although data from Iran is not available, two genomes 379 reported from Australia with travel history from Iran were shown to belong to ST3, suggesting 380 that this cluster was responsible for the pandemic in Iran. In Australia, all genomes were 381 reported except ST4. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 22, 2020 . . https://doi.org/10.1101 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 22, 2020 . . https://doi.org/10.1101 (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 22, 2020 . . https://doi.org/10.1101 of each dot represents the region of isolation of the corresponding isolate. Ancestral state 491 reconstruction and branch length timing were performed with TreeTime [16]. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 22, 2020 . . https://doi.org/10.1101 (STs) were identified. Each ST was found to exhibit a signature mutation profile. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 22, 2020. . (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 22, 2020 . . https://doi.org/10.1101 US but was less prevalent in other parts of the world. ST2 and ST3 was transmitted mainly in Asian countries other than China, as well as Europe 508 from mid of January to mid of February. ST4 was transmits mainly in Europe in the beginning and then transmitted to all over the world. Origin and evolution of pathogenic coronaviruses Severe acute respiratory syndrome-related coronavirus-The species and its 405 viruses, a statement of the Coronavirus Study Group Clinical 407 characteristics of 2019 novel coronavirus infection in China Comparative analyses of SAR-CoV2 genomes from 409 different geographical locations and other coronavirus family genomes reveals unique 410 features potentially consequential to host-virus interaction and pathogenesis Genomic 412 characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins 413 and receptor binding. The Lancet A novel 415 coronavirus from patients with pneumonia in China Epidemiological 418 and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, 419 China: a descriptive study GISAID: Global initiative on sharing all influenza data-from vision to 421 reality A 423 pneumonia outbreak associated with a new coronavirus of probable bat origin On the 425 origin and continuing evolution of SARS-CoV-2 Genome Annotation Transfer Utility (GATU): rapid 428 annotation of viral genomes using a closely related reference genome A new 431 coronavirus associated with human respiratory disease in China Nextstrain: real-time tracking of pathogen evolution Multiple alignment of DNA sequences with MAFFT. In: 436 Bioinformatics for DNA sequence analysis IQ-TREE: a fast and effective stochastic 438 algorithm for estimating maximum-likelihood phylogenies TreeTime: Maximum-likelihood phylodynamic analysis. 441 Virus evolution Inkscape: guide to a vector drawing program: prentice hall press RAxML version 8: a tool for phylogenetic analysis and post-analysis of large 444 phylogenies Interactive tree of life (iTOL) v3: an online tool for the display and 446 annotation of phylogenetic and other trees