key: cord-1056644-m2z9ki28 authors: Chen, Yan; Li, Shiyong; Wu, Wei; Geng, Shuaipeng; Mao, Mao title: Distinct mutations and lineages of SARS‐CoV‐2 virus in the early phase of COVID‐19 pandemic and subsequent 1‐year global expansion date: 2022-01-18 journal: J Med Virol DOI: 10.1002/jmv.27580 sha: 4f090e5ff9d29b795ac3cb43abd7413f1b9f068c doc_id: 1056644 cord_uid: m2z9ki28 A novel coronavirus, SARS‐CoV‐2, has caused over 274 million cases and over 5.3 million deaths worldwide since it occurred in December 2019 in Wuhan, China. Here we conceptualized the temporospatial evolutionary and expansion dynamics of SARS‐CoV‐2 by taking a series of the cross‐sectional view of viral genomes from early outbreak in January 2020 in Wuhan to the early phase of global ignition in early April, and finally to the subsequent global expansion by late December 2020. Based on the phylogenetic analysis of the early patients in Wuhan, Wuhan/WH04/2020 is supposed to be a more appropriate reference genome of SARS‐CoV‐2, instead of the first sequenced genome Wuhan‐Hu‐1. By scrutinizing the cases from the very early outbreak, we found a viral genotype from the Seafood Market in Wuhan featured with two concurrent mutations (i.e., M type) had become the overwhelmingly dominant genotype (95.3%) of the pandemic 1 year later. By analyzing 4013 SARS‐CoV‐2 genomes from different continents by early April, we were able to interrogate the viral genomic composition dynamics of the initial phase of global ignition over a time span of 14 weeks. Eleven major viral genotypes with unique geographic distributions were also identified. WE1 type, a descendant of M and predominantly witnessed in western Europe, consisted of half of all the cases (50.2%) at the time. The mutations of major genotypes at the same hierarchical level were mutually exclusive, which implies that various genotypes bearing the specific mutations were propagated during human‐to‐human transmission, not by accumulating hot‐spot mutations during the replication of individual viral genomes. As the pandemic was unfolding, we also used the same approach to analyze 261 323 SARS‐CoV‐2 genomes from the world since the outbreak in Wuhan (i.e., including all the publicly available viral genomes) to recapitulate our findings over 1‐year time span. By December 25, 2020, 95.3% of global cases were M type and 93.0% of M‐type cases were WE1. In fact, at present all the five variants of concern (VOC) are the descendants of WE1 type. This study demonstrates that viral genotypes can be utilized as molecular barcodes in combination with epidemiologic data to monitor the spreading routes of the pandemic and evaluate the effectiveness of control measures. Moreover, the dynamics of viral mutational spectrum in the study may help the early identification of new strains in patients to reduce further spread of infection, guide the development of molecular diagnosis and vaccines against COVID‐19, and help assess their accuracy and efficacy in real world at real time. measures. Moreover, the dynamics of viral mutational spectrum in the study may help the early identification of new strains in patients to reduce further spread of infection, guide the development of molecular diagnosis and vaccines against COVID-19, and help assess their accuracy and efficacy in real world at real time. Yunnan province, southwest China. 4, 5 Phylogenetic analysis also demonstrates that SARS-CoV-2 is similar to the two bat-derived SARS-like coronaviruses but distinct from SARS-CoV and MERS-CoV. 3, 5 Thus, it is speculated that bats might be the original host of SARS-CoV-2, and other non-bat mammals such as pangolins might have been the intermediate reservoir. 6 Moreover, the first reported patient cluster of COVID-19 was epidemiologically linked to a seafood wholesale market in Wuhan, China; so the market has been assumed as the origin of the outbreak by representing an intermediate reservoir of SARS-CoV-2. 7 However, epidemiological evidence doubted the market was the birthplace of SARS-CoV-2. 8, 9 During the 2014−2015 Ebola outbreak, full-length EBOV genome sequences from different severely stricken countries/districts in West Africa have helped us to better understand the viral evolution and transmission dynamics of the outbreak. [10] [11] [12] Likewise, genomic studies of SARS-CoV-2 viral sequences may provide key insights into the transmission and evolution dynamics of the ongoing COVID-19 pandemic. In this study, a series of cross-sectional analyses of viral genomes by five critical time points from early outbreak to the subsequent 1-year global expansion of the pandemic was carried out: (1) November 17, 2019 (when the patient zero presumably appeared); (2) January 1, 2020 (when the Wuhan Market was shut down); (3) Partial SARS-CoV-2 genome sequences or gene-level-only sequences were filtered out. Viral genome sequences from nonhuman hosts were also filtered out. Redundant sequences included in both databases, multiple samplings from the same patient, and resubmission of the identical sequences were excluded. Sequences with N for more than 3% of the total nucleotides (except 5′ and 3′ ends) were filtered out. After filtering, all the remaining sequences were mapped to the reference genome by a dual alignment software MAFFT (v7.450) which takes into consideration of both amino acid and nucleotide sequences. As the evolutionary rate of SARS-CoV-2 was estimated to be 27.1 subs per year (see more details in Section 3.4 of Results), and viral genomes by April 7, 2020 had only been evolved for less than half a year since the outbreak in late 2019, they were supposed to harbor mutations far less than 20 (theoretically no more than 14 mutations). Therefore, a cutoff of 20 was chosen to rigorously remove samples with low sequencing quality. After the genome sequences with >20 mismatches to the reference genome were further filtered out, a total of 4013 viral genome sequences (4002 from GISAID and 11 from NGDC) were included in this study for further analysis. Similarly, we retrieved 290 005 FASTA sequences of SARS-CoV-2 genomes from the GISAID database as of December 25, 2020, the second cutoff point of this study. Noteworthily, viral genomes by December 25, 2020 were supposed to have acquired more than 27.1 mutations since they had been evolved for over a year, and a stringent cutoff of 45 was used to discard samples with low sequencing quality. Therefore, by following the same filtering process as applied in the first cutoff point (except the genome sequences with >45 mismatches to the reference genome filtered out here), a total of 261 323 viral genome sequences were included in this study for further analysis. A previously published software named Augur was used to perform sequence data filtering and multiple sequence alignment. 13 Specifically, for each viral genome sequence, the first 150 bases at 5′ end and 80 bases at the 3′ end were omitted, and the ambiguous bases were ignored. Next, multiple sequence alignment with reference genome sequence MN908947.3 was performed for each genome. If a particular base of a genome was different from the one of reference genome at the same genomic position, it was defined as a mutation. After mutation detection, the mutation profiles of each sample were first converted into a matrix of 0 or 1 at each genomic position, where 1 meant a mutation was detected. Mutation matrices for all samples were used to perform the hierarchical clustering analysis via the Pheatmap (v1.0.12) package of R. This method was chosen over phylogenetics because the latter was heavily dependent on reliable rooting, which is challenging for SARS-CoV-2 given a relatively short time period of evolution. On the contrary, the principle of hierarchical clustering analysis is to put together samples with identical mutations to minimize the total branch length, and therefore it is hardly interfered with by rooting and is very straightforward for analyzing the genomes of etiologic pathogens from a new outbreak like COVID-19. After hierarchical clustering analysis, a sanity check for each of the major clusters was performed by extracting the decisive concomitant mutations and masking the nondecisive random mutations. Finally, 11 major genotypes were selected from clustering analysis and defined in the Pedigree chart ( Figure 4B ). Samples with mutation profiles matching to any of the 11 defined genotypes were classified into the corresponding types, whereas samples with mutation profiles not fitting into any defined genotypes were assigned as Others. Samples with no mutations under the aforementioned mutation calling methods were defined as ancestral type. Given limited sampling of viruses from the Market, we acknowledge that samples with concurrent T8782 and C28144 genotypes from the Market might have been underrepresented. However, we are confident that a significant portion of samples from the Market was derived from an ancestral genotype, generating a distinctive genotype defined by two concurrent mutations, which we named as M type (T8782C/C28144T) hereinafter. Based on the early samples available in the study, M type might have represented an overwhelming majority of all COVID-19 samples since the initial phase of the global pandemic although it should be drawn with caution that other viral types might have been underrepresented due to the potential sampling bias ( Figure S1 ). All the 16 samples collected before January 1, 2020 have the M-type mutations that coincide with the fact that market contact history was one of the diagnostic criteria of COVID-19 at the period of time (Table S1 ). 9 3.2 | Viral genotypic composition of early cases reported from Wuhan were already diversified Early cases reported from Wuhan were extremely critical to answer how the outbreak took place at the very beginning. In this study, we were able to collate 34 viral genomes sampled from Wuhan between 3.4 | The mutation spectrum and dynamics of SARS-CoV-2 genome in the early phase of COVID-19 As described above, sequencing errors were very unlikely to confound the mutation analyses of 4013 genomes. A total of 2954 unique nucleotide substitutions were identified from the 4013 SARS-CoV-2 genomes with relatively even distribution across the viral genome (Table S1) There were 17 mutations that occurred in more than 10% samples ( Figure 3A and Table S1 ). Like M-type T8782C/C28144T mutations, concurrent mutations were also observed from the other 15 most frequent single nucleotide mutations. A symmetric matrix plot by clustering analysis was generated from the 17 most frequent mutations to highlight the most common concurrent mutations ( Figure 3B ). T8782C/C28144T were concurrent in 81% samples, Nine hundred and fifty-two mutations (32.2%) spread at least once as they were detected in more than one patient sample, and distinct genotypes can be characterized based on the prevalence of mutations (Table S1) To trace the temporospatial transmission and regional expansion of the pandemic, we conducted mutation-based unsupervised clustering of all the samples. As shown in Figure 4A omes. This also reflects the high quality of sequencing data applied in the study (after filtering out low-quality sequence data) and the randomness of the mutations occurring across the viral genome. By taking into account all of the well-defined 11 major genotypes from Levels 1 and 2 in the study, we developed an algorithm, SOO, to match a particular SARS-CoV-2 viral genome to the known genotypes based on its mutation profile. The concordance of SOO was estimated in comparison with mutation clustering by assigning each of the 4013 samples included in the study to the corresponding genotype ( Figure 5 ). The overall concordance of genotypes assigned by SOO with those assigned by mutation clustering was 89.8%. Within Level 1 genotypes, the concordance ranged from 84.9% to 100.0% with an overall concordance of 86.5%. All the Level 2 genotypes represented major subtypes of M type and the overall concordance with clustering results at this level was 90.5%. The most abundant genotype at Level 2, WE1 showed 93.4% concordance. Thus, SOO represents a more accurate approach to define genotypes as it only takes into consideration the specific mutations of the particular genotypes with little influence from the rest random mutations. (Table S4 ). The three most prevalent clades GR (n = 1726, 33.6%), G (n = 1252, 24.4%) and GH (n = 977, 19.0%) were descendent from WE1 (n = 3955, 77.0%), with GR referred to WE1.1, GH referred to WE1.2, and G referred to WE1 others by SOO classification. Moreover, V (n = 122, 2.4%) was referred to SG/WE2 by SOO. Two other clades cannot be directly referred to any SOO genotypes although L can be vaguely inferred as ancestral type and others, S be inferred as a mixture of non-M types including SEA, ES, AU2, and GD. On the other hand, the O (n = 500, 9.7%) clade cannot be equivocally inferred as any SOO genotypes. It is plausible since it was not presented as a unique branch as other clades but scattered all over other branches of the phylogeny, implying it was not a welldefined unique clade. We analyzed all the available SARS-CoV-2 viral genomes in the GISAID database as of December 25, 2020, the second cutoff date of the study. A total of 10 392 unique nucleotide substitutions were identified from the 261 323 SARS-CoV-2 genomes (Table S5) , which indicates roughly one out of three nucleotides in the viral genomes has mutated during the 12-month time span of the viral evolution. A pedigree chart of the 100 most abundant mutations was generated to highlight the lineages of the most common concurrent mutations during the 12-month time window of the unfolding pandemic ( Figure 6 and Table S6 ). A very tiny proportion (92 genomes, less than 0.04%) of viral genomes were ancestral type. Fifty-nine (64.1%) of them were reported from China between January and March 2020, among which seven were sampled from Wuhan in January 2020 (Table S7) . Despite the overwhelming dominance of M type (95.3%), other major genotypes at Level 1 hierarchy in the early phase gradually faded out as the pandemic unfolds ( Figures 3B, 6 and Table S6 ). For example, SEA type was one of the most common viral genotypes in early April, accounting for 11.8% of the total samples. However, the percentage of SEA type drastically dropped to only 1.0% by the end of December ( Figure 6 and Table S6) that originated in India. 18 In this study, a total of 4130 viral genomes harbored N501Y by December 25, 2020. Interestingly, they were generally categorized into two strains, with 3931 genomes as a subtype of WE1.1, and 188 genomes as a subtype of WE1.2 under the SOO algorithm ( Figure 8A) , with the former mainly reported from the United Kingdom (98.5%) and the latter mostly reported from South Africa (96.3%). It was consistent with a previous report from WHO that the UK variant N501Y V1 (i.e., B.1.1.7) was a different virus variant from the one from South Africa N501Y V2 (i.e., B.1.351) by phylogenetic analysis. 19 Interestingly the first genome of V1 in F I G U R E 7 Snapshots of genotypic compositions of SARS-CoV-2 at five critical time points. Genotypic compositions of genomes sampled by the dates of the first case ever reported, market shutdown, Wuhan lockdown, the first data lock, and the second data lock of this study were analyzed. SARS-CoV-2, severe acute respiratory syndrome coronavirus-2 GISAID was from Victoria, Australia on June 3, 2020. A total of 31 V1 genomes were identified in June 2020, 30 of which were from Australia. This gave rise to the first wave of V1 in June. It was followed by a huge spike beginning in November 2020 which was attributed to the wide spread of V1 in the United Kingdom at that time ( Figure 8B ). The first genome of V2 was actually from New York City, USA on April 21, 2020, and V2 was later widely spread in South Africa as evidenced by a wave of V2 in November ( Figure 8B ). In contrast, neither Gamma nor Delta variants were found in the 261 323 viral genomes by December 25, 2020 in this study, which is plausible since they emerged later than the Alpha and Beta variants. Next, to bridge the latest consequential situation of VOCs to what have been recapitulated in the 1-year pandemic in the study, two post hoc analyses of the VOCs under the SOO algorithm were performed. We found that all the VOCs were subtypes of WE1 ( Figure 8C,D) . As of July 15, 2021, the Alpha variant was the most prevailing (42.5%) VOC and a subtype of WE1.1 in parallel with Gamma which was much less prevalent (2.2%). The Beta variant, on the other hand, was a subtype of WE1.2.1, with the least prevalence (1.2%) ( Figure 8C ). Interestingly, a recent study identified that a selectively convergent 501Y meta-signature may have granted Alpha, Beta, and Gamma fitness advantage. 20 The Delta variant was derived from WE1, a subtype not characterized by the SOO algorithm, which was understandable given that it emerged much later than the second cutoff date of the study. Although it accounted for a moderate amount (8.1%), it has been worrisome since its first emergence because it has been spreading faster than the other variants. 21, 22 Moreover, featured with amino acid changes at the N-terminal domain and the receptor binding domain (RBD) of spike protein, Delta variant has shown increased immune evasion potential. 22 Its ability to spread their viruses to a second person by early April 2020 so that they harbored very unique "singleton" viral genotypes. We doubt that those singleton types had different transmissibility from M type due to their unique mutation profiles because in the early phase of the pandemic, few mutations were epidemiologically significant and the evolutionary dynamics of the virus were predominantly characterized by a mutational pattern of slow and selectively neutral random genetic drift. 20 The predominance of a particular viral genotype such as M type at the beginning of a pandemic is more likely to be attributable to the "founder's effect" than the fitness of the virus. The sequential increment of concurrent mutations from early lineages to descendant lineages as the pandemic unfolds still remains as an enigma. This phenomenon can be exemplified with M type. It initiated with two concurrent mutations followed by acquiring four concurrent mutations to become WE1, and further obtained three concurrent mutations to become WE1.1. Although it is roughly consistent with the estimated evolutionary rate (~22 subs per year according to Nextstrain by December 2020), the underlying mechanism of those sequential increments of concurrent mutations is yet to be carefully unveiled. Interestingly, we found that all five VOCs were subtypes of WE1. Although the Alpha variant, a subtype of WE1.1, was the most prevailing VOC several months ago, the Delta variant, a subtype of WE1, are not yet available to provide definitive evidence, resulting in uncertainty about viral behavior, susceptibility to natural and vaccinemediated immunity as well as the efficacy of current treatment strategies to patients with Omicron. 23 As many Emergency Use Authorized (EUA) real-time reverse transcription polymerase chain reaction (RT-PCR) diagnostic tests for SARS-CoV-2 have been widely used all over the world to screen for infected COVID-19 patients, various genomic regions were chosen by different agencies and manufacturers to design primers for the tests. For example, the three target regions of the diagnostic kit developed by US CDC are within the N region, whereas the test that China CDC developed for the initial investigation in Wuhan targeted ORF1ab as well as the N region, which is similar to the test used in Singapore. 3, 30 On the other hand, many manufacturers' tests chose to target the S gene. For example, the Thermo Fisher Scientific and Applied DNA Sciences tests target the S gene. Thermo's test also targets the N and ORF1ab genes, while Applied DNA's test targets two regions in the S gene. 31 Since genetic variants of SARS-CoV-2 arise regularly, those tests may give rise to potential false-negative results due to the mutations in the viral genome. A few tests have been reported with false-negative issues like S-gene dropout or reduced sensitivity with the S-gene target in detecting variants with N501Y mutation. 31 Not to mention tests detecting a single target in the viral genome which may generate far more variable and equivocal results. Although tests with multiple genetic targets to determine a final result are less likely to be impacted by increased prevalence of genetic variants, ongoing analyses of viral genomes in a real-time fashion may help with early identification of new stains in patients to reduce further spread of infection, guide the development and assess the efficacy of SARS-CoV-2 vaccines. 32 Based on the viral mutation spectrum and evolutionary rate estimation in the study, it is evident that the common mutation loci should be avoided as targets when designing RT and PCR primers for SARS-CoV-2 tests. Similarly, when developing nucleotide-based vaccines of SARS-CoV-2, researchers should take into consideration the mutation frequency in selecting viral genomic regions encoding antigen epitopes. Finally, it is imperative to reassure the vaccines can generate equivalent immunity against different genetic variants (especially the VOCs with increased capacity to overcome vaccination-induced immunity) before inoculating in a large population. [21] [22] [23] [24] 32 This study also demonstrates the genotypes of SARS-CoV-2 are unique identifiers that can be used as molecular barcodes to trace the virus transmission retrospectively and to reveal its expansion prospectively at the molecular level. The SOO algorithm can match any particular SARS-CoV-2 viral genome to known genotypes with high accuracy based on its mutation profile. With the pandemic still ongoing, novel genotypes other than what we have characterized in this study may surface. Thus, we anticipate incorporating those newly emerging genotypes into the current algorithm may improve the performance of SOO in the future. Consortium with £20 million funding in March 2020, aiming to investigate how coronavirus is spreading in the United Kingdom, to help guide treatments in the future, and to anticipate the impact of mitigating measures. 33 As a result, the United Kingdom has contributed the most viral genome sequencing data to the GISAID database compared with other countries and regions, and some countries have been following the same strategy as the United Kingdom. Thanks to those SARS-CoV-2 sequencing initiatives, we were able to access and analyze tremendous viral genomes from all over the world in the study, which serves as a proof of concept to demonstrate the utility of large-scale viral genome sequencing during a novel pathogen outbreak. Ramping up sampling in a real-time manner like the UK viral genome sequencing consortium may generate high-resolution maps of who-infected-whom transmission at the community level and reveal the subsequent expansion patterns which are especially crucial for the most severely stricken countries and regions to promptly develop tailored mitigation plans. 34 We first would like to express our gratitude to the GISAID database and the NGDC database for collecting and sharing the SARS-CoV-2 sequence data. We also thank all the medical and research institutions from all over the world that promptly submitted the SARS-CoV-2 sequences to the aforementioned databases. We thank Dr. Shuyu Li for helpful comments on the manuscript. WHO COVID-19 Dashboard A new coronavirus associated with human respiratory disease in China A novel coronavirus from patients with pneumonia in China A novel bat coronavirus closely related to SARS-CoV-2 contains natural insertions at the S1/S2 cleavage site of the spike protein A pneumonia outbreak associated with a new coronavirus of probable bat origin Isolation of SARS-CoV-2-related coronavirus from Malayan pangolins Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding Clinical features of patients infected with 2019 novel coronavirus in Wuhan Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia Temporal and spatial analysis of the 2014−2015 Ebola virus outbreak in West Africa The evolution of Ebola virus: insights from the 2013−2016 epidemic Distinct lineages of Ebola virus in Guinea during the 2014 West African epidemic Augur: a bioinformatics toolkit for phylogenetic analyses of human pathogens Genomic characterization and infectivity of a novel SARS-like coronavirus in Chinese bats A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology Investigation of a COVID-19 outbreak in Germany resulting from a single travelassociated primary case: a case series South China Morning Post. Coronavirus: China's first confirmed Covid-19 case traced back to WHO Tracking SARS-CoV-2 variants WHO SARS-CoV-2 variants The emergence and ongoing convergent evolution of the SARS-CoV-2 N501Y lineages Reduced neutralization of SARS-CoV-2 B.1.617 by vaccine and convalescent serum Reduced sensitivity of SARS-CoV-2 variant Delta to antibody neutralization Omicron SARS-CoV-2 variant: a new chapter in the COVID-19 pandemic Omicron SARS-CoV-2 variant: unique features and their impact on pre-existing antibodies Genomic surveillance reveals multiple introductions of SARS-CoV-2 into Northern California Coast-to-coast spread of SARS-CoV-2 during the early epidemic in the United States Introductions and early spread of SARS-CoV-2 in the New York City area Spread of SARS-CoV-2 in the Icelandic population Genetic cluster analysis of SARS-CoV-2 and the identification of those responsible for the major outbreaks in various countries FDA issues alert about potential false negative test results from SARS-CoV-2 mutations Neutralization of N501Y mutant SARS-CoV-2 by BNT162b2 vaccine-elicited sera UK COVID-19 Sequencing Consortium launches with £20M in government, Wellcome Trust funding 1126/science.abf2946 SUPPORTING INFORMATION Additional supporting information may be found in the online version of the article at the publisher's website. How to cite this article The authors declare that there are no conflict of interests.