key: cord-0822775-ry0kgjxu authors: Cao, Canhui; He, Liang; Tian, Yuan; Qin, Yu; Sun, Haiyin; Ding, Wencheng; Gui, Lingli; Wu, Peng title: Molecular epidemiology analysis of early variants of SARS-CoV-2 reveals the potential impact of mutations P504L and Y541C (NSP13) in the clinical COVID-19 outcomes date: 2021-03-31 journal: Infect Genet Evol DOI: 10.1016/j.meegid.2021.104831 sha: 0aa1dbe7a13b12b6f393c4772118965789c80634 doc_id: 822775 cord_uid: ry0kgjxu Since severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused global pandemic with alarming speed, comprehensively analyzing the mutation and evolution of early SARS-CoV-2 strains contributes to detect and prevent such virus. Here, we explored 1962 high-quality genomes of early SARS-CoV-2 strains obtained from 42 countries before April 2020. The changing trends of genetic variations in SARS-CoV-2 strains over time and country were subsequently identified. In addition, viral genotype mapping and phylogenetic analysis were performed to identify the variation features of SARS-CoV-2. Results showed that 57.89% of genetic variations involved in ORF1ab, most of which (68.85%) were nonsynonymous. Haplotype maps and phylogenetic tree analysis showed that amino acid variations in ORF1ab (p.5828P > L and p.5865Y > C, also NSP13: P504L and NSP13: Y541C) were the important characteristics of such clade. Furthermore, these variants showed more significant aggregation in the United States (P = 2.92E-66, 95%) than in Australia or Canada, especially in strains from Washington State (P = 1.56E-23, 77.65%). Further analysis demonstrated that the report date of the variants was associated with the date of increased infections and the date of recovery and fatality rate change in the United States. More importantly, the fatality rate in Washington State was higher (4.13%) and showed poorer outcomes (P = 4.12E-21 in fatality rate, P = 3.64E-29 in death and recovered cases) than found in other states containing a small proportion of strains with such variants. Using sequence alignment, we found that variations at the 504 and 541 sites had functional effects on NSP13. In this study, we comprehensively analyzed genetic variations in SARS-CoV-2, gaining insights into amino acid variations in ORF1ab and COVID-19 outcomes. Notably, the United States has recorded 28.5 million confirmed infections and 0.5 million deaths due to COVID-19, accounting for 25.08% of global infections and 20.36% of global deaths, thus making it one of the most severely affected countries. However, the reasons for the rapid spread of this novel virus and the pathogenesis of infections remain to be determined. Recent studies focused on the mutation of Spike protein and showed that variants carrying D614G have become the most prevalent worldwide, suggesting the fitness advantage for SARS- CoV-2 (Canhui et al., 2020; Korber et al., 2020) . However, variations in other genes beyond the Spike protein might be important for the evolution of this virus. SARS-CoV-2 was indicated to evolve into at least three phylogenetic groups, characterized by positive selection from ORF1ab, ORF3a, and ORF8. Of note, for the first time, Velazquez-Salinas et al identified the potential relevance of amino acid Y5865C in ORF1ab, showing that this residue is experimenting directional selection. Also, the increased evolutionary rate of ORF10 was identified by Velazquez-Salinas et al (Velazquez-Salinas et al., 2020) . Moreover, the exoribonuclease (ExoN) of NSP14 knockout mutant assays showed the replication 5 / 34 lockdown in the United Kingdom (du Plessis et al., 2021) . At the phylogeny of SARS-CoV-2, strains were identified as lineage A, B, and C according to variants, and the variant carrying the Y5865C mutation was associated with the lineage S or 19 B (Velazquez-Salinas et al., 2020) . In the correlation analysis, variants of S 614G was associated with case fatality rates (CRF) and median CFR in 12 countries (Becerra-Flores and Cardozo, 2020) , and variants ORF1ab 4715L and S 614G were reported correlated with fatality rates in 28 countries and 17 states of the United States (Toyoshima et al., 2020) . It is still unclear whether the different mortality rates or transmission rates observed in different regions may be the consequences of the differences in clade virulence (Mercatelli and Giorgi, 2020) . ORF1 accounts for about two-thirds of the whole genome and encodes two polyproteins, i.e., pp1a (approximately 486 kD) and pp1ab (approximately 790 kD) (Hilgenfeld and Peiris, 2013) , which are processed by two viral cysteine proteases, i.e., papain-like protease (PLpro, Nsp3 domain) and main protease (Mpro or 3CLpro, Nsp5), into 15 or 16 NSPs. Most of these Nsps are involved in the transcription or replication of viral genomes (Sawicki et al., 2007) . SARS-CoV-2. Based on the infection, fatality, and recovery rates, as well as dynamic curves for the emergence of genome variations, in different countries, we identified amino acid variations in ORF1ab at the 5828 and 5865 loci (NSP13: P504L and NSP13: Y541C) and gained insight into COVID-19 outcomes in the United States that contained different proportions of strains with these ORF1ab variations. Genetic variations in SARS-CoV-2 strains were determined based on data obtained from 2019nCoVR (v2.1) (https://bigd.big.ac.cn/ncov) Genetic variation heatmaps of SARS-CoV-2 strains were generated by the online tool (Spatiotemporal dynamics) in 2019nCoVR (v2.1) (https://bigd.big.ac.cn/ncov) . By calculating the frequency of mutation sites of SARS-CoV-2 strains at each time point, the dynamic trend over time is displayed in the form of a genetic variation heatmap. By calculating the frequency of mutation sites in each country, the dynamic trend in different countries was displayed in the form of a genetic variation heatmap. Amino acid variations in SARS-CoV-2 strains corresponded to genetic variation data. . The full SARS-CoV-2 proteome was based on the NCBI reference sequence (NC_045512), with GenBank entry MN908947. The mutation density of each virus region was calculated by dividing the mutation number of strains by the length of each region (bp). Genetic variations in ORF1ab: 17858 (A->G) and ORF1ab: 17747 (C->T) J o u r n a l P r e -p r o o f Journal Pre-proof corresponded to amino acid variations in ORF1ab: p.5865Y>C and ORF1ab: p.5828P>L, or NSP13: P504L and NSP13: Y541C. In the protein sequence, the amino acid of ORF1ab at site 5828 P (proline) mutated into L (leucine) and at site 5865 Y (tyrosine) mutated into C (cystine). Viral genotyping maps were used to perform haplotype analysis across countries based on 2019nCoVR (v2.1) (https://bigd.big.ac.cn/ncov) . We used 1 962 viral strains for genetic variation analysis and 1 211 strains for haplotype maps (Fig. S1 ). Strain numbers used for haplotype and genetic variation analyses were different due to the asynchronous operation process of these two results in the dataset. We used the latest 2 250 strains of the virus from 13 December 2019 to 26 March 2020, identifying 1 210 haplotypes in total. Haplotype network maps were used to demonstrate the genetic distance and evolutionary relationships among different haplotypes. Root and leaf nodes indicated the direction of SARS-CoV-2 variants. The phylogenetic tree was built using 2019nCoVR with default settings. The phylogenetic tree was analyzed using the Molecular Evolutionary Genetics Analysis (MEGA) tool, with a scale of 0.0001. Two phylogenetic treemaps based on ORF1ab: p.5828P>L and ORF1ab: p.5865Y>C were download from Nextstrain (https://nextstrain.org/ncov) (Hadfield et al., 2018) . The COVID-19 infection, mortality, and recovery rates in the United States were collected from the Centers for Disease Control and Prevention (CDC) (https://www.cdc.gov/) and virusncov (https://virusncov.com/covid-statistics/usa), and included confirmed and presumptive positive cases reported to the CDC since 22 January 2020, not including cases repatriated to the United States from Wuhan (China) or Japan. COVID-19 cases, mortality rates, and recovery rates of other countries were collected from the World Health Organization (WHO) (https://www.who.int/). The α-helix, β-sheet, and β-turn structures were used to represent protein secondary structures. Protein secondary structure prediction of ORF1ab (QHD43415.1) and the ORF1ab variant (p.5865Y>C and p.5828P>L, also NSP13: P504L and NSP13: J o u r n a l P r e -p r o o f Journal Pre-proof Y541C) was performed based on the Chou-Fasman algorithm (Chou and Fasman, 1974) . We used the amino acid sequence of ORF1ab (QHD43415.1) and the ORF1ab variant (QHD43415.1: p.5865Y>C and p.5828P>L) to compare differences between the two ORF1ab proteins. Polymorphism Phenotyping v2 (PolyPhen-2) is a tool for predicting the possible impact of the amino acid substitution or indel on the structure and function of a protein (Adzhubei et al., 2010) . Protein Variation Effect Analyzer (PROVEAN) is also a tool for predicting the possible impact of an amino acid substitution or indel on the biological function of a protein (Choi et al., 2012; Choi and Chan, 2015) . PolyPhen-2 (http://genetics.bwh.harvard.edu/pph2/index.shtml) and PROVEAN (v1.1) (http://provean.jcvi.org/index.php) were used to predict whether the variation in NSP13: P504L and NSP13: Y541C affected the protein function of NSP13. We used the amino acid sequence of the variants for the query. In PolyPhen-2, the HumDiv value was used to evaluate rare alleles and dense mapping of regions and to analyze natural selection, whereas the HumVar value was used to help diagnose Mendelian disease. The closer the value (HumDiv or HumVar) is to 1.0, the greater the effect on 2019nCoVR dataset. We first performed variation frequency integration for whole genomes of SARS-CoV-2 strains over time. We found 1 660 sites with genetic variations, including several regions with a significant number of variations, i.e., ORF1ab (961) and S (162). Notably, variation frequency was markedly enriched from 27 February 2020 (Fig. 1A) . Variation heatmap of countries demonstrated the composition of several significant genetic variation sites, including variation in We next conducted variation annotation of all genome variations according to the NCBI reference sequence NC_045512, with GenBank entry MN908947. In general, we found 1 458 genome variations in coding regions of SARS-CoV-2, 69.48% of which were nonsynonymous. In addition, 66.05% (963/1458) of genetic variations were distributed in ORF1ab, of which 68.85% (663/963) were found to be nonsynonymous (Fig. 1C, Supplementary Table S1 ). As the length of ORF1ab occupied more than two-thirds of the whole genome, the mutation rate of each gene region should be determined in consideration of gene length. We calculated the mutation density of each gene via dividing the mutation number of each gene region by the length of each SARS-CoV-2 gene and found that the ORF10 region had a higher mutation rate (Fig. S3 ). We also found that the average variation frequency of coding regions was 5.48 (ranging from 1 to 597) and of noncoding regions was 6.96 (ranging from 1 to 592). To identify changes in genetic variations over time and by country, we constructed variation dynamic curves. Results showed that genetic variations in United States strains were not only responsible for the top nonsynonymous variations in the ORF1ab, S, and ORF8 regions but also were responsible for the top synonymous variations in the ORF1ab: 8782 and ORF1ab: 3037 sites. Furthermore, the occurrence date of the genetic variations in the United States strains contributed to the temporal trend of variations observed on 27 February 2020 (Fig. 2B ). Due to their extremely high mutation rates, short generation time, and large populations, viruses can rapidly develop viral quasispecies in diverse intra-host J o u r n a l P r e -p r o o f Journal Pre-proof populations (Domingo et al., 2012) , NGS data was used to produce thousands to millions of reads from a mixed sample and to estimate viral quasispecies (Giallonardo et al., 2014) . Here, the haplotype maps of SARS-CoV-2 strains across countries were based on genetic variations from high-quality genome sequencing data. From the 2 250 viral strains, we identified 1 210 haplotypes in total. Importantly, we observed that haplotype (H (140)) occurred in more strains than any other haplotype. Furthermore, 97.14% (136/140) of these strains were from the United States, with the remaining 2.85% (3/140) from Canada, thus showing significant differences between the two countries (P = 7.90E-14; Fig. 3A ). These results indicate that this virus strain from the United States exhibited the aggregation in the population than any other strain. As different viruses follow different patterns of variation, phylogenetic trees can be used to investigate their variation features (Holmes, 2008). Here, we constructed a phylogenetic tree of SARS-CoV-2 strains from the 2019nCoVR dataset using default settings . Results identified a clade enriched with strains from the United States (enriched clade, Fig. 3B ). Furthermore, 236 strains fell within the enriched cluster, while 140 strains belong to Haplotype H (140) and the Tacoma (Fig. 4E) . To further analyze the variation in ORF1ab (p.5828P>L and p.5865Y>C, NSP13: P504L and NSP13: Y541C) in SARS-CoV-2, we explored variation frequency over time. Results demonstrated that variation frequency increased on 27 February 2020 and peaked on 15-16 March 2020 (Fig. 5A) . We then calculated the infection, fatality, and recovery rates of COVID-19 in the United States. On 15 and 16 March 2020, infections increased by more than 1 000 cases, totaling 2 234 and 3 487 cases, respectively. In addition, the recovery rate dropped on 27 February 2020 and the fatality rate increased on 28 February 2020 (Fig. 5B) . These date links between ORF1ab variation and COVID-19 condition were not found in Iceland or Australia (Fig. S4A, B) In particular, the ORF1ab 4715L and S protein 614G variants were correlated with a higher fatality rate. Although the NSP13: P505L and NSP13: Y541C variants were identified in Cluster 3, mainly in the United States, Australia, and Canada, the fatality rate of Cluster 3 showed no statistical significance compared with Cluster 1 and Cluster 2 (Toyoshima et al., 2020) . Using the genomic diversity of mutations in early SARS-CoV-2 strains, scientists have identified two distinct mutations, i.e., S or L type, with the S type considered more aggressive and faster spreading (Tang et al., 2020) . Moreover, in silico structure modeling of Nsp13 and Nsp14, potential dual-target inhibitors of SARS-CoV-2 with high binding affinity were identified (Gurung, 2020) . The data was identified and integrated from the 2019nCoVR database, with the J o u r n a l P r e -p r o o f Journal Pre-proof non-redundant and duplicated biases . Factors associated with the fatality rate of COVID-19 fall into objective factors, such as the quality and capacity of the healthcare system, and subjective factors related to individual patients, such as age and history of chronic respiratory disease, hypertension, diabetes, and coronary heart disease (Zhou et al., 2020a) . In Washington State, although the number of infections ranks tenth, the fatality rate ranks second (as of 2 April 2020), with a higher fatality rate than even New York. However, we found that for mortality rate in Washington State was high, with 69.05% of strains containing ORF1ab variants (NSP13: P504L and NSP13: Y541C), whereas other states (e.g., Minnesota, Utah, Wisconsin, and California) that only contained a small proportion of strains with the ORF1ab variants, showed better disease outcomes. However, outcomes in hosts infected with strains containing whether ORF1ab variations (NSP13: P504L and NSP13: Y541C) or not in the same States were missing, making it difficult to directly build the correlation between them. The only way to know whether a nonsynonymous variation of a virus affects its function is to study it in cell assays or animal models to clarify entrance and transmission processes (Muth et al., 2018) . As Velazquez-Salinas's study identified that residue 5865 (NSP13: Y541C) was under directional selection, which is also shown by the datamonkey evolutionary server, the variants (NSP13: P504L and NSP13: Y541C) may be under experimenting positive selection (Velazquez-Salinas et al., 2020) . Besides, we only predicted secondary structures of proteins (α-helix, J o u r n a l P r e -p r o o f Journal Pre-proof β-sheet, and β-turn) and functional effects based on computational methods, without the application of cell cultures or animal models to demonstrate the consequences of the variations, which need further study. Furthermore, virus strains carrying the two key variations were also found in Iceland and Australia but were not dominant in these two countries. The mutation frequencies of these two loci were high in Mexico, Canada, and the United States (Fig. S7) , which are geographically close to each other. And genetic mutation heatmaps of regions showed that the mutation type of Iceland, Australia, and the United States were different (Fig. S8 ). In this research, we explored 1 962 high-quality genomes of early SARS-CoV-2 to identify the changing trends in genetic variations over time and by country. Haplotype mapping and phylogenetic tree analysis showed that amino acid variations in ORF1ab Zhou, P., Yang, X.L., Wang, X.G., Hu, B., Zhang, L., Zhang, W., et al. (2020b) . A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579 (7798), 270-273. doi: 10.1038 /s41586-020-2012 -7. Zhu, N., Zhang, D., Wang, W., Li, X., Yang, B., Song, J., et al. (2020 A method and server for predicting damaging missense mutations The proximal origin of SARS-CoV-2 SARS-CoV-2 viral spike G614 mutation exhibits higher case fatality rate The SWISS-MODEL Repository-new features and functionality Predicting the functional effect of amino acid substitutions and indels Prediction of protein conformation Viral quasispecies evolution Establishment and lineage dynamics of the SARS-CoV-2 epidemic in the UK Phylogeography and evolutionary history of reassortant H9N2 viruses with potential human health implications Full-length haplotype reconstruction to infer the structure of heterogeneous virus populations In silico structure modelling of SARS-CoV-2 Nsp13 helicase and Nsp14 and repurposing of FDA approved antiviral drugs as dual inhibitors Nextstrain: real-time tracking of pathogen evolution From SARS to MERS: 10 years of research on highly pathogenic human coronaviruses Genomic characterization and infectivity of a novel SARS-like coronavirus in Chinese bats Delicate structural coordination of the Severe Acute Respiratory Syndrome coronavirus Nsp13 upon ATP hydrolysis Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding Geographic and Genomic Distribution of SARS-CoV-2 Mutations Novel coronavirus 2019-nCoV: prevalence, biological and clinical characteristics comparison with SARS-CoV and MERS-CoV Attenuation of replication by a 29 nucleotide deletion in SARS-coronavirus acquired during the early stages of human-to-human transmission The Enzymatic Activity of the nsp14 Exoribonuclease Is Critical for Replication of MERS-CoV and SARS-CoV-2 A contemporary view of coronavirus transcription ProMod3-A versatile homology modelling toolbox The Global Landscape of SARS-CoV-2 Genomes, Variants, and Haplotypes SARS-CoV-2 genomic variations associated with mortality rate of COVID-19 Positive Selection of ORF1ab, ORF3a, and ORF8 Genes Drives the Early Evolutionary Trends of SARS-CoV-2 During the SWISS-MODEL: homology modelling of protein structures and complexes A new coronavirus associated with human respiratory disease in China The 2019 novel coronavirus resource