key: cord-0695086-xhj29hsu authors: Fang, Shuyi; Liu, Sheng; Shen, Jikui; Lu, Alex Z; Wang, Audrey K. Y.; Zhang, Yucheng; Li, Kailing; Liu, Juli; Yang, Lei; Hu, Chang‐Deng; Wan, Jun title: Updated SARS‐CoV‐2 single nucleotide variants and mortality association date: 2021-07-16 journal: J Med Virol DOI: 10.1002/jmv.27191 sha: 42784f53d25503b1269c84bc58c80b613602a184 doc_id: 695086 cord_uid: xhj29hsu By analyzing newly collected SARS‐CoV‐2 genomes and comparing them with our previous study about SARS‐CoV‐2 single nucleotide variants (SNVs) before June 2020, we found that the SNV clustering had changed remarkably since June 2020. Apart from that the group of SNVs became dominant, which is represented by two nonsynonymous mutations A23403G (S:D614G) and C14408T (ORF1ab:P4715L), a few emerging groups of SNVs were recognized with sharply increased monthly incidence ratios of up to 70% in November 2020. Further investigation revealed sets of SNVs specific to patients' ages and/or gender, or strongly associated with mortality. Our logistic regression model explored features contributing to mortality status, including three critical SNVs, G25088T(S:V1176F), T27484C (ORF7a:L31L), and T25A (upstream of ORF1ab), ages above 40 years old, and the male gender. The protein structure analysis indicated that the emerging subgroups of nonsynonymous SNVs and the mortality‐related ones were located on the protein surface area. The clashes in protein structure introduced by these mutations might in turn affect the viral pathogenesis through the alteration of protein conformation, leading to a difference in transmission and virulence. Particularly, we explored the fact that nonsynonymous SNVs tended to occur in intrinsic disordered regions of Spike and ORF1ab to significantly increase hydrophobicity, suggesting a potential role in the change of protein folding related to immune evasion. with sharply increased monthly occurrence ratios in November 2020. All of these individual SNVs could be traced back to February or March of 2020 when they were identified for the first time, suggesting a potential incubation period of the collectivity of special groups of SNVs. • 114 age-specific SNVs were identified in one or across multiple age groups. • 42 SNVs showed significantly high rates in either males or females. • 41 and 30 SNVs were observed with at least twofold higher incidence rates in the death and the nondeath group, respectively. • A logistic regression model demonstrated that three critical SNVs, G25088T (S:V1176F), T27484C (ORF7a:L31L), and T25A (upstream of ORF1ab), ages above 40 years old, and the male group contribute to a relatively higher mortality. • The emerging subgroups of nonsynonymous SNVs and the mortality-related ones were located on the protein surface area. Nonsynonymous SNVs tended to occur in intrinsically disordered regions of Spike and ORF1ab. Coronaviridae) have been observed, 2,3 including insertions, deletions, and single nucleotide variants (SNVs), 4,5 which led to viral protein structure changes that potentially affect the transmission and virulence. SNVs have been extensively detected by massive and daily updated whole-genome sequencing. SARS-CoV-2 SNVs presented clustering characteristics 4 in terms of concurrence [5] [6] [7] [8] that are likely linked to the complex mechanism of epistatic gene interactions in SARS-CoV-2 viral evolution. Therefore, timely updates of SARS-CoV-2 mutations, especially critical SARS-CoV-2 mutations that are associated with patients' clinical information, including age, gender, and mortality status, has become a necessary and important step in the fight against COVID-19. However, such investigations remain scarce due to limited clinical information along with whole-genome sequences. [9] [10] [11] Following our previous report, 5 we analyzed a total of 146,045 SARS-CoV-2 high-quality complete genomes downloaded from GISAID with collection dates after June 1, 2020. 12 The majority of SARS-CoV-2 genomes have evolved with a dominant SNV cluster represented by nonsynonymous mutations A23403G (S:D614G) and C14408T (ORF1ab:P4715L), in addition to C241T at the upstream of ORF1ab and another synonymous mutation C3037T on ORF1ab. According to two-way clustering analysis on SNVs harbored by over 1% of SARS-CoV-2 genomes, additional SNVs were uncovered with increasing occurrence ratios over recent months. Even though two dominant amino acid (AA) changes, ORF1ab:P4715L and S:D614G variants, were reported to be strongly correlated with mortality, 9 their death rates have been relatively stable. Or at least they were not strongly correlated with occurrence ratios of these two mutations which were carried by over 99% of patients now. Therefore, we did a systematic analysis of geographical distributions of SNVs and their corresponding mortality rates. By performing enrichment analysis on 6845 SARS-CoV-2 genomes with clinical information, we identified multiple SNVs that were significantly over-represented in different ages, or gender, or specifically related to COVID-19-associated mortality. In the meantime, the protein structure analysis on nonsynonymous SNVs showed clashes caused by mutations, which, in turn, might contribute to viral transmission and mortality. Some SNVs were also found with the tendency to occur in intrinsically disordered regions (IDRs) of Spike (S) protein and ORF1ab. Our findings may be helpful for a better understanding of the pathogenesis of SARS-CoV-2 at the genetic level, possibly providing insights into therapeutic intervention and vaccine design in the future. SARS-CoV-2 genome sequencing data and available clinical information were downloaded from GISAID with the collection dates between December 2019 and November 20, 2020. A total of 146,045 high-quality complete genomes were used for SNV analysis. It turned out that 40,172 and 36,844 samples had gender and age information, separately, while 6845 genomes came with clear death and deceased information after excluding samples marked with the ambiguous status of "unknown" or "cryptic." 2.2 | Enrichment analysis on age/gender/mortality associated SNVs The samples with age information were divided by 20-year bins into five groups, for example, "Under 20," "20-39," "40-59," "60-79," "At Using the hypergeometric model, 13 we evaluated the statistical significance of the SNV appearing at least S i times in the ith group. The p values were then modified based on false-discovery rate (FDR) adjusted multiple test correction. The SNVs harbored by more than 100 samples with FE > 1.1 and FDR < 0.05 in a specific age group were defined as age-specific SNVs for corresponding age. Similar enrichment analysis was conducted on samples with gender information (two groups only), but with two-tails for overrepresentation in either the male or female group. The SNV carried by at least 100 samples was determined as a gender-specific SNV if FDR < 0.05 and its FE > 1.1 in either the male or female group. Candidate SNVs related to mortality were estimated based on their occurrence rates in the death and nondeath group. Enrichment analysis was conducted for each SNV in the same way that we did for the gender association study as above. The FE was calculated as the ratio of the rate in the death group vs that in the nondeath group for a specific SNV, indicating either over-representation in the death group (FE > 2) or in the nondeath group (FE < 0.5). The statistical significance (p value) was evaluated based on the hypergeometric distribution model as well. 14 To explore the correlation between different features and mortality status, the logistic regression model was adopted for feature analysis on clinical information and SNVs by using the R glm function. [15] [16] [17] [18] [19] We selected age, gender, and SNVs related to mortality as the independent variables while the patient mortality status was chosen as the dependent variable in the logistic regression model. The significance cutoff was set as p < 0.001 to identify the variable-related features with coefficients from the logistic regression model. We selected PyMOL [20] [21] [22] to analyze and visualize protein structures for WT (Wuhan-Hu-1) and mutated proteins with identified nonsynonymous SNVs. Mutagenesis tools in PyMOL were utilized to detect if a clash was generated upon the mutation. Properties of AAs and corresponding hydrophobicity scales were retrieved from the " 3 | RESULTS Previously, we discovered four major SNV groups according to the analysis of SNVs in genomes collected before June 2020. 5 Here we analyzed 81,042 more SARS-CoV-2 complete genomes from the GISAID database following the exclusion of low-coverage genomes. Wuhan-Hu-1 (NCBI Reference Sequence: NC_045512.2) was used as the reference genome to keep consistency. 5 A total of 20,477 nucleotide sites were identified to carry SNVs in at least one genome. Among them, 52 SNVs occurring in greater than 3% of genomes were used for two-way clustering in the study. The clustering pattern of SNVs after June 2020 ( Figure 1A ) was distinct from the one before June 2020. 5 Figure 1A were identified for the first time before June 2020 (black boxes in Figure 1B ), but their occurrence ratios gradually increased to higher levels after several months (blue-filled boxes in Figure 1B ). This phenomenon might suggest potential incubation periods of SNVs during SARS-CoV-2 evolution. groups, the occurrence ratio of A.E3 soared up quickly since August 2020 ( Figure 1C ). They were carried by about 48% of new SARS-CoV-2 genomes collected in November. Although over 98% of these SNVs were found in England, we also observed 0.6% in Denmark. and 20 SNVs over-represented in male and female ( Figure 2D ), respectively, with FDR < 0.05 ( Figure 2E) . Surprisingly, 35 out of 42 SNVs were age-specific as well. To investigate the association between SARS-CoV-2 SNVs and the mortality rate, we analyzed a total of 6845 genomes with patient death/deceased status retrieved from GISAID. Among them, 665 samples (9.7%) were defined as death group with keywords "death"/ "deceased"/"Hospitalized, deceased" in the clinical information, while the nondeath group contains 6180 genomes with all the other patient statuses. The monthly mortality rate kept an increasing tendency from December 2019 to April 2020 with a peak death ratio of 18.5% and mainly decreased in the following months ( Figure 3A ). The mortality rates seemed independent of the total number of collected patient samples. To avoid bias from countries/areas with fewer cases of death, we analyzed a total of 2765 samples from countries with at least 10 death samples. By comparing SNV occurrence frequencies between the death and nondeath groups, we identified 71 mortalityrelated SNVs. Forty and thirty SNVs were identified from at least 10 dead patients and at least twofold significantly (p < 0.01) enriched in the death group ( Figure 3B ) and the nondeath group ( Figure 3C ), respectively. It is interesting to see remarkable overlap between three groups of SNVs, which are specific to age, gender, or significantly associated with mortality ( Figure 3D ) Figure 3E ). IDR in a protein represents an unfixed or disordered threedimensional structure due to the lack of sufficient hydrophobic AAs. 35, 36 The hydrophobic effect is a driving force of protein folding. Figure 4H ). Significantly higher SNV occurrence ratios were observed in IDRs of S (79.0%, p = 1.7 × 10 −15 ) and ORF1ab (77.1%, p = 9.9 × 10 −6 ), but not in IDR of ORF3a (97.4%, p = 0.12). Furthermore, these nonsynonymous mutations significantly increased hydrophobicity 37 and 50%, respectively, during the same time period. However, these SNVs were not new at all because they were observed for the first time during February and March in 2020, several months before when their incidences soared up. This suggests a potential incubation period for SARS-CoV-2 dominant SNVs during the evolution. The results also indicate that these SNV groups might make the virus more contiguous. Our systematic study revealed that apart from the A.E1-3, more SNV groups occurred more than 3% of the SARS-CoV-2 genomes with distinct temporal patterns, even though some of them did not clearly show an increasing temporal trend until November 2020 (Figure 1 ). Caution should be taken to monitor these SNVs with collectively dynamic changes. In December 2020, a set of 23 changes or mutations (VUI-202012/01) were found to possibly drive infections in the UK. 40, 41 The set of signature variants includes 8 changes from S protein: (T24506G), D1118H (G24914C). Viruses with these mutations were reported to be up to 70% more transmissible than previous strains, although there was "considerable uncertainty" and "no evidence" that these variants were more lethal or could render vaccines and treatments useless. 42 These SNVs occurred independently before November 2020 without causing significantly higher viral transmission. This might suggest that the collective mutations from these SNVs may speed up COVID-19 transmission. With the integration of the clinical information, 114 SNVs were identified to be specific to age groups. Except 27 out of them enriched across two groups of different ages, the majority was specific to one The release of original data must be permitted by GISAID based on the agreements. However, analyzed results can be found and downloaded from the website, https://shiny.ph.iu.edu/GESS/ Emergence of genomic diversity and recurrent mutations in SARS-CoV-2 Genome-wide analysis of SARS-CoV-2 virus strains circulating worldwide implicates heterogeneity Analysis of genomic distributions of SARS-CoV-2 reveals a dominant strain type with strong allelic associations Genetic Spectrum and Distinct Evolution Patterns of SARS-CoV-2 An updated analysis of variations in SARS-CoV-2 genome Genomic characterization of a novel SARS-CoV-2 GESS: a database of global evaluation of SARS-CoV-2/hCoV-19 sequences SARS-CoV-2 genomic variations associated with mortality rate of COVID-19 ORF3a mutation associated with higher mortality rate in SARS-CoV-2 infection Two mutations in the SARS-CoV-2 spike protein and RNA polymerase complex are associated with COVID-19 mortality risk Global initiative on sharing all influenza data-from vision to reality Enrichment or depletion of a GO category within a class of genes: which test? Univariate Discrete Distributions An Introduction to Generalized Linear Models Generalized linear models Generalized Linear Models. London: Chapman and Hall Modern Applied Statistics with S R: A Language and Environment for Statistical Computing The AxPyMOL Molecular Graphics Plugin for Microsoft PowerPoint, Version 1.8 The PyMOL Molecular Graphics System, Version 1.8 The JyMOL Molecular Graphics Development Component, Version 1.8 Announcing the worldwide Protein Data Bank Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein Structure of the RNA-dependent RNA polymerase from COVID-19 virus The I-TASSER Suite: protein structure and function prediction De novo design of protein peptides to block association of the SARS-CoV-2 spike protein with human ACE2 Protein structure and sequence reanalysis of 2019-nCoV genome refutes snakes as its intermediate host and the unique similarity between its spike protein insertions and HIV-1 SARS-CoV-2 Spike protein variant D614G increases infectivity and retains sensitivity to antibodies that target the receptor binding domain The D614G mutation in the SARS-CoV-2 spike protein reduces S1 shedding and increases infectivity Could the D614G substitution in the SARS-CoV-2 spike (S) protein be associated with higher COVID-19 mortality? The impact of mutations in SARS-CoV-2 Spike on viral infectivity and antigenicity The Vienna RNA websuite ViennaRNA Package 2.0 DisProt: intrinsic protein disorder annotation in 2020 DisProt 7.0: a major update of the database of disordered proteins A simple method for displaying the hydropathic character of a protein Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation Site-specific glycan analysis of the SARS-CoV-2 spike Covid-19: New coronavirus variant is identified in UK Investigation of novel SARS-COV-2 variant Covid: WHO in 'close contact' with UK over new virus variant Updated SARS-CoV-2 Single Nucleotide Variants and Mortality Association ACKNOWLEDGEMENTS Special thanks to researchers for depositing whole genomic sequences of Novel Pneumonia Coronavirus (SARS-CoV-2/hCoV-19/ 2019-nCoV) at the Global Initiative on Sharing All Influenza Data (GISAID) EpiFluTM. We are thankful for the technical support from The authors declare that there are no conflict of interests.