key: cord-0974169-7tyb1x36 authors: Hahn, Georg; Wu, Chloe M.; Lee, Sanghun; Hecker, Julian; Lutz, Sharon M.; Haneuse, Sebastien; Qiao, Dandi; Cho, Michael H.; Randolph, Adrienne; Laird, Nan M.; Weiss, Scott T.; Silverman, Edwin K.; Ribbeck, Katharina; Lange, Christoph title: Mutations in SARS-CoV-2 spike protein and RNA polymerase complex are associated with COVID-19 mortality risk date: 2020-11-24 journal: bioRxiv DOI: 10.1101/2020.11.17.386714 sha: e5c2a4cc5874e13ff4d7985f9a51b87447560540 doc_id: 974169 cord_uid: 7tyb1x36 SARS-CoV-2 mortality has been extensively studied in relationship to a patient’s predisposition to the disease. However, how sequence variations in the SARS-CoV-2 genome affect mortality is not understood. To address this issue, we used a whole-genome sequencing (WGS) association study to directly link death of SARS-CoV-2 patients with sequence variation in the viral genome. Specifically, we analyzed 3,626 single stranded RNA-genomes of SARS-CoV-2 patients in the GISAID database (Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017) with reported patient’s health status from COVID-19, i.e. deceased versus non-deceased. In total, evaluating 28,492 loci of the viral genome for association with patient/host mortality, two loci, 12,053bp and 25,088bp, achieved genome-wide significance (p-values of 1.24e-12, and 1.24e-26, respectively). Mutations at 25,088bp occur in the S2 subunit of the SARS-CoV-2 spike protein, which plays a key role in viral entry of target host cells. Additionally, mutations at 12,053bp are within the ORF1ab gene, in a region encoding for the protein nsp7, which is necessary to form the RNA polymerase complex responsible for viral replication and transcription. Both mutations altered amino acid coding sequences, potentially imposing structural changes that could enhance viral infectivity and symptom severity, and may be important to consider as targets for therapeutic development. prognosis and treatment development. To identify potential links between viral mutations and mortality, we utilized the GISAID database (Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017) , which currently contains data on 3,626 COVID-19 patients from 68 countries, for whom full metadata is available, i.e. age, sex, location and patient status, and whose viral genomes have been sequenced (see Table 1 ) and probed each locus of the single stranded RNA of the SARS-CoV-2 virus for direct association with host/patient mortality. The variable patient status indicates if the patient was alive or deceased at the time the virus sample was submitted to GISAID; we use it as a surrogate for mortality in our analysis. For the analysis, we repurposed the methodology of genome-wide association studies (GWAS) (Manolio, 2010) . This approach is widely used in human genetics and can test thousands of genetic loci for association in datasets such as GISAID. To identify potential confounding geographic factors in the sequencing data, we first conducted principal component analysis of the Jaccard similarity matrix (Figure 1 ) that was computed for the 3,626 viral genomes available for our analysis. We utilized the Jaccard similarity matrix because its computation does not require estimates of the mutation frequency for each locus in the SARS-CoV-2 genome, in contrast to other similarity matrices such as the variance/covariance matrix (Prokopenko et al., 2016) . We found that the virus genomes clustered in distinctive branches that correspond to the geographic regions from where their data was submitted to GISAID (Forster et al., 2020) (Figure 1 ). The geographical clustering of the viral genomes can cause bias in the association analysis if unaccounted for. Hence, we generated additional eigenvector plots to investigate the number of eigenvectors needed to eliminate bias caused by such clustering. Based on visual inspection of these plots, we selected the first 5 eigenvectors of the Jaccard matrix as covariates for the following logistic regression analyses. In the whole-genome sequencing analysis of the 3,626 SARS-CoV-2 viruses, we tested each locus (presence/absence of mutation) of the viral genome individually for association with the status indicator variable "deceased/ non-deceased" of the host/patient at submission to GISAID. In the logistic regression, we also adjusted for sex, age, and the first 5 eigenvectors of the Jaccard matrix. The qq-plot of the p-values for the association tests are displayed in Figure 2 . The significance level of 5% is controlled for multiple testing using Bonferroni correction, resulting in the corrected threshold of 0.05/28,492=1.75e-6. Based upon this correction, two loci of the SARS-CoV-2 genome achieved genome-wide significance: one at position 12,053bp on the SARS-CoV-2 reference genome having a p-value of 1.24e-12, and one at 25,088bp with a pvalue of 1.24e-26. (Figure 2 ). To investigate the robustness of the highly significant association signals, we examined the dataset at the individual patient and locus level. Our findings were enabled by two features specific to the data: 1.) the Brazilian centers enrolled much larger numbers of deceased patients than the other centers world-wide. At enrollment, 45% of the Brazilian patients were deceased in contrast to only 9.9% in the entire dataset. 2.) We also noticed that all genomes that carry at least one of the mutations either at 12,053bp or 25,088bp are located predominantly in those branches of the eigenvector plot (see Figure 1 ) that correspond to the PAHO/South America region, to Asia, or to Europe. We conducted three different types of sensitivity analysis to minimize potential confounding (Table 2): 1. "Geographic region", as indicated in Table 1 , was added as a categorical variable to the logistic regression analysis. 2. Our data set was restricted to genomes that were matched based proximity in the eigenvector plots (see Methods for details). 3. As further examination of the deceased indictor variable revealed that all "deceased" carrier genomes came from Brazil, our analysis was restricted to genomes that were submitted from Brazil. In all three secondary analyses, both loci remained significant (Table 2) . For both loci, 3 the effect size estimates of the mutations showed risk increases for mortality of 5-fold and higher (Table 2) . Furthermore, as the association signals for both loci stem from the Brazilian Covid-19 patients, we also obtained Fisher-exact tests for the data from Brazil overall and from Sao Paulo alone (Table 3) . Sao Paulo is the only Brazilian center/city for that sufficient numbers were available to compute the Fisher-exact test. For both loci and for both subgroup analyses, the Fisher-exact test detected significant associations between the presence of a mutation and the deceased indicator variable. In sum all results of the secondary analyses (Tables 1-4) support the genome-wide significant association between the mutation 25,088bp and mortality. The locus at 12,053bp did not achieve formally genome-wide significance in the three secondary analyses, but nonetheless remains a viable candidate locus. The large effect estimates for both mutations in all analyses (Table 2) , are substantial in support of the associations. Since the criteria for selection into the study likely varies by country, and may be related to the deceased indicator, the odds ratio estimate from the Brazil sample alone may be most interpretable. Among the samples from Brazil, 19.1% of the patients whose viral genome did not carry any mutation at either loci were deceased at enrollment, 83.7% of the patients whose viral genomes carried the mutation at 25,088bp, and 84.2% of the patients whose viral genomes carried both mutations. We did not observe any viral genomes that carried the mutation at 12,053bp, but not 25,088bp. Given the large effect estimates for mutations in all analyses (Table 2) , it is difficult to imagine un-accounted confounding mechanism that would affect mutations at just two out of almost thirty-thousand loci (12,053bp and 25,088bp) and that would be strong enough to cause such profound association signals, as the ones we observed in our analysis. Table 1 also provides a regional breakdown of the "deceased-at-enrollment" rates and the mutation frequencies for both loci. The rarity of the mutations outside of Brazil means that there is virtually no power to detect any association (if they exist). Single mutations in viruses can confer enhanced virulence associated with patient mortality (Bae et al., 2018; Brault et al., 2007) . In our analysis of SARS-CoV-2, the mutation at 25,088bp occurs in the spike glycoprotein, which mediates viral attachment and cellular entry. The spike protein consists of two functional subunits: S1, which contains the receptor-binding domain, and S2, which contains the machinery needed to fuse the viral membrane to the host cellular membrane. The mutation at 25,088 is in the S2 subunit, and specifically occurs within the S2' site, which is cleaved by host proteases to activate membrane fusion (Figure 3a ) In many 4 viruses, membrane fusion is activated by proteolytic cleavage, an event which has been closely linked to infectivity-for instance, a multibasic cleavage site is a signature of highly pathogenic viruses including avian influenza (Walls et al., 2020) . In coronaviruses, membrane fusion is known to depend on proteolytic cleavage at multiple sites, including the S1/S2 site, located at the interface between the S1 and S2 domains, and the S2' site located within the S2 domain. These cleavage events can impact infection-in fact, a distinct furin cleavage site present in the SARS-CoV-2 S1/S2 site is not found in SARS-CoV (Vankadari, 2020) , and it is thought to increase infectivity through enhanced membrane fusion activity (Walls et al., 2020; Vankadari, 2020; Xia et al., 2020) . Consequently, mutations at these sites can alter virulence-for instance, a recent study reported that mutations disrupting the multibasic nature of the S1/S2 site affect SARS-CoV-2 membrane fusion and entry into human lung cells (Hoffmann et al., 2020) . Several studies have also found that SARS-CoV mutants with an added furin recognition site at S2' had increased membrane fusion activity (Belouzard et al., 2009; Watanabe et al., 2008) . While enhanced infectivity does not always cause higher fatality rate, more infectious viruses can lead to a higher viral load, which can impact symptom severity and mortality (Pujadas et al., 2020) . The majority of carriers for 25,088bp (115 out of 132) exhibit a G to T missense mutation (Table 4 ), which changes the encoded amino acid from valine to phenylalanine. Of the other carriers, 13 had a G to A mutation corresponding to a change to isoleucine, and 4 had a G to C mutation leading to an encoded leucine. While valine, leucine, and isoleucine are branched-chain amino acids with similar biochemical and structural properties, phenylalanine has a bulkier aromatic structure. Such a substitution may impose local structural constraints, stabilize particular secondary structures (Makwana and Mahalakshmi, 2015) , or introduce specific interactions which lead to preferential binding. Therefore, a mutation in the S2' domain which promotes proteolytic cleavage could theoretically enhance viral infectivity ( Figure 3b ) and consequently, patient mortality. While many current therapies primarily target the receptor binding domain within the S1 subunit of the SARS-CoV-2 spike protein, our findings suggest that the S2 domain may be an important additional target for therapeutic development. The mutation at 12,053bp occurs within the ORF1ab gene, which expresses a polyprotein comprised of 16 nonstructural proteins (Yoshimoto, 2020) . Specifically, 12,053bp occurs in NSP7, which dimerizes with NSP8 to form a heterodimer that complexes with NSP12, ultimately forming the RNA polymerase complex essential for genome replication and transcription. 5 Mutations causing enhanced viral polymerase activity have been linked to increased pathogenicity of influenza viruses. The majority of carriers for 12,053bp (53 out of 63) exhibit a C to T missense mutation, which causes leucine to be substituted for phenylalanine (Table 4) . Such a mutation may confer structural rigidity which could potentially alter interactions with other components of replication and transcription machinery, but experimental analysis is needed to test these hypotheses. Collectively, these results suggest that genetic variation in the viral genome sequence may contribute to the increased COVID-19 mortality. The analysis presented in this article is based on nucleotide sequences with accession numbers EPI_ISL_402124 to EPI_ISL_541337, downloaded from the GISAID database (Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017) as a file in "fasta" format on 22 September 2020. Only patients with additional metadata (age, sex, and hospitalization status as plain text comments) were selected on GISAID, resulting in 7,151 samples. We filtered the 7,151 samples for complete nucleotide sequences, and aligned them using the anchor sequence given in Hahn et al. (2020a) . After this step, 4,385 samples remained. Using the location tag in the fasta file, we grouped all samples according to the WHO regional offices for Africa (AFRO), for the Eastern Mediterranean (EMRO), for Europe Finally, we matched these 4,378 samples to the metadata information (age, sex, clinical outcome) available on GISAID. Filtering for those samples having complete metadata information resulted in n=3,626 samples. After aligning all sequences, we established a window of length 28,492bp in which all viral sequences have reads. We compared all trimmed and aligned sequences entry-wise to the SARS-CoV-2 reference sequence published on GISAID (accession number EPI_ISL_412026), and denoted in a matrix X with an entry X ij =1 that sequence i deviated from the reference sequence at position j. All other entries of X are zero. We used the R-package "locStra" (Hahn et al., 2020c,d) to calculate the Jaccard similarity matrix (Jaccard, 1901; Tan et al., 2005; Prokopenko et al., 2016; Schlauch et al., 2017) for the n viral genomes based on the matrix X. The Jaccard matrix J(X) has n rows and n columns, and each entry (i,j) is the Jaccard similarity index between the ith and jth SARS-CoV-2 genome in our dataset. Computing the first 5 eigenvectors of the Jaccard similarity matrix J(X) allows us to visualize the geographic clustering of the viral genomes, as well as to guard the logistic regression analysis against such confounding by including the first eigenvectors in the regression analysis as covariates. For the association analysis of the entire viral genome, we defined the response to be a binary indicator for the clinical outcome, where we only distinguish between all those patients/hosts whose hospitalization status tag at enrollment into the GISAID database was listed as "deceased" (outcome of 1) versus the remaining samples as non-deceased (outcome of 0). At this point, no other information regarding clinical outcome was available in GISAID. We then performed p=28,492 logistic regressions of the binary outcome variable on the following covariates: the column vector X ·i encoding the mismatches of each sample at the i'th location on the SARS-CoV-2 nucleotide sequence, patient's age, sex and the first 5 eigenvectors of the Jaccard matrix. The logistic regression was carried out in R using the default "glm" command, where the parameter "family" was set to "family=binomial(link="logit")". We tested 7 the i'th locus/location of the viral genome for association with mortality by testing whether the regression coefficient for column X ·i is equal to zero. Finally, we assessed the significance of each tested locus by comparing the individual p-values to a Bonferroni corrected threshold of 0.05/p=1.75e-06. We observed in Figure 1 that the viral genomes that carry least one mutation at 12,053bp or 25,088bp are located in distinct branches that correspond to geographic regions. Therefore, we performed two additional logistic regression analyses to take the observed clustering into account: one in which we also included "region" as a categorical covariate in the regression model, and one in which we matched the viral genomes based on their positions in the eigenvector plot (Figure 1 ). For the matched analysis, we identified the 359 viral genomes whose patient indicator variable is set as "deceased" (outcome of 1). For each of the 359 deceased samples, we identified the "non-deceased" viral genome that is closest in terms of Euclidian distance in the eigenvector plot (Figure 1) , where each deceased sample is matched to a different "non-deceased" sample. Finally, we also report results from the Fisher exact test that is applied to contingency tables. To this end, for a certain subgroup of the population (e.g., location tag "Brazil"), we determined the number of deceased and non-deceased samples which are carriers or non-carriers of the mutations at 12,053bp or 25,088bp, respectively. The resulting 2x2 contingency table was tested in R with the default "fisher.test" command, and the p-value of the Fisher test was reported. the Supplementary Information. Sequence data that support the findings of this study are deposited in the GISAID database with accession numbers in the range of EPI_ISL_402124 to EPI_ISL_541337 (https://www.gisaid.org/). 8 Figure 1 : Geographic distribution of 3,626 SARS-CoV-2 genomes. Genomes are depicted according to their first two eigenvectors of the Jaccard matrix and colored by geographic region. The eigenvector plot shows distinct grouping of SARS-CoV-2 genomes according to their geographic origin. Furthermore, genomes that carry of at least one mutation at 12,053bp or for 25,088bp (depicted by triangles) are predominantly located in a subbranch whose samples come predominantly from Pan America/South America. 9 Figure 2 : Using logistic regression, we tested the 28,492 loci of SARS-CoV-2 genome that passed quality control for association with patient/host mortality. The QQ-plot shows the expected pvalue distribution plotted against the observed p-values obtained from the logistic regression analysis. The p-value distribution we observe for the GISAID dataset follows the reference line, confirming that the analysis is conservative. The grey horizontal line depicts the threshold for genome-wide significance at the corrected Bonferroni threshold 0.05/28,492=1.75e-6. We observe that only 2 p-values exceed the threshold for genome-wide significance, precise the loci at 12,053bp and 25,088bp. SARS-CoV-2 spike protein is colored by region (blue-S1, green-S2, magenta-S2'). The S2' site is cleaved by host proteases, facilitating membrane fusion and viral entry into host cells. A mutation in this region, depicted in yellow, could theoretically increase proteolytic activity and membrane fusion, thereby causing greater infectivity. Coronavirus Susceptibility to the Antiviral Remdesivir Is Mediated by the Viral Polymerase and the Proofreading Exoribonuclease A Single Amino Acid in the Polymerase Acidic Protein Determines the Pathogenicity of Influenza B Viruses SARS-CoV-2 viral spike G614 mutation exhibits higher case fatality rate Activation of the SARS coronavirus spike protein via sequential proteolytic cleavage at two distinct sites A single positively selected West Nile viral mutation confers increased virogenesis in American crows Could the D614G substitution in the SARS-CoV-2 spike (S) protein be associated with higher COVID-19 mortality? Data, disease and diplomacy: GISAID's innovative contribution to global health Phylogenetic network analysis of SARS-CoV-2 genomes Structure of the RNA-dependent RNA polymerase from COVID-19 virus The phylogenomics of evolving virus virulence ) cases in Beijing are from a genetic subgroup that consists of mostly European and South(east) Asian samples, of which the latter are the most recent Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus locstra: Fast analysis of regional/global stratification in whole genome sequencing (wgs) studies. Genetic Epidemiology locStra: Fast Implementation of A Multibasic Cleavage Site in the Spike Protein of SARS-CoV-2 Is Essential for Infection of Human Lung Cells Étude comparative de la distribution florale dans une portion des Alpes et des Jura Structure of the SARS-CoV nsp12 polymerase bound to nsp7 and nsp8 co-factors Nsp3 of coronaviruses: Structures and functions of a large multi-domain protein Molecular Architecture of Early Dissemination and Massive Second Wave of the SARS-CoV-2 Virus in a Major Metropolitan Area Implications of aromatic-aromatic interactions: From protein structures to peptide models Genomewide Association Studies and Assessment of the Risk of Disease NS1 Protein Amino Acid Changes D189N and V194I Affect Interferon Responses, Thermosensitivity, and Virulence of Circulating H3N2 Human Influenza A Viruses Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant FastTree 2 -Approximately Maximum-Likelihood Trees for Large Alignments Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project SARS-CoV-2 viral load predicts COVID-19 mortality Using Linkage Genome Scans to Improve Power of Association in Genome Scans Identification of genetic outliers due to substructure and cryptic relationships GISAID: Global initiative on sharing all influenza data --from vision to reality Introduction to Data Mining SARS-CoV-2 genomic variations associated with mortality rate of COVID-19 Structure of Furin Protease Binding to SARS-CoV-2 Spike Glycoprotein and Implications for Potential Targets and Virulence Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein Entry from cell surface of SARS coronavirus with cleaved S protein as revealed by pseudotype virus bearing cleaved S protein Factors associated with COVID-19-related death using OpenSAFELY Inhibition of SARS-CoV-2 (previously 2019-nCoV) infection by a highly potent pan-coronavirus fusion inhibitor targeting its spike protein that harbors a high capacity to mediate membrane fusion Structural basis for inhibition of the RNA-dependent RNA polymerase from SARS-CoV-2 by remdesivir The Proteins of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS CoV-2 or n-COV19), the Cause of COVID-19 The authors gratefully acknowledge the contributors, originating and submitting laboratories of the sequences from GISAID's EpiCoV TM Database (Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017) on which this research is based. A detailed list of contributors is available in