key: cord-1021731-a3x7kebl authors: Bæk, Kristoffer T.; Mehra, Rukmankesh; Kepp, Kasper P. title: Stability and expression of SARS-CoV-2 spike-protein mutations date: 2022-03-21 journal: bioRxiv DOI: 10.1101/2022.03.21.485157 sha: 33f695d364bd09e3cc745d1caa86b66cba6b164f doc_id: 1021731 cord_uid: a3x7kebl Protein fold stability likely plays a role in SARS-CoV-2 S-protein evolution, together with ACE2 binding and antibody evasion. While few thermodynamic stability data are available for S-protein mutants, many systematic experimental data exist for their expression. In this paper, we explore whether such expression levels relate to the thermodynamic stability of the mutants. We studied mutation-induced SARS-CoV-2 S-protein fold stability, as computed by three very distinct methods and eight different protein structures to account for method- and structure-dependencies. For all methods and structures used (24 comparisons), computed stability changes correlate significantly (99% confidence level) with experimental yeast expression from the literature, such that higher expression is associated with relatively higher fold stability. Also significant, albeit weaker, correlations were seen for ACE2 binding. The effect of thermodynamic fold stability may be direct or a correlate of amino acid or site properties, notably the solvent exposure of the site. Correlation between computed stability and experimental expression and ACE2 binding suggests that functional properties of the SARS-CoV-2 S-protein mutant space are largely determined by a few simple features, due to underlying correlations. Our study lends promise to the development of computational tools that may ideally aid in understanding and predicting SARS-CoV-2 S-protein evolution. The pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [1] [2] [3] has led to intensive research into its spike-protein (S-protein, Figure 1a) , whose evolution may lead to changes in its surface that reduce the efficiency of vaccine-induced antibodies. [4- 6] The S-protein is a glycosylated homo-trimer on the surface of coronaviruses responsible for their characteristic shape. [7, 8] During entry of the virus particle into human host cells, the Sprotein binds to the cell-surface receptor angiotensin-converting enzyme 2 (ACE2), Figure 1b . [9] [10] [11] Due to the importance of S-protein epitopes for immune recognition, presentation of the S-protein is the rationale behind a range of important vaccines. [12, 13] The presence of prominent antibodies in a population may lead to selection pressure to change the S-protein surface to evade the antibodies learned from vaccination or infection with earlier variants. [14] Such antigenic drift may lead to variants capable of escaping vaccine-induced immunity, a problem that is likely to persist for many years. The omicron variant has been a hallmark example of such evolution, leading to a very large number of breakthrough infections in late 2021 and early 2022. [15] Understanding such evolution effects requires protein structural data. [16] [17] [18] [19] Since function is structure-dependent, the structure is the platform on which the amino acid evolution occurs, with evolution rates typically depending on the structural context of the amino acid site. [20] [21] [22] The recent technical breakthroughs in cryo-electron (cryo-EM) microscopy of large molecules such as proteins [23] [24] [25] [26] [27] are an excellent example of basic science providing essential innovation enormous utility, enabling publication of hundreds of SARS-CoV-2 Sprotein structures during the last two years, both of the protein alone in the apo-state (apo-Sprotein), in various conformation states, and in complex with a large range of antibodies and ACE2 at resolutions typically at 2-4 Å. [28] Among the selection pressures acting on a protein beyond its direct function is the need to maintain the overall fold stability and translational fidelity, and thus evolution of new mutations often occurs with a tradeoff not to undermine these properties. [17, [29] [30] [31] [32] Before fusion with a host cell, the S-protein is in a metastable conformation state under selection to evade antibodies and enhance ACE2 binding, which requires appropriate conversion to a conformation state with the receptor binding domain (RBD) in an up-wards conformation state. [33] [34] [35] [36] [37] Understanding the importance of fold stability-function tradeoffs within the S-protein could be important for estimating the tolerability of new mutations and the likelihood and fitness of new emerging variants, which will remain an important topic even in the postpandemic period. [39] To understand this, we can compute the stability effects of the many mutations possible in the RBD. Many programs exist based on machine-learning or energybased force-fields that can compute changes in protein stability upon mutation, [21, [40] [41] [42] [43] [44] [45] [46] [47] with variable accuracy and issues relating to systematic errors (biases) and results being dependent on input structures used. [42, [48] [49] [50] [51] [52] The few experimental S-protein stability data from timeconsuming differential scanning calorimetry or denaturation experiments prevent statistically meaningful analysis. Instead, we hypothesized that expression levels, which are available for very many SARS-CoV-2 S-protein mutations, may be a proxy of fold stability, as argued previously [53] and explored further below. Data for the effect of S-protein RBD single-point mutations on yeast expression and ACE2binding were published by Starr et al. [53] These mutation sites are marked in red color in Figure 1a . The effect of mutations on expression is reported as the difference in log-mean fluorescence intensity (MFI) relative to wild-type (ΔlogMFI = logMFIvariant − logMFIwild-type), such that a positive value indicates higher RBD expression. [53] The effect of mutations on ACE2-binding (Figure 1b) is calculated from the apparent dissociation constants (KD, app) and shown as the difference in log10(KD,app) relative to wild-type (Δlog10(KD,app) = log10(KD,app)wild-type -log10(KD,app)variant), such that a positive value indicates higher variant ACE2 binding. [53] There were two independent measurements (one from each of two independent libraries) for each mutation, providing an important way to assess the quality and reproducibility of the individual data points. There is an overall good correlation between the results obtained from the two independent libraries ( Figure S1a ). We removed outliers by filtering out observations where the two replicates are different (for binding or expression), defined as residuals > 1, or where data for one replicate is missing. The rationale is that data points not similar in the two experimental replicates cannot be considered reproduced and may erroneously affect analysis. We furthermore removed data points with effects on binding in either replicate < -4.5 to avoid values near the detection limit (see Figure S1a ). The correlation between the replicates after curation is shown in Figure S1b . Removing outliers and data points at the detection limit also excluded stop codon mutations ( Figure S2 ). The described curation removed 685 of the original 4221 data points. Further analyses were performed using the remaining 3536 data points and the average of the binding and expression data from the two independent libraries, with detailed data collected in the supplementary file Table_S1.csv. For each mutation in the dataset, the change in protein free energy of folding (ΔΔG, kcal/mol) was computed using three different methods: DeepDDG [54] , mCSM Protein Stability [45] (mCSM), and SimBa-IB [55] . DeepDDG and mCSM were accessed via their respective web servers (http://protein.org.cn/ddg.html, http://biosig.unimelb.edu.au/mcsm/stability), and Simba-IB was run from the command-line (http://github.com/kasperplaneta/SimBa2). We have previously shown that analysis of functional properties can depend on input structure used, [56] and such heterogeneity is also seen in published cryo-EM structures of the S-protein. [28] Accordingly, we used eight experimental cryo-EM structures of the apo-S protein from the Protein Data Bank (PDB): 6VXX, [35] 6X6P, [57] 6X79, [58] 6Z97, [59] 6ZB4, [60] 7CAB, [61] 7DDD, [62] and 7DF3, [38] to account for such heterogeneity. These structures are shown in structural overlay in Figure 1c . The relative solvent accessible surface area (RSA) of the mutated sites were calculated using SimBa-IB [55] , which uses FreeSASA for this task. [63] Because the eight PDB structures represent homo-trimers, the ΔΔG and RSA values reported in this study are average ΔΔG and RSA values for the three chains (A, B, and C) of each structure. It is important to note that the cryo-EM structures discussed may not reflect very precisely the real conformations of the S-protein at physiological temperature (37 °C): Cryo-EM structures are typically obtained with samples deposited with vitrified ice and rapidly cooled using a cryo-agent. [24, 25, 64] The freezing may remove some conformational dynamics. [65] [66] [67] [68] [69] Conformational changes of the S-protein may also be temperature-dependent. [70] In addition, the physiologically relevant state of the S-protein tends to be heavily glycosylated, which none of the experimental assays studying mutation impacts so far has addressed. Our goal is to identify whether experimental apo-S-protein data can be approximated by computed data, without translating these to the physiological significance of these data. Our main interest was to investigate whether experimentally measured expression and ACE2 binding of S-protein RBD mutants [53] can be related to the thermodynamic stability of the mutants. Amino acid substitutions often change the stability of a protein with a tendency for the distribution of such effects to be skewed towards destabilization. [71] Mutations in RBD affect expression of the S-protein and binding to ACE2 differently, but as discussed before [53] there is a relatively strong correlation (R=0.59) between the effect of mutations on expression and ACE2 binding ( Figure S3a ). This relationship is not intuitive but could relate to underlying effects such as ACE2 recognition and stability both relating to the mutating site's solvent accessibility, or correlations between codon use [72] which affects replication efficiency [73] and amino acid properties. Expression levels could also affect apparent binding constants even at the same specific ACE2 affinity, even if this is apparently accounted for, due to complex (e.g., tertiary) interactions. The large number of mutations having a small or "nearly neutral" impact, as noted by the authors [53] and reasonably expected, may also spuriously affect the relationships. However, if the data are grouped to adjust for the skewed distribution, the correlation between expression and binding becomes even larger and quite remarkable (R=0.88) (Figure S3b) . To assess the correlation between the effects of mutations on expression, ACE2 binding and protein stability, we computationally predicted the stability changes (ΔΔG) caused by the mutations in RBD using three state-of-the-art methods (DeepDDG, mCSM, and SimBa-IB) and compared with the experimentally observed changes in expression and binding. Figure 2a and Overall, there is less similarity between ACE2 binding and the computed stability changes as perhaps expected. Figure 3 shows the effect of mutation at each site on expression, ACE2 binding and predicted protein stability mapped onto the RBD structure. The effect at each site is calculated as the average of the absolute effect (without sign) of all 19 mutations as a measure of the tolerance to mutation at each site. Mutations that affect expression are mainly located in the core RBD subdomain, and in particular in the central beta sheet and its flanking alpha helices (Figure 3a) . Mutations that affect binding are located in the ACE2 binding subdomain or, similarly to mutations that affect expression, in the central beta sheet, while large parts of the RBD domain are tolerant to mutations with regard to ACE2 binding (Figure 3b) . Mutations that affect protein stability are mainly located in the structured core of the RBD subdomain, similarly to the effect of mutations on expression (Figure 3c-3e) . When evaluating the computational methods in this way, their ability to identify the most tolerant sites for mutation (or equally, the sites more likely to be neutral from an evolutionary perspective) becomes more evident than in the heat maps of Figure 2 , while the better agreement with expression remains clear. As discussed above, mutations in the RBD affect expression, ACE2 binding and protein stability differently, but with some overlap in site effects especially for protein stability and expression. To quantify this relationship, we plotted the predicted ΔΔG values for all mutations against the observed changes in RBD expression (Figure 4a ) and ACE2 binding (Figure 4b) , respectively, for each experimental structure used as input for the three methods (24 comparisons for all studied RBD mutations for expression and 24 comparisons for ACE2 binding). Correlations between RBD expression and ΔΔG for individual PDB structures are consistently observed, but their magnitude depends on the prediction method, with correlation coefficients ranging from 0.40 to 0.48 for DeepDDG, 0.27 to 0.34 for mCSM, and 0.12 to 0.22 for SimBa-IB (Figure 4a) . The observed ACE2 binding and the predicted ΔΔG also correlate, depending on the prediction method (Figure 4b) , which is in line with the correlation between experimental measures of expression and ACE2 binding ( Figure S3) . However, the correlations are weaker than for expression with correlation coefficients ranging from 0.29 to 0.37 for DeepDDG, 0. Both the binding and expression effects are highly skewed, with an overrepresentation of data points between 0 and -1 ( Figure S2) . In order to adjust for this, we grouped the data into bins of 0.5-width and 0.25-width and calculated the mean expression and binding effect in each bin, and the mean predicted ΔΔG value for each bin. Figure 6 shows these data as averages of all PDBs (data for individual structures in Figures S4-S7) . The correlation increases substantially upon binning, with each data point more well-determined, whereas the p-values decrease substantially due to the few aggregate data points after averaging. Remarkably, the computational data correlate extremely well with the binned experimental data, especially for DeepDDG, much more than normally seen. [74] Considering that the models were developed to predict fold stability effects, not expression (which also depends on effects at the nucleic acid level) and that we used the full S-protein structures whereas the experiments express RBD on the yeast surface with expected modifications, this result is very surprising. One interpretation of this result is that broader functional properties of the mutant space of the Sprotein are in fact largely determined by a few simple features, due to underlying correlations. To understand the correlations on a per-site basis, Figure 7 shows the relationships between the experimental and computed data averaged over sites, as an indicator of the site's tolerance to mutations. As this removes some amino-acid specific variations between the three computer models used, the correlations now become more similar between the methods. In all cases, there is a significant correlation (99% confidence level, p-values of linear regression), such that sites more neutral to expression and ACE2 binding effects experimentally are also more neutral towards computed stability effects. Mutations in residues that are buried in the core of a protein tend to have larger effect on protein stability, which is also the case for the RBD domain, where mutations in the core subdomain are less tolerated (Figure 3c-3e) than mutations near the surface. To quantify if surface exposure of residues in RBD correlated with the effect of the mutations on expression and binding, we plotted these variables against the RSA for each site, and we see a moderate correlation (R = 0.29 for binding, and R = 0.43 for expression) with a notable overall tolerance to mutation at sites with high solvent exposure ( Figure S8 ). As expected, there is also a good correlation between surface exposure and predicted stability changes, with correlation coefficients ranging from 0.27 (for SimBa-IB) to 0.60 (for DeepDDG; Figure S9 ). The SARS-CoV-2 S-protein fusion with human ACE2 is a prerequisite for host cell entry. [9, 10] However, S-protein antibodies from previous infection or vaccines in the population bind the S-protein and neutralize some virus particles, thus reducing infectivity. During evolution of SARS-CoV-2, these two effects need to be balanced upon changes in the S-protein. [28] Most SARS-CoV-2 evolution until the omicron variant involved positive selection for better ACE2 binding [75] , a feature maintained with omicron despite substantial evasion of existing antibodies via its many mutations in the S-protein. [76] In addition to these two, protein fold stability is commonly important during the evolution of new functionality, as it constrains the options available (function-stability tradeoffs) [29, 30, 77, 78] and has been speculated to play a role in SARS-CoV-2 S-protein evolution as well. [28] In order to understand and possibly predict SARS-CoV-2 evolution, an important health challenge, [28, 39] we evaluated such contributions of protein fold stability to the S-protein RBD mutation space. Our work shows that computed protein stability effects correlate significantly for all 48 comparisons of data sets (8 structures, 3 methods, and two properties) with expression levels and to lesser extent ACE2 binding observed in experiments. [53] Thus, experimental mutant properties can partly be predicted, and the phenotypes may correlate, possibly due to underlying correlators between e.g. codon use, amino acid chemical properties, and site solvent exposure. Such correlations could impact the mutability of the SARS-CoV-2 S-protein and affect phenotype tradeoffs, and thus ultimately the virus evolution, given that S-protein fold stability in the prefusion state is an important property of SARS-CoV-2 mutants, [33] which suggests including protein stability into a fitness function of SARS-CoV-2. [28] To the extent that protein stability maintenance is an important selection pressure, it will affect the future SARS-CoV-2 S-protein evolution, including constraints on its antigenic drift. Although single amino acid changes as analyzed above and in assays [79, 80] are likely to be not additive in real variants with multiple substitutions, due to amino acid correlation effects (within-protein epistasis), and possibly epistasis with other virus genes. [81] [82] [83] [84] These epistasis effects haunt the protein evolution field and are not easily accounted for, but recent work suggests that substantial parts of the epistasis is already utilized, [85] although this remains to be studied in future work, and does not include potential inter-gene epistasis such as e.g. processing of S-protein RNA by non-structural proteins during virus replication within the host cell. Still, the maintenance of S-protein fold stability in the lipid surface of the virion is likely to be an important constraint on ACE2 binding and antigenic drift making the heat maps studied here of interest both as a proxy of expression and S-protein stability but possibly also as a model contribution to the computational estimates of the overall fitness function of SARS-CoV-2. The supporting information pdf file contains additional data and analysis of relevance. The data set file Table_S1.csv contains the curated experimental data sets used for the study. The full curated data used in this study are available as a csv file uploaded as supplementary information (Table_S1.csv), and scripts for analyzing these data are available at the web page https://github.com/ktbaek/Spike-ACE2. A new coronavirus associated with human respiratory disease in China A novel coronavirus from patients with pneumonia in China Virology, transmission, and pathogenesis of SARS-CoV-2 The antigenic anatomy of SARS-CoV-2 receptor binding domain SARS-CoV-2 evolution during treatment of chronic infection SARS-CoV-2 variants, spike mutations and immune escape Human SARS CoV-2 spike protein mutations Functional assessment of cell entry and receptor usage for SARS-CoV-2 and other lineage B betacoronaviruses Coronaviruses: an overview of their replication and pathogenesis Structural and functional basis of SARS-CoV-2 entry by using human ACE2 Research and development on therapeutic agents and vaccines for COVID-19 and related human coronavirus diseases COVID-19 vaccines: where we stand and challenges ahead Structural and functional ramifications of antigenic drift in recent SARS-CoV-2 variants Striking antibody evasion manifested by the Omicron variant of SARS-CoV-2 Positively Selected Sites in Cetacean Myoglobins Contribute to Protein Stability The interface of protein structure, protein biophysics, and molecular evolution Evolution and the tertiary structure of proteins A biophysical protein folding model accounts for most mutational fitness effects in viruses Mistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution Prediction of the stability of protein mutants based on structural environment-dependent amino acid substitution and propensity tables Structural and functional constraints in the evolution of protein families The resolution revolution in X-ray diffraction, Cryo-EM and other Technologies Cryo-electron microscopy for structural analysis of dynamic biological macromolecules Unravelling biological macromolecules with cryo-electron microscopy Cryo-electron microscopy methodology: current aspects and future directions Cryogenic electron microscopy and single-particle analysis Structure and Mutations of SARS-CoV-2 Spike Protein: A Focused Overview How protein stability and new functions trade off The evolution and evolutionary consequences of marginal thermostability in proteins Survival of the cheapest: how proteome cost minimization drives evolution Intense neutral drifts yield robust and evolvable consensus proteins The SARS-CoV-2 spike protein: balancing stability and infectivity Distinct conformational states of SARS-CoV-2 spike protein Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein Controlling the SARS-CoV-2 spike glycoprotein conformation Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation Conformational dynamics of SARS-CoV-2 trimeric spike glycoprotein in complex with receptor ACE2 revealed by cryo-EM Predicting the mutational drivers of future SARS-CoV-2 variants of concern Computational approaches for predicting mutant protein stability On the biases in predictions of protein stability changes upon variations: the INPS test case Quantification of biases in predictions of protein stability changes upon mutations Assessing computational methods for predicting protein stability upon mutation: good on average but not in the details SDM--a server for predicting effects of mutations on protein stability and malfunction MCSM: Predicting the effects of mutations in proteins using graph-based signatures Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0 I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure Accurate stabilities of laccase mutants predicted with a modified FoldX protocol Computing stability effects of mutations in human superoxide dismutase 1 Towards a "Golden Standard" for computing globin stability: Stability and structure sensitivity of myoglobin mutants Assessing predictors of changes in protein stability upon mutation using self-consistency Systematic Investigation of the Data Set Dependency of Protein Stability Predictors Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding DeepDDG: predicting the stability change of protein point mutations using neural networks Kepp KP Data set and fitting dependencies when estimating protein mutant stability: Toward simple, balanced, and interpretable models A base measure of precision for protein stability predictors: structural sensitivity Characterization of the SARS-CoV-2 S Protein: Biophysical, Biochemical, Structural, and Antigenic Analysis Structure-guided covalent stabilization of coronavirus spike glycoprotein trimers in the closed conformation Neutralization of SARS-CoV-2 by Destruction of the Prefusion Spike Free fatty acid binding pocket in the locked structure of SARS-CoV-2 spike protein Structural basis for neutralization of SARS-CoV-2 and SARS-CoV by a potent therapeutic antibody Development and structural basis of a two-MAb cocktail for treating SARS-CoV-2 infections FreeSASA: An open source C library for solvent accessible surface area calculations Origin and temperature dependence of radiation damage in biological samples at cryogenic temperatures Cryo-temperature effects on membrane protein structure and dynamics Cryogenic temperature effects and resolution upon slow cooling of protein preparations in solid state NMR Effects of temperature on protein structure and dynamics: X-ray crystallographic studies of the protein ribonuclease-A at nine different temperatures from 98 to 320K Thermal properties of water in myoglobin crystals and solutions at subzero temperatures Pros and cons of cryocrystallography: should we also collect a room-temperature data set? Cold sensitivity of the SARS-CoV-2 spike ectodomain The stability effects of protein mutations appear to be universally distributed From SARS and MERS CoVs to SARS-CoV-2: Moving toward more biased codon usage in viral structural and nonstructural genes Codon optimality, bias and usage in translation and mRNA decay Artificial intelligence challenges for predicting the impact of mutations on protein stability SARS-CoV-2 B. 1.1.7 and B. 1.351 Spike variants bind human ACE2 with increased affinity SARS-CoV-2 Omicron variant: Antibody evasion and cryo-EM structure of spike protein\&\#x2013 Protein stability imposes limits on organism complexity and speed of molecular evolution The Influence of Selection for Protein Stability on dN/dS Estimations Complete mapping of mutations to the SARS-CoV-2 spike receptor-binding domain that escape antibody recognition Complete map of SARS-CoV-2 RBD mutations that escape the monoclonal antibody LY-CoV555 and its cocktail with LY-CoV016 The importance of additive and non-additive mutational effects in protein engineering Epistasis-the essential role of gene interactions in the structure and evolution of genetic systems Epistasis as the primary factor in molecular evolution Mutation effects predicted from sequence co-variation Epistasis at the SARS-CoV-2 Receptor-Binding Domain Interface and the Propitiously Boring Implications for Vaccine Escape Baek 1 , Rukmankesh Mehra 2 , and Kasper P E-mail: * kpj@kemi E-mail: rukmankesh@iitbhilai.ac Deep Mutational Scanning of SARS-CoV-2 Receptor Binding Domain Reveals Constraints on Folding and ACE2 Binding The Danish Council for Independent Research (Grant case: 8022-00041B) is gratefully acknowledged for supporting this work.