key: cord-0694393-rpaervlg
authors: Chen, Chen; Boorla, Veda Sheersh; Banerjee, Deepro; Chowdhury, Ratul; Cavener, Victoria S; Nissly, Ruth H; Gontu, Abhinay; Boyle, Nina R; Vandergrift, Kurt; Nair, Meera Surendran; Kuchipudi, Suresh V; Maranas, Costas D.
title: Computational prediction of the effect of amino acid changes on the binding affinity between SARS-CoV-2 spike protein and the human ACE2 receptor
date: 2021-03-25
journal: bioRxiv
DOI: 10.1101/2021.03.24.436885
sha: 93ffc5769512d7cac79b95e7cf2b9a5e145f4d27
doc_id: 694393
cord_uid: rpaervlg

The association of the receptor binding domain (RBD) of SARS-CoV-2 viral spike with human angiotensin converting enzyme (hACE2) represents the first required step for viral entry. Amino acid changes in the RBD have been implicated with increased infectivity and potential for immune evasion. Reliably predicting the effect of amino acid changes in the ability of the RBD to interact more strongly with the hACE2 receptor can help assess the public health implications and the potential for spillover and adaptation into other animals. Here, we introduce a two-step framework that first relies on 48 independent 4-ns molecular dynamics (MD) trajectories of RBD-hACE2 variants to collect binding energy terms decomposed into Coulombic, covalent, van der Waals, lipophilic, generalized Born electrostatic solvation, hydrogen-bonding, π-π packing and self-contact correction terms. The second step implements a neural network to classify and quantitatively predict binding affinity using the decomposed energy terms as descriptors. The computational base achieves an accuracy of 82.2% in terms of correctly classifying single amino-acid substitution variants of the RBD as worsening or improving binding affinity for hACE2 and a correlation coefficient r of 0.69 between predicted and experimentally calculated binding affinities. Both metrics are calculated using a 5-fold cross validation test. Our method thus sets up a framework for effectively screening binding affinity change with unknown single and multiple amino-acid changes. This can be a very valuable tool to predict host adaptation and zoonotic spillover of current and future SARS-CoV-2 variants.

The ongoing coronavirus disease 2019 pandemic caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) continues to be a major global challenge to public health and has caused unprecedented losses to the global economy 1 , and Brazil 6 (P.1 lineage) with increasing prevalence worldwide. 7 A few non-synonymous mutations leading to amino-acid changes in the spike protein of these variants are associated with enhanced infectivity. The enhanced infectivity of the variants is possibly through increased binding affinity of the receptor binding domain (RBD) of the spike protein with the human angiotensin converting enzyme-2 (hACE2) receptor 8 and through changes in the conformational dynamics of the spike protein. 9 Although a recent preliminary report suggests that the current vaccines can still effectively protect people from SARS-CoV-2 variants 10 , another report showed that plasma from recipients of Moderna (mRNA-1273) or Pfizer-BioNTech (BNT162b2) vaccines was significantly less effective in neutralizing SARS-CoV-2 variants encoding E484K or N501Y or the K417N:E484K:N501Y combination. 11 In addition, a significant decrease in neutralizing titers against the B.1.351 but not the B.1.1.7 UK variant, with plasma from mRNA-1273 vaccinated humans and non-human primates has been observed. 12 Hence, continued surveillance and methods to accurately predict affinity gains of the RBD-hACE2 binding event due to amino acid changes in the RBD are extremely important.

SARS-CoV-2 is an enveloped virus with a single-stranded RNA genome of approximately 30 kb size 13 . The mutation rates of RNA viruses upon replication are generally higher than DNA viruses, which could be as high as 10 −4 to 10 −3 per nucleotide incorporated 14 SARS-CoV-2 has a mutation rate of on average 7.23 mutations per sample 15 which is significantly lower than HIV and Influenza-A viruses. 1617 The simultaneous incorporation of multiple (i.e., [15] [16] [17] [18] [19] [20] amino-acid changes in a few emerging strains such as ΔFVI (Danish mink), B.1.1.7 (UK), B.1.1.54 (SA) are a cause of concern as it suggests further adaptation of the virus and fitness gains in humans and other animals. 18, 19, 20 Several of these variants involve amino acid changes in the spike protein suspected to increase transmissibility 21 , alter infectivity 22, 23 and/or escape neutralizing antibodies. 22, 24 The viral spike protein binding to the hACE2 protein is the first and crucial step in viral entry. [25] [26] [27] [28] [29] [30] [31] The spike protein makes contact with hACE2 using 16 residues of the 223 amino acid long receptor binding domain (RBD) forming multiple polar and hydrophobic interactions. 32 The binding strength between RBD and hACE2 thus directly affects infection dynamics and potentially disease progression. Starr et al. 33 exhaustively assessed the impact of single amino acid changes in the RBD of the SARS-CoV-2 quantifying the effects on RBD expression and hACE2 binding. It was revealed that most amino acid changes (i.e., 84.5%) are detrimental for RBD expression and hACE2 binding, around 7.5% of mutations are neutral but about 8% enhance hACE2 binding. The corresponding amino acid changes in RBD that lead to enhancements in binding with hACE2 can potentially become additive in their contribution to receptor affinity. Even though the RBD accounts for only 2% of the amino acid changes observed in the entire spike protein 34 , it is the target for more than 90% of the neutralization antibodies generated by humoral response 35 . Therefore, RBD is the most susceptible target to antigenic escape by amino acid changes. Consequently, amino acid changes in RBD that can increase binding affinity with hACE2 and/or adversely affect antibody neutralization have been extensively mapped by high-throughput mutational studies. 36, 37 For example, the amino acid change Y453F in the RBD present in the ΔFVI (Danish mink) variant increases the binding affinity to hACE2 by four fold 38 while also managing to partially evade the monoclonal antibody REGN10933 present in the Regeneron antibody cocktail. 39 These studies highlight the importance of monitoring single and multiple amino acid changes in the spike RBD variants and their potential for increased binding affinity with ACE2 and/or immune escape. It is important to note that binding of the viral RBD with the hACE2 receptor is a necessary step but is not sufficient to cause a productive viral infection. Proteolytic cleavage of S1/S2 and S2' sites is also needed to expose the fusion peptide enabling membrane fusion followed by viral entry at the surface or upon endocytosis. 40 Furthermore, the host cellular environment must be permissive to viral RNA genome replication, translation to proteins and assembly into new virions. 41 Computational methods can help assess the mechanistic role of the amino acid changes occurring in circulating viral variants and also predict potentially problematic amino acid changes that have not been identified so far. In a recent study, Chowdhury et al. 32 biophysically characterized the binding interactions of human ACE2 with SARS-CoV-2 and SARS-CoV, uncovering the molecular details associated with the increased infectivity of CoV-2, relative to CoV. In another effort, Mohammad et al. 42 calculated that the D614G variant has a higher computational binding interaction energy with furin. This was later experimentally corroborated revealing that the D614G change both increases RBD accessibility to binding with hACE2 43 and enhances the efficiency of furin cleavage. 44 Zhou et al. 24 performed molecular dynamics (MD) simulations and molecular mechanics/Poisson-Boltzmann surface area (MM-PBSA) analysis on the N439K variant suggesting a higher binding affinity to hACE2 and resistance to the antibody REGN10987. These findings were supported by experimental evidence for the N439K variant escaping several neutralizing antibodies including REGN10987 24 . Several studies focus on testing the effect of one or several key single mutations, but systematic methods to predict and analyze a wider multimutational landscape are still lacking. It is worth noting that Chen 45 et al. used an algebraic topology-based machine learning model to quantify the binding free energy changes of RBD from several existing CoV-2 variants. However, the performance of the method used was tested on the general SKEMPI-2.0 46 dataset and not on a SARS-CoV-2 specific dataset.

Several computational approaches have been developed to predict the effect of amino-acid substitutions on protein-protein binding affinity. Some of them directly use energies from molecular-mechanics based empirical force-fields like FoldX 47 and Rosetta 48, 49 or energies from molecular mechanicsgeneralized Born surface area (MM-GBSA) analysis of ensembles obtained from MD simulations 50 . Other methods such as SAAMBE 51 and BindProfX 52 , use a combination of physical energies and residue-level structural properties or sequence-based conservation profiles, respectively. Purely statistical potentials like BeAtMuSiC 53 and Contact potentials 54 have also been explored. Updating the weights of energy terms using experimentally determined Gbind defined as the change in the free energy of binding upon amino-acid changes (i.e., Gbind = Gvariant-GWT) has been shown to improve the prediction performance 55,56 of molecular mechanics-based force-fields such as Rosetta 49 . The recently introduced machine learning (ML) based method TopNetTree achieves a better correlation coefficient over several existing methods on two benchmark datasets AB-Bind and SKEMPI. 57 In spite of the existence of many different Gbind prediction methods, as reviewed recently 58 , performance is not always robust on unseen datasets not part of training data. The major limiting factor contributing to test set prediction inaccuracies is the paucity of experimental datasets (on Gbind) with good coverage of both types and locations of amino acid changes. 58 In this work, we first tested the predictive power of both parameterized force fields (i.e., Rosetta 49, 59 ) and detailed MM-GBSA 60-62 analysis of MD simulation trajectories with explicit water molecules treatment 63,64 (see Table 1 ) in reproducing experimental RBD-hACE2 binding affinity data reported by Starr et al. 33 Predictions using the MM-GBSA binding energies provided only partial agreement with experimental data (i.e., r = 0.33). Therefore, we next used experimental RBD-hACE2 binding energy terms to train a neural network (NN) regression model (NN_MM-GBSA) using the decomposed MM-GBSA energy terms as features and the experimental dissociation constants (KD,app) ratio between mutated variants and the wild-type as the regression target. Figure 3 pictorially illustrates the computational pipeline employed for building the model. Agreement between experiment and the NN_MM-GBSA model predictions were significantly better than raw MM-GBSA energies reaching a correlation coefficient of r = 0.69 and an accuracy of 82.24% in recovering of the effect of amino-acid changes (i.e., in classifying as improving or worsening the binding affinity). The NN_MM-GBSA model also provided good predictions on binding affinities of the RBD from currently circulating SARS-CoV-2 variants (see Table 2 ). The achieved accuracy of prediction makes this model a useful tool for the computational assessment of current/emerging CoV-2 variants. The MM-GBSA energies and the NN_MM-GBSA model are available in Github at https://github.com/maranasgroup/NN_MM-GBSA_CoV2.

The three-dimensional coordinates of the SARS-CoV-2 receptor binding domain in complex with human ACE2 were obtained from PDB entry 6LZG 65 . There exist 20 RBD residues contacting directly with hACE2 and making strong interactions at the binding interface. 33 This gives a total of 380 possible single amino-acid variants by changing each one of the 20 RBD residues into the remaining 19 amino-acids. Of these, we chose all 27 variants with an increased binding affinity and 54 variants with lower binding affinity compared to the WT. The dataset was balanced by adding another 27 variants that exhibited binding enhancement though not in direct contact with hACE2 (see Supp. Table 1 for details). The selected 108 variants formed the training dataset in this study (see Figure 1 ). All RBD variants in the dataset were computationally modeled using Rosetta 48, 49 and analyzed for changes in binding affinity with hACE2 compared to the WT RBD. Experimental data on variant binding affinities were obtained from the deep mutagenesis study by Starr et. al. 33 The study reported apparent dissociation constant KD,app ratios for all possible variants with single-amino acid changes at every RBD position. A KD,app ratio (i.e., KD,app,WT/KD,app,variant) for a variant greater than one implies stronger binding compared to WT, whereas a value less than one implies weaker binding (see also Figure 1C ). KD,app ratios can be related to changes in the free energy of binding (i.e., Gbind) as (KD,app,variant)/(KD,app,WT) = exp(-Gbind/RT). This enables direct comparison of experimental measurements with estimates in changes of free energy of binding from MM-GBSA and other computational methods (see Methods for details).

For each RBD variant, we first performed MD simulation of the hACE2-RBD complex followed by MM-GBSA analysis on frames derived from the simulation to calculate binding energies. For each variant, 48 independent initial configurations of the complex are generated by Monte Carlo minimization 66 (see Methods). Starting from each configuration, a 4ns unconstrained MD simulation was carried out which generated a total of 192ns of MD trajectory for each variant. We used a sequence of short simulations, starting from several independent configurations instead of one long simulation trajectory -as it led to faster convergence, as also reported previously. 67 3D coordinates of the complex were extracted from each trajectory upon removing the solvent molecules after every 0.1ns, generating 1,920 different frames for each variant. MM-GBSA energies were calculated for all frames (see Methods) and subsequently averaged in 240 bins chosen randomly to obtain an ensemble of eight binding energy predictions for each variant. The median value of the ensemble of eight predictions was then chosen as the predicted binding energy Gvariant of the variant. The binding energy change for each variant Gbind is then obtained by subtracting the binding energy of the WT RBD-hACE2 complex GWT. A negative Gbind value (corresponds to KD,app ratio > 1) indicates improved binding affinity with hACE2, whereas a positive Gbind value (corresponding to KD,app ratio < 1) implies lower binding affinity. Using this computational workflow, we calculated the Gbind for the balanced dataset of 108 RBD variants. We score classification predictions using the percent recovery of correct variant classification (%VC) in terms of the direction of change in the binding affinity compared to WT. Note that a balanced training set was maintained, so as to alleviate the risk of biased predictions due to having more variants with worsening or improving binding affinities. We score quantitative binding affinity prediction using the Pearson correlation coefficient r (see Methods) between predicted and experimental Gbind values. We find that (see also Table 1 ) Rosetta slightly outperforms MM-GBSA in both prediction of the direction of change (i.e., %VC) and r value. This could be because of the scaling of the energy function within MM-GBSA not being in line with values obtained from experimental measurements leading to some outliers having very large predicted values (see Figure 2 ), causing a lower r value (i.e., r=0.33) than Rosetta (i.e., 0.47). In addition to the energy function from Rosetta 49 , three other computational servers were tested for the prediction; mCSM-PPI2 68 utilizing graph-based signatures, the random forest model MutaBind2 69 trained with Molecular Mechanics energies 70 and evolutionary scores 71 , and SAAMBE-3d 72 which uses a machine learning model trained on structural features. Using MutaBind2 and mCSM-PPI2, the performance in both %VC and r-value was worse than that of MM-GBSA and Rosetta. The predictions from SAAMBE-3d had a good correlation value r but were very poor in %VC (=0.53), almost the same as random prediction. This may be due to the fact that Rosetta and MM-GBSA attain a higher fidelity in the description of the underlying biophysics by using a detailed fully atomistic description of interactions and hence are better at distinguishing improving vs. worsening variants. Note that, since the numerical values Gbind for variants improving the binding affinity are quite small (maximum of ~ -0.3 kcal/mol) compared to those worsening the binding affinity (maximum of ~ +2.5kcal/mol), both metrics need to be simultaneously high to indicate robust prediction. Nevertheless, prediction metrics %VC and r calculated for MM-GBSA (or Rosetta) do not attain values that reflect reliable quantitative prediction. We next focused on improving prediction fidelity. This was accomplished by not merely using various energy terms in an additive fashion to assemble the overall binding energy, but instead by relying on a neural network to construct a nonlinear reassortment of these energy terms. Table 1 . Comparison of prediction performance of Rosetta and MM-GBSA with the regression models trained on MM-GBSA energies. *The standard deviation obtained for the five repetitions of the five-fold cross validation and training is shown within the parentheses. The overall computational workflow is summarized in Figure 3 .

During training and assessment of the model, a five-fold cross validation procedure was followed.

During each cross-validation cycle, the 108 variants are randomly assigned to five groupings of approximately equal size. Four of these subsets are used as the training sets whereas the fifth becomes the testing set. This approach was chosen so that the testing set used to assess the prediction performance of the NN model is never used to train the predicting NN model. This 5fold cross-validation was repeated 5 times using random assignments for the testing set. This led to the construction of 5x5 = 25 independently trained NN models with an average value of r=0.69 (obtained across the 25 models) and a standard deviation of only 0.03 implying both robust and accurate prediction (see Figure 4) . Notably, the correlation coefficient of prediction improved by more than 2-fold compared to the MM-GBSA method (r=0.33) indicating that a higher order nonlinear structure, relevant to Gbind prediction embedded in the energy terms, was captured by the NN model. Correct variant classification (i.e., %VC) was also improved from 64% to 82% (see also Table 1 ). Shown in blue are variants that were correctly classified as improving (or worsening) binding affinity with hACE2 and in red are the ones that were misclassified.

As a methodological check, we also explored whether the nonlinear nature of the NN model is needed to reach the gains in prediction or whether a linear regression model could simply re-weight the energy terms in a linear fashion and achieve similar performance. We found that a linear regression model improved the correlation coefficient r from 0.33 to 0.53 in comparison with the MM-GBSA prediction method but lowered %VC from 61% to 59%. This implies that the higherorder nonlinear re-assortment of energy terms is required for reaching improved prediction fidelity. As a follow up, we also explored if the energy terms from Rosetta 49,59 could be used instead of the ones from MM-GBSA to construct a NN model of equivalent predictive ability. Note that the Rosetta energy function captures solvation effects implicitly without an explicit treatment of water molecules. We found that the gains in r and %VC for a NN model trained on the Rosetta energy terms were less than those seen when trained on MM-GBSA energies (i.e., r = 0.57, %VC=74.63). Both these analyses suggest that (1) explicit water treatment in MD simulations is important for correctly describing water-mediated hydrogen bonding and other electrostatic contacts at the interface while also enabling the sampling of a larger conformational landscape, which is necessary for capturing binding affinity changes due to nonlocal structural changes, 73 and (2) the higher-order nonlinear regression afforded by a NN_MM-GBSA model are required to reach high prediction performance.

As a further demonstration that the NN_MM-GBSA model captures variant-specific information and does not simply carry-out numerical fitting, we performed a data scrambling test. Specifically, we re-assigned the variant definition (i.e., corresponding amino-acid change) to randomly chosen input energy terms, thereby destroying any variant-specific correspondence with the input features. We gradually increased the fraction of data scrambled and re-evaluated NN_MM-GBSA model performance. We found that as the fraction of scrambled data increases, the performance of the NN_MM-GBSA model declines. The %VC drops from the original 82.24% to 49.74% (almost completely random). This test reaffirms that the NN_MM-GBSA model indeed captured variantspecific information.

Several amino-acid changes have been identified in the spike protein of several of the currently circulating SARS-CoV-2 variants. 74 Some of the underlying single amino acid mutations were part of our balanced training set (i.e., N501Y, Y453F) whereas others (i.e., K417T, K417N, E484K, S477N, L452R,S494P) were not. The predicted KD,app ratios along with experimental values (when available) and lineage names of variants that contain the corresponding amino acid changes are tabulated in Table 2 . In all cases, amino acid changes were correctly classified as improving or worsening (i.e., %VC=100) with a good quantitative agreement (see Table 2 ). Notably, variant B.1.351 has the amino acid change N501Y first seen in B.1.1.7 along with additional changes E484K and K417N in the spike. The E484K change has been shown to be responsible for evasion of neutralization by several antibodies 75, 76 whereas the N501Y change has been associated with increased binding affinity to hACE2 33 and increased transmission. 77 Both amino acid changes were correctly classified by NN_MM-GBSA as improving binding affinity (see Table 2 ). In Table 2 , we also include predictions for multiple simultaneous amino acid changes present in some circulating variants, alas experimental values are not available to this date which prevents any direct comparison. Nevertheless, significantly higher binding affinities were predicted for all tested double and triple amino acid variants present in currently circulating isolates (i.e., KD,app ratios greater than 1.26 in all four cases). In addition, we included novel predicted variants V503W+E406F and V503W+Y505W, which we arrived at by maximizing binding affinity based on an exhaustive evaluation using the less accurate but more computationally tractable Rosetta energy function. 49 Upon re-scoring using NN_MM-GBSA, we recovered high KD,app ratios (i.e., 1.28 and 1.23, see Table 2 ). It is also worth noting that the single amino acid changes E406F and V503W by themselves cause a decline in binding affinity (experimental KD,app ratio of 0.13 and 0.81) but when applied in combination lead to binding energy gains. This is probably due to a cooperative hydrophobic effect formed at the binding interface. 

NN_MM-GBSA is a two-step procedure that uses the energy terms calculated for SARS-CoV-2 RBD variants from MM-GBSA to train a neural network model for predicting the qualitative and quantitative effects of amino acid changes in the RBD of the spike protein on the binding affinity with hACE2. Using a balanced training set of 108 variants, the method achieves a Pearson correlation coefficient of 0.69 between predicted and experimental values for the KD,app ratios. In addition, the recovery of the correct effect of an amino acid change (i.e., improving or worsening binding) was 82%. We find that prediction is quite robust in terms of selection of training/testing sets with a standard deviation for the prediction or r of only 0.03. Notably, Starr et. al 33 exhaustively assessed a total of ~4,000 variants for their binding affinity whereas, in this study, we used only a small fraction of the dataset (108 variants). As we continued to add additional members to the training dataset of 108 variants, no clear improving (or worsening) trendline was observed. Clearly, prediction fidelity is better for amino acid changes in positions that include at least one variant in the balanced training dataset. We plan to continue expanding upon the list; however, most of the remaining variants involve single amino acid changes far away from the RBD-hACE2 binding interface and thus contribute progressively less to binding affinity. The true value of NN_MM-GBSA is not the assessment of variants with single amino acid changes, but the surveillance for multiple amino acid change variants. We predicted the change in binding affinity upon the amino acid changes E484K+N501Y+K417N present in B1.351 and E484K+N501Y+K417T present in P.1. We found that N_MM-GBSA predicts for both, a significantly increased affinity for hACE2 (see Table 2 ) suggesting that the effect of amino-acid changes E484K and N501Y dominate the effect of K417N or K417T which are both known to decrease the binding affinity by themselves (see Table2) .

In addition, we explored the possibility of double amino acid variants with even higher gains in binding affinity. Two such candidates (V503W+E406F and V503W+Y505W) were identified through exhaustive enumeration, using the Rosetta energy scoring function. 49 For both of them, method NN_MM-GBSA confirmed high KD,app ratios. These variants introduce large hydrophobic amino acids at the interface to enhance binding affinity with hACE2. The importance of the hydrophobic effect in protein-protein binding has been long recognized. 82 Specifically, it has also been shown that the RBD-hACE2 interface is dominated by networks of hydrophobic contacts forming strong anchors for binding. 83 A drawback of NN_MM-GBSA is that it requires a priori MD simulation of the variant under evaluation and collection of all energy terms using MM-GBSA analysis. This is computationally costly as a single calculation requires on average a total of 48 GPU-hours and 3 CPU-hours.

Ideally, one could simply use existing energy terms generated from the balanced training set of 108 variants to make predictions for novel variants. However, this would require training a neural network model with more than just energy terms as descriptors. The use of sequence and/or structural features could provide a tractable path forward in this direction.

In principle, NN_MM-GBSA can also be used to assess infection potential for other non-human hosts by SARS-CoV-2 by assessing the binding energies of the spike RBD with the animal ACE2 receptors. The structures of non-human ACE2 are generally unavailable (except for bats 84 and felines 85 ) therefore, the first step would require homology modeling of the ACE2 receptor for the examined species. Efforts along this direction have been carried out for livestock and companion animals 32 and assessments for high-risk animals are urgently needed. Assuming that training of NN_MM-GBSA using hACE2 data is robust, it could in principle be used to prospectively assess, the relative affinity of the RBD of circulating variants for various animal ACE2s. Crucially, this may provide the first indication that the viral variants have been established in non-human animal reservoir population(s). These variants can include both those already circulating and ones de novo predicted.

The 3D coordinates of SARS-CoV-2 viral spike RBD in complex with hACE2 were extracted from the crystal structure with PDB-id 6LZG. 65 The obtained WT model was first pre-processed by removing all solvent molecules and all non-amino acid residues. For each of the 108 RBD variants with single point amino acid changes, 3D coordinates were generated using Rosettascripts 86 . First, the PackRotamers mover was used to build the variants with amino-acid changes and repack the rotamers. Then, for each variant, 48 independent configurations for MD simulations were generated using the Relax 66 energy minimization protocol.

For each variant, the 3D coordinates of 48 independent configurations obtained using Rosetta (as described above) were prepared using protein preparation wizard 87 protocol of Maestro in Schrödinger suite (v2019.4). Each configuration was then solvated with water using the tip3p 64 model in an orthorhombic box with 10 Å buffer distance in each dimension. The residual charges were neutralized by adding Na + and Cl − ions at a salt concentration of 0.15 M. The solvated systems were minimized and pre-equilibrated using the default relaxation protocol of Desmond 88 followed by a 4-ns production run using the amber99sb-ildn 63 force field at 300 K and 1 atm. The NPT ensemble with periodic boundary conditions using particle mesh Ewald 89 for long-range interactions. A time step of 2.0 fs was used and a cutoff distance of 9.0 Å was chosen for nonbonded interactions.

For each variant, the 4-ns trajectory for each of the 48 configurations was sampled at an interval of 0.1 ns, generating 1920 snapshots in total. For each snapshot, the Prime/MM-GBSA analysis 62 was performed using thermal_mmgbsa.py script from the Schrödinger suite. The MM-GBSA analysis produces the binding energy and its constituent eight individual energy terms i.e., Coulomb energy, Covalent binding energy, van der Waals energy, Lipophilic energy, Generalized Born electrostatic solvation energy, Prime energy, Hydrogen-bonding correction, π-π packing correction, and Self-contact correction. Another set of values for these nine terms have also been calculated by not accounting for receptor and ligand conformational changes needed to form the complex. Due to a high degree of variation in the energies, we averaged data from 240 snapshots to produce a single set of averaged energy terms in the dataset. Thus, a total of 1920 snapshots generate eight sets of averaged energy terms for each variant. In total, these 18 energy values were utilized as the input features for NN construction.

A. Dataset Generation: MM-GBSA analysis was used to generate 18 energy components (as described above) which are fed as the input features for the NN_MM-GBSA model. Each input energy term across the whole data set was scaled independently to have zero mean and a variance of one. The output target was set to the experimental apparent dissociation constant KD,app ratios, (KD,app)variant/(KD,app)WT. The experimental data for the 108 RBD variants (see Supp. Table 1 for list) was obtained from Starr et. al. 33 B. Model Architecture: The neural network has a single input layer, a single output layer and four fully connected hidden layers with 144 nodes per layer, forming 18-(144-144-144-144)-1 structure. The rectified linear unit was used as the activation function for all of the hidden layers and the dropout regularization method was applied to hidden layers, with a dropout rate of 0.5 for the first and last hidden layers, and 0.75 for the rest.

C. Model Training: The model was trained through backpropagation to minimize the mean squared error between predicted KD,app and target KD,app values. Adam 90 optimizer was used to perform the backpropagation with a learning rate of 0.001 and weight decay of 0.005. The training was performed for 2,000 epochs including the entire training data in each batch.

D. Model Evaluation: We used a five-fold cross validation protocol to evaluate the model, by splitting the whole dataset into 5 subsets. In one complete evaluation cycle, each of the 5 subsets was used as a test set once and the rest constituted the training set and a total of 5 such cycles were performed. Two metrics were employed to quantify the performance of NN_MM-GBSA model predictions: % correct variant classification (%VC), and the Pearson correlation coefficient (r). %VC is the percentage of instances in which a variant is classified correctly as increasing or decreasing the binding affinity compared to WT. The Pearson's correlation coefficient is defined as: F. Implementation: All codes were developed in Python using the PyTorch library.

The complexes for 108 RBD variants were subject to Relax 66 with harmonic constraints to prevent the structure from deviating significantly from the crystal structure. During Relax, rotamers of amino acid residues within 8 Å of the mutated amino acid were only allowed to repack (local packing). All default parameters were used for Relax with the ref2015 energy function. 49 . At the end of Relax, a gradient minimization is performed using lbfgs_armijo algorithm for 2000 steps after which the relevant metrics of binding were calculated using InterfaceAnalyzer. 91 The binding energy, Gvariant of each variant is calculated as the average of dG_separated scores obtained from 30 independent Relax simulations. For each variant, a WT binding energy, GWT is calculated using the same protocol by making a dummy amino acid change (change amino acid to itself).

Finally, the change in binding energy Gbind is calculated as Gbind = Gvariant -GWT.

CC, VSB, and CDM conceived, and designed the study. CC performed the MD simulations and built the NN models. VSB performed Rosetta calculations. DB built the linear regression models. All the authors wrote and approved the study.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

If the world fails to protect the economy, COVID-19 will damage health not just now but also in the future

Ecology and economics for pandemic prevention: Investments to prevent tropical deforestation and to limit wildlife trade will protect against future zoonosis outbreaks

Presented to SAGE on 21/1/21 Authors -Peter Horby

Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations -SARS-CoV-2 coronavirus / nCoV-2019 Genomic Epidemiology -Virological

Emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa

Genomic characterisation of an emergent SARS-CoV-2 lineage in Manaus: preliminary findings

Fast-spreading COVID variant can elude immune responses

SARS-CoV-2 D614G spike mutation increases entry efficiency with enhanced ACE2-binding affinity

D614G Spike Mutation Increases SARS CoV-2 Susceptibility to Neutralization

Neutralizing Activity of BNT162b2-Elicited Serum -Preliminary Report

mRNA vaccine-elicited antibodies to SARS-CoV-2 and circulating variants 2 3 4. bioRxiv

mRNA-1273 vaccine induces neutralizing antibodies against spike mutants from global SARS-CoV-2 variants

The Architecture of SARS-CoV-2 Transcriptome

Principles of Molecular Virology: Sixth Edition

Geographic and Genomic Distribution of SARS-CoV-2 Mutations

The coronavirus is mutating -does it matter?

Comparative Review of SARS-CoV-2, SARS-CoV, MERS-CoV, and Influenza A Respiratory Viruses

Infection and Rapid Transmission of SARS-CoV-2 in Ferrets

SARS-CoV-2 and the human-animal interface: outbreaks on mink farms

The B1.351 and P.1 variants extend SARS-CoV-2 host range to mice

Estimated transmissibility and impact of SARS-CoV-2 lineage B.1.1.7 in England

The Impact of Mutations in SARS-CoV-2 Spike on Viral Infectivity and Antigenicity

SARS-CoV-2 spike-protein D614G mutation increases virion spike density and infectivity

N439K variant in spike protein may alter the infection efficiency and antigenicity of SARS-CoV-2 based on molecular dynamics simulation

A pneumonia outbreak associated with a new coronavirus of probable bat origin

Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein

Receptor recognition by novel coronavirus from Wuhan: An analysis based on decade-long structural studies of SARS

MRNA destabilization by BTG1 and BTG2 maintains T cell quiescence. Science (80-. )

Functional assessment of cell entry and receptor usage for SARS-CoV-2 and other lineage B betacoronaviruses

Angiotensin-converting enzyme 2 is a functional receptor for the SARS coronavirus

SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor

Computational biophysical characterization of the SARS-CoV-2 spike protein binding with the ACE2 receptor and implications for infectivity

Deep Mutational Scanning of SARS-CoV-2 Receptor Binding Domain Reveals Constraints on Folding and ACE2 Binding

Human SARS CoV-2 spike protein mutations

Mapping Neutralizing and Immunodominant Sites on the SARS-CoV-2 Spike Receptor-Binding Domain by Structure-Guided High-Resolution Serology

Prospective mapping of viral mutations that escape antibodies used to treat COVID-19

Complete Mapping of Mutations to the SARS-CoV-2 Spike Receptor-Binding Domain that Escape Antibody Recognition

The SARS-CoV-2 Y453F mink variant displays a pronounced increase in ACE-2 affinity but does not challenge antibody neutralization

REGN-COV2 antibodies prevent and treat SARS-CoV-2 infection in rhesus macaques and hamsters. Science (80-. )

Coronavirus membrane fusion mechanism offers a potential target for antiviral development

An overview of their replication and pathogenesis

Higher binding affinity of furin for SARS-CoV-2 spike (S) protein D614G mutant could be associated with higher SARS-CoV-2 infectivity

D614G Spike Mutation Increases SARS CoV-2 Susceptibility to Neutralization

D614G Mutation Alters SARS-CoV-2 Spike Conformation and Enhances Protease Cleavage at the S1/S2 Junction

Mutations Strengthened SARS-CoV-2 Infectivity

SKEMPI 2.0: An updated benchmark of changes in protein-protein binding energy, kinetics and thermodynamics upon mutation

Predicting changes in the stability of proteins and protein complexes: A study of more than 1000 mutations

Rosetta3: An object-oriented software suite for the simulation and design of macromolecules

The Rosetta All-Atom Energy Function for Macromolecular Modeling and Design

Applying Physics-Based Scoring to Calculate Free Energies of Binding for Single Amino Acid Mutations in Protein-Protein Complexes

SAAMBE: Webserver to predict the charge of binding free energy caused by amino acids mutations

Assessing Mutation-Induced Binding Affinity Change by Protein Interface Profiles with Pseudo-Counts

BeAtMuSiC: Prediction of changes in protein-protein binding affinity on mutations

Intermolecular contact potentials for protein-protein interactions extracted from binding free energy changes upon mutation

Flex ddG: Rosetta Ensemble-Based Estimation of Changes in Protein-Protein Binding Affinity upon Mutation

Rosetta custom score functions accurately predict ΔΔ: G of mutations at protein-protein interfaces using machine learning

A topology-based network tree for the prediction of protein-protein binding affinity changes following mutation

Finding the ΔΔ G spot: Are predictors of binding affinity changes upon mutations in protein-protein interactions ready for it?

Rosetta3: An object-oriented software suite for the simulation and design of macromolecules

A hierarchical approach to all-atom protein loop prediction

On the role of the crystal environment in determining protein side-chain conformations

The VSGB 2.0 model: A next generation energy model for high resolution protein structure modeling

Improved side-chain torsion potentials for the Amber ff99SB protein force field

Comparison of simple potential functions for simulating liquid water

Structural and Functional Basis of SARS-CoV-2 Entry by Using Human ACE2

Relaxation of backbone bond geometry improves protein energy landscape modeling

End-Point Binding Free Energy Calculation with MM/PBSA and MM/GBSA: Strategies and Applications in Drug Design

MCSM-PPI2: predicting the effects of mutations on protein-protein interactions

MutaBind2: Predicting the Impacts of Single and Multiple Mutations on Protein-Protein Interactions

CHARMM36 all-atom additive protein force field: Validation based on comparison to NMR data

Predicting the Functional Effect of Amino Acid Substitutions and Indels

SAAMBE-3D: Predicting effect of mutations on protein-protein interactions

End-Point Binding Free Energy Calculation with MM/PBSA and MM/GBSA: Strategies and Applications in Drug Design

The effect of the D614G substitution on the structure of the spike glycoprotein of SARS-CoV-2

Antibody cocktail to SARS-CoV-2 spike protein prevents rapid mutational escape seen with individual antibodies. Science (80-. )

Molecular determinants and mechanism for antibody cocktail preventing SARS-CoV-2 escape

Quantifying the transmission advantage associated with N501Y substitution of SARS-CoV-2 in the UK: an early data-driven analysis

A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology

Emergence of a Novel SARS-CoV-2 Variant in Southern California

SARS-CoV-2 infection in farmed minks, the Netherlands

SARS-CoV-2 lineage B.1.526 emerging in the New York region detected by software utility created to query the spike mutational landscape

Principles of protein-protein interactions: What are the preferred ways for proteins to interact?

Enhanced receptor binding of SARS-CoV-2 through networks of hydrogen-bonding and hydrophobic interactions

Cross-species recognition of SARS-CoV-2 to bat ACE2

Broad host range of SARS-CoV-2 and the molecular basis for SARS-CoV-2 binding to cat ACE2

RosettaScripts: A Scripting Language Interface to the Rosetta Macromolecular Modeling Suite

Protein and ligand preparation: Parameters, protocols, and influence on virtual screening enrichments

Scalable algorithms for molecular dynamics simulations on commodity clusters

Particle mesh Ewald: An N·log(N) method for Ewald sums in large systems

Adam: A method for stochastic optimization

A comparison of successful and failed protein interface designs highlights the challenges of designing buried hydrogen bonds

The authors declare no competing financial interests.

Supplementary data to this article are given in supp_file.pdf