key: cord-0900811-d25d4uex authors: Sorokina, Marija; Belapure, Jaydeep; Tüting, Christian; Paschke, Reinhard; Papasotiriou, Ioannis; Pglm Rodrigues, João; Kastritis, Panagiotis L. title: An electrostatically-steered conformational selection mechanism promotes SARS-CoV-2 Spike protein variation date: 2022-05-17 journal: J Mol Biol DOI: 10.1016/j.jmb.2022.167637 sha: f7f8fde808036ca1b4a4f402490fa01184923d86 doc_id: 900811 cord_uid: d25d4uex After two years since the outbreak, the COVID-19 pandemic remains a global public health emergency. SARS-CoV-2 variants with substitutions on the spike (S) protein emerge increasing the risk of immune evasion and cross-species transmission. Here, we analyzed the evolution of the S protein as recorded in 276,712 samples collected before the start of vaccination efforts. Our analysis shows that most variants destabilize the S protein trimer, increase its conformational heterogeneity and improve the odds of the recognition by the host cell receptor. Most frequent substitutions promote overall hydrophobicity by replacing charged amino acids, reducing stabilizing local interactions in the unbound S protein trimer. Moreover, our results identify “forbidden” regions that rarely show any sequence variation, and which are related to conformational changes occurring upon fusion. These results are significant for understanding the structure and function of SARS-CoV-2 related proteins which is a critical step in vaccine development and epidemiological surveillance. The COVID-19 pandemic triggered a worldwide health crisis, claiming over 3.3 million lives within 16 months and causing substantial damage to global economy (https://www.imf.org/en/Publications/WEO/Issues/2021/03/23/world-economic-outlookapril-2021). Although preventive methods such as social distancing, face covering, testing, and tracing can mitigate spread of the virus, only the achievement of herd immunity through vaccination can put an end to the pandemic in a timely way [1, 2] . Development of vaccines started as early as January 2020, enabled by rapid breakthroughs in genome sequencing and structural biology, leading to the first mass vaccination programs starting in (https://www.nytimes.com/interactive/2020/science/coronavirus-vaccine-tracker.html), which mediates entry into the host cell [3] through diverse mechanisms [4] . Although SARS-CoV-2 sequence diversity is very low [5] , natural selection can lead to the appearance of favorable S protein mutations that confer viral fitness advantages and pose a threat to vaccine effectiveness [6] . The S protein is a highly glycosylated homotrimeric transmembrane protein [7] that enables binding and fusion of the virus with the host cell [4] . Each S protein monomer consists of nearly 1300 residues divided in two furin-cleavable subunits, S1 and S2 [8] . Additionally, SARS-CoV-2 S possesses a second cleavage site -S2' -for the TMPRSS2 protease, which is needed to mediate membrane fusion [9] . The S1 subunit is responsible for binding to the host cell receptor, the angiotensin converting enzyme 2 (ACE2) [10] [11] [12] , a zinc-dependent peptidyl dipeptide hydrolase [13] . The recognition happens via the receptor binding motif (RBM) [14] , which is part of the receptor binding domain (RBD) [15] . In its metastable prefusion state, the S protein exists predominantly in two distinct conformations, "closed" and "open", that have been structurally resolved through cryo-EM [16] . In the "closed" conformation, the S protein cannot bind to the ACE2 receptor. Thus, to enable the binding, S1 must undergo conformational changes that expose the RBD and bring the protein into the "open" conformation. The S2 is responsible for the fusion of the viral and host cell membranes, a mechanism underlined by a complex series of events [17] . In summary, the S protein evolved to compartmentalize functional aspects across several the annotated structural domains, including the receptor binding domain, cleavage sites and fusion-related regions (Figure 1 (ae)) [18] . Slight changes in the ACE2 sequence occurring within humans [19] or across species [20] , can not only have influence on disease severity, but also on viral infectivity: Largescale structural modeling efforts targeting ACE2 variants aim to illuminate slight changes in ACE2-S protein interface and comprehend critical interface residues and biophysical properties involved in the recognition [21] [22] [23] [24] . Obviously, not only variations of the ACE2 protein, the cell receptor, but also variations across SARS-CoV-2 proteins, especially the S protein [25] [26] [27] [28] can have a vast effect on viral infectivity [6, 25, 27] and severity [29, 30] . Although ongoing efforts are connecting structure-function relationships of S protein variants as compared to the Wuhan strain [26, [31] [32] [33] [34] [35] , a systematic, large-scale statistical, biophysical, and structural analysis of S protein variants is missing. This is further complicated by the different conformations that S protein acquires before, during and after recognition by the ACE2 receptor. Here, we systematically analyzed all S protein substitutions at the amino acid residue level. S protein variants were extracted from SARS-CoV-2 sequencing projects deposited in the public database GISAID [36] (Global Initiative on Sharing All Influenza Data) until the 2 nd of January 2021. The end date was chosen to select substitutions that occurred before ongoing vaccination campaigns that started in December 2020. Our results show "forbidden" regions for residue substitutions and a common biophysical denominator for their selection, underlined by a "steered" conformational selection mechanism for subsequent biomolecular recognition [37, 38] . In this type of recognition, unbound structural ensembles shift their energy distributions towards less favorable energies upon an external trigger to achieve and facilitate protein-protein recognition, lowering energy barriers [38] . Our results have a direct impact on understanding S protein variation as well as on current and future efforts for deriving vaccines and therapeutics. Three distinct sequence regions within S2 are strictly conserved. We downloaded all 311,255 SARS-CoV-2 S protein sequences deposited in the public database GISAID before the 2 nd of January 2021 and considered those with more than 95% sequence coverage (N=276,712). These sequences account for a total of 505,403 amino acid changes, of which nearly half (N=257,552) corresponded to the D614G "mutation", a substitution that became dominant after July 2020 [6] . For each substitution present in the dataset, we calculated its percentage of occurrence (p.occ.) considering all possible substitutions (N=24,187) ( Figure S1 ). This analysis revealed that almost every residue across the S protein has been substituted at least once ( Figure S1 ), except for 44 specific residues C738, S746, C749, G757, F759, and F800 do not exhibit variation and are all localized within the UH, which shields the postfusion helical trimer [39] . Interestingly, the role of F800 is unclear and is not resolved in the postfusion structure [39] . HR1 and Heptad Repeat 2 (HR2) mediate membrane fusion [39] and are known to be conserved within SARS-CoV-2 lineages and other CoVs [18, 40] . However, we clearly observe that HR2 is undergoing selection pressure, indicated by the absence of fully conserved residues (Figure 1(e) ). This contrasts with the HR1 domain, where various residues were not observed to be substituted (Figure 1(e) , Figure S2 (a)). The adjacent CH domain has 4 residues that were never observed to be substituted, i.e., E988, R995, A1025, K1028. These residues are embedded in the interface of CH and UH, in very close proximity to the UH conserved residues and co-localize at a specific lateral position across the postfusion conformation ( Figure S2(b) ). Lastly, CD also includes residues that remain conserved, which are again found near UH and CH, pointing to a critical stabilizing interaction located at the inter-chain interface ( Figure S2(b) ). Interestingly, C1082 and C1126, both present in the CD, form a cysteine bridge and were never observed to be substituted ( Figure S2(b-d) , Figure S3 (a)). Rare to none substitution events occur for the rest of the cysteine bridges ( Figure S3(a) ). Glycosylation sites are also very conserved, with the exception of T323 exhibiting slightly higher p.occ. (Figure S3(b) ). Functionally, the localization of the UH, HR1/CH and CD regions also correlates with crucial conformation changes connected to (a) the opening of the RDB; and (b) the prefusion to postfusion extensive reorganization of the S2. For (a), molecular dynamics simulations of the closed and open S protein conformations showed that a cavity appears in the prefusion state, formed by HR1/CH and CD regions [41] . This cavity formed by domains of the S2 subunit was observed to be more rigid than regions of the S1 subunit [41] and is suggested to allosterically modulate the opening of the RBD domain through a "bouncing spring" mechanism [41] . For (b), conserved residues are initially in proximity to each other in a condensed coiled coil formed by CH-HR1, but after fusion, the coiled-coil is extended in coronaviruses ( Figure S2(b) ) [42] , which implies their active role in the conformational rearrangement. These findings are now further corroborated by our conservation analysis, particularly since mutation rates of RNA viruses are up to a million times higher than that of their hosts [43] . Since the virus can escape neutralizing antibodies by just a single amino acid substitution [44, 45] , identifying protein regions with extensive, localized conservation is extremely important for designing drugs against SARS-CoV-2 and for vaccine development. substitutions. Although these highly conserved sites are critical for drug design, frequent substitutions are also equally important since they enable viral immune and drug escape [46, 47] or increase viral fitness. Viral genetic diversity has to be steadily monitored to swiftly adapt vaccine and therapeutic drug development to emerging CoV-2 variants which appear through accumulation of nucleotide variations [48] . Our frequency analysis shows that most of the substitutions are not persisting, having very low percentages of occurrence (p.occ.), most being below 0.5% (Figure 2(a) ), probably partly due to CoVs proofreading function [49] , and partly because many of the substitutions do not confer substantial fitness advantages and therefore do not persist. Only 23 residues have p.occ. ≥ 0.5% and are annotated in Figure 2 (a). Interestingly, 18 of these residues are located on the S1 subunit, formed by residues M1-R685 with 4 residues (N439, Y453, S477, N501) located in the receptor binding motif (RBM) of the RBD (Figure 1 (e), Figure 2 (b)). We also observe residue clustering (N=9) in the NTD (Figure 1 (e), Figure 2 (g)). Another mutation of interest, which is placed at the NTD, is A222V. This mutation first appeared in July 2020 [50] and was present in over half of the sequences by November 2020 (Figure 2(c) ). Despite becoming widespread, it remains mysterious for its structural or functional impact, although 2 reports suggested that it does not affect antigenicity [51] and has minimal impact on viral entry [50, 51] . The high frequency substitutions present in S1 should affect the unbound and bound states of the S protein and not the postfusion state. Residues with p.occ. ≥ 0.5% (N=23) are not only participating in characterized substitutions across lineages, but also in variants not extensively described to date, which comprise a large portion of the retrieved data (Figure 2 (c-e), Figure S3 (c), Figure S4 ). Multiple substitutions per residue are also observed ( Figure S3(c) , Table S1 ). Many identified substitutions influence monoclonal antibody (mAb) escape or alter binding affinity for ACE2, especially those located on the RBD and include: (a) S477N. S477 is located within a flexible loop (residues A475-G485) [52] and exhibiting the highest local flexibility [52] . Most often S477 substitutes to asparagine (S477N) [50] and its relative population peaked in July 2020 (p.occ. ~27.5%), remaining still high (p.occ. ~5.2%) by the end of 2020 (Figure 2(c) ). This substitution was shown to strengthen binding of S protein with the human ACE2 receptor [52, 53] and to be resistant against multiple mAbs [54] . Additionally, S477 was observed to mutate to arginine (S477R) and isoleucine (S477I) (Figure 2(e) ). As such, S477 substitutions may promote viral transmission [55] and possibly increase affinity with ACE2 [56, 57] , although the biophysical basis of these effects is unknown and warrants further structural characterization. (b) Y453F and (c) N439K. Y453F [58] is known as "cluster five", which arose among farmed minks in Denmark and had its frequency peak in November 2020 (p.occ. ~1.4%, Figure 2 (d)). Y453F is located in the interaction interface with ACE2 and increases binding affinity by 4-fold but does not alter inhibition potency in convalescent sera [53, 58] . Both Y453F and N439K may potentially be escape substitutions [59] . (d) N501Y . The most prominent substitution of N501 is N501Y (Figure 2(c) [60] . Another frequent substitution is N501T (Figure 2(e) ) which was identified as one of the dominating mutations in minks in the USA [61] . Hydrophobicity as a driving force for S protein variation at the primary structure level. We hypothesized that variation in amino acid substitutions across the S protein sequence may be driven by certain physical-chemical properties, since certain physicalchemical rules exist for rationalizing binding affinity of protein-protein interactions [62, 63] . To discover possible global effects, we first compared hydrophobicity between the Wuhan strain and 148 amino acid substitutions with a spectrum of measured frequencies. These substitutions were selected based on their Jan-Dec 2020 time series profiles as shown in Figure 2 (c-e) and Figure S4 . The overall distribution of hydrophobicity across the S protein shows 3 prominent sets, at low, near-neutral and high hydrophobicity values (Figure S5(a) ). These 3 distinct sets are recapitulated for a subset of the residues from the Wuhan strain that corresponds to the wild-type residues for the selected 148 substitutions (Figure 3(a) ). In short, an obvious shift towards hydrophobicity is detectable upon mutation (Figure 3(a-c) ). The shift is underlined by a disappearance of set 1, decrease of set 2, prominent increase of set 3 (Figure 3(a-c) ) and is substantial when either all 148 selected residues are considered (Figure 3(a) ) or relatively low frequency substitutions (p.occ. < 0.5%) (Figure 3(b) ). For the group of high frequency substitutions (p.occ. ≥ 0.5%) a similar trend is observed, with the prominent appearance of set 3, but shift was not substantial (Figure 3(c) ). Certain percentages are calculated for the S protein sequence of the Wuhan strain when considering the type of amino acid (Figure S5(b) ). Percentages are also derived in a similar range for the Wuhan strain residues of the selected residues that are substituted in the different groups ( Figure 3 (d-f)). Their mutated equivalents exhibit a substantial increase of non-polar amino acids, reducing the proportion of charged and polar residues. This result holds true for both low-and high-frequency groups of the substitutions (Figure 3 (e-f)). Overall, there is a clear shift toward hydrophobicity for all considered groups of substitutions that points to a possible denominating biophysical effect of specific origin. Structure-based analysis of S protein variation reveals suboptimal scoring across conformational states for the Wuhan strain. Protein variation and selection is underlined by a plethora of factors [64] , some of which can be derived from structural data [65] . Therefore (N=148) and postfusion [39] (N=31) (Figure 1(a-d) ). As a control, the Wuhan strain S protein was included. We refined all structural models via short molecular dynamics simulations using the HADDOCK server [69] and HADDOCK scores, along with its components, were retrieved Figure S6 ). This result shows that an introduced variation in the S protein sequence is not expected to majorly impact the S protein structure, but rather finely regulate interface non-covalent interactions related to HADDOCK scoring components. Interestingly, calculated scores of Wuhan strain S protein states most of the times falls within the average values of the calculated distributions (Figure 4) , but in two cases it considerably shifts towards lower or higher values (Figure 4(b-c) ). One might argue that slight score shifts are close to noise, but, within these unit shifts, HADDOCK scoring can recapitulate binding affinities of protein-protein interactions within experimental inaccuracies [62, 63, 70] . This observation for the Wuhan strain S protein indicates that its initial sequence was suboptimal in terms of either stability or conformation if scores are to be interpreted as stability or affinity proxies; This means that additional mutations show better scores, overall stabilizing the unbound states, binding to the human ACE2 host receptor and also its fusion. Our observations rely on the HADDOCK score and its components that are not absolute energetic values, but rather indicative of effects of different scoring components that each corresponds to a physical (e.g., van der Waals, electrostatics, desolvation energy) or structural (e.g., buried surface area, BSA (Å 2 )) contribution. To corroborate our analysis, we implemented similar approach using Rosetta [71] . Energy scores for each HADDOCK refined model with single substitution were re-calculated with Rosetta and its own scoring function. As the scores from these two methods are in different scales, rather than comparing the absolute values 1-to-1 we can only compare if they both follow a similar trend (positive, negative or zero slope). A Bayesian-Ridge linear regression model is fitted for every pair of similar type of scores from the two methods. Correlation coefficient R 2 and p-value (for null hypothesis that the slope is zero, using Wald Test with t-distribution) are calculated and quoted in respective subplots ( Figure S7 ). The results show the strongest correlation between the scores from the two methods for buried surface area (BSA, Å 2 ), followed by the electrostatics score for closed and open conformational states ( Figure S7(a,b) ). However, for ACE2-bound and postfusion states, the strongest correlation is seen for BSA and desolvation ( Figure S7(c,d) ). The results from this assignment indicate that both methods robustly capture the electrostatics, desolvation, and BSA scores in closed and open states corroborating that performed HADDOCK scoring is not biased towards its own scoring. To disentangle possible scoring components that may be important for the studied substitutions, structures were modelled by introducing corresponding substitutions, followed by a refinement procedure as detailed in the Materials and Methods. All models were ranked corresponding to their increasing probability of occurrence (p.occ.), and split into two groups, low-and high-frequency, using a variable p.occ. value. The p.occ. value was discretely changed between a range of 0.04% ≤ p.occ. ≤ 2.0% in incremental step size of 0.02%. At every step, the two groups thus formed, were compared with each other, and statistically treated by applying (a) a t-test (Figures S8, S9 ) and (b) Bayesian methods ( Figure S10 ). For the Bayesian analysis the groups were formed at slightly broader step size of 0.1% in the range of 0.1%≤ p.occ. ≤ 2.0%. Number of mutations per step is reported in Table S7 . Interestingly, the HADDOCK score, which is a combination of reported non-covalent forces (van der Waals, desolvation, electrostatics) shows substantial decreased values with increased p.occ. thresholds towards lower p-values, specifically for unbound closed and open states ( Figure S8 ). This behavior is derived from the electrostatics scoring component. states (Figure 6 (a), Figure S8 ) exhibit substantial decrease and not those calculated for the bound ( Figure S8 ) and postfusion states ( Figure S9 , Figure S11(a) ). In addition, the difference in average electrostatics scoring components between consecutive groups is negative. This translates to higher electrostatics scores in the high-frequency groups, pointing to, possibly, a global, destabilizing effect underlying selection of the substitutions ( Figure 6(b) ). This effect is not observed when the bound state or the postfusion conformation are taken into account, although increasing destabilizing effects are observed for the bound state ( Figure S11(b) ). Electrostatics scoring components may capture long-range charge interactions in a protein complex [62, 72] . Therefore, we calculated all distances between Cα atoms of residues in the models and grouped the substitutions according to their inter-chain proximity. Results show that residue variation is not localized in the interface but is well-distributed across the S protein, namely inside interfaces and rim regions (d Cα-Cα ≤ 15.0 Å) and non-interacting surfaces Statistical analysis for each of the classes (N interface , N non-interface ) as function of increased frequency (as performed before, e.g., for Figure S10 -S11) shows that substitutions proximal to the interfaces are involved (N interface ), (Figure 6 (c), Figure S12 ), and not the residues localizing further ( Figure S13 ). It is of note that, again, electrostatics is the single reliable scoring component for the group of residues proximal to the inter-trimeric S protein interface with increasing p.occ. thresholds (Figure 6 (c)). To critically assess our observation from the above p-value trends, we applied a thorough Bayesian parameter estimation method to calculate the effect size/s as described in Materials and Methods. Based on the Bayesian analysis, we see that in the closed state the electrostatics score predominantly shows a significant effect size irrespective of the p.occ value. In other words, Electrostatics is a significant contributor but only in the closed conformational state of the S protein ( Figure 6 (df)) and is calculated not to dependent on interface proximity ( Figure S10 ). Given the fact that in the high-frequency group (p.occ. ≥ 0.5%) only around a quarter of the residues (27%) (Figure 3 To further look into the scoring data, substitutions were categorized into two distinct classes, namely those with low (p.occ. < 1.0%) and high (p.occ. To understand the structural effects of electrostatics in closed and open unbound states, substitutions were visualized after refinement and compared to the Wuhan strain S protein trimer (Figure 7(a-d) ). As expected, our calculations directly show the loss of an intertrimeric interface salt bridge when D614 is mutated to a G (Figure 7(a) ). In addition, K854 forms a salt bridge with D614 of the neighboring chain [39] . This salt bridge is abolished through the most common D614G mutation which is known to increase infectivity of SARS-CoV-2 [6] . K854, in the fusion peptide proximal region (FPPR), supports the closed conformation [39] , further indicating importance of the FPPR in regulating the opening and closing of the RBD. Cryo-EM data have shown that such an effect leads to a destabilization of the closed state [73] as also shown by the presented calculations (Figure 6 (b), 6(e)). In addition, calculations show a drastic effect on the electrostatics scoring component, being one of the most destabilizing substitutions together with D1118 ( Figure 5) . Interestingly, D1118H removes a negative charge from the side chain, therefore removing a formed salt bridge with R1091, and replaces it with a protonatable group (Figure 7(b) ). The newly introduced H1118 still interacts with R1091, but the polar interaction formed is weaker due to possible protonation effects of His and, therefore, introduction of similar charge that would lead to severe repulsion. The A222V variant remains however, structurally unclear and subtle localized electrostatic effects are influenced. This is because when closed or open states of S protein are investigated, the localized interaction network of A222 does not considerably change. However, due to the marginally longer side-chain of V and its higher hydrophobicity, hydrophobic atoms across the localized region tend to come closer as indicated by the presented atom-atom distances of side chains which upon mutation, overall, decrease (Figure 7(c) ). This effect may well simulate a localized atom-atom hydrophobic attraction that is translated into decreased polarity, and therefore, electrostatics score in the corresponding calculations ( Figure 5 ). An interesting outlier is the A570D variant, which provides contradictory results as (Figure 8(b) ). These observations indicate that the FPPR domain which contains the N856 residue is very flexible and is displaced upon opening of the RBD. Interestingly, each distribution for the open conformation, where RBD of chain A is turned upwards, is relatively narrow and has a clear maximum (Figure 8(a-b) ), whereas all distributions of the closed state have many peaks. Therefore, we can assume, based on the calculations and derived distributions that there is a preferred, most populated state in the open conformation but not in the closed one. Next, we performed molecular dynamics simulations of the A570D S protein variant in closed and open conformational states in triplicates (Figure 9(a,b) , Figure S18(a,b) ). These In this work we analyzed all SARS-CoV-2 genomes deposited in GISAID [36] up to January 2021. This date is critical to investigate only substitutions representing vaccineunhindered virus adaptation. The S protein sequence was specifically investigated and its naturally occurring sequence variation. We have first observed that there are "hot" and "cold" spots for variation and showed that 44 positions across the S protein sequence have never undergone selection pressure during this timeline. Such "cold" spots, which are independent of viral adaptation, are mainly clustered in 3 regions (UH, HR1/CH and CD) of the S2 subunit and are correlated to important structure-based regions for biomolecular folding, function, and recognition. On the contrary, the most frequently substituted residues, the "hot" spots, are mainly located in the S1 subunit and especially in the NTD region. Therefore, the "cold" regions are of uttermost importance for development of therapeutics which might cover a broader spectrum of SARS-CoV-2 variants, whereas monitoring of "hot" spots and their structural characterization is of uttermost importance for vaccine adaptation. Overall, our primary structure-based analysis showed that the hydrophobicity is a driving force for a selection of low-frequency substitutions. The same trend for selection of more hydrophobic residues was observed for the high-frequency substitutions. This tendency may affect structural features of the S protein such as e.g., promotion of the S protein RBD We hypothesize that such destabilization may raise the free energy of the unbound state, and therefore, a conformational ensemble that could recognize the ACE2 receptor is easier sampled. This mechanism is a reminiscent of a "steered" conformational selection mechanism for biomolecular recognition [37, 38] . It is rather surprising that the "steered" conformational selection is majorly and globally underlined by electrostatics and is not strongly influenced by any other type of non-covalent interactions. Globally, when looking into the structural data in PDB, just few structurally resolved S protein variants are resolved; most of the variants included in our study are not experimentally structurally determined (compare N=476 vs few experimental data in Figure S14 ) Initial structures for prefusion S protein conformation in closed and open states were downloaded from the CHARMM-GUI [78] COVID-19 Archive [79] . Out of eight existing structural models for each of the both states, which were built as described by Woo et al. [80] , the "1_1_1" models were chosen. As an initial structure for the S protein postfusion conformation, the partially resolved cryo-EM structure (PDB ID: 6XRA) [39] was considered. (residues V320-A520) was remodeled with MODELLER v.9.24 using the x-ray structure of ACE2 receptor bound to RBD solved at 2.45 Å resolution (PDB ID: 6M0J) [14] as a template. The ZN+2 atom was kept in the catalytic center of ACE2 protein since the ion is placed in the vicinity of the interface (~20-25 Å) and may have influence on ACE2 folding. Postfusion conformation. The postfusion conformation was optimized using a cryo-EM structure of SARS-CoV-2 S2 in the postfusion conformation solved at 3.00 Å average resolution (PDB ID: 6XRA) [39] . The partly resolved segments of each chain (residues A1174-E1202) were modelled using the x-ray structure solved at 2.90 Å resolution (PDB ID: 6LXT) [15] as a template. The final structure contains residues N703-I770 and T912-E1202. The (Tables S8-S10) , prepared in accordance with HADDOCK submission requirements using pdb-tools [82] and subjected to HADDOCK refinement. Thereafter, if there were multiple structures of the same variant ( [83] or (F817P, A892P, A899P, A942P, K968P, V969P) [84] (Tables S8-S10). One should be aware of molecular docking algorithms as well as molecular dynamics simulations strengths as well as weaknesses. Molecular docking is a combination of a conformational sampling algorithm and a scoring function [85] . Conformational sampling algorithms are developed and trained on limited number of samples in a test set, thus depending on the system of study and may vary in performance [85] . There are many different categories of sampling algorithms based on molecular dynamic (MD) simulations or stochastic methods such as Monte Carlo and genetic algorithms [86] . MD simulations are simulating atomic motions using simple approximations based on Newtonian physics [87] . The forces which arise form bonded and non-bonded interactions are described in the force fields, which are limited by two principal challenges: force fields which require further refinements [88] as well as limitations in computational power. It is known that, post-docking refinements provide higher hit rates and higher correlation with experimental data [89] . For example, for HADDDOCK it was shown that solvated docking/refinement do not always improve the docking results but improve the scoring [90] [91] [92] , which is highly important for our study. Water refinement procedure of models via HADDOCK is described in the following section. HADDOCK refinement. HADDOCK (High Ambiguity Driven protein-protein DOCKing) is a semiflexible docking method for biomolecular research which is based on bioinformatical predictions and experimental knowledge of biochemical/biophysical interactions. HADDOCK approach is using python scripts derived from ARIA [93] . For structure calculations it uses CNS [94] (Crystallographic and NMR system). The method can be used not only for docking of biomolecules but also for structure-refinement purposes [69] . HADDOCK uses nonbonded electrostatic and van der Waals energy terms of the OPLS force field with the cutoff distance of 8.5 Å from a modified version of the parallhdg5. 22 .pro parameter file [95] for the evaluation of inter-and intramolecular energies. The usual docking protocol consists of three following stages: a rigid-body energy minimization, three steps of a semi-flexible simulated annealing refinements in torsion angle space and a final refinement in Cartesian space with explicit solvent or DMSO (in our case water refinement was applied). The refinement interface, which was utilized in our study, includes only the last stage [69] . 100 K with 500 MD steps at each temperature [96] . The user can decide how many times the initial structure will be refined, in this work we run 20 refinements per model. After each of the refinement a HADDOCK score (HS water ) is derived as a weighted sum of van der Waals (E vdW ), electrostatics (E elec ) and desolvation (E desolv ) scoring components. The resulting value for each scoring component is calculated as an average for the top four best-scoring models with the smallest weighted sum value [69] (https://alcazar.science.uu.nl/services/HADDOCK2.2/). For energetic calculations with Rosetta, the latest weekly release (Rosetta 2021.38) [71] was used. The best ranked docking solution of HADDOCK was used as input model, taking only models of variants with single substitutions. Zinc entries in the .pdb files of ACE2-bound were modified: "ZN+2" was renamed "ZN" and "ZN2" was renamed "ZN". The altered .pdb file was scored according to the Rosetta tutorial "Analyzing Interface Quality" (https://www.rosettacommons.org/demos/latest/public/analyzing_interface_quality/README, accessed 10. March 2022). To assess any correlation between the resulting scoring values from Rosetta and HADDOCK, scoring components of HADDOCK output (Electrostatics, vdW, Desolvation, BSA) and Rosetta scores (fa_elec, fa_atr, fa_sol, dSASA_int) are paired up (x,y) ( Figure S7 ). For each pair (e.g., an attribute from Rosetta vs that corresponding from HADDOCK) a Bayesian ridge linear regression model is fitted using Scikit-Learn library (https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html#). The R 2 score is calculated per pair (as quoted in the subplots) and is defined as (1-u/v), where 'u' is the residual sum of squares ∑((y_true -y_pred) 2 ) and 'v' is the total sum of squares ∑((y_true -mean(y_true)) 2 , where, y_true is the actual data and y_pred is the linear regression model value. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). Moreover, the p-value (for null hypothesis that the slope is zero, using Wald Test with t-distribution) was calculated and quoted in the subplots (Figure S7 ). For this work we queried the public database GISAID [36] for SARS-CoV-2 S protein sequences uploaded over period of time starting from July 2020 with the last query on the 2 nd of January 2021. From the total number of 311,255 S protein sequences, sequences with more than 95% sequence coverage (276,712) were considered for the subsequent analysis. These sequences were aligned against the reference SARS-CoV-2 spike protein sequence (UniProt [97] P0DTC2) using MAFFT [98] v7.471. Frequency of mutations for each aligned site was calculated ignoring undefined residues (X) and gaps (-). In this work, percentage of occurrence (p.occ.) of all appearing substitutions was continuously monitored starting from July 2020 until the 2 nd of January 2021 and substitutions appearing at the ectodomain of S protein were carefully selected (Table S12) . Initially, in the dataset all substitutions appearing in July with p.occ. ≥ 0.2% (e.g., T76I, M153I, V213L, S254F, V382L, T778I, T1117I, G1124V, V1176F) were included. After July, considering the exponential increase in deposited data, substitutions appearing with p.occ. ≥ 0.5% at any considered month were immediately selected. Also, substitutions of the same residues (e.g., G181R, N501T, A1020V) or of the residues next or in immediate vicinity from the residue with substitution p.occ. ≥ 0.5% (e.g., S254F, T572I, G842V) were selected starting from the p.occ. is very close to cleavage site and substitution is Pro) were selected even if they appeared at any of the considered months with p.occ. ~ 0.1%. Furthermore, substitutions which were observed to appear continuously over the period of at least three months with p.occ. ≥ 0.1% were selected (e.g., S12F , P25S, P26L, T29I, T95I, H146Y, M153I, R190M, D198G, Q675R, V687L, V772I, R847K, T859I, A879S, P1069S, Q1071L, A1078S, V1104L, V1122L) . Additionally, substitutions which periodically appeared with p.occ. ≥ 0.1% (T307I, A845S, T859I) and/or were described in the literature (T572I [99] and A706V [100] ) were also included. Also, we considered three lineages which are named according to Pangolin [101] nomenclature Figure S4 and in detail in Table S12 . For the rest of the performed statistics the percentage of occurrence (p.occ.) for each of the substitutions was calculated considering all S protein sequences and are shown in Figure 2(a,b) , Figure S1 -S4. Introduction of substitutions and model refinement. ∆H69/∆V70, T95I, G142D, ∆V143/∆Y144/∆Y145, ∆N211, L212I, ins214EPE, G339D, S371L, S373P, S375F, K417N, N440K, G446S, S477N, T478K, E484A, Q493K, G496S, Q498R, N501Y, Y505H, T547K, D614G, H655Y, N679K, P681H, N764K, D796Y, N856K, Q954H, N969K, L981F) were created. All created models were submitted to water refinement using HADDOCK webserver v2.2 [69] as previously described [62, 92] . Sequence-based hydropathy calculations were performed using the Kyte-Doolittle hydrophobicity scale [105] . Distribution of the Kyte-Doolittle hydrophobicity scores (K-Dhs) was visualized for the complete ectodomain sequence (M1-P1213) of the Wuhan strain ( Figure S5) , for the "wild type" residues and corresponding substitutions for the set of all considered substitutions (N=148), substitutions with p.occ. < 0.5% (N=126), and substitutions with p.occ. ≥ 0.5% (N=22) (Figure 3) . and charged (R, D, E, K) (Figure 3(d-f) ). Scores (HD (HADDOCK) score, electrostatics, van der Waals, desolvation, BSA (buried surface area)) derived from structure-refinement calculations for each of the with 0.02% step, number of mutations per step is reported in Table S7 ). To identify statistically significant contribution of the calculated scoring components, p-value for each step was calculated via unpaired t-test (Figure S8, S9, Figure S12, 13) . To observe destabilization profiles of the electrostatics scoring component, the only significant contributor (Figure 6(a,c) , Figure S8 , S9), a mean scoring value for each of the groups (0.04% ≤ p.occ. ≤ 2.0% with 0.02% step) was calculated, and difference in the electrostatics scoring component (ΔElectrostatics) between the group of substitutions with higher and lower p.occ. was derived for each step (Figure 6(b) , Figure S11(b) ). Since p-values provide evidence for rejecting the underlying H 0 , we consider groups of (a) strong (pvalue < 0.01, e.g., in the case of hydrophobicity), (b) moderate (p-value < 0.05), (c) weak evidence or trend (p-value < 0.1) and (d) no evidence (p-value ≥ 0.1). We use PyMC3 [106] , a probabilistic programming package in Python, that fits Bayesian models using notably Markov chain Monte Carlo (MCMC) methods. To quantitatively assess how different any given two groups of data are from the other, we perform a rigorous Bayesian parameter estimation, using the module -Bayesian Estimation Supersedes the Ttest (BEST) under PyMC3 based on Kruschke, 2013 [107] . Driven by Bayesian probability, this is a comprehensive and more solid approach than the testing approaches that involve expressing a null hypothesis. Moreover, we estimate the uncertainty associated with the estimated parameter that accounts for our lack of knowledge of the model parameters. For a given (group 1, group 2) data, we calculate two parameters, namely, (a) the effect size, and (b) a high-density probability interval around the effect size. Farther away from 0 the effect size (and the 95% HDI) is, the better. Technical details of the procedure: A student t-distribution is used to describe the attributes of each group. A t-distribution is a robust choice for our data, since it is less sensitive to outliers compared to a normal distribution. A t-distribution is represented by three parameters, mean (μ), standard deviation (σ), and degrees-of-normality parameter (ν) (which is assumed to be same for both groups). The prior distribution corresponding to each model parameter is set to be a very broad Uniform distribution, of width equal to 10 times the standard derivation. The prior distribution for parameter ν is assumed to be a very wide exponential distribution, of mean of 30. These wide For every attribute (e.g., HADDOCK score, HADDOCK components (Electrostatics score, vdW score, desolvation, BSA etc.)), the mutations are divided in to two groups (p.occ., %, below and above a threshold), while changing the threshold value incrementally from 0.1% to 2%. For each pair of groups, we calculate a distribution of effect size and plot the mean effect size along with the 95% HDI as a function of p.occ. threshold (%). The process is repeated for closed, open, and bound structures. The MD simulations of the S protein in "closed" and "open" conformational states were performed using NVIDIA CUDA version 9010 acceleration modules of NAMD 2.13 [108] for Linux-x86_64-multicore-CUDA employing CHARMM36 [109] force field. The long-range electrostatics were treated using the particle-mesh Ewald approach [110] . The S protein included M1-L1141 residues of each chain of HADDOCK refined model of A570D variant. The system was solvated in a water box of TIP3P water (with a water layer of 10 Å) and neutralized with 150 mM NaCl, eventually comprising 523,769 atoms in "closed" and 552,368 atoms in "open" conformation. System preparation as well as MD analysis were performed using VMD 1.9.3 [111] . Note that lipids and sugars were not included in the simulation because the MD simulations were performed using "hard-restrained" approach where only atoms of protein 15 Å around protein residues D570, N856 and K964 in all three chains and water-ion environment were allowed to move without harmonical restrains. On all other atoms of the system was applied harmonic force constant of 5 kcal mol −1 Å −2 . Each approach was repeated three times for 10 ns per replicate after observing system equilibration. Prior to each of the MD simulations, two all-atom minimization-relaxation cycles were performed. At the beginning of each cycle, the system was minimized for 10000 steps. During the first cycle, the water-ion environment and hydrogens were allowed to relax, keeping protein harmonically restrained. Subsequently, the temperature was incrementally changed from 0 to 310 K, relaxing the system for 1 ps per increment of 10 K with a final relaxation for 0.1 ns at 310 K. In the second cycle, side chains within 15 Å around (and including) D570, N856, K964 residues in all three S protein chains, as well all hydrogens, water molecules and ions were set free while all other atoms were kept restrained. The system was incrementally heated as described for the first cycle with a final relaxation for 0.1 ns at 310 K. Finally, the system with unrestricted atoms 15 Å around (and including) D570, N856, K964 residues in all three S protein chains was incrementally heated as previously described and the MD simulation was performed for 10 ns at 310 K, 1.01325 bar. During MD simulation an integration time step was set to 2 fs applying SHAKE algorithm on all bonds involving hydrogen atoms. Structure files and associated data of S protein variants in closed, open, ACE2-bound and postfusion conformational states generated in this work have been deposited in SBGrid [112] . Four directories are shared: "S_closed", "S_open", "S_ACE2" and "S_postfus". In each folder an initial .pdb file for each of the conformational states and two subdirectories ("param_index" and "structures") are placed. The "param_index" directory includes two files for each of the considered variants: the results file after the HADDOCK2.2 refinement [69] (.html file) and the parameter file that was used for the structure calculation (.web). The user can reproduce any run by uploading the .web file using the HADDOCK webserver (https://alcazar.science.uu.nl/services/HADDOCK2.2/haddockserver-file.html). The "structures" directory contains the top scoring refined structure file (.pdb) for each of the variants. The nomenclature of each file in subdirectories "param_index" and "structures" corresponds to DIRNAME_R1NUMR2. DIRNAME stands for the name of the main directory ("S_closed", "S_open", "S_ACE2" and "S_postfus"), R1 stands for the one-letter residue code of the "wild type" residue of the S protein, NUM stands for the residue number according to the Uniprot sequence of SARS-CoV-2 Spike protein and R2 stands for the one-letter residue code of the variant to which the "wild type" residue R1 was changed. Results of the HADDOCK score and its components calculations performed with HADDOCK2.2 for each generated variant are summarized in Table S2 -S6. All unpublished code and scripts used in this study are available upon request. and open (N=64) conformations separated into low-and high-frequency groups as function of threshold (0.04 % ≤ p.occ. ≤ 2.0 % with 0.02% step). Asterisk in (a-c) panels indicate position starting from which the calculated p-values are indicative of a consistent trend (p-value < 0.1). (e-g) Calculated effect size (and 95% high probability density interval, HDI) using Bayesian parameter estimation for closed, open and bound states. The variants were separated into low-and high-frequency groups as function of threshold (0.1 % ≤ p.occ. ≤ 2.0 % with 0.1% step). The farther away from 0 the effect size (and the 95% HDI) is, the better. The effect sizes (and 95% HDI) for the electrostatics scores for only the closed conformation show significant differences. Identifying airborne transmission as the dominant route for the spread of COVID-19 Toward Achieving a Vaccine-Derived Herd Immunity Threshold for COVID-19 in the U.S. Front Public Health Structure, Function, and Evolution of Coronavirus Spike Proteins Cell entry mechanisms of SARS-CoV-2 Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus Structural insights into coronavirus entry A Multibasic Cleavage Site in the Spike Protein of SARS-CoV-2 Is Essential for Infection of Human Lung Cells The furin cleavage site in the SARS-CoV-2 spike protein is required for transmission in ferrets Angiotensinconverting enzyme 2 is a functional receptor for the SARS coronavirus Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2 Hydrolysis of biological peptides by human angiotensin-converting enzyme-related carboxypeptidase Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor Inhibition of SARS-CoV-2 (previously 2019-nCoV) infection by a highly potent pan-coronavirus fusion inhibitor targeting its spike protein that harbors a high capacity to mediate membrane fusion Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor Domains and Functions of Spike Protein in Sars-Cov-2 in the Context of Vaccine Design ACE2 gene variants may underlie interindividual variability and susceptibility to COVID-19 in the Italian population Origin and cross-species transmission of bat coronaviruses in China Structural models of human ACE2 variants with SARS-CoV-2 Spike protein for structure-based drug design Insights on cross-species transmission of SARS-CoV-2 from structural modeling SARS-CoV-2 spike protein predicted to form complexes with host receptor protein orthologues from a broad range of mammals Identifying the Zoonotic Origin of SARS-CoV-2 by Modeling the Binding Affinity between the Spike Receptor-Binding Domain and Host ACE2 The Impact of Mutations in SARS-CoV-2 Spike on Viral Infectivity and Antigenicity Cryo-electron microscopy structures of the N501Y SARS-CoV-2 spike protein in complex with ACE2 and 2 potent neutralizing antibodies Effect of natural mutations of SARS-CoV-2 on spike structure, conformation The D614G mutations in the SARS-CoV-2 spike protein: Implications for viral infectivity, disease severity and vaccine design Spike protein D614G and RdRp P323L: the SARS-CoV-2 mutations associated with severity of COVID-19 D614G Mutation Alters SARS-CoV-2 Spike Conformation and Enhances Protease Cleavage at the S1/S2 Junction Structural impact on SARS-CoV-2 spike protein by D614G substitution Circulating SARS-CoV-2 spike N439K variants maintain fitness while evading antibody-mediated immunity Computational Mutagenesis at the SARS-CoV-2 Spike Protein/Angiotensin-Converting Enzyme 2 Binding Interface: Comparison with Experimental Evidence GISAID: Global initiative on sharing all influenza data -from vision to reality Multiple conformational selection and induced fit events take place in allosteric propagation The role of dynamic conformational ensembles in biomolecular recognition Distinct conformational states of SARS-CoV-2 spike protein Unlocking COVID therapeutic targets: A structure-based rationale against SARS-CoV-2, SARS-CoV and MERS-CoV Spike Highly Conserved Homotrimer Cavity Formed by the SARS-CoV-2 Spike Glycoprotein: A Novel Binding Site Tectonic conformational changes of a coronavirus spike glycoprotein promote membrane fusion Why are RNA virus mutation rates so damn high? A single amino acid substitution in the S1 and S2 Spike protein domains determines the neutralization escape phenotype of SARS-CoV A single amino acid substitution (R441A) in the receptorbinding domain of SARS coronavirus spike protein disrupts the antigenic structure and binding activity Permissive secondary mutations enable the evolution of influenza oseltamivir resistance Predominance of positive epistasis among drug resistance-associated mutations in HIV-1 protease SARS-CoV-2 variant evolution in the United States: High accumulation of viral mutations over time likely through serial Founder Events and mutational bursts Coronavirus RNA Proofreading: Molecular Basis and Therapeutic Targeting Emergence and spread of a SARS-CoV-2 variant through Europe in the summer of 2020 N-terminal domain antigenic mapping reveals a site of vulnerability for SARS-CoV-2 Serine 477 plays a crucial role in the interaction of the SARS-CoV-2 spike protein with the human receptor ACE2 Deep Mutational Scanning of SARS-CoV-2 Receptor Binding Domain Reveals Constraints on Folding and ACE2 Binding Identification of SARS-CoV-2 spike mutations that attenuate monoclonal and serum antibody neutralization Mutations Strengthened SARS-CoV-2 Infectivity Vaccine-escape and fast-growing mutations in the United Kingdom, the United States Adaptive Evolution of Peptide Inhibitors for Mutating SARS-CoV-2 The SARS-CoV-2 Y453F mink variant displays a pronounced increase in ACE-2 affinity but does not challenge antibody neutralization Prospective mapping of viral mutations that escape antibodies used to treat COVID-19 Modelling conformational state dynamics and its role on infection for SARS-CoV-2 Spike protein variants SARS-CoV2 spike protein gene variants with N501T and G142D mutation-dominated infections in mink in the United States Proteins feel more than they see: fine-tuning of binding affinity by properties of the non-interacting surface On the binding affinity of macromolecular interactions: daring to ask why proteins interact Types and effects of protein variations Predicting protein function from sequence and structure Conformational transition of SARS-CoV-2 spike glycoprotein between its closed and open states Conformational Changes of the Receptor Binding Domain of SARS-CoV-2 Spike Protein and Prediction of a B-Cell Antigenic Epitope Using Structural Data Real-Time Conformational Dynamics of SARS-CoV-2 Spikes on Virus Particles The HADDOCK2.2 Web Server: User-Friendly Integrative Modeling of Biomolecular Complexes Exploring the interplay between experimental methods and the performance of predictors of binding affinity change upon mutations in protein complexes The Rosetta All-Atom Energy Function for Macromolecular Modeling and Design On the role of electrostatics in protein-protein interactions The effect of the D614G substitution on the structure of the spike glycoprotein of SARS-CoV-2 Structural and Functional Analysis of the D614G SARS-CoV-2 Spike Protein Variant In situ structural analysis of SARS-CoV-2 spike reveals flexibility mediated by three hinges Structures and distributions of SARS-CoV-2 spike proteins on intact virions SARS-CoV-2 Spike receptor-binding domain with a G485R mutation in complex with human ACE2 CHARMM-GUI: a web-based graphical user interface for CHARMM Receptor Binding, and Antibody Binding of the Fully Glycosylated Full-Length SARS-CoV-2 Spike Protein in a Viral Membrane Spike Protein Model in a Viral Membrane Comparative protein modelling by satisfaction of spatial restraints pdb-tools: a swiss army knife for molecular structures Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation Structure-based design of prefusionstabilized SARS-CoV-2 spikes On evaluating moleculardocking methods for pose prediction and enrichment factors Molecular docking: a powerful approach for structure-based drug discovery Molecular dynamics simulations and drug discovery Current status of protein force fields for molecular dynamics simulations Refinement and Rescoring of Virtual Screening Results Explicit treatment of water molecules in data-driven protein-protein docking: the solvated HADDOCKing approach Solvated proteinprotein docking using Kyte-Doolittle-based water preferences Are scoring functions in protein-protein docking ready to predict interactomes? Clues from a novel binding affinity benchmark Automated assignment of ambiguous nuclear overhauser effects with ARIA Crystallography & NMR system: A new software suite for macromolecular structure determination Influence of non-bonded parameters on the quality of NMR structures: a new force field for NMR structure calculation HADDOCK: a protein-protein docking approach based on biochemical or biophysical information UniProt: a worldwide hub of protein knowledge The EMBL-EBI search and sequence analysis tools APIs in 2019 Stabilizing the closed SARS-CoV-2 spike trimer Variations in SARS-CoV-2 Spike Protein Cell Epitopes and Glycosylation Profiles During Global Transmission Course of COVID-19 A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology Emergence and rapid spread of a new severe acute respiratory syndrome Genotype to Phenotype A simple method for displaying the hydropathic character of a protein Probabilistic programming in Python using PyMC3 Bayesian estimation supersedes the t test Scalable molecular dynamics with NAMD CHARMM36 all-atom additive protein force field: validation based on comparison to NMR data A smooth particle mesh Ewald method VMD: visual molecular dynamics Varinats of SARS-CoV-2 Spike protein in closed, open, ACE2-bound and postfusion conformational state Analysis of S protein structural states delineates vaccine-unhindered adaptation 2. We identified three strictly conserved S2 regions, critical for drug design 3. Frequent substitutions increase overall hydrophobicity replacing charged amino acids Writing -Review & Editing Jaydeep Belapure: Data curation, Formal analysis, Methodology, Software, Validation, Visualization, Writing -Review & Editing Christian Tüting: Data curation, Formal analysis, Methodology, Validation, Writing -Review & Editing Reinhard Paschke: Funding acquisition, Project administration, Resources Ioannis Papasotiriou: Conceptualization, Funding acquisition, Project administration, Resources, Supervision João PGLM Rodrigues: Conceptualization, Data curation Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Visualization, Writing Original Draft, Writing -Review & Editing The authors thank the members of the Kastritis Laboratory, RGCC and BioSolutions for valuable discussions. This work was supported by the Federal Ministry for Education and Research (BMBF, ZIK program) (Grant nos. 03Z22HI2, 03Z22HN23 and 03COV04 to PLK), the European Regional Development The authors (Marija Sorokina, Jaydeep Belapure, Christian Tüting, Reinhard Paschke, Ioannis Papasotiriou, João PGLM Rodrigues and Panagiotis L Kastritis) declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.