key: cord-0277744-gjzql71h authors: Manrique, P.D.; Chakraborty, S.; Nguyen, K.; Mansbach, R.; Korber, B.; Gnanakaran, S. title: Network analysis outlines strengths and weaknesses of emerging SARS-CoV-2 Spike variants date: 2021-09-04 journal: bioRxiv DOI: 10.1101/2021.09.03.458946 sha: 21f9fe62d5b610b3aecf4efe0570c036245ec9d4 doc_id: 277744 cord_uid: gjzql71h The COVID-19 pandemic, caused by the SARS-CoV-2 virus, has triggered myriad efforts to dissect and understand the structure and dynamics of this complex pathogen. The Spike glycoprotein of SARS-CoV-2 has received special attention as it is the means by which the virus enters the human host cells. The N-terminal domain (NTD) is one of the targeted regions of the Spike protein for therapeutics and neutralizing antibodies against COVID-19. Though its function is not well-understood, the NTD is reported to acquire mutations and deletions that can accelerate the evolutionary adaptation of the virus driving antibody escape. Cellular processes are known to be regulated by complex interactions at the molecular level, which can be characterized by means of a graph representation facilitating the identification of key residues and critical communication pathways within the molecular complex. From extensive all-atom molecular dynamics simulations of the entire Spike for the wild-type and the dominant variant, we derive a weighted graph representation of the protein in two dominant conformations of the receptor-binding-domain; all-down and one-up. We implement graph theory techniques to characterize the relevance of specific residues at facilitating roles of communication and control, while uncovering key implications for fitness and adaptation. We find that many of the reported high-frequency mutations tend to occur away from the critical residues highlighted by our graph theory analysis, implying that these mutations tend to avoid targeting residues that are most critical for protein allosteric communication. We propose that these critical residues could be candidate targets for novel antibody therapeutics. In addition, our analysis provides quantitative insights of the critical role of the NTD and furin cleavage site and their wide-reaching influence over the protein at large. Many of our conclusions are supported by empirical evidence while others point the way towards crucial simulation-guided experiments. The emergence and subsequent worldwide spread of the severe acute respiratory syndrome coronavirus 30 2 (SARS-CoV-2) causing COVID-19 [1, 2] is a global health emergency that, according to the World 31 Health Organization, has taken more than 4 million lives as of July 16, 2021 [3] . COVID-19 is a highly 32 contagious respiratory illness initiated by viral entry into host cells prior to infection and symptoms. 33 The first event in viral entry is contingent upon the binding of the Spike glycoprotein, located at the 34 surface of the SARS-CoV-2 pathogen, with the human host receptor angiotensin-converting enzyme 2 35 (ACE2) [4] [5] [6] . 36 37 The Spike is a homotrimeric class I viral fusion protein able to adopt different conformations according 38 to the state of the receptor-binding-domain (RBD) of each of its protomers. It has been determined by 39 cryo-electron microscopy at the atomic level [7] that the Spike adopts two predominant conformations 40 in which either all three protomers are in a closed state (all-down conformation), or one protomer is illustrate the graph and graph representations, respectively, of the Spike protein divided into pre-assigned protomer domains (colored regions/nodes). In the top right panel, we provide the full domain names, the domain abbreviations, and the residues that comprise them. From its discovery in November 2019 and throughout 2020, COVID-19, like other coronaviruses, has positioned and hierarchically connected to exert wide-reaching control of the protein at large; and (5) 105 a specific examination of the NTD residues that are altered in the Delta variant, reveals that these 106 residues are more efficient at impacting the full protein than the residues of the NTD supersite. form than a similar intervention performed on the G form, since a network with a more uniform distri-177 bution of betweenness centrality among its nodes tends to be more resilient to external interventions 178 aimed at network disruption. This finding is in agreement with experimental and structural studies 179 that found that the Spike protein of the D form is less stable with regards to S1 dissociation than that 180 of the G form [15, 16] . The G form promotes rapid allosteric pathways to the RBD at the cost of immune vulnerability 183 We explore the effects of the D614G variant at altering specific pathways relevant for viral infectivity 184 through a path length analysis. Up to date, variants of concern are characterized by a combination 185 of high transmissibility and/or immune response escape [23, 24, 26] . For example, the Delta variant, 186 characterized by eleven modifications (two of them in the RBD) including two deletions, has shown 187 resistance to monoclonal antibodies (mAbs) casirivimab and imdevimab, which are otherwise effective 188 binders to the RBD able to inhibit human receptor binding [34] . Furthermore, RBD opening transition 189 and changes in FCS are known to affect each other [18, 35] . Thus, variations in the communication 190 properties to the RBD becomes a key question to address in order to establish the impact of mutations 191 in the transmissibility and antibody resistance of the virus. To this end, we analyze the optimal paths between the furin cleavage region and the RBM [ Fig. 3 In Figure 3 (b), we depict the results of assessing optimal pathways between the different source/target versely, the D form shows smaller differences between the path lengths in these two conformations, 219 and the all-down path lengths in the D form are generally longer than those of the G form [ Fig. S4 . For all cases, the target region is specified at the protomer U, and the source is either protomer: from U to U (UU), from L to U (LU), and from R to U (RU). including the D614G shift, that have been shown to affect the gating mechanism of the RBD, must in 244 some way communicate via the hinges [17, 18] . To understand how communication occurs between the stalk to the Spike, we explore the role of H 1 247 and H 2 in connecting residue 614 to the subset of residues of the RBD that binds to the ACE2, RBM (residues 438-508), by computing the shortest network pathways. We quantify the involvement of 249 each of the hinge residues in these pathways in two complementary manners. In method (i) we count, 250 for all RBM residues, the number of times a particular hinge residue is part of a shortest pathway 251 between that residue and 614. In method (ii), we count the number of hinge residues that belong to 252 either the group of residues comprising the shortest paths or an additional group comprising the top-50 253 suboptimal residues. Adding the top-50 suboptimal residues accounts for neighboring residues to the 254 shortest pathway that could play a key role by assisting optimal residues in their communication tasks 255 by providing alternatives routes to the RBM. These suboptimal residues are identified by comparing 256 the direct (i.e., optimal) path from 614 and RBM with an indirect (i.e., suboptimal) pathway that 257 goes through an additional third node, which does not belong to the optimal path. The difference in 258 the length of these pathways provides a measure of how well this third node connects 614 and RBM 259 through a suboptimal path; the smaller the path length difference, the higher the third node ranks 260 within the suboptimal group. Thus, method (ii) identifies and quantifies the number of hinge residues 261 that take part of optimal pathways and their surroundings (i.e., suboptimal pathways), and method 262 (i) quantifies the frequency with which these residues are traversed in the optimal pathways. Since the up conformation of the RBD is capable of binding ACE2, the event that leads to viral entry 265 and later infection, we focus on the paths from residue 614 to RBM in the U-protomer [ Fig. 4 The one-up conformation is characterized by an increase in the usage of H 2 residues when compared 277 to that in the all-down case. We find that the G form is able to use both hinges almost equally and at the virus is expected to mutate at regions where antibodies bind, for immune escape [26] . But if an 307 antibody binds to the NTD, even if it is not exactly binding to the highest centrality residues, it will 308 affect the NTD dynamics, which are highly correlated with the RBD dynamics, and therefore the residues (i.e., edge weights) for the different protein systems. We find that the average pair correlation 320 between contact residues for the all-down conformation in both, the D-and the G-form, is always 321 higher than that of the one-up conformation, which in turn, makes the average path length shorter in 322 the all-down conformation [ Table S1 ]. Moreover, we find that the G-form has shorter path lengths, on 323 average, than the D-form, which makes the Spike of the D614G variant more efficient at establishing 324 effective allosteric communication pathways than that of the wild type. centrality, which computes the influence of a particular node based on the influence of its neighbors. Thus, the centrality score of a given node j takes into account the score of its neighbors, which in turn, so on. Therefore, this centrality intrinsically includes interactions with higher degree of neighbors, and 339 hence quantifies the global influence of a given node. Mathematically, the eigenvector centrality of the 340 node j is the j-th entry of the adjacency matrix eigenvector associated with the largest eigenvalue. In contrast to optimal paths analysis, in which edge weights indicate path lengths and therefore the 342 smaller the weight the higher the importance of the edge, in influence centrality measures, the edge 343 weights directly indicate importance. Therefore we employ an edge weighting for influence centrality 344 that monotonically increases with the correlation between two residues (see Materials and Methods) 345 and which is the inverse of the weight used for optimal communication paths calculations [39] . 2), which originated in India [43] , has received special attention 368 due its enhanced infectivity, its ability to bypass antibodies [44] , and to quickly become the dominant Our analysis provides a quantitative argument for why these critical sites are being modified/deleted 405 in the Delta variant: changing residues with low betweenness does little to hinder the overall allosteric 406 dynamics of the protein, but changing residues with high closeness could impact antibody binding. Modifications in the sites G142 and R158 in specific have already been shown to be directly associated 408 to mAbs neutralization escape [27] . Thus, our results suggest that the closeness centrality may be 409 used as a quantitative measure to determine critical residues in the Spike protein with the potential 410 of serving as effective epitopes to neutralize the protein. 16] , the greater vulnerability to RBD-binding antibodies [17, 51] , and 427 the increased stability of the Spike protein, as a result of the D614G mutation [15, 16] . of open Spike conformation [52] and enhance protease cleavage at FCS [53] , but the exact mechanism 432 of such information transfer has not been determined. Our computational graph approach has elu- Table 453 S6], suggesting that substitution/deletion events tend to avoid sites that are relevant for protein func- Their actual centrality values equal 98% of the highest-ranking node in closeness of each conformation. In the wild-type, the values are 96.5% and 97.8% of the highest ranked residue in the all-down and 482 one-up conformations, respectively. The ranking of residue 417 in closeness is lower than residue 50 483 but still reaches scores within the 70% mark for all four networks. These residues rank below the 1.3% 484 and 0.002% marks in betweenness and eigenvector centrality, respectively. built with a data-driven homology-modeling approach. Missing residues in the RBD specifically were 504 built from an ACE2-bound RBD substructure (PDB ID 6M0J). The G-form was created from the 505 D-form through manual mutation of residue 614, but otherwise the initial structures were identical. Atomistic molecular dynamics simulations were performed using the Gromacs software suite with the 507 CHARMM36m force field. We ran five 1.2-µs simulations for each of the four systems. The final 508 1000ns of each trajectory was considered as the production set, after equilibration, for a total of 20 509 µs of simulation in neutral solution, using the Berendsen barostat, the particle mesh Ewald method 510 for electrostatics and temperature coupling at 310K with a Langevin thermostat. Graph construction 513 We use the contact matrix in combination with the cross-correlation matrix to define the edges and 514 weights of our networks, respectively. An edge between two residues is defined when heavy atoms of 515 each residue were at a Euclidean distance of 6Å or less for at least 75% of the analyzed simulation 516 frames. The cross-correlation matrix is defined as: where ∆ r j (t) are the fluctuations of atom j with respect to its average coordinates. The weights for 518 the communication graphs edges are defined as w ij = − log |C ij | [30] . We use this definition of weights 519 for calculations based on shortest paths analysis (e.g., betweenness and closeness centralities), while 520 its inverse for control/influence analysis [39] (e.g. degree and eigenvector centralities). In graph theory, centrality measures quantify the importance of a particular node within the network 524 at large [62] . Since the importance can be interpreted in several different ways, there are different 525 types of centrality measures according to specific criteria. In undirected networks, the most employed 526 measures are degree, eigenvector, closeness, and betweenness centrality. Degree centrality quantifies the number of edges a particular node has. It is interpreted as a measure 529 of popularity and influence a node has at the local level. In weighted networks, each edge is leaden 530 by a specific weight and the degree of a node is the sum of the weights of the associated edges and 531 it is hence also known as vertex strength. Mathematically, the degree centrality of node j (k j ) can be 532 computed as: where A ij is the adjacency matrix defining the connections of the graph (A ij = 1, if i is in contact 534 distance with j and A ij = 0, otherwise), and w ij is the weight matrix defining the strength in the 535 connection between any two nodes. Eigenvector centrality measures the influence a node has at a wider scale than degree. It quantifies 538 the influence of a node given that of its neighbors. For example, a given node j has many edges and Closeness centrality measures how far a particular node is from all the other nodes in the network 549 according to their associated pathlength, which quantifies the distance between any two nodes in a 550 network. For weighted networks, the pathlength between node i and node j (d ij ), is the sum of the 551 weights associated to the edges that comprise the network path between node i and node j. Therefore, 552 nodes holding higher values of closeness centrality are located, on average, at a shorter distance from 553 all the rest of the nodes and hence are able spread information more efficiently throughout the graph. For a network with n number of nodes, the closeness centrality of the j-th node is defined as the 555 inverse of the average distance ( j ) between a particular node and the rest of the nodes. Finally, betweenness centrality quantifies the number of shortest pathways (also called geodesics) a 557 particular node takes part on that connects any two other nodes. Hence, nodes with high betweenness 558 are critical at establishing efficient bridges between distant regions of the network. The removal of 559 these high-betweenness nodes can ultimately lead to the disruption of the network, and therefore their 560 role is key for the stability and resiliency of it. Formally, betweenness centrality of node j (b j ) in an 561 arbitrary network can be expressed as: where σ st is the total number of geodesics connecting nodes s and t, while σ st,j is the number of 563 geodesics connecting node s to node t that pass through node j. while the inter-region connection with the CH is weakened. By the same token, in the one-up conformation 51 we find that the connections between CT0 and CT2, and between S2S2' with FPR and with CH, are all 52 weakened, while the connection between S2S2' with HR1, is strengthened. 53 54 The results presented in the Table S1 were average values over the whole protein. Figure S3 where k w i is the vertex strength of node i, k i is the unweighted degree, w ij is the edge weight connecting 61 nodes i and j, and a ij is the adjacency matrix element associated to nodes i and j. Table S2 details the number of residues belonging to specific regions (see Fig. 1 residues ranked by betweenness centrality for the four networks is provided in the Table S3 . We find that in the all-down configuration there is a marked preference for using H 1 over H 2 , and using 90 routes that do not pass through the hinges. This finding is more prominent in the D form than the G form. The one-up configuration is characterized by the use of both hinges at comparable amounts with a small 92 preference towards H 1 , and a highly more frequent use of the hinge residues in the G form than in the D shows the ranking of these residues (horizontal grid lines) within the whole protein, and a table that details specific residue information (rank, residue identity, centrality score, and protein region where the residue belongs to), for the all-down (left panel) and the one-up (right panel). Colors in the gid lines of the centrality plots and the tables are assigned by the protein region each residue belongs to: pink for RBD, green for NTD supersite, blue for FCS, and white for other regions. [ Fig. S5(b) ]. 18, 20, 26, 138, 190 . Comparing the average closeness centrality of these deleted residues with that 147 of the NTD supersite, we find that the score of the mutations associated to the Gamma and Beta variants 148 rank higher [ Fig. S7(b) ]. The scores of the residues deleted by the Alpha variant and those of the NTD 149 supersite are very similar to each other, though the former is slightly larger than the latter. The ranking 150 of these residues within the whole protein is shown in Fig. S7 (c) revealing that these key residues occupy 151 dominant position in the protein ranking in closeness centrality. These residues are very efficient at initiating 152 communication pathways capable to reach the whole protein. A computation of the betweenness and eigenvector centralities of the residues deleted by the Alpha and Beta 155 variants, shows that these residues rank very low in these metrics within the entire protein [ Fig. S7 ]. For 156 betweenness, we find that the impact of these residues often changes depending on which conformation we (c) Figure S7 : (a) Spike protein in the network representation highlighting the location of modified/deleted residues in the Alpha (blue), Beta (orange), and Gamma (purple) variants of the SARS-CoV-2 virus. (b) Comparison of the average closeness centrality of residues deleted by the Alpha (blue), Beta (orange), and Gamma (purple), and that of the NTD supersite (light orange). (c) Ranking of the closeness centrality for these deleted residues (horizontal grid lines) within the whole protein, and a table that details specific residue information (rank, residue identity, and centrality score times 10 4 ), for the all-down (left panel) and the one-up (right panel). Colors in the gid lines of the centrality plots and the tables are assigned by the variant: blue for Alpha, and orange for Beta. Each panel shows the ranking of these residues (horizontal grid lines) within the whole protein, and a table that details specific residue information (rank, residue identity, and centrality score), for the all-down (left panel) and the one-up (right panel). nearly seven thousand pairs of nodes. Therefore, deletion of these two residues is likely to affect severely only 163 one of the conformations. On the other hand, the deletion of residues L242 A, A243, and L244 (all three in 164 Beta), is expected to impact the protein the most because they hold large values of betweenness centralities Conservation entropy measures the mutation frequency associated to each residue. Table S6 shows the 172 entropy for the highest mutable residues of the Spike protein alongside with their respective betweenness where, up to August 2021, no mutations have been found. We identify region 1 by looking into residues with 182 low-betweenness (< 2.15% of the highest), low-eigenvector (< 9% of the highest), mid-closeness (between 71 183 and 92% of the highest), and high-exposure (> 16% of exposure). Total exposure of each residue is obtained 184 from theoretical surface area calculations of Gly-X-Gly tripeptides [7] . This criteria identfies 80.4% of the 185 residues belonging to the NTD super site, as well as, 63.1% of the NTD residues altered by the Alpha, 186 Beta, Gamma, and Delta variants. On the other hand, region 2 is characterized by sole condition of having 187 eigenvector centrality scores higher than 12% of the highest. This single condition carries some interesting 188 implications for the other centrality measures. For example, we find that residues in region 2 hold a very 189 uniform closeness centrality values that lie between 91 and 94% of the highest. Similarly, we find that these 190 residues hold vertex strength values higher than 25% of the highest. These findings become the first step 191 towards predicting where the next mutations could take place and hence talk about pandemic preparedness. Structural impact on SARS-CoV-2 spike protein by 611 D614G substitution SARS-CoV-614 2 spike-protein D614G mutation increases virion spike density and infectivity D614G Spike Mutation Increases SARS CoV-2 Susceptibility to SARS-CoV-2 and bat RaTG13 spike glycoprotein structures 622 inform on virus evolution and furin-cleavage effects D614G mutation alters SARS-CoV-2 spike 625 conformation and enhances protease cleavage at the S1/S2 junction The furin cleavage site in the SARS-CoV-2 spike protein 628 is required for transmission in ferrets SARS-CoV-2 spike P681R mutation, a hallmark of the Delta variant, enhances viral fusogenicity 631 and pathogenicity Spike protein cleavage-633 activation mediated by the SARS-CoV-2 P681R mutation: a case-study from its first appear-634 ance in variant of interest (VOI) A.23.1 identified in Uganda Genetic variants of SARS-CoV-2 -What do they mean The emerging plasticity of SARS-CoV-2 A neutralizing human antibody binds to the N-terminal domain of the Spike 643 protein of SARS-Cov-2 Recurrenti deletions in the SARS-CoV-2 spike glycoprotein drive antibody escape N-terminal domain antigenic mapping reveals a site of vulnerability 649 for SARS-CoV-2 Neutralizing and protective human 658 monoclonal antibodies recognizing the N-terminal domain of the SARS-CoV-2 spike protein Dynamical networks in tRNA:protein com-661 plexes Conformational dynamics of SARS-CoV-2 trimeric spike glyco-663 protein in complex with receptor ACE2 revealed by cryo-EM The effect of the D614G substitution on the structure of the spike glycoprotein of 666 SARS-CoV-2 Structures and dynamics 668 of the novel S1/S2 protease cleavage site loop of the SARS-CoV-2 spike glycoprotein SARS-CoV-2 S protein:ACE2 interaction 675 reveals novel allosteric targets. eLife, 10, e636446 Natural Polymorphisms Are Present in the Furin Cleavage Site 677 of the SARS-CoV-2 Spike Glycoprotein Molecular charac-679 terization of interactions between the D614G variant of SARS-CoV-2 S-protein and neutralizing 680 antibodies: A computational approach The architecture of complex 682 weighted networks Statistical analysis of weighted networks. Discrete Dynamics in 684 Detection of a SARS-CoV-2 variant of concern in 686 South Africa Emergence of a 688 Novel SARS-CoV-2 Variant in Southern California Spatiotemporal pattern of COVID-19 690 spread in Brazil Convergent evolution of SARS-CoV-2 spike mutations, L452R, E484Q and 693 P681R, in the second wave of COVID-19 in Maharashtra Reduced sensitivity of SARS-CoV-2 variant Delta to 696 antibody neutralization Variants distribution of cases SARS-CoV-2 immune evasion by 703 the B.1.427/B.1.429 variant of concern A mechanistic understanding of 706 allosteric immune escape pathways in the HIV-1 envelope glycoprotein Residue geometry networks: a rigidity-based 709 approach to the amino acid network and evolutionary rate analysis Generalized correlation-based 712 dynamical network analysis: a new high-performance approach for identifying allosteric com-713 munications in molecular dynamics trajectories Cross-neutralization of SARS-CoV-2 by a human 716 monoclonal SARS-CoV antibody Structural and Functional Analysis of the D614G SARS-CoV-2 Spike Protein Variant D614G Mutation Alters SARS-CoV-2 Spike 724 Conformation and Enhances Protease Cleavage at the S1/S2 Junction Structural basis of SARS-CoV-2 spike protein induced by 726 ACE2 Surveying the Side-Chain Network Approach to Protein Structure and Dynamics: The SARS CoV-2 Spike Protein as an Illustrative Case Dynamic Network Modeling of Allosteric Interactions and 731 Communication Pathways in the SARS-CoV-2 Spike Trimer Mutants: Differential Modulation of 732 Conformational Landscapes and Signal Transmission via Cascades of Regulatory Switches. The 733 Exploring dynamics and network analysis of spike glycopro-735 tein of SARS-COV-2 Point mutations in the S protein connect the 737 sialic acid binding activity with the enteropathogenicity of transmissible Structural and functional analysis of the surface protein of human coron-739 avirus OC43 Bat-to-human: spike features determining 'host jump' of coronaviruses 741 SARS-CoV, MERS-CoV, and beyond Structure and binding properties of Pangolin-CoV spike glycoprotein inform the 744 evolution of SARS-CoV-2 Networks: An Introduction Top 60 highest mutable residues according to their conservation entropy score. Residues below this ranking have entropy scores below the 1% of the highest ranked residue in conservation entropy. Data from Los Alamos National Laboratory Sequence Entropy Data server cov.lanl.gov database, as of The architecture of complex weighted 196 networks The emerging plasticity of SARS-CoV-2 Recurrenti deletions in the SARS-CoV-2 spike glycoprotein drive antibody escape Emergence of a Novel 203 SARS-CoV-2 Variant in Southern California Detection of a SARS-CoV-2 variant of concern in South 205 Spatiotemporal pattern of COVID-19 207 spread in Brazil Maximum allowed solvent accessibilites 209 of residues in proteins