key: cord-0180872-3wmkux58
authors: Chiang, Yuan; Hui, Wei-Han; Chang, Shu-Wei
title: Encoding protein dynamic information in graph representation for functional residue identification
date: 2021-12-15
journal: nan
DOI: nan
sha: 461b93762943737b317a1f979699ee126a482b62
doc_id: 180872
cord_uid: 3wmkux58

Recent advances in protein function prediction exploit graph-based deep learning approaches to correlate the structural and topological features of proteins with their molecular functions. However, proteins in vivo are not static but dynamic molecules that alter conformation for functional purposes. Here we apply normal mode analysis to native protein conformations and augment protein graphs by connecting edges between dynamically correlated residue pairs. In the multilabel function classification task, our method demonstrates a remarkable performance gain based on this dynamics-informed representation. The proposed graph neural network, ProDAR, increases the interpretability and generalizability of residue-level annotations and robustly reflects structural nuance in proteins. We elucidate the importance of dynamic information in graph representation by comparing class activation maps for the hMTH1, nitrophorin, and SARS-CoV-2 receptor binding domain. Our model successfully learns the dynamic fingerprints of proteins and provides molecular insights into protein functions, with vast untapped potential for broad biotechnology and pharmaceutical applications.

Proteins are molecular machines that carry out a variety of functions in biological processes. From 1D sequences to 3D structures, the genetic codes determine the sequence of amino acids and the way proteins fold into 3D structures (Fig. 1a) . With great experimental efforts, e.g. with X-ray crystallography (XRC) and cryogenic electron microscopy (cryo-EM), many protein structures have been determined at high resolutions, typically spanning from 1 to 4Å [1] [2] [3] [4] . To date, over 170,000 entries of 3D protein structures can be found in the Protein Data Bank (PDB) 5 , the repository of experimentally determined 3D structures of proteins, nucleic acids, and complex assemblies. However, the number of available protein structures falls far short of the available sequence data. For example, UniProt Knowledge-base (UniProtKB) 6 , the database of protein sequence and function, contains over 200 million annotated sequences.

To fill the gap between sequence and structure data and hence explore the unknown functions of newly discovered proteins, the determination of protein structures by experimental and computational techniques has been a long-standing issue in structural biology. Homology-modeling techniques have driven advances in the comparative construction and evaluation of protein structures 7, 8 . Comparative construction means such homology-modeling techniques need a priori structures as templates to build the models, which inevitably limits the generalizability toward the large proportion of completely unknown structures. The recent development of deep learning methods takes advantage of an increasing amount of sequence data to predict 3D protein structure. In the 14th biennial Critical Assessment of Structure Prediction (CASP14), deep learning models AlphaFold2 9 and RoseTTAFold 10 achieved supremacy in the accuracy of blind structure prediction tests. At the frontier of deep learning approaches, both models took input from multiple sequence alignments (MSA) and applied attention-based message passing schemes on residue pairs.

In particular, their integration with roto-translationally equivalent SE(3)-transformer and invariant point attention 11, 12 plays a crucial role in the fine adjustment of torsional angles in backbones and side chains.

With the aid of the high-throughput structures predicted by these models, we can now tackle the next challenge in protein function prediction by data-driven approaches.

In recent decades, there has been an emerging paradigm that, in proteins, sequence-encodes-structureencodes-dynamics-encodes-function 13 (Fig. 1a) . With the structures determined, the dynamics in turn regulate the behaviors and functions of proteins. For example, human MutT homolog 1 (hMTH1), as shown in Fig. 1b , utilizes different conformations of the same substrate binding pocket to recognize different types of oxidized nucleotides (8-oxo-dGTP and 2-oxo-dATP) and to conduct nucleotide saniti-zation 14 . In many cases, protein functions are intertwined with dynamic properties ranging from local residues to global collective motions. More essentially, the functional prediction of proteins is a great challenge, and the problems are multifold. Traditional machine learning classifiers, such as support vector machines, random forests, and gradient boosted decision trees, have been extensively used to predict sitespecific or protein-level functions 15, 16 . Many of the recent deep learning models focus on the variations in sequence embedding and neural network architecture [17] [18] [19] [20] . One of the most natural choices for learning protein features is the graph neural network (GNN). GNNs can aggregate local residue features in a high-dimensional space and learn appropriate representations for local and global inferences 19 . Although previous works demonstrate the utility of GNNs for protein function prediction, the dynamic properties of proteins have not been considered in most of these models. Even a tiny point mutation in the protein sequence or excess thermal overflow can distort the backbone structure, alter the dynamic behavior, and eventually nullify certain functions as a whole. It is therefore expected that encoding dnamic information of proteins into GNNs would increase the discriminatory capacity of neural networks for either regular function prediction or anomaly (for example, mutation) detection.

In this work, we propose a framework to incorporate dynamic information into graph-based deep neural networks. By processing protein structures into three feature pipelines, including spatial, topological, and dynamic features, we successfully increase the discriminatory power of neural networks and enhance the classification performance on protein function prediction. We retrieve 3D structures of proteins from PDB 5,21-24 and perform multilabel classification on molecular function (MF) gene ontology (GO) annotations obtained from Structure Integration with Function, Taxonomy, and Sequence (SIFTS) database 21, 25, 26 . We also use the gradient-weighted class activation map (Grad-CAM) 27 to identify the dynamically activated residues (DARs) by comparing the activation maps with and without dynamic information. Our model, protein dynamically activated residue (ProDAR), determines the residues that are critical to protein dynamic behaviors and that have fundamental importance for protein and drug engineering. We also demonstrate several examples of DARs identified by neural networks and discuss their effects on the dynamic characteristics of protein functions. Fig. 1 Protein dynamics in relation to molecular functions and graph encoding of dynamic information. a Protein hierarchy from DNA sequence, to polypeptide chain, to folding structure, to molecular functions. b Conformational difference between hMTH1 complexes with 8-oxo-dGTP and 2-oxo-dATP. The binding pockets in the two complexes have different hydrogen bonding patterns with substrates. c Equilibrium fluctuation about the native conformation and the energy induced by conformational change. d Twenty mode fluctuations and the cross-correlations between residue pairs. The correlation map is calculated by the trace of cross-correlation (inner product of the fluctuation vectors) as well as normalization against the diagonal components. e Graph construction scheme used in this work, merging the contact map and correlation map into an adjacency matrix for graph representation. One-hot vectors of 21 dimensions are used to encode different types of amino acids.

Proteins in vivo are neither static nor rigid bodies but dynamic and deformable molecules undergoing conformational changes due to thermal fluctuations. Via exploration by thermodynamical process, particular atomistic configurations are sampled in large conformational space to obtain the desired behaviors. To screen thousands of conformational changes in a high-throughput manner, we apply normal mode analysis (NMA) to the anisotropic network model (ANM), a surrogate model where molecular structures are represented as bead-spring networks. As illustrated in Fig. 1c , C α atoms are extracted from the polypeptide chain as beads, and any two beads within a prescribed cutoff distance r c are connected with each other through a spring. The formed bead-spring network has zero potential energy at the native conformation. The energy incurred by conformation change is then the sum of spring harmonic potential (see Methods). We collect the first 20 mode fluctuations (mode shapes) obtained from the eigenanalysis of the potential Hessian matrix and compute the cross-correlations between all residue pairs (Fig. 1d) . The cross-correlations underscore the residue pairs that have a high degree of synergy across 20 mode fluctuations. The pairs that mostly move toward the same direction have high correlation, while that mostly move toward the opposite directions have high anticorrelation. For each protein, we calculate the correlation map by taking the trace of the cross-correlation matrix and normalizing along diagonal components. The correlation map thus identifies the residue pairs that are important for the collective dynamics of protein accessible at the native conformation.

To encode structural and dynamic information into the protein graph representation, we merge the contact map and correlation map into a single adjacency matrix. The residue pairs in contact are first connected together with contact edges. Aside from the residue pairs with contact edges, other residue pairs with absolute correlation values no less than 0.5 (c i j ≥ 0.5) are connected with correlation edges.

Each node is assigned with a one-hot feature vector of 21 dimensions according to the type of amino acid it holds. The constructed protein graphs therefore encompass sequential, structural, and dynamic information of proteins in an unified representation. Through our graph construction framework, a considerable number of correlation edges are added to the protein graphs, as shown in Fig. 2c . Although a large number of the PDB entries have few correlation edges, the number of correlation edges in some PDB entries can be nearly eightfold the number of contact edges. Our work has benefited from these correlation edges, which fundamentally change the way GNNs draw inferences and reason down to the residue level. Fig. 2d shows the distributions of contact and correlation edges with respect to ten MF-GO terms in our dataset. Note that the number of correlation edges for aspartic-type endopeptidase activity (GO:0004190) is markedly smaller than those of the other terms. 6 Moreover, zinc ion binding (GO:0008270) has few PDB entries with extremely large numbers of correlation edges (approximately over 120,000). These correlation edges alter protein graph representations and increase the amount of dynamic information encoded into the graphs. The distributions of contact edges are similar across different terms, but those of correlation edges are remarkably dissimilar from each other, indicating that the dynamic information encoded in correlation edges plays an important role in MF-GO terms and that our graph representations contain more information, better representing and differentiating proteins in high-dimensional space. Dynamics-informed representation increases the discriminatory power for function classification The correlation edges in our graph construction establish connections between residues far apart in space but dynamically correlated in terms of functional motions. These distant connections ameliorate the information flow in graphs by permitting message to pass over a long distance without the limitation imposed by the depth of graph convolutional layers. This largely alleviates the restriction of the depth of GNNs in consideration of the computing efficiency. To examine the effectiveness of correlation edges for protein function classification, we construct two sets of graphs, one with only contact edges and another with both contact and correlation edges (Fig. 3a) . In addition to graphs, we calculate 1D and 2D persistence diagrams of C α atoms in each protein, the coarsened topological features measured by persistent homology 20, 28 , and embed the diagrams into persistence images via Gaussian kernels and pixelated integrals 29 (see Methods). Our model takes input from three feature pipelines: (1) spatial information from the contact map where residue pairs are closer than cutoff distance r c such as 8Å and 12Å, (2) topological information from the persistence image, which is preprocessed as vectors of 625 dimensions, and (3) dynamic information based on the correlation map such that residue pairs with absolute correlation no less than 0.5 are connected with correlation edges if they are not in contact. As shown in Fig. 3b , the graphs are fed into five graph convolutional layers in sequence. Each layer of graph convolution takes the adjacency matrix and node features as input and outputs the aggregated node features to the next layer. In this work, we adopt the graph convolutional network (GCN) by Kipf et al. 30 To examine the utility and importance of structural, topological, and dynamic features in predicting protein functions, we design four model architectures: Contact, ContactPers, ContactCorr, and Contact-CorrPers, with each model having different combinations of feature pipelines (Table 1) . We also construct two datasets with different cutoff distances r c = 8Å and r c = 12Å to observe the effect of the cutoff on the classification performance. The choices of cutoff distance for graph-based protein feature extraction vary from approximately 6 to 15Å in the literature 17, 20, 32 . Here we follow the cutoff 8Å used by PersGNN 20 for a fair comparison. Furthermore, to avoid the influence of imbalanced positive and negative samples, we calculate the balanced accuracy (arithmetic mean of the true positive rate and true negative rate) of the test set as the immediate metric of model performance. Due to the usage of weighted binary cross-entropy loss (Methods), the models tend to have higher final recall at the cost of precision. We regard this phenomenon as the desired outcome because we expect models to capture as many functions as possible. and PersGNN 20 . We evaluate the trained classifiers by leveraging the threshold at the final sigmoid activation. The area under the precision-recall curve (AUPR) and the maximum F1 score (F1-max) are calculated to compare the model performance. We find that GCNs in general outperform GraphSAGEs with maximum aggregators, despite a small margin of improvement. This agrees with the previous experimental and theoretical study showing that the mean aggregator essentially retains more discriminatory power than max aggregators since mean aggregator, although more simplistic than the sum aggregator, learns the distribution of the node features such that it can be as powerful as the sum aggregator if the feature distribution is diverse or the recovery to the sum is possible 33 . Our result also shows that Con-tactPers and ContactCorr comparably enhance the representational capacity of the Contact model, with the ContactCorr model outperforming ContactPers by up to 0.005 in AUPR and 0.004 in F1-max score when GraphSAGE and the r c = 12Å dataset are used. However, it is worth noting that by combining persistence and correlation together, the ContactCorrPers model has a notable performance gain that is greater than the direct sum of individual improvements by ContactPers and ContactCorr. The most robust classifier of the above models is ContactCorrPers with GCN convolutional layers for both the 8Å and 12Å datasets. This model outperforms the state-of-the-art models in terms of AUPR and F1-max scores by significant margins, as presented in Fig. 4 and Table 2 . Fig. 4c illustrates two-dimensional projec- DARs are critical functional residues with dynamic importance

Proteins are complex molecular machines whose functions are closely correlated to key functional residues. The experimental approaches usually include indirect methods, e.g. ,point mutation of specific residues, to identify the important collections of residues and their underlying functional mechanism. Examples abound in active sites of an enzyme, ligand-binding sites, ionic gating, protein-protein interaction (PPI), and mechanical conformation changes involving soft and stiff secondary structures 17 can thus be expressed as

where w l (k−1)d N + j is the equivalent linear weight from the GMP layer to the graph embedding l, and w m l is the equivalent linear weight from graph embedding to the output layer (see the red blocks in Fig. 5a ). d N and d G are node and graph embedding dimensions, respectively. K is the total depth of graph convolutions.

1 is the binary indicator function which equals 1 if the true condition is satisfied. For each PDB chain, we record the outputs activated by five graph convolutional layers (red arrow in Fig. 5a ) and calculate the sum of the outputs weighted by the partial derivatives of GO annotations with respect to the feature channels in the convolutional layers. As the outputs forwardly propagate through two fully connected layers with layer normalization and ReLU activation, the corresponding weights are retrieved during calculations. Fig. 5b shows the Grad-CAMs of hMTH1 (PDB 5GHI) predicted by the Contact and ContactCorr network with a cutoff of 12Å. Both activation maps successfully identify hydrogen bonding aspartate residues (Asp119 and Asp120), while only the ContactCor network activates L-A, the loop surrounding the binding pocket ( Fig. 5c-d) . As shown in Fig. 5b ,e, the difference in the saliency maps (∆SALS, see Methods) emphasizes residues from Phe113 to Asp120 and loop L-A, reflecting the dynamic significance of their roles in substrate binding 14 . The binding pocket changes conformation to recognize different substrates, as shown previously in Fig. 1b 14 . Notably, the peaks of the activation map by the ContactCorr network are generally larger than those by the Contact network. This is attributed to the contribution of the correlation edges in the ContactCorr network. The correlation edges reinforce the activation of residues and provide more confident functional inferences in graphs. These results show that with dynamic information in graph representation, ProDAR is able to identify the residues that are overlooked by the networks with mere input from the contact map.

13 Fig. 5 DARs and functional residue identification. a Schematic derivation of the importance weight α m i at channel i of the node embeddings for function m. b Grad-CAMs of the Contact and ContactCorr networks for PDB 4GHI and the saliency difference (∆SALS) between the Grad-CAM obtained by the ContactCorr network and the Grad-CAM obtained by the Contact network. c-d PDB structure annotations of activation maps predicted by c Contact and d ContactCorr networks. All residues are colored with a marine-magenta gradient according to the Grad-CAM profiles after minmax normalization, with more salient residues highlighted in magenta and less salient residues colored in marine blue. e PDB structure annotation of the saliency difference. All residues are colored with blue-white-red gradient. The residues colored in red and blue represent reinforcement and suppression, respectively, while the residues colored in white are comparably salient in both Grad-CAMs. in that residues 30-34 and 129-130 are inside A-B and G-H loops, respectively, and that residues [39] [40] [41] [42] [43] are located at the β -sheet in connection with the A-B loop. Moreover, residues 120-124 are pertinent to hydrogen bonds with water in the distal pocket 37 . In contrast, the residues near Val25 and Tyr105, the β -sheets with commonplace structural motifs, are suppressed. This implies that the ContactCorr network, unlike the Contact network, has a stronger denoising capability and does not activate trivial residues in protein.

In the structure of the D129A/L130A mutant of NP4, the 129-130 peptide bond is shifted by ≈ 2Å, and Ala130 is rotated away from the heme, leading to an enlarged distal pocket for additional water molecules.

This nullification of abilities for NO binding and release is precisely reflected in the activation map by the ContactCorr network. By comparing the Grad-CAMs of the ContactCorr network in Fig. 6a and b, we note that Asp30, Ala129, and Ala130 are deactivated in the D129A/L130A structure. Moreover, the Grad-CAMs of the ContactCorr network are relatively more stable in the T121V and D129A/L130A structures.

The two activation profiles exhibit similar activated sites. However, despite the ability to consistently identify some parts of functional residues (such as residues around 40 and 121), the Grad-CAMs of the Contact network show incoherent activation profiles for the other residues. In particular, it is unable to properly respond to the D129A/L130A mutations, leading to erroneous overactivation around Asp147 and Tyr134. Additionally, note that N-terminus, which forms hydrogen bonds with Asp 129 in the T121V structure and with Ala130 in the D129A/L130A structure, has larger activation values in the ContactCorr profiles than those in Contact profiles. This reveals that ProDAR has better sensitivity to structural nuance and stronger confidence for functional residue identification. Grad-CAMs and ∆SALS on T121V mutant of nitrophorin 4 (NP4) with nitric oxide, ferric heme, and phosphate ion (PDB 1SY1). b Grad-CAMs and ∆SALS on D129A/L130A mutant of NP4 with ferric heme and ammonium ion (PDB 1SY2). In Grad-CAMs, all residues are colored with marine-magenta gradient according to the Grad-CAM profiles after min-max normalization, with more salient residues highlighted in magenta and less salient residues colored in marine blue. In ∆SALS, all residues are colored with blue-white-red gradient. The residues colored in red and blue represent reinforcement and suppression, respectively, while the residues colored in white are comparably salient in both Grad-CAMs.

At the forefront for the fight against the highly pathogenic severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), identifying residues of dynamic importance is urgently needed for structure-based vaccine design 38, 39 . Here, we apply ProDAR to the structure of the SARS-CoV-2 receptor binding domain (RBD) to determine whether our model is able to identify the important residues in the unseen and fragmented structure. Fig. 7a shows the complex of the SARS-CoV-2 RBD and cell receptor angiotensin converting enzyme 2 (ACE2) in cats 40 . Previous cryo-EM studies of the SARS-CoV spike protein and its interaction with ACE2 have shown that ACE2 receptor binding is a critical step for SARS-CoV to enter target cells 41 . However, the interaction between the SARS-CoV-2 RBD and ACE2 at the atomic level is not well understood. At the hydrophilic interface between the SARS-CoV-2 RBD and human ACE2 (hACE2), 13 hydrogen bonds and 2 salt bridges have been found previously 39 . Similar interactions have been found at the interface between the SARS-CoV-2 RBD and cat ACE2 (cACE2) 40 . In the complex of SARS-CoV-2 RBD and cACE2 40 (Fig. 7a) , seven residues (Leu455, Phe456, Tyr473, Ser477, Phe486, Asn487, and Tyr489) on the β 1 /β 2 loop of SARS-CoV-2 RBD have been found in contact with cACE2, including 2 hydrogen bonds (Ser477 with Gln18 of cACE2 and Asn487 with Tyr83 of cACE2 40 ). As shown in Fig. 7b and c, ProDAR successfully activates the aforementioned residues in the Grad-CAM of the ContactCorr network. Thr500, Asn501, and Tyr505, the residues contributing to the contacts of SARS-CoV-2 RBD with both hACE2 and cACE2 40 , are also activated. Although neither the complex nor the individual atomic structure of SARS-CoV-2 RBD and cACE2 is in the training dataset and cACE2 is not present in the inference phase, the model yet exhibits strong predictive power and makes cogent inference on residues.

In the Grad-CAM of the ContactCorr network, we find that there is a second largest peak around Ser373, which is not widely discussed in the literature. Molecular dynamics (MD) simulation is performed to confirm whether the residues around Ser373 are indeed crucial for protein functions. We simulate the full atomistic model of SARS-CoV-2 spike glycoprotein and find that Ser373 forms hydrogen bonds with other SARS-CoV-2 RBDs in the trimeric unit. In Fig. 7d , we show the hydrogen-bonding patterns between the SARS-CoV-2 RBDs in the entire homotrimeric spike glycoprotein 38 . Our results indicate that in the closed state, Ser373 tends to form hydrogen bonds with Asp405, Glu406, and Arg408 and stabilize the trimer of RBDs. At 0.77 ns, two and one hydrogen bonds form at the A-B and B-C interfaces, respectively, while no hydrogen bonds are found at the C-A interface. Fig. 7e shows the number of hydrogen bonds between SARS-CoV-2 RBDs. Three Ser373-Asp405 hydrogen bonds form at the same time, while Arg408-Ser373 pairs can form at most one hydrogen bond simultaneously. The occupancy analysis also shows that S373:OG-D406:OE1 is the most stable hydrogen bond, with occupancy close to 80%, and that S373:OG-D405:OD1 and R408:NH2-S373:O have occupancy values of approximately 60%, whereas S373:OG-D405:CG has only 13.86% occupancy (Fig. 7h) . The results again reveal that ProDAR can pinpoint the critical residues that have important functionality and molecular characteristics, even though the structure is not included in the training dataset. Different approaches for encoding dynamic information into graphs are of interest in future studies.

The straightforward merging of contact and correlation edges can be replaced by more sophisticated graph architectures, e.g., dual graphs, directed graphs, and graphs with multiple types of edges. The values obtained from the correlation map can also be applied as the weights for edges, resembling the motifs 20 of attention-based networks (for example, Graph ATtention network). While this work focuses on the utility of correlation edges and their implications for functional residue identification, we demonstrate that ProDAR has remarkable robustness against noise and has unprecedented interpretability for both regular and mutated PDB structures. Our model also shows strong generalizability to unseen PDB structures.

Without a priori knowledge of ligands and hydrogen bonds, ProDAR can still provide reliable inferences regarding PDB structures at the residue level.

Herein, we describe a method to encode the dynamic information of proteins in graph representation.

This work is the first to incorporate the NMA of proteins into a deep learning framework for functional prediction. In addition to the contact map commonly used by the previous graph-based deep learning models 17, 19, 20 , we compute the correlation map and add edges between the residue pairs that are highly correlated in the first 20 modes. These correlation edges connect the dynamically correlated residues distant in space and allow message to pass over long spatial distances. The model therefore alters the graph inference regime and better captures the dynamic information intrinsically residing in the native conformation of proteins. Such dynamics-informed graph representation achieves supreme performance in protein function classification and increases the discriminatory power of neural networks. The models with input from dynamic information identify DARs that are both structurally and functionally meaningful to protein function.

To summarize, we close the information gap between structure and function by introducing dynamic analysis in the graph construction process. Our method bridging structure and function with dynamics provides the missing part of the deep learning framework for protein function prediction and sketches a comprehensive picture from sequence to function, with great potential to advance science in mutation detection and function engineering through rapid screening over large sequential and structural spaces.

Given a surrogate physical system for a protein represented by N interaction particles (C α atoms), the potential energy can be expressed as the Taylor expansions near the native conformation q o . By omitting higher order term to the second order, the required energy to transit from q o to q becomes the sum of the pairwise potentials

where H is the 3N × 3N Hessian matrix of the second derivatives of the potential with respect to particle coordinates at equilibrium

. By considering only the internal interactions under classical regime, the equation of motion of particles can be written as

where M is the diagonal matrix of particle masses. The solution to equation (4) is a 3N-dimensional harmonic oscillator, which can be further generalized as

where a (k) is a complex vector containing the amplitude and phase factor, and ω k is the frequency of the k-th solution. Substituting the solution (equation (5)) into equation (4) gives the generalized eigenvalue equation

. By stacking a complete set of solution vectors into a single matrix U and arranging the corresponding squared frequencies as a diagonal matrix Λ, the equation (5) can be rewritten as

. The equation (7) can be solved by transforming it into a standard eigenvalue equation through mass-

. Thus the massed-weighted eigenvalue equation becomes

. The equation (10) has 3N − 6 non-zero eigenvalues, reflecting six degenerate eigenstates of three translational and three rotational degrees of freedom. The mass-weighed HessianH remains real, symmetric, and positive semi-definite as the original Hessian H by construction. Its eigenvectorsũ (k) (column vectors ofŨ, k = 1, 2, · · · , 3N ) form an orthonormal basis and construct normal modes of the system in the mass-weighed coordinates. The original normal modes in Cartesian coordinates can hence be obtained by

SinceŨ is a unitary matrix, the orthogonality is again satisfied by

The energy change associated with a given mode k is proportional to the square of its mode frequency ω k , as seen in equation (2):

Fluctuations along the high-frequency modes are therefore energetically more expensive than those along the low-frequency modes. According to the equipartition theorem, the vibrational energy is prone to be equally partitioned among all the modes, such that the vibrational amplitudes are scaled by 1/ω 2 k .

NMA is based on the assumption of symmetric and positive semi-definite Hessian matrix. Strictly speaking, energy minimization needs to be carried out before performing NMA on protein crystal structure to ensure the local energy minimum. However, energy minimization is computationally intensive.

An alternative method is adopting elastic network model (ENM) that accept the initial structure (crystal structure from PDB) to be the conformation at local energy minimum.

Anisotropic network model (ANM) is the most broadly used ENM 13,46-48 . We first extracted C α atoms of amino acids as nodes and connected them with springs if the separating distance lies within the prescribed cutoff distance r c . The potential energy of ANM is given by the sum of pairwise harmonic potentials:

where γ αβ is the spring force constant between i, j pair of atoms; r αβ = q β − q α and r o αβ = q o β − q o α are the instantaneous distance and the equilibrium distance between two atoms, respectively. Here we use

greek letters as indices to avoid confusion with the indices of individual components used in equation (2) .

The components in the Hessian of ANM H can be obtained from the second derivatives with respect to the components of α and β atoms:

where x, y are free indices interchangeable with respect to the three dimensions of coordinates. Evaluating equation (14) at equilibrium conformation q o α , q o β yields the off-diagonal Hessian submatrices

and the diagonal Hessian submatrices

. The spring force constants γ αβ are set as 1 in this work.

To elucidate the dynamic behavior of proteins about the equilibrium conformation and find out the underlying cooperative motions, it is of interest to know the cross-correlations between residue fluctuations given by the mode shapes, i.e. the eigenvectors of the Hessian, within and across different normal modes.

To this end, we can calcualte the ensemble average of cross-correlations between certain components of eigenvectors

where Z is the configurational integral

, and H −1 i j is the i, j-th element in the inverse of Hessian. The integration performs over the entire configurational space following Boltzmann distribution and therefore favors low energy modes with high probability over high energy modes. By equation (10) , the inverse of Hessian can be expressed as

. Since there are six degenerate modes with zero eigenvalues, the inverse of Hessian is replaced by the pseudoinverse, which takes the sum of non-degenerate modes

. Therewith we have the cross-correlation matrix that is the product of inverse eigenvalues and eigenvectors 25 by transforming it into the Cartesian coordinates

. It should be noted that C has size of 3N × 3N and gives the cross-correlation between the individual components of atom coordinates. We therefore reduce it into N × N matrix c by taking the trace of the 3 × 3 submatrices of C, which is equivalent to the sum of the inner products of eigenvectors between i, j residues:

Note that all elements of cross-correlation matrix c are linearly proportional to k B T and thus can be canceled out after normalization with respect to each residues. We therefore reach our final normalized cross-correlation matrix (correlation map)c

such that all of the diagonal elements are equal to 1 and off-diagonal elements represent the normalized correlation between residues.

Persistent homology (PH) is an algebraic topology method for measuring topological features of shapes, data, and functions 28, 49 . It provides a stable tool for a myriad of applications, including biomolecules classification, geometric modeling, and network analysis.

Consider the union of identical ε-balls around C α atoms. The simplicial complex (points, lines, triangles, and terahdra) of C α atoms start to emerge and then disappear (birth and death of k-simplices) as the radii of the balls increase. The nested sequence of this simplicial complex is the filtration filtered by the radius of the ball. Each birth-death pair is then transformed into the persistence diagram (PD) by mapping to birth-persistence coordinate. The vertical distance between diagonal line and point is exactly the persistence of the topological feature. We used alpha complex to compute the one-and two-dimensional persistent homology group H 1 , H 2 . For a given α, the alpha complex includes all the simplices in Delaunay triangulation which have an empty circumscribing ball with squared radius equal or smaller than α.

The filtration of alpha complex therefore allows us to extract the topological features of loops and cavities in protein structures.

Let D be a PD in birth-death coordinates and T : R 2 → R 2 be the linear transformation T (x 1 ,

x 1 ) from birth-death coordinates to birth-persistence coordinates. At each transformed point u = (u 1 ,

we placed a Gaussian kernel function

where σ 2 is the variance of the gaussian kernel. With an appropriate weighting function w : R 2 → R, we transform PD into the persistence surface

. We designate σ 2 = 1 and w as the square of persistence

. The function intensifies the feature with larger persistence and satisfies stability requirements being zero along z 1 , continuous, and piecewise differentiable 28 . By taking the pixel-wise integral of the discretized persistence surface, we obtain the persistence image (PI) consisting of a vector of integral value for each pixel p

. to select the optimal ones. However, previous experimental studies have indicated that the classification accuracy in machine learning framework is robust against the choice of image resolution 28, 50 and has low sensitivity to the variance of Gaussian kernel (equation (23)). The implementation of persistent homology and persistence image representation were carried out using GUDHI library with CGAL as the backend for alpha complex calculation 51, 52 . To optimize the multi-label classification prediction and take the imbalanced samples into account, all models were trained to minimize the weighted cross-entropy loss function that induces models to act if the balanced samples are given:

w j y i j log ŷ i j + (1 − y i j ) log 1 −ŷ i j (27) where N is the total number of training samples, L is the number of MF-GO classes, y i j is the true binary indicator for protein i to have MF-GO function j, andŷ i j is the predicted probability that protein i is annotated with MF-GO function j; w j is the added weight for the positive class j, which is assigned as the ratio between negative and positive samples

For inference prediction, the protein i is said to have function j annotated with associated GO terms if the predicted probabilityŷ i j ≥ 0.5. True positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and false negative rate (FNR) are calculated by varying the threshold value from 0 to 1 to obtain the precision-recall curve.

We randomly separated our dataset into train and test set by 90% and 10%. We used ADAM optimizer 57 with learning rate LR = 5 × 10 −5 and exponential decay rate β 1 = 0.9, β 2 = 0.999. To avoid overfitting, we employed weight decay 1 × 10 −5 58 and further applied dropout regularization p = 0.1 and layer normalization after each linear layer 59 . Each model is trained for 300 epochs with batch size of 64 on NVIDIA V100 16/32G GPU. The model training is implemented using PyTorch deep learning library 60 .

To locate the residues crucial for protein function, we devised a method inspired by Gradient-weighted Class Activation Map (Grad-CAM) 17, 27 to find the residues with highest contribution at inference stage. Motivated by its recent success in image classification and residue-level annotations for proteins, we used Grad-CAM to identify the residues that are important for function prediction. The method produces visual explanation from neural network and highlights the features that contribute more to the activation in the downstream neurons.

We used PyTorch hook to extract the residue embeddings after each graph convolutional layer H (k) ∈ R V ×D k . To compute the contribution of each residue i to the prediction of function j, we computed derivatives of the model output layerŷ j with respect to residue embeddings h (k)

i ∈ R D k :

where α j measures the importance of each feature in residue embeddings for predicting function j. Instead of summing the derivatives from individual residue for measuring the importance of a specific feature map 29 f c ∈ R V (i.e. c channel of residue embeddings along residue space H (k) = [f 1 , · · · , f D k ] = [h 1 , · · · , h V ] ) as implemented by Gligorijević et al. 17 , here we maintained the vector form of feature map and took its inner product with the importance vector for each residue to obtain the funtion-specific heatmap in the residue space:

where ReLU function ensures that only positive contributions to the function prediction are preserved.

The difference of saliency map is given by:

where I j is the intensity of activation signal at residue j, ∀ j ∈ V . The saliency map accentuates the region that is noticeable among its neighborhood, providing a self-referential measure of feature importance such that the comparison between the Grad-CAM from ContactCorr network and the vanilla one from Contact network becomes meaningful. The difference of saliency map ∆SALS therefore identifies the residues that are confidently activated or suppressed by the ContactCorr network compared with Contact network.

Principles of protein X-ray crystallography

How cryo-em is revolutionizing structural biology

Cryo-electron microscopy methodology: current aspects and future directions

Extraction of protein dynamics information from cryo-em maps using deep learning

Announcing the worldwide protein data bank

Uniprot: the universal protein knowledgebase in 2021

Comparative protein structure modeling using modeller

Swiss-model: 31 homology modelling of protein structures and complexes

Highly accurate protein structure prediction with alphafold

Accurate prediction of protein structures and interactions using a three-track neural network

Se (3)-transformers: 3d rototranslation equivariant attention networks

Generative models for graphbased protein design

Normal mode analysis of biomolecular structures: functional mechanisms of membrane proteins

Structural and kinetic studies of the human nudix hydrolase mth1 reveal the mechanism for its broad substrate specificity

Towards region-specific propagation of protein functions

Cath functional families predict functional sites in proteins

based protein function prediction using graph convolutional networks

Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function

Proteingcn: Protein model quality assessment using graph convolutional networks. bioRxiv

Persgnn: Applying topological data analysis and geometric deep learning to structure-based protein function prediction

Pdbe: towards reusable data delivery infrastructure at protein data bank in europe

Rcsb protein data bank: powerful new tools for exploring 3d structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences

Protein data bank japan (pdbj): updated user interfaces, resource description framework, analysis tools for large structures

New tools and functions in dataout activities at protein data bank japan (pdbj)

Sifts: structure integration with function, taxonomy and sequences resource

Sifts: updated structure integration with function, taxonomy and sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins

Grad-cam: Visual explanations from deep networks via gradient-based localization

Persistent homology-a survey

Persistence images: A stable vector representation of persistent homology

Semi-supervised classification with graph convolutional networks

Inductive representation learning on large graphs

Protein function prediction via graph kernels

How powerful are graph neural networks?

High precision protein functional site detection using 3d convolutional neural networks

Predicting protein function from domain content

Learning deep features for discriminative localization

Role of binding site loops in controlling nitric oxide release: structure and kinetics of mutant forms of nitrophorin 4

Structure, function, and antigenicity of the sars-cov-2 spike glycoprotein

Structure of the sars-cov-2 spike receptor-binding domain bound to the ace2 receptor

Broad host range of sars-cov-2 and the molecular basis for sars-cov-2 binding to cat ace2

Cryo-em structures of mers-cov and sars-cov spike glycoproteins reveal the dynamic receptor binding domains

A new set of amino acid descriptors and its application in peptide qsars

Where did the blosum62 alignment score matrix come from?

Learning protein sequence embeddings using information from structure

Amino acid encoding for deep learning applications

Large amplitude elastic motions in proteins from a single-parameter, atomic analysis. Physical review letters

Anisotropy of fluctuation dynamics of proteins with an elastic network model

Anisotropic network model: systematic evaluation and a new web interface

Computational topology: an introduction

Topological descriptors for 3d surface analysis

The gudhi library: Simplicial complexes and persistent homology

3D alpha shapes

Prody: protein dynamics inferred from theory and experiments

Biopython: freely available python tools for computational molecular biology and bioinformatics

The PyMOL molecular graphics system, version 1.8

Vmd: visual molecular dynamics

Adam: A method for stochastic optimization

Automatic differentiation in pytorch

The authors appreciate the financial support from the Ministry of Science and Technology, Taiwan [109- 2224-E-007-003, 110-2636-E-002-013]. We thank National Center for High-performance Computing (NCHC) for providing computational and storage resources. 

The authors declare no competing interests.

The code implementation associated with this paper is publicly available on GitHub (https://github.com/Chiang-Yuan/ProDAR)