key: cord-0743984-qay58ug1 authors: Lu, Shuai; Li, Yuguang; Nan, Xiaofei; Zhang, Shoutao title: A Structure-based B-cell Epitope Prediction Model Through Combing Local and Global Features date: 2021-07-14 journal: bioRxiv DOI: 10.1101/2021.07.13.452188 sha: 40097fc1be4c98a8b079161db1794dd9fb678da5 doc_id: 743984 cord_uid: qay58ug1 B-cell epitopes (BCEs) are a set of specific sites on the surface of an antigen that binds to an antibody produced by B-cell. The recognition of epitopes is a major challenge for drug design and vaccines development. Compared with experimental methods, computational approaches have strong potential for epitope prediction at much lower cost. Moreover, most of the currently methods focus on using local information around target amino acid residue for BCEs prediction without taking the global information of the whole antigen sequence into consideration.We propose a novel deep leaning method thorough combing local features and global features for BCEs prediction. In our model, two parallel modules are built to extract local and global features from the antigen separately. For local features, we use graph convolutional networks to capture information of spatial neighbors of a target amino acid residue. For global features, Attention based Bidirectional Long Short-Term Memory networks (Att-BLTM) are applied to extract information from the whole antigen sequence. Then the local and global features are combined to predict BCEs. The experiments show that the proposed method achieves superior performance over the state-of-the-art BCEs prediction methods on benchmark datasets. Also, we compare the performance differences between data with or without global features. The experimental results show that global features play an important role in BCEs prediction. B-cell epitopes (BCEs) are a set of specific sites on the surface of an antigen that binds to an antibody produced by B-cell. The recognition of epitopes is a major challenge for drug design and vaccines development. Compared with experimental methods, computational approaches have strong potential for epitope prediction at much lower cost. Moreover, most of the currently methods focus on using local information around target amino acid residue for BCEs prediction without taking the global information of the whole antigen sequence into consideration.We propose a novel deep leaning method thorough combing local features and global features for BCEs prediction. In our model, two parallel modules are built to extract local and global features from the antigen separately. For local features, we use graph convolutional networks to capture information of spatial neighbors of a target amino acid residue. For global features, Attention based Bidirectional Long Short-Term Memory networks (Att-BLTM) are applied to extract information from the whole antigen sequence. Then the local and global features are combined to predict BCEs. The experiments show that the proposed method achieves superior performance over the state-of-the-art BCEs prediction methods on benchmark datasets. Also, we compare the performance differences between data with or without global features. The experimental results show that global features play an important role in BCEs prediction. The humoral immune system protects the body from foreign objects like bacteria and viruses by developing B-cells and producing antibodies [1] . Antibodies play a crucial role in immune response through recognizing and binding the disease-causing agents, called antigen. Bcell epitopes (BCEs) are a certain region on the antigen surface that are bound by an antibody [2] . BCEs of protein antigens can be roughly classified into two categories, linear and conformational [3] . Linear BCEs consist of residues that are contiguous in the antigen primary sequence, while the conformational BCEs comprise residues which are not contiguous in sequence but folding together in three-dimensional structure space. About 10% of BCEs are linear and about 90% are conformational [4] . In our study, we focus on conformational BCEs of protein antigens. The localization and identification of epitopes is of great importance for the development of vaccines and for the design of therapeutic antibodies [5] , [6] . However, traditional experimental methods to identify BCEs are still expensive and time-consuming [7] . Therefore, great efforts for computational approaches based on machine learning algorithms have been developed to predict BCEs. These approaches can be divided in two categories: sequence-based and structure-based methods. As the name implies, the sequence-based approaches predict BCEs only based on the antigen sequence, while the structure-based approaches also consider its structural features. Currently, various structure-based predictors have been developed to predict and analyze BCEs including BeTop [8] , Bpredictor [9] , DiscoTope 2.0 [10] , CE-KEG [11] , CeePre [12] , EpiPred [13] , ASE Pred [14] and PECAN [15] . Some of those methods improve model performance by introducing novel features such as statistical features in BeTop, thick surface patch in Bpredictor, new spatial neighborhood definition and half-sphere exposure in DiscoTope2.0, knowledge-based energy and geometrical neighboring residue contents in CE-KEG, B factor in CeePre and surface patches in ASE Pred. Except novel features, antibody structure and powerful model also improve the results of BCEs prediction. Epipred utilizes antibody structure information to annotate the epitope region and improves global docking results. PECAN represents antigen or antibody structure as a graph and employ graph convolution operation on them to make aggregation of spatial neighboring residues. An additional attention layer is used to encode the context of the partner antibody in PECAN. However, all the currently structure-based BCEs prediction methods only use local information around target amino acid residue without considering the global information of the whole antigen sequence. Global features have been proved to be effective in some biology sequence analysis models such as proteinprotein interaction sites prediction model DeepPPISP [16] and protein phosphorylation sites prediction model DeepPSP [17] . However, which model is used determines the effect of global features. DeepPPISP utilizes TextCNNs processing the whole protein sequence for protein-protein interaction sites prediction. DeepPSP employs SENet blocks and Bi-LSTM blocks to extract the global features for protein phosphorylation sites. In our study, we take advantage of the Attention based Bidirectional Long Short-Term Memory(Att-BLTM) networks which are first introduced for relation classification in the field of natural language processing (NLP) [18] . Att-BLSTM networks are also employed for some chemical and biomedical text processing tasks including chemical named entity recognition [19] and biomedical event extraction [20] . Given the excellent performance of Att-BLSTM, we combine it with the novel deep learning model Graph Convolution Networks(GCNs) [21] for BCEs prediction. In this study, we propose a structure-based BCEs prediction model utilizing both antigen local features and global features. By combining Att-BLSTM and GCNs, both local and global features are used in our model to improve its prediction performance. We implement our model on some public datasets and the results show that global features can provide useful information for BCEs prediction. We use the same antibody-antigen complexes as PECAN [15] . Those complexes are from two separate datasets, one from EpiPred [13] , and the other from Docking Benchmarking Dataset (DBD) v5 [22] . There are 148 antibody-antigen complexes from EpiPred [13] , 118 for training and 30 for testing and all of them shared no more than 90% pairwise sequence identity. In order to construct a separate validation set, [15] filters the antibody-antigen complexes in the Docking Benchmarking Dataset (DBD) v5 [22] to guarantee that all antigens in the validation set had no more than 25% pairwise sequence identity to those in the testing set. Together 162 complexes are selected. 103 complexes are used for training, 29 for validation and 30 for testing. Same as works in [13] , [15] , residues are labeled as part of the epitope if they have any heavy atom within 4.5Å away from any heavy atom on the antibody. The size of datasets and number of epitopes are shown in Table 1 . For global features, we construct the input antigen sequence as a set of sequential amino acid residues: where each residue is represented as a vector r i ∈ R d corresponding to the i-th residue in the antigen sequence, l is the antigen sequence length, and d is the residue feature dimension. For local features, each antigen structure is represented as a graph as related studies in [15] and [23] . The amino acid residue is a node in the protein graph whose features represent its properties. For residue r i , the local environment consists of k spatial neighboring residues, and constructs its receptive field N i : {r n1 , · · · , r n k } are the neighbors of residue r i which define the operation field of the graph convolution. In this study, the residue node feature is a 128dimension vector encoding important properties used as in our earlier work [24] . All those residue node features can be divided into two classes: sequence-based and structure-based. Sequence-based features consist of the one-hot encoding of the amino acid residue type, seven physicochemical parameters [25] and a conservation profile returned by running PSI-BLAST [26] against nr database [27] . The structure-based features are calculated for each antigen structure isolated from the antibodyantigen complex by DSSP [28] , MSMS [29] , PSAIA [30] and Biopython [31] . The residue edge features between two residues r i and r j are representing as e i j. e i j reflects the spatial relationships including the distance and angle between residue pair r i and r j and it is computed by their C a [23] . Our model solves a binary classification problem: judging an antigen residue binding to its partner antibody or not. As shown in Fig. 1 , our model consists of two parallel parts: GCNs and Att-BLSTM networks. The former captures local features of target antigen residue from its spatial neighbors, and the latter extracts global features from the whole antigen sequence. The outputs are concatenated and fed to two linked fully connected networks to predict the binding probability for each antigen residue. Each protein is represented as a graph, and a residue is a node in the graph. We perform graph convolution operation on the receptive field of each residue. In our study, we utilize two graph convolution operators as previous work [23] : where N i is the receptive field, i.e. a set of neighbors of target residue r i , W t is the weight matrix associated with the target node, W n is the weight matrix associated with neighboring nodes, σ is a non-linear activation function, and b n is a bias vector. Formula 3 groups the node information in receptive filed. Formula (4) utilizes not only node features but also edge feature between two residues, where W e is the weight matrix associated with edge feature, e i j represents the edge feature between residue r i and r j , and b n e is a vector of biases. Besides local features, global features are crucial in BCEs prediction as well. In our work, Attention-based Bidirectional Long Short-Term Memory(Att-BLSTM) networks are used to capture global sequence information of input antigen sequence. Att-BLSTM is first proposed for relation classification which is a natural language processing (NLP) task [18] . It captures the most important semantic information in a sentence and achieves the state-of-the-art performance on a public dataset. Currently, Att-BLSTM has been used for processing chemical and biomedical text [19] , [20] . However, its advantage has not been exploited in biology sequence analysis such as BCEs prediction. Fig. 2 shows the architecture of Att-BLSTM. The input antigen sequence is represented as a set of residues S = [r 1 , r 2 , r 3 , · · · , r i , · · · , r l ] T , S ∈ R (l * d) . S is then fed into a Long Short-Term Memory (LSTM) network which learns long-range dependencies in a sequence [32] . Typically, the structure of an LSTM unit at each time t is calculated by the following formulas: where tanh is the element-wise hyperbolic tangent, σ is the logistic sigmoid function, r t , h t−1 and c t−1 are inputs, and h t and c t are outputs. There are three gates Bidirectional LSTM(BiLSTM) can learn forward and backward information of input sequence. As shown in Figure 2 , the networks contain two sub-networks for the left and right sequence contexts. For the i-th residue in the input antigen sequence, we combine the forward pass output − → h i and backward pass output ← − h i by concatenating them: The output of BiLSTM layer is matrix H which consists of all output vectors of input antigen residues: H = [h 1 , h 2 , h 3 , . . . , h i , . . . , h l ] T , H ∈ R (l * 2d) , where l is the input antigen sequence length, and d is the residue features dimension. The attention layer of Att-BLSTM networks employs attention mechanism which has been used in a lot of biology tasks ranging from compound-protein interaction prediction [33] , paratope prediction [15] and protein structure prediction [34] . In attention layer of Att-BLSTM, the novel representation S as well as the output y g of the input antigen sequence is formed by a weighted sum of those output vectors H: As shown in Fig. 1 and 2 , the local features z i extracted by GCNs and the global features y g derived from Att-BLSTM networks are concatenated. And then, they are fed to two linked fully-connected networks. The calculation of probability y i for each input antigen residue belonging to epitope is shown as: Most structure-based methods use precision and recall as evaluation metrics. Our model predicts a probability for each antigen residue, and precision and recall are calculated by taking probability threshold of 0.5. Structure-based BCEs predictors DiscoTope 2.0 [10] and EpiPred [13] report precision and recall of each antigen in testing set. And PECAN [15] calculates averaged precision, recall and area under the precision recall curve (AUC PR) over all antigens in the testing set. To compare with those state-of-the-art structure-based BCEs methods, we use evaluation metrics including averaged precision, recall and AUC PR over all antigens in testing set, as well as the precision and recall values of each testing antigen. In order to evaluate the effects of local and global features in our method, we take advantage of the area under the receiver operating characteristics curve (AUC ROC) and AUC PR. AUC PR is more sensitive than AUC ROC on imbalanced data. And, the datasets used for BCEs prediction are roughly 90% negative class. Therefore, we take AUC PR as the most import metric for model evaluation [35] . We implement our model using PyTorch v1.4. The training details of these neural networks are as follows: optimization: Momentum optimizer with Nesterov accelerated gradients; learning rate: 0.001; batch size: 32; dropout: 0.5; spatial neighbors in the graph: 20; number of LSTM layers in Att-BLSTM networks: 1, 2 or 3; number of graph convolution networks layers: 1, 2 or 3. Training time of each epoch varies from roughly 1 to3 minutes depending on network depth, using a single NVIDIA RTX2080 GPU. For each combination, networks are trained until the performance on the validation set stops improving or for a maximum of 250 epochs. Graph convolution networks have the following number of filters for 1, 2 and 3 layers, respectively: (256), (256, 512), (256, 256, 512). All weight matrices are initialized as and biases are set to zero [23] . In this section, we focus on which network combinations are most effective. The AUC ROC and AUC PR are shown in Figure 3 , where label G denotes the model processing global features and the number after G indicates the model depth, for instance, G1 represents 1-layer Att- First, we train our model of 1-layer Att-BLSTM with varying GCNs depths with or without residue edge features. From figure 3A and 3B, we observe that the 2-layer GCNs with residue edge features perform best (AUC ROC = 0.804, AUC PR = 0.376). This draws the same conclusion with our earlier work for antibody paratope prediction [24] . We also find that residue edge features can always provide better performance as the GCNs depths vary. The same results are found in protein interface prediction task using GCNs as well [23] . Second, 2-layer GCNs and Att-BLSTM networks of different depths are combined in our model. Figure 3C and 3D show the performance evaluated by AUC ROC and AUC PR. It can be found that the combination of 1-layer Att-BLSTM network and 2-layer GCNs with residue edge features still has the best results. In general, the deeper the Att-BLSTM networks grow, the results get worse. As discussed in DeepPPISP [16] , global features may cover the relationships among residues of longer distances. However, those relationships might become weaken when the Att-BLSTM networks grow deeper. In summary, our model with 2-layes GCNs and 1-layer Att-BLSTM network performs best, and it is the proposed model in this paper and used for comparison with competing methods in the following sections. The global feature has been shown to improve the performance of protein-protein interaction sites predic-tion in DeepPPISP [16] and protein phosphorylation sites prediction in DeepPSP [17] . In order to verify whether global features are effective in BCEs prediction as well, we remove the Att-BLSTM networks in our model for comparison. As shown in Figure 3C and 3D, label G0 means there is only GCNs in our model, and no global features are used. Without global features, the AUC ROC is 0.787, which is lower than the proposed model G1NE2 (also lower thanG2NE2, but slightly better than G3NE2). Without global features, the AUC PR is 0.335, which is significantly worse than the proposed model (but slightly better than G2NE2 and G3NE2). The model without global features performs worse on both AUC ROC and AUC PR metrics than our proposed model. Therefore, global features improve the performance of our model for BCEs prediction. However, models with global features not always outperform model without global features in our experiment. Similar observation has been found in DeepPSP [17] , but DeepPPISP [16] reaches a contrary conclusion. This situation might be caused by different models processing global features. In DeepPPISP, a simple fully-connected network is used, and in DeepPSP, SENet blocks and Bi-LSTM blocks are used. Different types of input features (sequence and structure-based) play different roles in our model. The input features consist of four types: residue type one-hot encoding (a), profile features (b), seven physicochemical parameters (c) and structural features (d). To discover what role each feature type plays in our method, we delete each input feature type and compare their performances on our proposed model (G1NE2, i.e. 1-layer Att-BLSTM network and 2-layersGCNs with residue edge features). Figure 4 shows the experimental results. As Figure 4A shows, the AUC ROC without features b is 0.754, significantly lower than the best performance 0.804. The AUC PR without features b drops biggest from 0.376 (all features) to 0.289. It indicates that profile features (feature type b) are most important in our model for BCEs prediction. The model using all the features still performs best on both AUC ROC and AUC PR metrics. To evaluate the performance of our method for BCEs prediction, we compare our proposed model with three competing structure-based BCEs prediction methods: DiscoTope [10] , EpiPred [13] and PECAN [15] . Note that these methods all used local features but did not consider global features. Table 2 shows the experimental results of averaged AUC PR, precision and recall on testing set of our proposed model. The results on three competing methods are taken from [15] . Although our model gets lower recall than PECAN, it is higher than all other competing methods on precision and makes distinct improvement on AUC PR. We also compare the results of each antigen in testing set with DiscoTope 2.0 and EpiPred. The results presented of DiscoTope 2.0 and EpiPred in Table 3 are taken form [13] . The values in bold indicate the best prediction result. We find that our model achieves best precision on 26 antigens and best recall on 13 antigens of all 30 antigens in testing set. We also observe that our model produces usable prediction even for the long antigen target. Accurate prediction of BCEs is helpful for understanding the basis of immune interaction and is beneficial to therapeutic design. In this work, we propose a novel deep learning framework combining local and global features which are extracted from antigen sequence and structure to predict BCEs. GCNs are used for capturing the local features of a target residue. Att-BLSTM networks are used to extract global features, which figure the relationship between a target residue and the whole antigen. We employ our model on a public and popular dataset and the results show improvement of BCEs prediction. Moreover, our results declare that the global features are useful for improving the prediction of BCEs. Though our method outperforms other competing computational methods for BCEs prediction, it also has some disadvantages. The first one is that our predictor needs antigen structure as it takes structure-based residue features as input. The second one is that it consumes long computer time because PSI-BLAST [26] needs to be performed at the stage of extracting residue features. In this study, we show that combing local and global features can be useful for BCEs prediction. In the future, we would further improve BCEs prediction by expanding The Chemistry and Mechanism of Antibody Binding to Protein Antigens Submitting antibodies to binding arbitration Continuous and discontinuous protein antigenic determinants Hybrid methods for B-cell epitope prediction approaches to the development and utilization of computational tools for practical applications Epitope mapping: The first step in developing epitope-based vaccines Therapeutic antibodies for autoimmunity and inflammation Current approaches to fine mapping of antigen-antibody interactions B-cell epitope prediction through a graph model Prediction of conformational B-cell epitopes from 3D structures by random forests with a distancebased feature Reliable B Cell Epitope Predictions: Impacts of Method Development and Improved Benchmarking Prediction of conformational epitopes with the use of a knowledge-based energy function and geometrically related neighboring residue characteristics Tertiary structurebased prediction of conformational B-cell epitopes through B factors Improving B-cell epitope prediction and its application to global antibody-antigen docking Antibody Specific B-Cell Epitope Predictions: Leveraging Information From Antibody-Antigen Protein Complexes Learning context-aware structural representations to predict antigen and antibody binding interfaces Protein-protein interaction site prediction through combining local and global features with deep neural networks DeepPSP: A Global-Local Information-Based Deep Neural Network for the Prediction of Protein Phosphorylation Sites Attention-based bidirectional long shortterm memory networks for relation classification An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition Biomedical event extraction based on GRU integrating attention mechanism Semi-supervised classfication with graph convolutional networks Updates to the Integrated Protein-Protein Interaction Benchmarks: Docking Benchmark Version 5 and Affinity Benchmark Version 2 Protein interface prediction using graph convolutional networks Leveraging Sequential and Spatial Neighbors Information by Using CNNs Linked With GCNs for Paratope Prediction Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks Gapped BLAST and PSI-BLAST: a new generation of protein database search programs BLAST: At the core of a powerful and diverse set of sequence analysis tools Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features Reduced surface: An efficient way to compute molecular surfaces PSAIA -Protein structure and interaction analyzer Biopython: Freely available Python tools for computational molecular biology and bioinformatics Long Short-Term Memory TransformerCPI: Improving compound-protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments Transformer Protein Language Models Are Unsupervised Structure Learners The Relationship Between Precision-Recall and ROC Curves Jesse