key: cord-0104262-8od27nw8 authors: Bai, Yunsheng; Gu, Ken; Sun, Yizhou; Wang, Wei title: Bi-Level Graph Neural Networks for Drug-Drug Interaction Prediction date: 2020-06-11 journal: nan DOI: nan sha: a4d6d3a753ca97e8ba81df06825009e0cedefcb0 doc_id: 104262 cord_uid: 8od27nw8 We introduce Bi-GNN for modeling biological link prediction tasks such as drug-drug interaction (DDI) and protein-protein interaction (PPI). Taking drug-drug interaction as an example, existing methods using machine learning either only utilize the link structure between drugs without using the graph representation of each drug molecule, or only leverage the individual drug compound structures without using graph structure for the higher-level DDI graph. The key idea of our method is to fundamentally view the data as a bi-level graph, where the highest level graph represents the interaction between biological entities (interaction graph), and each biological entity itself is further expanded to its intrinsic graph representation (representation graphs), where the graph is either flat like a drug compound or hierarchical like a protein with amino acid level graph, secondary structure, tertiary structure, etc. Our model not only allows the usage of information from both the high-level interaction graph and the low-level representation graphs, but also offers a baseline for future research opportunities to address the bi-level nature of the data. We introduce BI-GNN (Bi-Level Graph Neural Networks) for modeling biological link prediction tasks such as drugdrug interaction (DDI) (Vilar et al., 2012) and proteinprotein interaction (PPI) (Keskin et al., 2008) . Taking drugdrug interaction as an example, existing methods using machine learning either only utilize the link structure between drugs without using the graph representation of each drug molecule (Zitnik et al., 2018; Ma et al., 2018) , or only leverage the individual drug compound structures without using graph structure for the higher-level DDI graph (Deac et al., 2019) . Readers are referred to a recent survey (Sun et al., 2019) for more details. We demonstrate our model using the drug-drug interaction prediction task as an example. Drug-drug interactions occur when the presence of one drug changes the effect of another drug producing an observed side effect (Wishart et al., 2018) . Specifically, we consider the transductive setting of drug repurposing (Pushpakom et al., 2019) , in which predicting interactions between existing drugs along with leveraging interaction data can help infer similar physiological effects for other existing drugs (Zhou et al., 2015) . This is especially important with the recent COVID-19 pandemic, as drug repurposing techniques aim to find existing drugs that may be helpful in treading COVID-19 (Hamilton, 2020) . Our framework is also extendable to other biological link prediction tasks with different interacting biological entities, e.g. proteins (Fout et al., 2017; Huan et al., 2005) . The key idea is to fundamentally view the data as a bi-level graph, in which the highest level is a representing the interaction between biological entities (interaction graph), and each biological entity itself is further expanded to its intrinsic graph representation (representation graphs), in which the graph is either flat like a drug compound (Kearnes et al., 2016) or hierarchical like a protein with amino acid level graph, secondary structure, tertiary structure, etc (Wikipedia contributors, 2020). Our BI-GNN not only allows the usage of information from both the high-level interaction graph * Equal contribution 1 Department of Computer Science, University of California, Los Angeles, USA. Correspondence to: Yunsheng Bai , Ken Gu . Figure 1 . A bi-level graph view of the drug-drug interaction data, in which the number of levels is 2. Existing GNN methods operate only on either the representation graphs or the single interaction graph without utilizing both under the GNN framework. Node colors in the representation graphs denote molecular level element types. Edge colors in the interaction graph denote drug interactions types. and the low-level representation graphs, but also offers a baseline for future research opportunities to address the bi-level nature biological interaction networks. Definitions We denote an undirected, unweighted graph G = (V, E) with N = |V| nodes. In addition, we let G h = (V h , E h ) denote the higher level interaction network with node set V h and edge set E h . Moreover each e h i ∈ E h can be of different types. Let R denote the set of all edge types. Each v i ∈ V h is also a graph which we denote as The set of all graphs in the lower level is Node features for the i th representation graph are summarized in a N i × D l matrix H l i where N i = |V i |. Likewise node features for the interaction graph are denoted by a N × D h matrix H h . Taking DDI as an example in which the representation graphs are molecular structures of drugs, H l i would represent atom features such as the atomic number, whether an atom is aromatic, its hybridization, the number of hydrogen atoms attached to the atom etc. of G l i . Currently, we do not consider edge features in the lower level graph but this can be easily extended in future work. (Deac et al., 2019) {GAT-GMN} l × 3 N/A Molecular Structure DECAGON-R (Zitnik et al., 2018) N/A GAT h × 3 Random DECAGON-FP (Zitnik et al., 2018) N/A GAT h × 3 Molecular Structure DECAGON-OH (Zitnik et al., 2018) Learned Transductively BI-GNN (this paper) GAT l × 5 GAT h × 3 Molecular Structure stages: 1) Lower Level Representation Graph Neural Network generates vector representations for each representation graph; 2) Higher Level Interaction Graph Neural Network further propagates information from the lower level graph embeddings to neighboring nodes in the interaction graph, which provides the final graph representations to a fully connected network to obtain a final link prediction score. In this scenario, the representation graph may contain more than one level, one at the amino acid level and another at the secondary structure level. In Stage I, we generate the representation graphs' graph embeddings. In general this can be obtained through any graph embeddinging method. For example, one can use node embedding models such as Graph Convolutional Networks (Kipf & Welling, 2017) , Graph Attention Networks (Velikovi et al., 2018) , or Graph Isomorphism Networks (Xu et al., 2019) followed by a READOUT function, a function that takes the embeddings of nodes as input and outputs a single embedding for the graph. Additionally, one can use hierarchical graph representation models such as DIFFPOOL (Ying et al., 2018) or GRAPH-U-NET (Gao & Ji, 2019) . In this work, we use Graph Attention Networks (GAT) with multi-scale READOUT of the updated node embeddings. Formally, the k-th layer of GAT is defined as: ) . (2) Here,x i is the transformed node embedding from initial feature:x N (i) is the set of all first-order neighbors of node i plus node i itself; W (l) ∈ R D l ×D l+1 is the weight matrix associated with the n-th GAT layer; α i,j is a scalar attention weight that node i gives to node j; a (l) ∈ R 2D l is the attention weight vector; and || is vector concatenation. Multi-Scale READOUT Inspired by GIN (Xu et al., 2019) , we concatenate the node representations across all layers of the GAT. This allows the model to consider all structural information, at various levels of granularity. For K GAT layers the representation graph embeddings x G can be expressed as: Graph embeddings from Stage I as initial node features H h ∈ R N ×D h for the interaction graph. This is motivated by the intuition that the lower level network presents a useful initial representation from which the higher level network further enhances for the task. Similar to Stage I, many different node embedding methods may be used to refine the entity representation. We continue to use a different set of GAT layers for the higher level node embedding. Because only the final node representation in the interaction graph is needed, there is no multi-scale READOUT. Additionally, in the case of multiple edge types, the GAT propagation becomes For link prediction, we concatenate the representations of a pair of entities after T upper level GAT layers and feed it into fully connected layers to predict a link prediction score: We choose this decoder over simpler alternatives such as dot product between drug embeddings, because empirically we find this yields better performance. For multiclass link prediction (multiple edge types), the fully connected layers output C scores for C classes. We use cross entropy loss with logits for the loss function. We train BI-GNN on two DDI datasets, DRUGBANK (Marinka Zitnik & Leskovec, 2018; Wishart et al., 2018) and DRUGCOMBO (Liu et al., 2020 ) whose details can be found in the supplementary material. To fairly compare the uses of different levels of information we aim to use similar architectures as BI-GNN, summarized in Table 1 and detailed in the supplementary material. Representation Graph Models For models that only use the lower level representation graph we focus on three variants. The first which we call FP-PRED, feeds extended connectivity fingerprint (ECFP) (Rogers & Hahn, 2010) representations directly into the prediction layers. ECFPs are representations of chemical structures which capture relevant molecular features and molecular structure. The second is LL-GNN which is the lower level model outlined in section 1.1. The last model is MHCADDI (Deac et al., 2019) which uses intra-graph message passing and inter-graph co-attention. We implement a similar model that uses Graph Matching Networks (GMN) for inter-graph attention. Under models that only use the interaction graph, we consider DECAGON (Zitnik et al., 2018) , which considers protein-protein networks and protein-drug networks on top of the DDI as the input graph. We adopt DECAGON for the DDI network only. More specifically, we focus on three different initializations of H h ∈ R N ×D h , the input drug features: (1) one hot initialization which we call DECAGON-OH, (2) random initialization denoted as DECAGON-R, and (3) ECFP (Rogers & Hahn, 2010) initialization which we refer to as DECAGON-FP. From Table 2 we observe the effectiveness of different levels of network information. For models that use only the lower level representation graphs, inter-graph communication provides better performance (LL-GNN vs MHCADDI) and utilizing GNNs allow better learning of the graph representations suit the link prediction task (LL-GNN vs FP-PRED). However, DECAGON baselines, which only use the higher level interaction network, significantly outperform the lower level models. Finally, utilizing both levels of information with BI-GNN is better than DECAGON-FP and DECAGON-R and is competitive with DECAGON-OH. Nevertheless, it must be noted that despite the strong performance by one hot initialization, it is not scalable to larger interaction networks. Specifically, because D h grows linearly with the number of the nodes, the weight matrix becomes larger. In addition, as described in section 2.3 one hot initialization is unable to generalize representations for unseen nodes or nodes which have few neighbors. Since the overall performance shows that BI-GNN is either the best or the second best, we conduct additional evaluation to investigate under what circumstances BI-GNN performs better than all the baseline methods. Specifically, we conjecture that the high performance achieved by DECAGON-OH is due to its ability to learn drug representations in a transductive way. From Equation 3, if the initial feature is one-hot, after multiplying with the learnable weight matrix W , each drug is essentially represented as a randomly initialized but learnable embedding. On one hand, this inherently prohibits the using of DECAGON-OH to inductive setting where a drug in the testing set can be unseen during training whose embedding is then unknown. On the other hand, however, this also allows DECAGON-OH to take advantage of the transductive nature of the drug re-purposing task and often achieve the best performance. What if the link structure is not available or very sparse? To DECAGON-OH, sparse neighborhood would limit its learning ability to adjust each drug representation, and lower performance would be expected. In contrast, our BI-GNN may still leverage the inherent molecular structure of each drug compound. Based on this hypothesis, we split the drugs into "bins" based on node degrees. Figure 2 shows the distribution of number of nodes in different bins, where bins with lower indices correspond to sparse regions of the DDI network. We then evaluate the performance of each model under each bin. As shown in Figure 3 , BI-GNN indeed outperforms DECAGON-OH when node degree is low. In practice, drugs may come with different amount of known interacting drugs, and our framework is especially good for drugs with little to no information available. To gain a better understanding of what kind of representations are useful for the link prediction task, we perform the following visualization procedure. For LL-GNN, DECAGON-OH, and BI-GNN, we take the final drug embeddings before the final prediction layer, and project them into a 2D plane using t-SNE (Maaten & Hinton, 2008) . We then use DBSCAN (Ester et al., 1996) to cluster the embeddings and color the drug nodes in the interaction network according to the clustering assignment, resulting in Figure 4 . It is clear that DECAGON-OH and BI-GNN learn useful drug embeddings which highly correlate to the link structure in the DDI graph, while the embeddings generated by LL-GNN which only relies on the molecular structure do not lead to good performance for link prediction in the interaction graph. Based on Section 2.3 and 2.4, we conclude that the message passing across the higher level graph contributes the most to the performance of GNN models on the DDI prediction task. We propose a multi-level GNN framework for biological entity link prediction by constructing a bi-level graph with higher level representing the interaction graph between biological entities while the lower levels representing the individual biological entity such as drug, protein, etc. In the future, we plan to extend our method to the protein-protein interaction task where each protein is represented as a hierarchical graph with amino acids, secondary structures, etc. Additionally, to improve the accuracy of our BI-GNN, we will explore the integration of graph matching and the introduction of additional input features into our framework. We run experiments on 2 real word datasets on drug-drug interaction. Moreover, for each drug, we obtain its SMILE string from which we can derive the molecular structure. We one hot encode the lower level representation graph's node features. These include the atomic number, the number of attached hydrogen atoms, the hybridization, whether it is an acceptor or donor and whether it is aromatic. DRUGBANK is a dataset of drug-drug interactions from the DRUGBANK database containing detailed drug data and its interaction information 1 . We take the pairs mined by (Marinka Zitnik & Leskovec, 2018) and keep the pairs given as DRUGBANK Ids from which we can find the drugs' molecular structure information. After preprocessing, there are 1309 drug representation graphs and 41072 drug interactions. For training and evaluation we perform negative sampling on the positive drug pairs. DRUGCOMBO is a database containing drug-drug combinations obtained from various sources including external databases, manual curations, and experimental results 2 . We take the pairs of drug combinations that have been classified as exhibiting synergistic or antagonistic effects. We crossreference PubChem 3 , a database of chemical molecules for the molecular structures. Similar to before, we keep the pairs in which molecular structure information is found. In total, there are 3242 drug representation graphs, 34335 drug pairs classified as having synergistic effects, and 15057 pairs classified as having antongonistic effects. We also treat the different classes as edge features in the interaction graph and use it in our higher level propagation. Similar to DRUGBANK we perform negative sampling for each known effect. For each lower level and higher level GAT we have hidden dimensions of 64. The input dimensions for DECAGON-FP and DECAGON-R are 64 dimensions while the input dimension for DECAGON-OH is the number of drugs in the DDI. We use 3 higher level GAT layers for DECAGON models. For LL-GNN we use 5 lower level GAT layers while for MHCADDI we use 3 blocks of GAT followed by GMN co-attention. Experiments are ran on an Intel i7-6800K CPU and Nvidia Titan GPU. We split the dataset intro training validation and testing sets based on the known pairs. For training, for all models except MHCADDI, we have a batch size of 64 graphs and select all adjacent pairs in the training set. For MHCADDI, we have a batch size of 128 pairs as each graph is dependent on its paired graph only for its own representation. Additionally, we use the Adam optimizer (Kingma & Ba, 2015) with the learning rate set to 0.001 and the the number of iterations set to 10000. We use the mean aggregator as our READOUT function. We evaluate on the validation set after every 100 iterations and employ early stopping if performance does not improve over a window size of 15. Finally, all experiments were implemented with the PyTorch and PyTorch Geometric libraries (Fey & Lenssen, 2019) . In this work, we mention the use of inter-graph attention in MHCADDI with GMNs . For two graphs G l 1 , G l 2 and node x i ∈ G l 1 the model as described in (Deac et al., 2019) computes the inter-graph representation as Drug-drug adverse effect prediction with graph coattention A densitybased algorithm for discovering clusters in large spatial databases with noise Fast graph representation learning with PyTorch Geometric Protein interface prediction using graph convolutional networks Graph methods for covid-19 response Comparing graph representations of protein structure for mining family-specific residue-based packing motifs Molecular graph convolutions: moving beyond fingerprints Prism: proteinprotein interaction prediction by structural matching A method for stochastic optimization Semi-supervised classification with graph convolutional networks Graph matching networks for learning the similarity of graph structured objects DrugCombDB: a comprehensive database of drug combinations toward the discovery of combinatorial therapy Drug similarity integration through attentive multi-view graph auto-encoders Visualizing data using t-sne Stanford biomedical network dataset collection Drug repurposing: progress, challenges and recommendations Extended-connectivity fingerprints Graph convolutional networks for computational drug development and discovery Graph attention networks Drugdrug interaction through molecular structure similarity analysis Here, ·, · is the dot product. In our model, we utilize a similar attention mechanism adopted from Graph Matching Networks . Specifically we compute the inter-graph message as follows.In both cases, MHCADDI and GMN update the next step node representation, x (k+1) i , using a combination of the representation from intra-graph message passing and intergraph message passing. In the main text, we compare with three versions of DECAGON (Zitnik et al., 2018) , but here we would like to point out that the original work of DECAGON uses additional input features for the drugs as well as a heterogeneous interaction network including both drugs and proteins, whereas our three versions use the same amount of input features as BI-GNN for fair comparison. In addition, drugdrug similarity has also been explored for DDI prediction in literature (Ma et al., 2018) . In future, we plan to incorporate these additional information to further improve the prediction accuracy.