key: cord-0475450-0x3j7lyg authors: Guo, Jinjiang; Li, Jie; Leng, Dawei; Pan, Lurong title: Heterogeneous Graph based Deep Learning for Biomedical Network Link Prediction date: 2021-01-28 journal: nan DOI: nan sha: 0ccba814d5a4801b3ce410ec04104dd5dd668797 doc_id: 475450 cord_uid: 0x3j7lyg Multi-scale biomedical knowledge networks are expanding with emerging experimental technologies that generates multi-scale biomedical big data. Link prediction is increasingly used especially in bipartite biomedical networks to identify hidden biological interactions and relationshipts between key entities such as compounds, targets, gene and diseases. We propose a Graph Neural Networks (GNN) method, namely Graph Pair based Link Prediction model (GPLP), for predicting biomedical network links simply based on their topological interaction information. In GPLP, 1-hop subgraphs extracted from known network interaction matrix is learnt to predict missing links. To evaluate our method, three heterogeneous biomedical networks were used, i.e. Drug-Target Interaction network (DTI), Compound-Protein Interaction network (CPI) from NIH Tox21, and Compound-Virus Inhibition network (CVI). Our proposed GPLP method significantly outperforms over the state-of-the-art baselines. In addition, different network incompleteness is analysed with our devised protocol, and we also design an effective approach to improve the model robustness towards incomplete networks. Our method demonstrates the potential applications in other biomedical networks. Network science is well applied in both medicine [1] and pharmacology [2] areas to better understand the disease interactome and system-level drug response. In the past decade, increasing multi-scale biomedicine networks [9, 10, 11] have been constructed, connecting multi-level entities such like drugs, targets, activities, diseases, side-effects and omics. In these biomedicine networks, link prediction [3, 4] is one of the important network-based computing and modelling approaches for studying biomedicine relationships, which has great potency in drug repurposing [5] , target identification [6] and biomarker discovery [7] . To our knowledge, researchers usually refer to recommendation algorithms [8] to fulfil such link prediction tasks. There exist two main branches of recommendation algorithms: content-based recommendation methods [12] and collaborative filtering models [13] . Content-based recommendation methods take account of attacker (user) and target (item) side information as their attributes, find optimal projections between attackers and targets, and to predict future behavior of attackers or probable ratings of targets, i.e., interactions between attackers and targets. Collaborative filtering models utilize collective attacker-target links to predict missing/future interactions in the bipartite attacker-target network. In this work, we convert collective attacker-target links from biomedical network into partially observed matrix, and treat the Matrix Completion (MC) task as link prediction on graphs. Attacker-target interactions, such as Drug-Target Interaction (DTI), Compound-Virus Inhibition (CVI), etc., can be represented as bipartite attacker-target network in which edges denote observed links. In our proposed Graph Pair based Link Prediction (GPLP) method, we extract 1-hop subgraph around each node (attacker or target) on the bipartite network. Such 1-hop subgraphs contain local graph pattern which is deterministic for inferring missing links, and independent on attacker and target side information as well [14] . Exhaustively around each attacker and target node pair, we construct their 1-hop subgraphs for GNN model to extract subgraphs latent features as representations of each attacker and target node. Such representations arXiv:2102.01649v4 [cs.SI] 24 Feb 2022 heterogeneous graph based deep learning for biomedical network link prediction are used to infer the potential links via Multi-Layer Perceptron (MLP) classifier (or regressor). Figure 1 illustrates the framework of our method. The main contributions of our work are: • We explored a novel end-to-end GNN method for predicting potential/missing links in biomedical network. • Our method is independent on side information of attacker and target, simply based on their observed connections in the network. • We curated an antiviral compound-phenotype network for researchers to fight against the COVID-19 pandemic. • We invented a protocol to analyse the robustness of network based models towards real biomedical scenarios. This paper is organized as follows: Section 2 introduces the related methods used in complex systems. Section 3 gives specific description of our GPLP framework. Methods comparison results on different biomedical networks are shown in Section 4. Robustness analysis of GPLP and future work are discussed in Section 5. In this section, we briefly introduce theories of graph neural networks (GNN) and matrix completion (MC) methods used in complex systems, such as recommendation systems and biomedical networks. Also, we give a discussion on motivations of using GNN methods to fulfil MC tasks. Matrix Completion (MC) is a kind of task formulation [15] commonly used in recommender systems, which converts heterogeneous network datasets into observed matrix M, and aims to fill the missing entries of M. In the matrix M, the rows and columns represent users (attackers) and items (targets) respectively, each element in M denotes the interaction between corresponding (attacker, target) pair coordinate. To fulfil the MC tasks, well-known Matrix Factorization (MF) methods decompose the low-rank M into the product of two lower dimensionality of rectangular matrices [16, 17] . Later on, Singular Value Decomposition (SVD) is adopted in MF families [18, 19, 20] , which decompose M into the inner product of three matrices X, Y and P (i.e., M = XP Y T ), where X and Y contain the latent features of attackers and targets respectively, P is the association matrix. The idea is to find the projections from X to Y by determining the optimal P from M. SVD based MF methods can integrate latent features of attackers and targets, and therefore partially solve the cold-start problem [18] . Also, SVD basd MF methods successfully solved many biomedical problems, such as Drug-Target Interaction (DTI) prediction [8] , Gene-Disease Association prediction. [21] , etc. In the work [8] , the low-dimensional latent features of drug (attacker) and target is obtained by integrating heterogeneous data, such like drug-drug interaction, drugside-effect, drug-disease association, drug similarities for attacker, and protein-protein interaction, protein-disease association, protein similarities for target. The integrated features describe topological properties for each attacker and target respectively. Similarly, for predicting Gene-Disease Association [21] , microarray measurements of gene expression, gene-phenotype associations of other species and HumanHet [22] features (incorporating mRNA expression, protein-protein interactions, protein complex data, and comparative genomics) serve as latent gene (attacker) features, while disease similarities from MimMiner [23] and features collected from OMIM diseases [24] are used as latent disease (target) features. As we can see, the performance of MF methods is highly dependent upon the integrated latent features as representations of attackers and targets. Usually, the feature construction procedure are manually designed, and separate from the optimization of association matrix P in MF procedure as well. Recently, GNN based methods have demonstrated breakthroughs in many tasks regrading to network datasets [25] , such like protein structure, knowledge graph, physical systems [26, 27] , etc. There are mainly two reasons to use GNN methods in network datasets: (1) most of the complex systems datasets are in the form of graph structure. (2) GNN methods are featured at processing topological connections among nodes, and learning graph representations [25] . Existing GNN methods can be categorized into two types: spectral based methods and spatial based methods [25] . For spectral based GNNs, graphs are projected into Fourier domain where the convolution operation is conducted by computing Laplacian eigendecomposition [28, 29] . Due to the high computational complexity of Laplacian eigendecomposition, the Chebyshev polynomials are adopted as an approximation [28, 29] . Spatial based GNN methods imitate convolutional neural networks (CNN) by aggregating and updating neighborhood message from central node [30, 31, 32, 33, 34, 35] , and construct the whole graph representation through read-out function. General operations of spatial based GNN can be expressed as below: v is the node feature of center node v at kth layer, e uv is the edge feature between v and its neighbor node u. N (v) denotes the neighborhood of node v, usually 1-hop neighbors. Aggregating function A(·) aggregates node features over neighborhood of v, while updating function C(·) integrates features both from center node v and its neighboring features aggregated by A(·). Various mathematical operations have been adopted as A(·) and C(·). For instances, in work [30] , mean-pooling and attention mechanisms are used as A(·), whereas GRU [36] , concatenation [30] and summation are usually applied as C(·). For obtaining whole graph representation, a pooling function, namely read-out, is used at the last Kth layer: (2) h G means the whole representation of input graph G. R(·) represents the read-out function. Spectral based GNN methods require Fourier transform on graph, meaning that all the input graph samples should be static (i.e., fixed topological structure) [29] . In contrast, spatial based GNN methods have no such restriction, and are able to extract features on graphs with varied structures. Hence, spatial based GNNs are suitable for the 1-hop subgraph pairs from bipartite networks in our work. In this part, we introduce our end-to-end Graph Pair based Link Prediction (GPLP) framework for predicting the potential links in biomedical networks. For a given bipartite biomedical network G, we constructed the observed interaction matrix M ⊂ R M ×N , where M is the attacker number and N is the target number, each row index (a ∈ Z M ) and column index (t ∈ Z N ) denote the sequential identical numbers of attackers (A M ) and targets (T N ) respectively, and each entry m a,t in M represents whole possible interaction of (a, t) pair. In our case, m a,t ∈ [0, 1, x], 1 denotes observed positive label meaning active interaction between a and t, whereas 0 is the observed negative label meaning the inactive interaction. Here x means the unobserved (unknown) interaction waiting for prediction. For each attacker a and target t in the bipartite network G, we respectively exact their 1-hop subgraphs G M a→t and G N t→a , in which the edges E M a→t and E N t→a mean the links (active interactions) of neighbouring target nodes (t ∈ G M a→t ) around center node a, and neighbouring attacker nodes (â ∈ G N t→a ) around center node t respectively. Then for each (a, t) , we pair G M a→t and G N t→a as a regrouped graphĜ M ×N a,t , and labelĜ M ×N with (a, t) corresponding m a,t in M. We feed the paired subgraph dataset to the GNN model for training or prediction. Note that after pairing the subgraphs, if m a,t is positive, the (a, t) edge and corresponding nodes should be removed from both G M a→t and G N t→a . It is to make sure that GNN model cannot see any link for prediction in the training subgraphs pair. In Figure 1, steps (a), (b) and (c) show GPLP procedures that convert the bipartite network into GNN training dataset. The GNN backbone we selected is the HAG-Net [35] developed by Global Health Drug Discovery Institute (GHDDI). HAG-Net is a variant of spatial based graph neural networks, which aggregates features by taking both summation and maxima of neighbouring features, and thus enhances the message propagation from shallow layers to deep layers. Specifically, for each node v in graphĜ M ×N a,t , its features h k v out of HAG-Net kth layer is obtained by: where φ(·) is the MLP function, and concat(·) concatenates features of node v from (k − 1)th layer and kth layer, u denotes a neighbouring node in node v's neighbourhood N (v). Then we use a mean pooling read-out function to obtain final representation hĜ a,t of graphĜ M ×N After getting the final representation hĜ a,t of graphĜ M ×N a,t , we apply a 3-layer MLP Φ(·) as functional head to predict interaction (link) probabilitym a,t :m Since the link prediction in our case is a binary classification task, i.e. predicting active or inactive interaction, we adopt cross-entropy [37] as our loss function, which can be described as: where m a,t is the observed interaction, andm a,t is the predicted link probability. For optimization on model parameters, we tried SGD [38] , Adam [39] and Adabelief [40] as optimizers, and find that Adabelief gives the best optimized model with highest evaluation performance and most stable convergence during training and testing. We conducted experiments on three heterogeneous biomedical networks: Drug-Target Interaction (DTI) dataset from DTINet [8] , National Toxicology Program (NIH) Tox21 challenge dataset [41] , and GHDDI constructed Compound-Virus Inhibition (CVI39) Dataset. We trained our GPLP models on each dataset respectively, and tuned each model's parameters with different protocols. In details, we took a 80-20 split on datasets of DTI and CVI39 respectively, where 80% of dataset was used for training and the remaining 20% was used for testing. For model training on NIH Tox21 dataset, we followed the 5-cross validation protocol from the baseline work [42] . We took a random shuffle on each dataset for ensuring each input data sample give an independent change on the model in each training batch. Also, we resmapled the training positive and negative data samples, balancing their number ratio to 1:1, it avoids the model's skewness towards the class with majority number during training. For the hyperparameters configuration, we followed the default settings of HAG-Net and selected Adabelief optimizer with learning rate η = 0.001, ξ = 10 −16 , (β 0 , β 1 ) = (0.9, 0.999) and weight decay λ = 0.0001. The training batch size and epoch number were fixed to 256 and 1,000 for all datasets. This work is publicly available at our website COVID-19: GHDDI Info Sharing Portal: https://ghddi-ailab.github.io/Targeting2019-nCoV We selected the heterogeneous network constructed in the work [8] . The network includes 12,015 nodes and 1,895,445 edges in total, for predicting missing drug-target interactions. It incorporates 4 types of nodes (i.e., drugs, proteins, diseases and side-effects) and 6 types of edges (i.e., drug-protein interactions, drug-drug interactions, drug-disease associations, drug-side-effect associations, protein-disease associations and protein-protein interactions). Since our GPLP framework is independent on node's side information, only the topological information of the network is used for our method. Comparison with other counterparts, our GPLP model constantly outperforms in terms of AUROC and AUPR, reaching up to 95% and 93% respectively. The AUROC and AUPR we computed are the harmonic average between positive and negative performances. Our GPLP model performs nearly 5% higher than DTINet in terms of AUROC, and achieves an approximate AUPR. GPLP also beats other Random Walk based methods (e.g., HNM [11] ), see details in Table 1 . Figure 2 : Different methods comparison on NIH Tox21 dataset. Our GPLP outperforms other state-of-the-art methods for toxicity prediction in terms of AUROC. For NIH Tox21 Challenge Dataset [41] , a dataset with 12,707 chemical compounds, which consisted of a training dataset of 11,764, a leaderboard set of 296, and a test set of 647 compounds. For the training dataset, the chemical structures and assay measurements for 12 different toxic effects were fully available at the beginning of the challenge, so were the chemical structures of the leaderboard set.To fulfil the challenge, the common methods, no matter what kind: descriptor based [46, 47, 48] or deep learning methods [41, 42] , focus on extracting effective representations of chemical compounds under certain toxicity task. In contrast, our prospective takes an insight into the interactions (toxic or non-toxic) between compound (attacker) and toxicity task (target), regardless of compound chemical structures and task properties. In the experimental results, we find that our network based GPLP model significantly outperforms against the chemical structure based deep models (e.g., DeepTox [41] , ProtentialNet [42] ), achieving AUROC of 94.2% and AUPR of 89.6%. In addition, GHDDI also designed a chemical structure based GNN model for multi-task prediction, namely STID-Net [49] , which can compete with other chemical structure based deep methods. Hence, we added performance of STID-Net into comparison (see Figure 2 ). During COVID-19 pandemic broke-out in the year 2020, GHDDI constructed an antiviral compound-phenotype network, which collected 9,193 drugs (attacker), 39 5 Discussion and Future Work As in practical biomedical problems, usually the observed links in the network are rather limited and incomplete, and thus it causes a sparse observed matrix [8] . It implies that the observed links only cover a very small portion of real biomedical network (i.e., the observed matrix is of very low-rank ). This fact results in a problem that local heterogeneous graph based deep learning for biomedical network link prediction graph pattern leant from training data cannot fully reveal the reality. Unfortunately, the network that covers the whole connections can never be obtained in real world as our experimental dataset. To analyse our GPLP model's robustness towards real scenarios, we assume that each network we have covers the whole connections, and we randomly knock-out connections in the network to simulate the partially observed network in reality (see Figure 3 ). Then, we trained GPLP model on these partial network datasets. To implement the knocking-out protocol, we firstly convert the whole network into subgraph pairs G M a→t and G N t→a , and then randomly knock-out the edges of G M a→t and G N t→a respectively, obtaining G M a→t and G N t→a . However, in the biomedical dataset, the degree distribution of all the G M a→t (or G N t→a ) usually appears as long-tail. For instance, Figure 4 shows G M a→t degree distribution of DTI dataset, where horizontal axis denotes the attacker (drug) ID and vertical axis means the degree of G M a→t . As we can see, most of attackers have 1 or even 0 interaction with any target. In order to equally knock each subgraph, we invented a knocking-out method by random sampling a knocking-out portion according to the mixture distributions of G M a→t edges, which is conducted as: where E M a→t denotes the edges after knocking-out on original edges E M a→t of G M a→t . ρ ω is a probabilistic function which determines the knocking-out portion of E M a→t , sign means the random edge removal according to ρ ω , Ω is the degree of G M a→t , while C ω Ω means the ω-combinations over Ω. The knocking-out operation on G N t→a follows the same way as G M a→t . As was expected, the performances of GPLP models trained on knocked-out datasets declined w.r.t. AUROC and AUPR, since the topological information was constantly lost with varying degrees during training. Specifically, with this protocol, GPLP achieved AUPR of 82% (10% decreased ) and AUROC 96% (approximate the same) on DTI dataset. We find that the precision of positive samples was constantly low (nearly 45%) while their recall maintained a high level (about 80%), it is probably because the knocking-out operates on the links that are actually positive interactions. Whereas, on NIH Tox21 dataset, the GPLP model got AUROC of 88% ( 6% decreased) and AUPR of 82%( droping 8%). The low precision (nearly 60%) and decreasing recall (convergent to 40%) of positive samples appeared during training on NIH Tox21 dataset. Even though the performance of GPLP models modestly dropped on partially knocked-out networks, our method still gives promising results, and demonstrates the robustness in different real world scenarios. In this work, we have introduced our Graph Pair based Link Prediction (GPLP) framework for predicting potential/unobserved links in heterogeneous biomedical networks. GPLP model infers the linking probability between attacker and target based on the local graph patterns learnt from connectivity information of the network. The model does not rely on the side information (chemical structures, gene sequences, etc.) of attacker or target. Our method has demonstrated an out-performance compared to state-of-the-art baselines. However, it is well-known that such network based methods suffer from cold-start problem [18] , especially when a totally new attacker or target (e.g., SARS-CoV-2) emerges, meaning that no links between the new node and known nodes in the observed network. Hence, effective and efficient integration of side information into our framework become the key work in future. Nevertheless, GPLP method shows a novel perspective for revealing the underlying interaction mechanisms of complex biomedical systems. Network medicine: a network-based approach to human disease Network pharmacology: the next paradigm in drug discovery Graph embedding on biomedical networks: methods, applications and evaluations Computational prediction of drug-target interactions using chemogenomic approaches: an empirical survey Artificial intelligence in covid-19 drug repurposing. The Lancet Digital Health Architecture of the human interactome defines protein communities and disease networks Networkbased identification of micrornas as potential pharmacogenomic biomarkers for anticancer drugs A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information Drug target identification using side-effect similarity Autodock4 and autodocktools4: Automated docking with selective receptor flexibility Drug repositioning by integrating target information through a heterogeneous network model Content-based recommendation systems Using collaborative filtering to weave an information tapestry Inductive matrix completion based on graph neural networks Exact matrix completion via convex optimization Regression-based latent factor models What recommenders recommend-an analysis of accuracy, popularity, and sales diversity effects Evaluating recommender behavior for new users A group-specific recommender system Understanding and improving relational matrix factorization in recommender systems Inductive matrix completion for predicting gene-disease associations Prioritizing candidate disease genes by network-based boosting of genome-wide association data A text-mining analysis of the human phenome Online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders Graph neural networks in recommender systems: A survey Interaction networks for learning about objects, relations and physics Graph networks as learnable physics engines for inference and control Convolutional neural networks on graphs with fast localized spectral filtering Semi-supervised classification with graph convolutional networks Inductive representation learning on large graphs Strategies for pre-training graph neural networks Weisfeiler and leman go neural: Higher-order graph neural networks Asap: Adaptive structure aware pooling for learning hierarchical graph representations Enhance information propagation for graph neural network by heterogeneous aggregations Gated graph sequence neural networks Machine learning: a probabilistic perspective Convergence and efficiency of subgradient methods for quasiconvex minimization Adam: A method for stochastic optimization Adabelief optimizer: Adapting stepsizes by the belief in observed gradients Deeptox: toxicity prediction using deep learning Potentialnet for molecular property prediction Drug-target interaction prediction by learning from local information and neighbors Semi-supervised drug-protein interaction prediction from heterogeneous biological spaces Collaborative matrix factorization with multiple similarities for predicting drug-target interactions Kernel k-nearest neighbor algorithm as a flexible sar modeling tool Support vector machines: development of qsar models for predicting anti-hiv-1 activity of tibo derivatives A new qsar model, for angiotensin i-converting enzyme inhibitory oligopeptides A deep learning method incorporating ligand structure and task id for multi-task prediction