key: cord-0440664-t2a6cxrq
authors: Ioannidis, Vassilis N.; Zheng, Da; Karypis, George
title: PanRep: Universal node embeddings for heterogeneous graphs
date: 2020-07-20
journal: nan
DOI: nan
sha: 07bbacfec84b369766679853237543ba762a4980
doc_id: 440664
cord_uid: t2a6cxrq

Learning unsupervised node embeddings facilitates several downstream tasks such as node classification and link prediction. A node embedding is universal if it is designed to be used by and benefit various downstream tasks. This work introduces PanRep, a graph neural network (GNN) model, for unsupervised learning of universal node representations for heterogenous graphs. PanRep consists of a GNN encoder that obtains node embeddings and four decoders, each capturing different topological and node feature properties. Abiding to these properties the novel unsupervised framework learns universal embeddings applicable to different downstream tasks. PanRep can be furthered fine-tuned to account for possible limited labels. In this operational setting PanRep is considered as a pretrained model for extracting node embeddings of heterogenous graph data. PanRep outperforms all unsupervised and certain supervised methods in node classification and link prediction, especially when the labeled data for the supervised methods is small. PanRep-FT (with fine-tuning) outperforms all other supervised approaches, which corroborates the merits of pretraining models. Finally, we apply PanRep-FT for discovering novel drugs for Covid-19. We showcase the advantage of universal embeddings in drug repurposing and identify several drugs used in clinical trials as possible drug candidates.

Learning node representations from heterogeneous graph data powers the success of many downstream machine learning tasks such as node classification [29] , and link prediction [47] . Graph neural networks (GNNs) learn node embeddings by applying a sequence of nonlinear operations parametrized by the graph adjacency matrix and achieve state-of-the-art performance in the aforementioned downstream tasks. The era of big data provides an opportunity for machine learning methods to harness large datasets [49] . Nevertheless, typically the labels in these datasets are scarce due to either lack of information or increased labeling costs [9] . The lack of labeled data points hinders the performance of supervised algorithms, which may not generalize well to unseen data and motivates unsupervised learning.

Unsupervised node embeddings may be used for downstream learning tasks, while the specific tasks are typically not known a priori. For example, node representations of the Amazon book graph can be employed for recommending new books as well as classifying a book's genre. This work aspires to provide universal node embeddings, which will be applied in multiple downstream tasks and achieve comparable performance to their supervised counterparts.

Although unsupervised learning has numerous applications, limited labels of the downstream task may be available. Refining the unsupervised universal representations with these labels could further increase the representation power of the embeddings. This can be achieved by fine-tuning the unsupervised model. Natural language processing methods have achieved state-of-the-art performance by applying such a fine-tuning framework [12] . Fine-tuning pretrained models is beneficial compared to end-to-end supervised learning since the former typically generalizes better especially when labeled data are limited [16] and decreases the inference time since typically just a few fine-tuning iterations typically suffice for the model to converge [12] .

This work introduces a framework for unsupervised learning of universal node representations on heterogenous graphs termed PanRep 1 . It consists of a GNN encoder that maps the heterogenous graph data to node embeddings and four decoders, each capturing different topological and node feature properties. The cluster and recover (CR) decoder exploits a clustering prior of the node attributes. The motif (Mot) decoder captures structural node properties that are encoded in the network motifs. The meta-path random walk (MRW) decoder promotes embedding similarity among nodes participating in a MRW and hence captures intermediate neighborhood structure. Finally, the heterogeneous information maximization (HIM) decoder aims at maximizing the mutual information among node local and the global representations per node type. These decoders model general properties of the graph data related to node homophily [19, 30] or node structural similarity [37, 15] . PanRep is solely supervised by the decoders and has no knowledge of the labels of the downstream task. The universal embeddings learned by PanRep are employed as features by models such as SVM [42] or DistMult [52] to be trained for the downstream tasks. To further accommodate the case where limited labels are available for some downstream tasks we propose fine-tuning PanRep (PanRep-FT). In this operational setting, PanRep-FT is optimized adhering to a task-specific loss. PanRep can be considered as a pretrained model for extracting node embeddings of heterogenous graph data. Figure 1 illustrates the two novel models. The contribution of this work is C1. We introduce a novel problem formulation of universal unsupervised learning and design a tailored learning framework termed PanRep. We identify the following general properties of the heterogenous graph data: (i) the clustering of local node features, (ii) structural similarity among nodes, (iii) the local and intermediate neighborhood structure, (iv) and the mutual information among same-type nodes. We develop four novel decoders to model the aforementioned properties.

C2. We adjust the unsupervised universal learning framework to account for possible limited labels of the downstream task. PanRep-FT refines the universal embeddings and increases the model generalization capability.

C3. We compare the proposed models to state-of-the-art supervised and unsupervised methods for node classification and link prediction. PanRep outperforms all unsupervised and certain supervised methods in node classification, especially when the labeled data for the supervised methods is small. PanRep-FT outperforms even supervised approaches in node classification and link prediction, which corroborates the merits of pretraining models. Finally, we apply our method on the drug-repurposing knowledge graph for discovering drugs for Covid-19 and identify several drugs used in clinical trials as possible drug candidates.

Unsupervised learning. Representation learning amounts to mapping nodes in an embedding space where the graph topological information and structure is preserved [22] . Typically, representation learning methods follow the encoderdecoder framework advocated by PanRep. Nevertheless, the decoder is typically attuned to a single task based on e.g., matrix factorization [43, 4, 8, 10, 33] , random walks [21, 34] , or kernels on graphs [41] . Recently, methods relying on GNNs are increasingly popular for representation learning tasks [50] . GNNs typically rely on random walk-based objectives [21, 22] or on maximizing the mutual information among node representations [45] . Relational GNNs methods extend representation learning to heterogeneous graphs [14, 40, 39] . Relative to these contemporary works PanRep introduces multiple decoders to learn universal embeddings for heterogeneous graph data capturing the clustering of local node features, structural similarity among nodes, the local and intermediate neighborhood structure, and the mutual information among same-type nodes.

Supervised learning. Node classification is typically formulated as a semisupervised learning (SSL) task over graphs, where the labels for a subset of nodes are available for training [7] . GNNs achieve state-of-the-art performance in SSL by utilizing regular graph convolution [29] or graph attention [44] , while these models have been extended in heterogeneous graphs [38, 17, 48] . Similarly, another prominent supervised downstream learning task is link prediction with numerous applications in recommendation systems [47] and drug discovery [56, 26] . Knowledge-graph (KG) embedding models rely on mapping the nodes and edges of the KG to a vector space by maximizing a score function for existing KG edges [47, 52, 55] . RGCN models [38] have been successful in link prediction and contrary to KG embedding models can further utilize node features. The universal embeddings extracted from PanRep without labeled supervision offer a strong competitive to these supervised approaches for both node classification and link prediction tasks.

Pretraining. Pretraining models provides a significant performance boost compared to traditional approaches in natural language processing [12, 35, 36, 32] and computer vision [13, 18] . Pretraining offers increased generalization capability especially when the labeled data is scarce and increased inference speed relative to end-to-end training [12] . Recently, [25] introduced a framework for pretraining GNNs for graph classification. Different than [25] that focuses on graph representations, PanRep aims at node prediction tasks and obtains node representations via capturing properties related to node homophily [19, 30] or node structural similarity [37] . PanRep is a novel pretrained model for node classification and link prediction that requires significantly less labeled points to reach the performance of its fully supervised counterparts.

A heterogeneous graph with T node types and R relation types is defined as

The node types represent the different entities and the relation types represent how these entities are semantically associated to each other. For example, in the IMDB network, the node types correspond to actors, directors, movies, etc., whereas the relation types correspond to directed-by and played-in relations. The number of nodes of type t is denoted by N t and its associated nodal set by V t := {n t } Nt n=1 . The total number of nodes in G is N := t t=1 N t . The rth relation type, E r := {(n t , n t ) ∈ V t × V t }, holds all interactions of a certain type among V t and V t and may represent that a movie is directed-by a director. Heterogenous graphs are typically used to represent knowledge graphs [47] . Each node n t is also associated with an F × 1 feature vector x nt . This feature may be a natural language embedding of the title of a movie. The nodal features are collected in a N × F matrix X. Note that certain node types may not have features and for these we use an embedding layer to represent their features.

Unsupervised learning. Given G and X, the goal of representation learning is to estimate a function g such that H := g(X, G), where H ∈ R N ×D represents the node embeddings and D is the size of the embedding space.Note that in estimating g, no labeled information is available.

Universal representation learning. The universal representations H should perform well on different downstream tasks. Different node classification and link prediction tasks may arise by considering different number of training nodes and links and different label types, e.g., occupation label or education level label. Consider I downstream task, for the universal representations H it holds that

where L (i) , f (i) , and T (i) represent the loss function, learned classifier, and training set (node labels or links) for task i, respectively and is the largest error for all tasks. The goal of unsupervised universal representation learning is to learn H such that is small. While learning H, PanRep does not have knowledge of {L (i) , f (i) , T (i) } i . Nevertheless, by utilizing the novel decoder scheme PanRep achieves superior performance even compared to supervised approaches across tasks.

Our universal representation learning framework aims at embedding nodes in a low-dimensional space such that the representations are discriminative for node classification and link prediction. Methods for learning over graphs typically rely on modeling homophily of nodes that postulates neighboring vertices to have similar attributes [41, 53, 19, 30] or structural similarity among nodes [37] , where vertices involved in similar graph structural patterns possess related attributes [15] . Motivated by these methods we identify related properties encoded in the graph data. Clustering nodes based on their attributes provides a strong signal for node homophily [31] . Network motifs reveal the local structure information for nodes in the graph [5] . Metapaths encode the heterogeneous graph neighborhood and indicate the local connectivity [14] . Finally, maximizing the mutual information among embeddings declusters node representations and provides further discriminative information [45] .

In order to capture the aforementioned properties we develop four novel universal decoders.

Cluster and recover supervision. Node attributes may reveal interesting properties of nodes, such as clusters of customers based on their buying power and age. This is important in recommendation systems, where traditional matrix factorization approaches [31] rely on revealing clusters of similar buyers. To capitalize such information we propose to supervise the universal embeddings by such cluster representations. Specifically, we cluster the node attributes via K-means [27] and then design a model that decodes H to recover the original clusters. The CR-decoder is modeled as a two layer MLP and is supervised by

where the cluster assignment C nk is 1 if node n belongs in class k and the predicted cluster assignmentĈ nk is the output of the CR-decoder. Such a supervision signal will enrich the universal embeddings H with information based on the clustering of local node features.

Motif supervision. Network motifs are sub-graphs where the nodes have specific connectivity patterns. Typical size-3 motifs for example, are the triangle and the star motifs. Each of these sub-graphs is identified by a particular pattern of interactions among nodes, and reveals important properties for the participating nodes. In gene regulatory networks for example, motifs are associated with certain biological properties [6] . The work in [5] develops efficient parallel implementations for extracting network motifs. We aspire to capture structural similarity among nodes by predicting their motif information. The motivation is that nodes which might be distant in the graph may have similar structural properties as described by their motifs.

Using the method in [5] we extract a frequency vector µ n per node that shows how many times n participates to graph motifs up to size 4. This information reveals the structural role of nodes such as star-center, star-edge nodes, or bridge nodes [37, 23] . The motif decoder predicts this vector for all nodes using the universal representation H. This allows for information sharing among nodes which are far away in the graph but have similar motif frequency vectors. The novel motif decoder is modeled as a two-layer MLP and is supervised by the following loss function

whereμ n is the output of the Mot-decoder for the nth node. Using the Motdecoder PanRep enhances the universal embeddings by structural information encoded in the node motifs.

Metapath RW supervision. Metapaths are sequences of edges of possibly different type that connect nodes in a KG [14] . A metapath random walk (MRW) is a specialized RW that follows different edge-types; see e.g., [14] .

We aspire to capture local connectivity patterns by promoting nodes participating in a MRW to have similar embeddings. Consider all node pairs for nodes (n t , n t ) participating in a MRW, the following criterion maximizes the similarity among these nodes as follows

where h nt and h n t are the universal embeddings for nodes n t and n t , respectively, r t,t is an embedding parametrized on the pair of node-types and y is 1 if n t and n t co-occur in the MRW and -1 otherwise. Negative examples are generated by randomly selecting tail nodes for a fixed head node with ratio 5 negatives per positive example. Link prediction is indeed a special case of the MRW supervision that considers MRWs of length 1. However, metapaths convey more information than regular links since the former can be defined to promote certain prior knowledge. For example, in predicting the movie genre in IMDB the metapath configured by the edge types (played by, played in) among node types (movie, actor, movie) will potentially connect movies with same genre and hence it is desirable. The embedding per node-type pair r t,t allows the MRW-decoder to weight the similarity among node embeddings from different node types accordingly. The length of the MRW controls the radius of the graph neighborhood considered in equation (4) and it can vary from local to intermediate.

The aforementioned supervision signals capture clustering affinity, structural similarity and local and intermediate neighborhood of the nodes. Nevertheless, further information can be extracted by the representations by maximizing the mutual information among node representations. Such an approach for homogeneous graphs is detailed in [45] , where the mutual information between node representations and the global graph summaries is maximized [24] . Towards further refining the universal embeddings, we propose an adaptation of [45] for heterogeneous graphs. We consider a global summary vector per t as s t := Nt nt=1 h nt that captures the average tth node representation. We aspire to maximize the mutual information among s t and the corresponding nodes in V t . The proposed HIM decoder is supervised by the following contrastive loss function

where the bilinear scoring function [52] σ(h nt Ws t ) captures how close is h nt to the global summary, W is a learnable matrix andh nt represents the negative example used to facilitate training. Designing negative examples is a cornerstone property for training contrastive models [45] . We generate the negative examples in (5) by shuffling node attributes among nodes of the same type. The HIM decoder maximizes the mutual information across nodes and complements the former decoders.

Putting everything together. PanRep's overall loss function is the linear unweighted combination of (2)-(5) and can be considered in the framework of deep multitask learning [54] , since the GNN encoder is shared across the multiple supervision tasks and PanRep makes multiple inferences in one forward pass. Such networks are not only scalable, but the shared features within these networks can induce more robust regularization and possibly boost performance [11] . A future direction of PanRep is to introduce adaptive weights per decoder to control its learning rate [11] .

Although the PanRep framework can utilize any GNN model as an encoder [50] , in this paper PanRep uses a relational (R)GCN encoder [38] . RGCNs extend the graph convolution operation [29] to heterogenous graphs. An RGCN model is comprised by a sequence of RGCN layers. The lth layer computes the nth node representation h (l+1) n as follows

where N r n is the neighborhood of node n under relation r, σ the rectified linear unit non linear function, and W (l) r is a learnable matrix associated with the rth relation. Essentially, the output of the RGCN layer for node n is a nonlinear combination of the hidden representations of neighboring nodes weighted based on the relation type.

Several works consider designing possible more general GNN encoders that utilize attention mechanism [48, 44] or graph isomorphism networks [51] . This paper proposes novel supervision signals for unsupervised learning that capture general properties of the graph data. Designing a universal encoder based on these contemporary GNN models is an interesting future direction of PanRep.

In certain cases a very small subset of labels may be known a priori for the downstream task. In such cases it is beneficial to fine-tune PanRep's model to obtain refined node representations. In this context, PanRep can be considered as pretrained model and a downstream task specific loss may be applied to supervise PanRep. BERT models in natural language processing have reported state of the art results by considering such a pretrain and fine-tune framework [12] . PanRep-FT can be considered a counterpart of BERT for extracting information from heterogenous graph data. PanRep-FT combines the benefit of universal unsupervised learning and task specific information and achieves greater generalization capacity especially when labeled data are scarce [16] .

The proposed universal represention learning techniques are compared with state-of-the-art methods. The models are developed in the efficient deep graph learning library [46] . For node classification the following contemporary methods are considered RGCN [38] , HAN [48] , MAGNN [17] , node2vec [21] , meta2vec [14] and an adaptations of the work in [45] for heterogenous graphs termed HIM. For link prediction the baseline models is RGCN [38] with DistMult supervision [52] that uses the same encoder as PanRep. The Mot-decoder and RC-decoder employ a 2-layer MLP. For the MRW-decoder we use length-2 MRWs. The parameters for all methods considered are optimized using the performance on the validation set.

Datasets. We consider a subset of IMDB dataset [1] containing 11,616 nodes with 3 node-types and 17,106 edges from 6 edge-types. Each movie is associated with a label representing its genre and with a feature vector capturing its keywords. We also use a subset of the OAG dataset [2] with 23,696 with 4 node types (authors, affiliations, papers, venues) and 90,183 edges from 14 edge-types. In OAG we did not use mot supervision since [5] is not applicable. Each paper is associated with a label denoting the scientific area and with an embedding of the papers' text. Finally, we utilize the drug-repurposing knowledge graph (DRKG) constructed in [26] . DRKG contains 5,874,261 biological interactions belonging to 107 edge-types among 97,238 biological entities from 13 entitytypes. For further details on the datasets and configuration of methods see the supplementary material. 

We split the labeled nodes in 10% training, 5% validation, and 85% testing sets. In this experiment we compare supervised and unsupervised methods for classification. First, the methods learn embeddings for the labeled nodes with or without labeled supervision. We then obtain the embeddings corresponding to the 85% testing nodes as calculated from the unsupervised and supervised methods and further split these nodes to training and testing sets and train a linear SVM. This evaluation setting allows us to directly compare the different supervised and unsupervised approaches. We report the Macro and Micro F1 accuracy for different training percentages of the 85% nodes fed to the SVM classifier in Table 1 . It is observed that PanRep outperforms significantly other unsupervised approaches as well as some supervised approaches. In certain splits, PanRep outperforms its supervised counterpart RGCN that uses node labels for supervision. Metapath2vec [14] reports competitive performance for OAG in Macro-F1 score but unperformed in Micro-F1. This result is also in par with the Table 3 where the strongest signal for PanRep is given by the MRW decoder. PanRep-FT outperforms significantly RGCN that uses the same encoder, which is a testament to the power of pretraining models. Finally, PanRep-FT matches and outperforms in certain splits the state-of-the-art MAGNN that uses a more expressive encoder. PanRep's universal decoders enhance the embeddings with additional discriminative power that results to improved performance in the downstream tasks. Table 2 reports the accuracy of the PanRep-FT and the encoder RGCN that is only trained for the semi-supervised learning task for different splits of the training labels used to obtain the encodings. PanRep outperforms the supervised RGCN embeddings for 5% training labels. PanRep-FT consistently outperforms RGCN across most SVM splits. This demonstrates the importance of using PanRep as a pretraining method. RGCN reports similar performance across SVM splits, whereas PanRep-FT increases with more supervision. These results suggest that PanRep-FT's embeddings have higher generalization capacity. Table 3 reports an ablation study by using different decoder subsets. PanRep that uses all signals obtain the best performance. The decoders him and mrw and the their combination exhibit the best performance after PanRep. Nevertheless, the full supervision in PanRep leads to the superior performance. 

Our universal embedding framework is further evaluated for link prediction using the IMDB and OAG datasets. The MRW decoder is used to evaluate the performance of PanRep in link prediction. Figure 2 reports the MRR, and Hit-10 scores of the baseline methods along with the PanRep and PanRep-FT methods. We report the performance of the methods for different percentages of links used for training. Observe that PanRep-FT consistently outperforms the competing methods and the performance gain increases as the percentage of training links decreases. This corroborates the advantage of pretraining GNNs Losartan  232  Thalidomide  41  Chloroquine  69  Tofacitinib  33  Chloroquine  198  Hydroxychloroquine  19  Colchicine  41  Ribavirin  32  Deferoxamine  104  Tetrandrine  13  Tetrandrine  40  Methylprednisolone  30  Ribavirin  101  Eculizumab  10  Oseltamivir  37  Deferoxamine  30  Methylprednisolone  44  Tocilizumab  9  Azithromycin  36  Thalidomide  25 We retain the top-5 drugs based on their number of hits for each method. Note that a random classifier will result to approximately 5.3 per drug. This suggests that the reported predictions are significantly better than random.

for link prediction. Note that PanRep reports similar performance with RGCN that is trained solely in link prediction. This result confirms the success of the universal embeddings in link prediction.

Drug-repurposing aims at discovering the most effective existing drugs to treat a certain disease. Using the Drug Repurposing Knowledge Graph (DRKG) [26] , we compare the drug repurposing results in Covid-19 among PanRep-FT that is finetuned in link prediction and the baseline RGCN [38] . We employ L = 1 hidden layer with D = 600 and train for 800 epochs both networks. Drugrepurposing can be formulated as predicting direct links in the DRKG. Here, we attempt to predict whether a drug inhibits a certain gene, which is related to the target disease. We identify 442 genes that are related with the Covid-19 disease [20, 56] . We select 8,104 FDA-approved drugs in the DRKG as candidates; see also [26] . To validate our predictions we use 32 Covid-19 clinical trial drugs from [3] .

For each gene node we calculate with RGCN and PanRep-FT an inhibit link score associated with every drug. Next, we score all 'drug-inhibits-gene' triples and rank them per target gene. We obtain in this way 442 ranked lists of drugs, one per gene node. Finally, to assess whether our prediction is in par with the drugs used for treatment, we check the overlap among the top 100 predicted drugs and the drugs used in clinical trials per gene. Table 4 lists the clinical drugs included in the top-100 predicted drugs across all the genes with their corresponding number of hits for the RGCN and PanRep-FT. It can be observed, that several of the widely used drugs in clinical trials appear high on the predicted list in both prediction. Furthermore, PanRep-FT reports a higher hit rate than RGCN, which corroborates the benefit of using the universal pretraining decoders. The universal representation endows PanRep with increased generalization power that allows for accurate link prediction performance when training data are extremely scarce as is the case of Covid-19. While this study, does not recommend specific drugs, it demonstrates a powerful deep learning methodology to prioritize existing drugs for further investigation, which holds the potential of accelerating therapeutic development for Covid-19.

This paper develops a novel framework for unsupervised learning of universal node representations on heterogenous graphs termed. PanRep supervises the GNN encoder by decoders attuned to model the clustering of local node features, structural similarity among nodes, the local and intermediate neighborhood structure, and the mutual information among same-type nodes. To further facilitate cases where limited labels are available we implement PanRep-FT. Experiments in node classification and link prediction corroborate the competitive performance of the learned universal node representations compared to unsupervised and supervised methods. Experiments on the DRKG showcase the advantage of the universal embeddings in drug repurposing.

as an unsupervised representation learning approach (PanRep), it requires no labeled data; whereas when it operates in its fine-tuning mode (PanRep-FT), it can achieve considerable improvements with limited amount of labeled data. This paper's potential positive impact to society stem from the fact that it can lower the effort required to achieve the desired performance for learningon-graphs applications and consequently enable them to larger segments of the society. This will allow us to recommend the appropriate course to a student, show an enjoyable movie to a customer, find a drug candidate with no adverse side effects, detect money laundry transactions, decrease the time and hence the energy required by a supercomputing-based materials science numerical simulation. Note that the lower effort is a direct consequence of the need for none or fewer labeled data.

At the same time, this work can also have some negative consequences. It can result in job losses by automating tasks that are currently done by people. It can make it easier for bad actors or undemocratic regimes to infer protected/sensitive/private information by leveraging learning on graphs approaches and smaller labeled sets. The unsupervised nature of PanRep can fail to compute high-quality representations for infrequently occurring nodes, which depending on the dataset, it can potentially discriminate against some protected groups, if those groups are not well represented. Nevertheless, some of these adverse consequences can be averted by adjusting PanRep with appropriate regularization or constraints accounting for privacy or fairness concerns.

The methods presented in this paper are implemented in the efficient deep graph learning (DGL) 2 library [46] . PanRep is implemented using the mini-batch training framework that facilitates training for very large graphs even with limited computational resources 3 . The competing methods RGCN, MAGNN and HAN are also implemented using the DGL. PanRep experiments are executed on an AWS P3.8xlarge 4 instances with 8 GPUs each having 16GB of memory.

Different competing methods include RGCN [38] , HAN [48] , MAGNN [17] , node2vec [21] , meta2vec [14] and an adaptation of the work in [45] for heterogenous graphs termed HIM. For link prediction the baseline model is RGCN [38] with DistMult supervision [52] that uses the same encoder as PanRep. The Mot-decoder and RC-decoder employ a 2-layer MLP. For the MRW-decoder we use length-2 MRWs. The parameters for all methods considered are optimized using the performance on the validation set. For the majority of the experiments PanRep uses a hidden dimension of 300, 1 hidden layer, 800 epochs of model training, 100 epochs for finetuning, and an ADAM optimzer [28] with a learning rate of 0.001. For link prediction finetuning PanRep uses a DistMult model [52] whereas for node classification it uses a logistic loss.

The Drug Repurposing Knowledge Graph (DRKG) contains 97055 entities belonging to 13 entity-types [26] . The type-wise distribution of the entities is shown in Table 6 . DRKG contains a total of 5869294 triplets belonging to 107 edge-types. Table 7 shows the number of triplets between different entity-type pairs for DRKG and various data sources. The DRKG is publicly available. 5

IDMB [1] is a movie database including information about the cast, production crew, and plot summaries. A subset of IMDb is used after data preprocessing in Table 5 . Movies are labeled as one of three classes (Action, Comedy, and Drama) based on their genre information. Each movie is also described by a bag-of-words representation of its plot keywords.

OAG [2] is bibliography website. We preprocess the data and retain the subgraph in Table 5 . The papers are divided into 6 research areas. Each paper is described by a BERT embedding of the paper's title. 

Distributed large-scale natural graph factorization

Graphlet decomposition: Framework, algorithms, and applications

Structure and evolution of transcriptional regulatory networks

Regularization and semi-supervised learning on large graphs

Laplacian eigenmaps and spectral techniques for embedding and clustering

Unsupervised feature learning and deep learning: A review and new perspectives. CoRR, abs/1206

Grarep: Learning graph representations with global structural information

Gradient normalization for adaptive loss balancing in deep multitask networks

Pre-training of deep bidirectional transformers for language understanding

Decaf: A deep convolutional activation feature for generic visual recognition

metapath2vec: Scalable representation learning for heterogeneous networks

Learning structural node embeddings via diffusion wavelets

Why does unsupervised pre-training help deep learning

Magnn: Metapath aggregated graph neural network for heterogeneous graph embedding

Rich feature hierarchies for accurate object detection and semantic segmentation

Pagerank beyond the web

A sars-cov-2-human protein-protein interaction map reveals drug targets and potential drug-repurposing

node2vec: Scalable feature learning for networks

Representation learning on graphs: Methods and applications

Rolx: structural role extraction & mining in large graphs

Learning deep representations by mutual information estimation and maximization

Strategies for pre-training graph neural networks

Drkg -drug repurposing knowledge graph for covid-19

An efficient k-means clustering algorithm: Analysis and implementation

Adam: A method for stochastic optimization

Semi-supervised classification with graph convolutional networks

Heat kernel based community detection

Matrix factorization techniques for recommender systems

Albert: A lite bert for self-supervised learning of language representations

Asymmetric transitivity preserving graph embedding

Deepwalk: Online learning of social representations

Deep contextualized word representations

Language models are unsupervised multitask learners

Role discovery in networks

Modeling relational data with graph convolutional networks

Meta-path guided embedding for similarity search in large-scale heterogeneous information networks

Heterogeneous information network embedding for recommendation

Kernels and regularization on graphs

Least squares support vector machine classifiers. Neural processing letters

Line: Large-scale information network embedding

Graph attention networks

Deep graph infomax

Deep graph library: Towards efficient and scalable deep learning on graphs

Knowledge graph embedding: A survey of approaches and applications

Heterogeneous graph attention network

Data mining with big data

A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems

How powerful are graph neural networks?

Embedding entities and relations for learning and inference in knowledge bases

Exploiting sentiment homophily for link prediction

Facial landmark detection by deep multi-task learning

Dgl-ke: Training knowledge graph embeddings at scale

Network-based drug repurposing for novel coronavirus 2019-ncov/sars-cov-2

This work contributes to improving the performance of machine learning approaches on large linked datasets (known as learning on graphs ML tasks), which are used to enable applications in various domains. Examples of such applications include recommendations, link prediction, node/entity classification, clustering, entity resolution, drug-protein binding prediction, personalized treatment, and inter-atomic potential estimation. Examples of such domains include (e-)commerce; life, physical, and social sciences; engineering and manufacturing; law enforcement and (information) security; healthcare; and finance. The source of this paper's improved performance is that it reduces the amount of labeled data that is required to achieve a certain level of performance. When it operates