key: cord-0481852-40ah0a27 authors: Baek, Jinheon; Lee, Dong Bok; Hwang, Sung Ju title: Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Prediction date: 2020-06-11 journal: nan DOI: nan sha: 225d571f97f9097be436626f225cff572007aaf4 doc_id: 481852 cord_uid: 40ah0a27 Many practical graph problems, such as knowledge graph construction and drug-to-drug interaction, require to handle multi-relational graphs. However, handling real-world multi-label graphs with Graph Neural Networks (GNNs) is often challenging due to their evolving nature, where new entities (nodes) can emerge over time. Moreover, newly emerged entities often have few links, which makes the learning even more difficult. Motivated by this challenge, we introduce a realistic problem of few-shot out-of-graph link prediction, where we not only predict the links between the seen and unseen nodes as in a conventional out-of-knowledge link prediction but also between the unseen nodes, with only few edges per node. We tackle this problem with a novel transductive meta-learning framework which we refer to as Graph Extrapolation Networks (GEN). GEN meta-learns both the node embedding network for inductive inference (seen-to-unseen) and the link prediction network for transductive inference (unseen-to-unseen). For transductive link prediction, we further propose a stochastic embedding layer to model uncertainty in the link prediction between unseen entities. We validate our model on multiple benchmark datasets for knowledge graph link prediction and drug-to-drug interaction prediction. The results show that our model significantly outperforms relevant baselines for out-of-graph link prediction tasks. Graphs have strong expressive power to represent structured data, as it can model data into a set of nodes (objects) and edges (relations). To exploit the graph structured data which works on a non-Euclidean domain, several recent works propose graph-based neural architectures, referred to as Graph Neural Networks (GNNs) [9, 22] . While early work mostly deals with simple graphs with unlabeled edges, recently proposed relation-aware GNNs [38, 39] consider multi-relational graphs with labels and directions on the edges. These multi-relational graphs expand the application of GNNs to more real-world domains such as social networks modeling [18] , natural language understanding [25] , modeling protein structure [13] , and drug-to-drug interaction prediction [60] . Among multi-relational graphs, Knowledge Graphs (KGs), which represent knowledge bases (KBs) such as Freebase [3] and WordNet [27] , get the most attention. They represent entities as nodes and relations among the entities as edges, in the form of a triplet: (head entity, relation, tail entity) (e.g. (Louvre museum, is located in, Paris)). Although knowledge graphs in general contain a huge amount of triplets, they are well known to be highly incomplete [28] . Therefore, automatically completing knowledge graphs, which is known as the link prediction task, is a practically important problem for KGs. Prior work tackles this problem, i.e. inferring missing triplets, by learning embedding of entities and relations from existing triplets, achieving impressive performances [5, 57, 10, 31, 30] . LAN [Wang et al. 19] Seen Unseen Ours Figure 1 : Concept (Left): An illustration of out-of-graph link prediction for emerging entities. Blue dotted arrows denote inferred relationships between seen and unseen entities, and red dotted arrows denote inferred relationships between unseen entities. (Center): An illustration of our meta-learning framework for Out-Of-Graph link prediction task. Orange arrows denote the support (training) set S and green dotted arrows denote the query (test) set Q. Visualizations of the learned embeddings (Right): Our transductive GEN embeds the unseen entities on the manifold of seen entities, while the baseline [49] embeds them off-manifold. Despite this success, the link prediction for KGs in real-world scenarios remains to be challenging for a couple of reasons. First, knowledge graphs dynamically evolve over time, rather than staying static. Shi and Weninger [40] report that around 200 new entities emerge every day. Predicting links on these emerging entities pose a new challenge, especially when predicting the links between the emerging entities themselves. Moreover, real-world KGs generally exhibit long-tail distribution, where a large portion of the entities have only a few triplets to train (See Figure 2 ). The embedding-based methods, however, usually assume that a sufficient amount of associative triplets exist for training, and cannot embed unseen entities. Thus they are highly suboptimal for learning and inference on evolving real-world graphs. Motivated by the limitations of existing approaches, we introduce a realistic problem of few-Shot Out-Of-Graph (OOG) link prediction for emerging entities. In this task, we not only predict the links between seen and unseen entities but also between the unseen entities themselves (Figure 1 , left). To this end, we propose a novel meta-learning framework for OOG link prediction, which we refer to as Graph Extrapolation Networks (GENs). GENs are meta-learned to extrapolate the knowledge from seen to unseen entities and transfer knowledge from entities with many links to few links. Specifically, given embeddings of the seen entities for a multi-relational graph, we meta-train two GNNs to predict the links between seen-to-unseen, and unseen-to-unseen entities. The first GNN, inductive GEN, learns to predict the embeddings of unseen entities that are not observed, and predicts the links between seen and unseen entities. The second GNN, transductive GEN, learns to predict the links between the unseen entities. This transductive inference is possible since our meta-learning framework can simulate the unseen entities during meta-training, while they are unobservable in conventional learning frameworks. Also, since link prediction for unseen entities is inherently unreliable, which gets worse when few triplets are available for each entity, we learn the distribution of unseen representations for stochastic embedding to account for uncertainty. Moreover, we apply transfer learning strategy to model the long-tail distribution. These lead GEN to learn embeddings for unseen entities to be well aligned with seen entities (See Figure 1 , right). We validate GENs for their OOG link prediction performance on two benchmark knowledge graph completion datasets, namely FB15K-237 [3] and NELL-995 [55] . We also validate GENs for OOG drug-to-drug interaction prediction task on DeepDDI [36] and BIOSNAP-sub [26] . The experimental results on four datasets show that our model obtains significantly superior performance over the relevant baselines. Further analysis of each component shows that both inductive and transductive layer of GEN help with the accurate link prediction for out-of-graph entities. In sum, our main contributions are as follows: • We tackle a realistic problem setting of few-shot out-of-graph link prediction, aiming to perform link prediction between unseen (emerging) entities for multi-relational graphs that exhibit long-tail distributions, where each entity has only few associative triplets. • To tackle this problem, we propose a novel meta-learning framework, Graph Extrapolation Network (GEN), which meta-learns the node embeddings for unseen entities, to obtain low error on link prediction for both seen-to-unseen (inductive) and unseen-to-unseen (transductive) cases. • We validate GEN for few-shot OOG link prediction tasks on four benchmark datasets for knowledge graph completion and drug-to-drug interaction prediction, on which it significantly outperforms relevant baselines. Graph Neural Network Existing GNNs encode the nodes by aggregating the features from the neighboring nodes, that use recurrent neural networks [16, 37] , mean pooling with layer-wise propagation rule [22] , learnable attention-weighted combination of the features [47] , to name a few. While most of the existing models work with simple undirected graphs, some recent work tackles multi-relational graphs for their practical importance. Directed-GCN [25] and Weighted-GCN [39] consider direction and relation types, respectively. Also, R-GCN [38] considers direction and relation types simultaneously. Recently, Vashishth et al. [46] propose to jointly embed nodes and relations in a multi-relational graph. Since our GEN is a general framework for OOG link prediction rather than a specific GNN, it is compatible with any GNN implementations for multi-relational graphs. Meta Learning Meta-learning, whose objective is to generalize over a distribution of tasks, is an essential approach for our few-shot OOG link prediction framework, where we simulate the unseen test nodes with a subset of training nodes. To mention a few, metric-based approaches [48, 41] learn a shared metric space to minimize the distance between correct and instance embeddings. Also, gradient-based approaches [12, 32] learn shared parameters for an initialization, to generalize over diverse tasks in a bi-level optimization framework. Relatively few works consider meta-learning with GNNs. Meta-GNN [59] uses meta-learning for few-shot node classification, and Meta-Graph [6] proposes to construct graphs over seen nodes, with only a small sample of known edges. Multi-relational Graphs A popular application of multi-relation graphs is knowledge graph completion. Previous methods for this problem can be broadly classified as translational distance based [5, 51] , semantic matching based [33, 57, 45] and deep neural network based methods [31, 10, 38, 30] . While they require a large amount of training instances, many real-world graphs exhibit long-tail distribution. Few-shot relational learning methods tackle this issue by learning few relations of seen entities [56, 7] . Nonetheless, the problem becomes more difficult as knowledge graphs have an evolving nature with new emerging entities. Several models [54, 52] tackle this problem by utilizing extra information about the entities, such as their textual description. Some recent methods [17, 49, 1] propose to handle unseen entities in an inductive manner. However, since they can not simulate the unseen entities in the training phase, there are some fundamental limitations on the generalization for handling actual unseen entities. On the other hand, our method entirely tackles both of seen-to-unseen and unseen-to-unseen link prediction, under the transductive meta-learning framework. Drug-to-drug interaction (DDI) prediction is another important real-world application of multi-relational graphs, where the problem is to predict interactions between drugs. Recently, Zitnik et al. [60] and Ma et al. [24] propose end-to-end GNNs to tackle this problem, which demonstrate comparatively better performance over non-GNN methods [2, 34, 58] . Our goal is to perform link prediction for emerging entities of a multi-relational graph, in which a large portion of the entities have only few triplets associated with them. We begin with the formal definition of the multi-relational graph and the link prediction task. Definition 3.1. (Multi-relational Graph) Let E and R be two sets of entities and relations respectively. Then a relation is defined as a triplet (e h , r, e t ), where e h , e t ∈ E are the head and the tail entity, and r ∈ R is a specific type of relation between them. A multi-relational graph G is represented as a collection of triplets. That is, Definition 3.2. (Link Prediction) Link prediction refers to the task of predicting an unknown item of a triplet, when given two other items. We consider both of the entity prediction and relation prediction tasks. Entity prediction refers to the problem of predicting e ⊆ E, given the entity and the relation: (e h , r, ?) or (?, r, e t ). Relation prediction refers to the problem of predicting r ⊆ R, given the head and tail entities: (e h , ?, e t ). Link prediction for multi-relational graphs Link prediction is essentially the problem of assigning high scores to the true triplets, and therefore, many existing methods use score function s(e h , r, e t ) to measure the score of a given triplet, where the inputs depend on their respective embeddings (see Table 1 ). As a result, the objective of the link prediction is to find the representation of triplet elements and the function parameters in a parametric model case, which maximize the score of the true triplets. Which embedding methods to use depends on their specific application domain. However, existing work mostly tackles the link prediction between seen entities that already exist in the given multi-relational graph. In this work, we tackle a task of the few-shot Out-Of-Graph (OOG) link prediction defined as follows: where e ∈ (E ∪ E ). We further assume that each unseen entity e is associated with K triplets: |{(e , r, e) or (e, r, e )}| ≤ K and e ∈ (E ∪ E ) , where K is a small number (e.g., 1 or 3). While few existing works [17, 49] tackle the entity prediction between seen and unseen entities, in real-world settings, unseen entities do not emerge one by one but may emerge simultaneously as a set, with only few triplets available for each entity. Thus, they are highly suboptimal for handling such real-world scenarios. We now introduce Graph Extrapolation Networks (GENs) for the out-of-graph link prediction task. Since most of the previous methods assume that every entity in the test set is seen during training, they cannot handle emerging entities, which are unobserved during training. While few existing works [17, 49] train for seen-to-seen link prediction with the hope that the models generalize on seen-to-unseen cases, they are suboptimal in handling unseen entities. Therefore, we use metalearning framework to handle the OOG link prediction problem, whose goal is to train a model over a distribution of tasks such that the model generalizes well on unseen tasks. Figure 1 illustrates our learning framework. Basically, we meta-train GEN which performs both inductive and transductive inference on various simulated test sets of OOG entities, such that it extrapolates the knowledge of existing graphs to any unseen entities. We describe the framework in details in next few paragraphs. Learning Objective Suppose that we are given a multi-relational graph G ⊆ E × R × E, which consists of seen entities e ∈ E and relations r ∈ R. Then, we aim to represent the unseen entities e ∈ E over a distribution p(E ), by extrapolating the knowledge on a given graph G, to predict the link between seen e and unseen e entities: (e, r, e ) or (e , r, e), or even between unseen entities themselves: (e , r, e ). Toward this goal, we have to maximize the score of a triplet s(e h , r, e t ) that contains any unseen entities e , with embedding and score function parameters θ: max While this is a seemingly impossible goal as it involves generalization to unseen entities, we can tackle it with meta-learning, which we describe next. Meta-Learning Framework While conventional learning frameworks can not handle unseen entities in the training phase, with meta-learning, we can formulate a set of tasks such that the model learns to generalize over unseen entities, which are simulated using seen entities. To formulate the OOG link prediction problem into a meta-learning problem, we first randomly split the entities in a given graph into the meta-training and meta-test set. Then, we generate a task by sampling the set of (simulated) unseen entities for meta-training (See Figure 1 ). Formally, each task T corresponds to a set of unseen entities E T ⊂ E , with a predefined number of instances |E T | = N . Then we divide the triplets associative with each entity e i ∈ E T into the support set S i and the query set K is the few-shot size, and M i is the number of triplets associated with each unseen entity e i . Our meta-objective is then learning to represent the unseen entities as φ using a support set S, to maximize the triplet score on a query set Q: We refer to this specific setting as K-shot OOG link prediction throughout this paper. Once the model is trained with the meta-training tasks T train , we can apply it to unseen meta-test tasks T test , whose set of entities is disjoint from T train , as shown in Figure 1 . Require: Distribution over train tasks p (Ttrain), Require: Learning rate for meta-update α 1: Initialize parameters Θ = {θ, θµ, θσ} 2: while not done do 3: Sample a task T ∼ p (Ttrain) 4: for all e i ∈ T do 5: Sample support and query set {Si, Qi} correspond to e i 6: Inductively generate using (3): φi = f θ (Si) 7: end for 8: for all e i ∈ T do 9: Transductively generate using (4): µi = g θµ (Si, φ) and σi = g θσ (Si, φ) 10: (6) 13: end while Figure 3 : The overall framework of our model for each task. We extrapolate knowledge by using a support set S with inductive and transductive learning, and then predict links with output of the embedding φ . Graph Extrapolation Networks In order to extrapolate knowledge of a given graph G to an unseen entity e i through a support set S i , we propose a GNN-based meta-learner that outputs the representation of unseen entities. We formulate our meta-learner f θ (·) as follows ( Figure 3 - , where we denote the set of neighboring entity and relation as n (S i ) = {(r, e) | (e i , r, e) or (e, r, e i ) ∈ S i }. Further, K is the size of n(S i ), W r ∈ R d×2d is a relation-specific transformation matrix that is meta-learned, and C r,e ∈ R 2d is a concatenation of feature representation of the relation-entity pair and σ is the ReLU function. Since GEN is essentially a framework for OOG link prediction, it is compatible with any GNNs. Transductive Meta-Learning of GENs The previously described inductive GEN constructs the representation of each unseen entity e i through a support set S i , as described in (3), and perform link prediction on a query set Q i , independently. A major drawback of this inductive scheme is that it does not consider the relationships between unseen entities. However, to tackle unseen entities simultaneously as a set, one should consider not only the relationships between seen and unseen entities as with inductive GEN, but also among unseen entities themselves. To tackle this issue, we extend the inductive GEN to further perform a transductive inference, which will allow knowledge to propagate between unseen entities. More specifically, we add one more GEN layer g θ (·), which is similar to the inductive meta-learner f θ (·), to consider inter-relationships between unseen entities (Figure 3 is a weight matrix for the self-connection to consider embedding φ i , which is updated by the previous inductive layer f θ (S i ). To leverage the knowledge of neighboring unseen entities, our transductive layer g θ (·) aggregates the representations across all the associative neighbors with transductive weight matrix W r ∈ R d×2d , where the neighbors can include the representations of unseen entities φ, rather than treating them as noises. Stochastic Inference A naive transductive GEN generalizes to unseen entities by simulating them with seen entities during the meta-training. However, due to the intrinsic unreliability of few-shot OOG link prediction with each entity having only few triplets, there could be high uncertainties on the representation of unseen entities. To model such uncertainties, we stochastically embed the unseen entities by learning the distribution over an unseen entity embedding φ i . To this end, we first assume that the true posterior distribution has a following form: p(φ i | S i , φ). Since computation of the true posterior distribution is intractable, we approximate the posterior using , and then compute the mean and variance via two individual transductive GEN layers µ i = g θµ (S i , φ) and σ i = g θσ (S i , φ), which modifies GraphVAE [21] to our setting. The form to maximize the score function is then defined as follows: where we set the MC sample size to L = 1 during meta-training for computational efficiency. Also, we perform MC approximation with sufficiently large sample size (e.g. L = 10) at meta-testing. We let the approximate posterior same as the prior to make the consistent pipeline at training and testing (see Sohn et al. [42] ). We also model the source of uncertainty on output embedding of an unseen entity from the transductive GEN layer via Monte Carlo dropout [14] . Our final GEN is trained for both the inductive and transductive steps, as described in Algorithm 1. Loss Function Each task T consists of a support set and a query set: T = {S, Q}. During training, we represent the embedding of unseen entities e i ∈ E T using the support set S i with GENs. After that, at the test time, we use the true labeled query set Q i to optimize our GENs. Since every query set contains only positive triplets, we perform negative sampling [5, 57] to update meta-learner by allowing it to distinguish positive from negative triplets. Specifically, we replace the entity of each triplet in the query set: where e − is the corrupted entity and Q − i holds negative samples for an unseen entity e i . We use hinge loss to optimize our model: where γ > 0 is a margin hyper-parameter and s is the result of each score function in the Table 1 . s + and s − denote the scores of positive and negative triplets, respectively. For Drug-to-Drug interaction predict task, we follow Ryu et al. [36] to optimize our model, where binary cross-entropy loss is calculated for each label with a sigmoid output of the score function in the Table 1 . Meta-Learning for Long-Tail Tasks Since many real-world graphs follow the long-tail distribution (See Figure 2) , it would be beneficial to transfer the knowledge from entities with large links to entities with only few links. To this end, we follow a scheme similar to Wang et al. [50] and start to learn the model with many shot cases, then gradually decrease the number of shots to few shot cases in a logarithmic scale. Further implementation details are given in the Section B of the Appendix. We validate GEN on few-shot OOG link prediction tasks for two different applications of multirelational graphs: knowledge graph (KG) completion and drug-to-drug interaction (DDI) prediction. Datasets For knowledge graph completion datasets, we consider out-of-graph entity prediction, whose goal is to predict the other entity given an unseen entity and a relation. 1) FB15k-237. This dataset [44] consists of 310, 116 triplets from 14, 541 entities and 237 relations, which is collected via crowdsourcing. 2) NELL-995. This dataset [55] consists of 154, 213 triplets from 75, 492 entities and 200 relations, which is collected by a lifelong learning system [29] . Since existing benchmark datasets do not target OOG link prediction, they assume that all entities given at the test time are seen during training. Therefore, we modify the dataset such that the triplets used for link prediction at test time contain at least one unseen entity (see Appendix A.1 for the detailed setup). Baselines and our models 1) TransE. 2) RotatE. Translation distance based embedding methods for multi-relational graph [5, 43] . 3) DistMult. 4) ComplEx. Semantic matching based embedding methods [57, 45] . 5) R-GCN. This is a GNN-based method for modeling relational data [38] . 6) MEAN. 7) LAN. These are GNN models for a out-of-knowledge base task, which tackle unseen entities without meta-learning [17, 49] . 8) GMatching. This model tackles the link prediction on unseen relations of seen entities, and we extend it to our meta-learning framework [56] . 9) I-GEN. An inductive version of our GEN which is meta-learned to embed an unseen entity. 10) T-GEN. A transductive version of GEN, with additional stochastic transductive GNN layers to predict the link between unseen entities. We report detailed description in the Appendix A.2. Implementation Details. For both I-GEN and T-GEN, we use DistMult for the initial embedding of entities and relations, and the score function. However, the models that use TransE for embedding and score function perform similarly, which we report in the Appendix C. Following Xiong et al. [56] , we train seen-to-seen link prediction baselines including support triplets of meta-valid and meta-test sets, since baselines assume that only seen entities will appear at the test time and thus unable to solve the problem if the entity in triplets is completely unseen. However, for our methods, we train them only with the meta-training set, where we generate OOG entities using episodic training. Detailed experimental setups used for both datasets are described in the Appendix A.3. TransE [5] . Evaluation Metrics For evaluation, we use the ranking procedure by Bordes et al. [4] . For a triplet with an unseen head entity, we replace its corresponding tail entity with candidate entities from the dictionary to construct corrupted triplets. Then, we rank all the triplets, including the correct and corrupted ones by a scoring measure, to obtain the rank of the correct triplet. We provide the results using mean reciprocal rank (MRR) and Hits at n (H@n); MRR is the average of the multiplicative inverse of the rank of the correct triplets and H@n is the ratio of the correct triplets ranked smaller than or equal to n. Moreover, as done in previous works [5, 31, 38] , we measure the ranks in a filtered setting where we do not consider triplets that appeared in either training, validation, or test sets. Results Table 2 shows that our I-GEN and T-GEN outperform all baselines by impressive margins in all evaluation metrics with 1-shot and 3-shot settings. Baseline models work poorly on emerging entities that come with only a few triplets to train, even when they have seen the entities during training (with Support Set in Table 2 ). However, in our meta-learning framework, our GENs show superior performance over the baselines, with even one training triplet for each unseen entity. Moreover, while GMatching [56] , which handles unseen entities by searching for the closest entity pair, with our meta-learning framework achieves decent performance, it significantly underperforms ours. We also observe that T-GEN outperforms I-GEN on both datasets by all evaluation metrics. To see where the performance improvement comes from, we further examine the link prediction results for seen-to-unseen and unseen-to-unseen cases. Figure 4 shows that T-GEN obtains significant performance gain on the unseen-to-unseen link prediction problems, whereas I-GEN mostly cannot handle the case as it does not consider the relationships between unseen nodes. Also, while we mostly target a long-tail graph with the majority of the entities having few links, our method works well on many-shot cases as well ( Figure 5 ), on which I-and T-GENs still largely outperform the baselines, even though R-GCN sees the unseen entities during training. We further experiment our GEN with varying the number of triplets by considering 1-, 3-, 5-, and random-shot (between 1 and 5) during meta-training and meta-test. Table 3 shows that the difference in the number of shots used for training and test does not significantly affect the performance, which demonstrates the robustness of GENs on varying number of triplets at test time. Moreover, our model trained on 1-shot setting obtains even better performance on 3-shot setting. Furthermore, we conduct an ablation study of the T-GEN on seen-to-unseen (S/U) and unseen-to-unseen (U/U) cases. Table 4 shows that using stochastic modeling on the transductive inference layer helps significantly improve the unseen-to-unseen link prediction performance. Moreover, the meta-learning strategy of learning on entities with many links and then progressing to entities with few links performs well. Finally, we observe that using pre-trained embedding of a seen graph leads to better performance. We visualize the output representations of unseen entities with seen entities. Figure 1 (Right) shows that the embeddings of unseen entities are well aligned with the seen entities. Regarding concrete examples of link prediction on NELL-995, see the Section D of the Appendix. Datasets We further validate our GENs on the OOG relation prediction task using two public Drug-to-Drug Interaction (DDI) datasets. Baselines and our models 1) MLP. Feed-forward neural networks used in DDI task [36] . 2) MPNN. Graph Neural Networks that use edge-conditioned convolution [15] . 3) R-GCN. The same model used in the entity prediction on KG completion task [38] . 4) I-GEN. Inductive GEN, which only uses feature representation of an entity e k , instead of a relation-entity pair (r k , e k ). This is because the relation is the prediction target for the DDI tasks. 5) T-GEN. Transductive GEN with an additional transductive stochastic layers for unseen-to-unseen relation prediction. for the initial embedding of entities with linear score function in Table 1 . To train baselines, we use the same training scheme as the KG completion task, where support triplets of meta-valid and meta-test sets are included in the training set. Detailed experimental settings are described in the Appendix A.3. For evaluation, we use the area under the receiver operating characteristic curve (ROC), the area under the precision-recall curve (PR), and the classification accuracy (Acc). Results Table 5 shows the DDI prediction performance of the baselines and GENs. Note that the performances on BIOSNAP-sub are comparatively lower in comparison to DeepDDI, due to the use of the preprocessed input features, as suggested by Ryu et al. [36] . Similarly with the KG completion tasks, both I-and T-GEN outperform all baselines by impressive margins in all evaluation metrics. These results demonstrate that our GENs can be extended to OOG link prediction for other real-world applications of multi-relational graphs. We also compare the link prediction performance for both seen-to-unseen and unseen-to-unseen cases on two DDI datasets. The rightmost two columns of Figure 4 show that T-GEN obtains superior performance over I-GEN on unseen-to-unseen link prediction, especially on stochastic modeling cases. We proposed a realistic problem of the few-shot out-of-graph (OOG) link prediction, which considers link prediction between unseen (or emerging) entities for multi-relational graphs, where each entity comes with only few associative triplets. To this end, we proposed a novel meta-learning framework for OOG link prediction, which we refer to as Graph Extrapolation Network. Under the defined K-shot learning setting, GENs learn to extrapolate the knowledge of a given graph to unseen entities, with a stochastic transductive layer to further propagate the knowledge between the unseen entities and model uncertainty in the link prediction. We validated the OOG link prediction performance of GENs on four benchmark datasets, on which it largely outperformed the relevant baselines. Constructing knowledge bases which accurately reflect up-to-date knowledge about the entities and the links between them is crucial for its application in real-world scenarios. However, conventional link prediction methods for knowledge base systems mostly consider static knowledge graph that does not change over time. Yet, as new entities emerge every day [40] (e.g., , the ability to dynamically incorporating them into the existing knowledge graph is becoming a significantly important problem, which we mainly tackle in this paper. As a specific example of our approach, the novel coronavirus, COVID-19, is threatening our lives around the globe. To eradicate the novel coronavirus, we may want to best utilize the accumulated knowledge about existing coronavirus variants [53, 8] by identifying the links between the seen (SARS and MERS) and unseen entities (COVID-19) , or the links between unseen entities that have newly emerged (COVID-19 and novel vaccine under study). The following are more use cases of our proposed out-of-graph link prediction system: • The proposed meta-learning based few-shot out-of-graph link prediction method can infer and inform the relationship between the entities that describe past coronavirus outbreaks and the current COVID-19 situation. • Our transductive inference, with stochastic transductive GENs, can lead to finding the relationships among novel entities regarding COVID-19 which rapidly emerge over time, that may allow us to discover meaningful links among them. • Regarding drug-to-drug interaction prediction, our method can be further utilized to analyze the side-effects of simultaneously taking a novel antiviral drugs for COVID-19 and existing drugs, before the clinical trials. While we describe the impact of our method on a specific, but significantly important topic, our method can be broadly applied to any real-world applications that require to predict the links which involve unseen entities. While our method obtains significantly better performance over existing methods on out-of-graph link prediction, its prediction performance is yet far from perfect. Thus, the model should be used more as a candidates selection tool (Hits@N) when inferring critical information (e.g. drug-to-drug interaction prediction for COVID- 19) , and more efforts should be made to develop a reliable system. Since existing benchmark datasets assume that all entities given at the test time are seen during training, we modify the datasets to formulate the Out-of-Graph (OOG) link prediction task, where completely unseen entities appear at the test time. Datasets modification processes are as follows: • First, we randomly sample the unseen entities, which have a relatively small amount of triplets on each dataset. We then divide the sampled unseen entities into meta-training/validation/test sets. • Second, we select the triplets which are used for constructing an In-Graph, where the head and tail entities of every triplet in the In-Graph do not contain any unseen entity. • Finally, we match the unseen entities in the meta-sets with their triplets. Each triplet in meta-sets contains at least one unseen entity. Also, every triplet in meta-sets is not included in the In-Graph. 1) FB15k-237. This dataset [44] consists of 14,541 entities, which is used for the knowledge graph completion task. We randomly sample the 5,000 entities from 10,938 entities, which have associated triplets between 10 and 100. Also, we split the entities such that we have 2,500/1,000/1,500 unseen (Out-of-Graph) entities and 72,065/6,246/9,867 associated triplets containing unseen entities for meta-training/validation/test. The remaining triplets that do not hold an unseen entity are used for constructing In-Graph. As shown in the Figure 6 , this dataset follows a highly long-tailed distribution. 2) NELL-995. This dataset [55] consists of 75,492 entities, which is used for the knowledge graph completion task. We randomly sample the 3,000 entities from 5,694 entities, which have associated triplets between 7 and 100. Also, we split the entities such that we have 1,500/600/900 unseen (Out-of-Graph) entities and 22,345/3,676/5,852 associated triplets containing unseen entities for meta-training/validation/test. The remaining triplets that do not hold an unseen entity are used for constructing In-Graph. As shown in the Figure 6 , this dataset follows a highly long-tailed distribution. 3) DeepDDI. This dataset [36] consists of 1,861 entities, which is used for the drug-to-drug interaction prediction task. We randomly sample the 500 entities from 1,039 entities, which have associated triplets between 7 and 300. Also, we split the entities such that we have 250/100/150 unseen (Out-of-Graph) entities and 27,726/1,171/2,160 associated triplets containing unseen entities for meta-training/validation/test. The remaining triplets that do not hold an unseen entity are used for constructing In-Graph. This dataset [26, 24] consists of 637 entities, which is used for the drug-todrug interaction prediction task. We randomly sample the 150 entities from 507 entities, which have associated triplets between 7 and 300. Also, we split the entities such that we have 75/30/45 unseen (Out-of-Graph) entities and 7,140/333/643 associated triplets containing unseen entities for meta-training/validation/test. The remaining triplets that do not hold an unseen entity are used for constructing In-Graph. Knowledge Graph Completion We describe the baseline models and our graph extrapolation networks for few-shot out-of-graph entity prediction on the knowledge graph (KG) completion task. 1) TransE. The translation embedding model for a multi-relational data by Bordes et al. [5] . It represents both entities and relations as vectors in the same space, where the relation in a triplet is used as a translation operation between the head and the tail entity. 2) RotatE. This model represents entities as complex vectors and relations as rotations in a complex vector space [43] , which extends TransE with a complex operation. 3) DistMult. This model represents the relationship between the head and the tail entity in a bi-linear formulation, which captures pairwise interaction between entities [57] . This model extends the DistMult by introducing embeddings on a complex space to consider asymmetric relations, where scores are differently measured based on the order of the entities [45] . This is a GNN-based method for modeling relational data, which extends the graph convolutional network to consider multi-relational structure, by Schlichtkrull et al. [38] . 6) MEAN. This model computes the embedding of entities by GNN based neighboring aggregation scheme, where they only train for seen-to-seen link prediction, with the hope that the model generalizes on seen-to-unseen cases [17] . This model extends the MEAN to consider relation and neighbor-level information by utilizing attention mechanisms [49] . GMatching. This model tackles the link prediction on unseen relations of seen entities by searching for the closest entity pair, and we extend it in our meta-learning framework such that it can handle unseen entities [56] . , that is meta-learned to embed an unseen entity into the embedding space to infer hidden links between seen and unseen entities. A transductive version of GEN, with additional stochastic transductive GNN layers on top of the I-GEN, that is meta-learned to predict the links not only between unseen entities but also between seen and unseen entities. Drug-to-Drug Interaction We describe the baseline models and our graph extrapolation networks for few-shot out-of-graph relation prediction on the drug-to-drug interaction (DDI) task. The feed-forward neural network used in DeepDDI [36] . It classifies the relation of two drugs using their pairwise features. 2) MPNN. The GNN-based model which uses features about relation types with edge-conditioned convolution operations [15] . 3) R-GCN. The same model used in the entity prediction on KG completion tasks, applied to DDI tasks. An inductive GEN, which only uses the feature representation of the entity e k , instead of using the concatenated representation of the relation-entity pair (r k , e k ), when aggregating neighboring information. A transductive GEN, with additional transductive stochastic layers for unseen-to-unseen relation prediction. For every dataset, we set the embedding dimension of entity and relation as 100. Also, we set the initial embedding of unseen entities as zero vector. Furthermore, since we consider a highly multi-relational graph, we use the basis decomposition on weight matrices W r and W r to prevent the excessive increase in the model size, proposed in Schlichtkrull et al. [38] : where B is the number of basis, a r b is a coefficient of each relation r ∈ R and V b ∈ R d×2d is a shared representation of various relations. For all the experiment, we use PyTorch [35] and PyTorch geometric [11] frameworks on a single Titan XP or a single GeForce RTX 2080 Ti GPU. We optimize the proposed GENs using Adam [20] . Knowledge Graph Completion For both I-GEN and T-GEN, we search for the learning rate α in the range of 3 × 10 −4 , 1 × 10 −3 , 3 × 10 −3 , margin γ in the range of {0.25, 0.5, 1}, and dropout ratio at every GEN layer in the range of {0.1, 0.2, 0.3}. To select the best model, we use the mean reciprocal rank (MRR) as an evaluation metric. For FB15k-237 dataset, we set the α = 1 × 10 −3 and γ = 1 with dropout rate 0.3. Also, we set the number of basis units B = 100 for the basis decomposition on each GEN layer, and sample 32 negative triplets for each positive triplet in both I-GEN and T-GEN. At every episodic training, we randomly sample 500 unseen entities in the meta-training set. For NELL-995 dataset, we use the same parameter settings with FB15k-237, except that we sample 64 negative triplets for each positive triplet. For both datasets, we consider the inverse relation as suggested by several recent works on multi-relational graphs [25, 38, 46] , where directed relation information flows along with both directions. Drug-to-Drug Interaction For both I-GEN and T-GEN, we search for the learning rate α in the range of 5 × 10 −4 , 1 × 10 −3 , 5 × 10 −3 , and dropout ratio at every GEN layer in the range of {0.1, 0.2, 0.3}. As a score function, we use two linear layers with ReLU activation function at the end of the first layer. To select the best model, we use the area under the receiver operating characteristic curve (ROC) as an evaluation metric. For DeepDDI dataset, we set the α = 1 × 10 −3 with dropout rate 0.3. Also, we set the number of basis units B = 200 for the basis decomposition. At every episodic training, we randomly sample 80 unseen entities in the meta-training set. For BIOSNAP-sub dataset, we set the α = 1 × 10 −3 with dropout rate 0.1 for I-GEN and 0.2 for T-GEN, respectively. Also, we set the number of basis units B = 200 for the basis decomposition. At every episodic training, we randomly sample 50 unseen entities in the meta-training set. For both datasets, we consider the inverse relation as in the case of knowledge graph completion task, where directed relation information flows along with both directions. B Meta-learning for Long-tail Task Implementation Details Many real-world graphs follow the long-tail distribution, where few entities have many links while the majority have few (See Figure 6 ). For such an imbalanced graph, it would be beneficial to transfer the knowledge from entities with many links to entities with few links. To this end, we transfer the meta-knowledge on data-rich entities to data-poor entities by simulating the data-rich circumstance under the meta-learning framework, motivated by Wang et al. [50] . Specifically, we firstly meta-train our GENs with many shot cases (e.g., K = 10), and then gradually decrease the number of shots to few shots cases (e.g., K = 1 or 3) in logarithmic scale: K i = log 2 (max-iteration/i) + K, where K i is the training shot size at the current iteration number i, and K is the test shot size. In this way, GENs learn to represent the unseen entities using data-rich instances, and entities with few links regimes may experience like data-rich instances, with the model parameters trained on the entities with many links and fine-tuned on the entities with few links. More Ablation Studies Since knowledge graphs follow a highly long-tailed distribution (See Figure 6) , we provide the more experimental results about transfer strategies on knowledge graph completion tasks, to demonstrate the effectiveness of the proposed meta-learning scheme on a long-tail task. Table 6 shows that the transfer strategy outperforms naive I-GEN and T-GEN on all evaluation metrics, except for two H@1 cases of T-GEN on 3-shot OOG link prediction settings. We conjecture that the effectiveness of the meta-learning scheme is especially larger on 1-shot cases, where data is extremely poor, rather than the 3-shot cases. Table 7 : Total, seen-to-unseen and unseen-to-unseen results of 1-and 3-shot OOG link prediction on FB15k-237. * means training a model within our meta-learning framework. Bold numbers denote the best results. Seen to Unseen Unseen to Unseen Model MRR H@1 H@3 H@10 MRR H@10 MRR H@10 1-S 3-S 1-S 3-S 1-S 3-S 1-S 3-S 1-S 3-S 1-S 3-S 1-S 3-S 1-S 3-S Seen to Seen TransE [5] . Effect of Score Function We also evaluate proposed GENs on the few-shot OOG link prediction task with another popular score function, namely TransE [5] . We use the same settings with DistMult [57] score function, except that we use TransE for the initial embedding and the score measurement. Table 7 shows that our I-GEN and T-GEN with TransE score function also outperform all baselines by impressive margins, where they perform comparably to DistMult. These results suggest that our model works regardless of the score function. We further demonstrate the meta-training effectiveness of our meta-learner, by randomly initializing In-Graph, in which GEN extrapolates knowledge for an unseen entity without using the pre-trained embedding of entity and relation. Table 7 shows that, while results with the random initialization are lower than pre-trained models, GENs are still powerful on the unseen entity, compared to the baseline. These results suggest that GENs trained under the meta-learning framework can be applied to more difficult situations, as pre-trained In-Graph might not be available for the few-shot OOG link prediction in real-world scenarios. Table 8 shows some concrete examples of the OOG link prediction result from NELL-995 dataset, where the 7 to 9 rows show that our T-GEN correctly performs link prediction for two unseen entities. In this section, we describe in detail about task-level transductive inference and meta-level inductive inference for the proposed transductive GEN (T-GEN) model. Since transductive GEN requires to predict links between two unseen test entities which is impossible to handle using conventional link prediction approaches, the problem is indeed transductive. Furthermore, the inference of unseento-unseen links could be also considered as inductive at meta-level, where we inductively learn the parameters of GEN across the batch of tasks. Thus, we are tackling transductive inference problems by considering them as meta-level inductive problems, but the intrinsic unseen-to-unseen link prediction is still transductive. To illustrate more concretely, different sets of unseen entities make mutually inconsistent predictions, which is caused by transduction. Other transductive meta-learning approaches such as TPN [23] and EGNN [19] tackle the problem with similar high-level ideas, where they classify unseen classes by leveraging both of the information on labeled and unlabeled nodes. Out-of-sample representation learning for multi-relational graphs Similarity network fusion for aggregating data types on a genomic scale Freebase: a collaboratively created graph database for structuring human knowledge Learning structured embeddings of knowledge bases Translating embeddings for modeling multi-relational data Meta-graph: Few shot link prediction via meta learning Meta relational learning for few-shot link prediction in knowledge graphs Efficacy of hydroxychloroquine in patients with covid-19: results of a randomized clinical trial. medRxiv Convolutional neural networks on graphs with fast localized spectral filtering Convolutional 2d knowledge graph embeddings Fast graph representation learning with PyTorch Geometric Model-agnostic meta-learning for fast adaptation of deep networks Protein interface prediction using graph convolutional networks Dropout as a bayesian approximation: Representing model uncertainty in deep learning Neural message passing for quantum chemistry A new model for learning in graph domains Knowledge transfer for out-of-knowledge-base entities : A graph neural network approach Inductive representation learning on large graphs Edge-labeling graph neural network for few-shot learning Adam: A method for stochastic optimization Variational graph auto-encoders Semi-supervised classification with graph convolutional networks Learning to propagate labels: Transductive propagation network for few-shot learning GENN: predicting correlated drug-drug interactions with graph energy neural networks Encoding sentences with graph convolutional networks for semantic role labeling BioSNAP Datasets: Stanford biomedical network dataset collection Wordnet: A lexical database for english Distant supervision for relation extraction with an incomplete knowledge base Neverending learning Learning attention-based embeddings for relation prediction in knowledge graphs A novel embedding model for knowledge base completion based on convolutional neural network On first-order meta-learning algorithms A three-way model for collective learning on multi-relational data Label propagation prediction of drug-drug interaction Pytorch: An imperative style, highperformance deep learning library Deep learning improves prediction of drug-drug and drug-food interactions The graph neural network model Modeling relational data with graph convolutional networks End-to-end structure-aware convolutional networks for knowledge base completion Open-world knowledge graph completion Prototypical networks for few-shot learning Learning structured output representation using deep conditional generative models Rotate: Knowledge graph embedding by relational rotation in complex space Representing text for joint embedding of text and knowledge bases Complex embeddings for simple link prediction Composition-based multi-relational graph convolutional networks Graph attention networks Matching networks for one shot learning Logic attention based neighborhood aggregation for inductive knowledge graph embedding Learning to model the tail Knowledge graph embedding by translating on hyperplanes Tackling long-tailed relations and uncommon entities in knowledge graph completion Saliva is more sensitive for sars-cov-2 detection in covid-19 patients than nasopharyngeal swabs. medRxiv Representation learning of knowledge graphs with entity descriptions Deeppath: A reinforcement learning method for knowledge graph reasoning One-shot relational learning for knowledge graphs Embedding entities and relations for learning and inference in knowledge bases Predicting potential drug-drug interactions by integrating chemical, biological, phenotypic and network data Metagnn: On few-shot node classification in graph meta-learning Modeling polypharmacy side effects with graph convolutional networks