key: cord-0039630-tesitlwh authors: Jia, Ningning; Cheng, Xiang; Su, Sen title: Improving Knowledge Graph Embedding Using Locally and Globally Attentive Relation Paths date: 2020-03-17 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45439-5_2 sha: a2a7f85d2ba28750725c4956eb14d53f6a90f003 doc_id: 39630 cord_uid: tesitlwh Knowledge graphs’ incompleteness has motivated many researchers to propose methods to automatically infer missing facts in knowledge graphs. Knowledge graph embedding has been an active research area for knowledge graph completion, with great improvement from the early TransE to the current state-of-the-art ConvKB. ConvKB considers a knowledge graph as a set of triples, and employs a convolutional neural network to capture global relationships and transitional characteristics between entities and relations in the knowledge graph. However, it only utilizes the triple information, and ignores the rich information contained in relation paths. In fact, a path of one relation describes the relation from some aspect in a fine-grained way. Therefore, it is beneficial to take relation paths into consideration for knowledge graph embedding. In this paper, we present a novel convolutional neural network-based embedding model PConvKB, which improves knowledge graph embedding by incorporating relation paths locally and globally. Specifically, we introduce attention mechanism to measure the local importance of relation paths. Moreover, we propose a simple yet effective measure DIPF to compute the global importance of relation paths. Experimental results show that our model achieves substantial improvements against state-of-the-art methods. Large-scale knowledge graphs such as Freebase [3] , DBpedia [1] , and Wikidata [38] store real-world facts in the form of triples (head, relation, tail ), abbreviated as (h, r, t), where head and tail are entities and relation represents the relationship between head and tail. They are important resources for many intelligence applications like question answering and web search. Although current knowledge graphs consist of billions of triples, they are still far from complete and missing crucial facts, e.g., 75% of the person entities in Freebase have no known nationality [8] , which hampers their usefulness in the aforementioned applications. Various methods are proposed to address this problem, and the knowledge graph embedding methods have attracted increasing attention in recent years. The main idea of knowledge graph embedding is to embed entities and relations of a knowledge graph into a continuous vector space and predict missing facts by manipulating the entity and relation embeddings involved. Among knowledge graph embedding methods, the translation-based models are simple and efficient, also perform well. For example, given a triple (h, r, t), the most wellknown translation-based model TransE [5] models the relation r as a translation vector r connecting the embeddings h and t of the two entities, i.e., h + r ≈ t. It performs well on simple relations, i.e., 1-to-1 relations. but poorly on complicated relations, i.e., 1-to-N, N-to-1 and N-to-N relations. To address this issue, TransH [41] , TransR [20] and TransD [14] are proposed. Unfortunately, these models are less simplicity and efficiency than TransE. Nickel et al. [26] present HolE, which uses circular correlation to combine the expressive power of the tensor product with the simplicity and efficiency of TransE. Recently, several convolutional neural network (CNN)-based models [7, 22, 23] have been proposed to learn the embeddings of entities and relations in knowledge graphs, in which [22] reserves the transitional characteristic in translationbased models and is comparably simple and efficient, achieves state-of-the-art performance. However, it only focuses on knowledge triples, ignoring the rich knowledge contained in relation paths. In fact, a path of one entity pair describes the relation connecting the entity pair from some aspect in a fine-grained way, and the importance of each path is different. For example, in Fig. 1 , the two paths place of birth -country and friend -nationality of entity pair (Tom Cruise, America) describes the relation nationality from the location and social way, respectively. Since the path place of birth -country is more essential than friend -nationality to express the relation nationality, thus it is more important from the local view. Moreover, from the global view the path friend -nationality also occurs in entity pair (Tom Cruise, England), which is connecting by the relation travel, thus it is less important than the path place of birth -country to express the relation nationality. In this paper, we present a path-augmented CNN-based model, which incorporates relation paths for knowledge graph embedding. Specifically, we first introduce the attention mechanism to automatically measure the local importance of each path for the given entity pair, then inspired by inverse document frequency, we propose degree-guided inverse path frequency to compute the global importance of each path. Finally, we improve knowledge graph embedding by incorporating locally and globally attentive relation paths. Our contributions in this paper are summarized as follows: -We present a path-augmented CNN-based knowledge graph embedding model, which improves embedding model by incorporating relation paths locally and globally. 1 . An illustration that a path of one relation describes the relation from some aspect in a fine-grained way, and the importance of each path is different. -We introduce attention mechanism to model the local importances of relation paths for knowledge graph embedding. -We propose a simple yet effective measure, degree-guided inverse path frequency, to compute the global importances of relation paths for knowledge graph embedding. -In addition, we apply three pooling operations to aggregate convolutional feature maps, which reduces the number of parameters greatly. -The experimental results on four benchmark datasets show that our model achieves state-of-the-art performance. Given a knowledge graph G, which is a collection of valid factual triples (h, r, t), where h, t ∈ E and r ∈ R. E is the entity set and R is the relation set. In knowledge graph completion, embedding methods aim to define a score function f that gives an implausibility score for each triple (h, r, t) such that valid triples receive lower scores than invalid triples. In this section, we briefly describe the state-of-the-art CNN-based model Con-vKB, and choose it as the base of our model. For each triple (h, r, t), ConvKB denotes the dimensionality of embeddings by k, such that each embedding triple (v h , v r , v t ) can be viewed as a matrix where · denotes a dot product, A i,: is the i-th row of A, b is a bias term, and g is the non-linear activation function ReLU. In particular, if ω = [1, 1, −1], b = 0, and g(x) = |x| or g(x) = x 2 , ConvKB reduces to the plain TransE. Hence, in some point of view, ConvKB is an extension of TransE, which models triple more globally and comprehensively. The overview of ConvKB is shown in Fig. 2 . Let Ω and n denote the set of filters and the number of filters, respectively. ConvKB uses n filters to generate n feature maps. These feature maps are concatenated into a single vector, which is then calculated using the dot product with a weight vector w ∈ R nk×1 to give an implausibility score for the triple (h, r, t). Formally, the score function of ConvKB is defined as follows: where Ω and w are shared parameters, independent of h, r and t, * denotes the convolution operator, and concat denotes the concatenation operator. It is obvious that ConvKB only learns from triples, ignoring the rich knowledge contained in relation paths, which can lead to poor performance. In this section, we present our model PConvKB, which learns the embeddings by taking relation paths into consideration. Moreover, we also take into account the local and global importances of the relation paths. The architecture of our model is shown in Fig. 3 . We denote relation paths between the head entity h and the tail entity t as P (h, t) = {p 1 , p 2 , . . . , p N }, where relation path p = (r 1 , . . . , r m ) is a series of interconnected relations between the entities, i.e., h r1 − → . . . rm − − → t. Similar to ConvKB, for each triple (h, r, t), the score function of our model PConvKB is defined as follows: where σ denotes the non-linear function, i.e., sigmoid, ψ denotes the average pooling operation, Φ Gi denotes the global importance of the i-th path, Φ Li denotes the local importance of the i-th path, p i is the embedding of the i-th path, which is computed as m i=1 v ri , Ω and w are shared parameters. The computation of local and global importances is detailed in Sects. 3.2 and 3.3, respectively. Attention mechanism [2] is designed to improve the performance of encoderdecoder model on machine translation, which assigns different weights to different data to allow the model focusing on important data. In recent years, attention mechanism has been widely used in several research topics, such as question answering [18] and image captioning [40] . In this paper, we apply attention mechanism to measure the local importances of relation paths for knowledge graph embedding. Given a triple (h, r, t) and its set of relation paths P (h, t) = {p 1 , p 2 , . . . , p N }, we compute the local importance of each path as: where W L ∈ R k×k is the parameter matrix. Similar to [19] , we set the maximum length of each path to 3. Since the attention mechanism only focuses on the set of relation paths P (h, t) of the given entity pair (h, t) that connects by the relation r. It does not consider that the path in the set of relation paths may also occur in other entity pairs that connects by other relations. Typically, the more set of relation paths a path occurs in, the less importance the path is. Therefore, inspired by inverse document frequency [10, 16] , which is a weighting function that has been widely used for measuring how informative each word is in a set of documents. We propose the Degree-guided Inverse Path Frequency (DIPF) to model the global importance of each path in the set of relation paths. For each relation r ∈ R in the knowledge graph G, we first find its corresponding entity pairs (h r , t r ) i , i = 1, 2, . . . , n r , where n r is the number of entity pairs connecting by the relation r. i.e., Then, we choose the entity pair (h r , t r ) b , which has the biggest node degree and is computed as: where NodeDegree(·) is the function to compute the node degree of an entity pair, max[·] is the maximum function, deg(·) is the node degree of an entity, which is computed as the number of the edges connected with the entity. Next, we count the set of relation paths for entity pair (h r , t r ) b , and is denoted as P ((h r , t r where m r is the number of paths of entity pair (h r , t r ) b . Similar to local importance computation, we set the maximum length of each path to 3. Finally, the global importance of each path in the set of relation paths P (h, t) = {p 1 , p 2 , . . . , p N } of the given triple (h, r, t) is computed as: where |R| is the cardinality of R (i.e., total number of relations in R), pt i is the number of times the path p i occurs in the set of {P ((h r , t r ) b ), r ∈ R}. As mentioned in Sect. 2.1, ConvKB uses concatenate operation to aggregate feature maps. However, previous works [30, 35] demonstrate that pooling operation can better aggregate feature maps than simply concatenate operation, and reduce the number of parameters greatly. In this paper, we adopt the following three pooling operations to replace the concatenate operation, respectively: The average pooling operation is finally chosen due to its superior performance in the experiments. The objective is to ensure that a triple in the golden set G should have a lower implausibility score than a triple in the corrupted triple set G . Similar to [22] , we adopt Adam optimizer [17] to train PConvKB, and minimize the loss function with L 2 regularization on the weight vector w as follows: We For a fair comparison, we evaluate our model on two tasks: link prediction [5] , and triples classification [33] . Both of them evaluate the accuracy of predicting unseen triples from different viewpoints. We evaluate our model on four benchmark datasets WN18 [5] , FB15k [5] , WN18RR [7] and FB15k-237 [36] . WN18 is extracted from WordNet [21] , which contains word concepts and lexical relations between the concepts. FB15k is a subset of Freebase constructed by Bordes et al. [5] . As noted by Toutanova and Chen [36] , WN18 and FB15k have problematic reversible triples causing abnormally high results. This is the reason that the refined version of WN18 and FB15k, i.e., WN18RR and FB15k-237, are widely used in state-of-the-art methods. Table 1 shows the statistics of the datasets used in our experiments. To demonstrate the effectiveness of our model, we compare PConvKB against a variety of knowledge graph embedding methods developed in recent years. -TransE [5] is one of the most widely used knowledge graph embedding methods. -TransH [41] associates each relation with a relation-specific hyperplane to alleviate the complex relations problem. -TransD [14] not only considers the complex relations, but also the diversity of entities, by embedding entities and relations into separate entity space and relation-specific spaces. -HolE [26] uses circular correlation, a novel compositional operator, to capture rich interactions of embeddings. -ConvE [7] is the first CNN-based model for knowledge graph embedding. -ConvKB [22] improves ConvE by taking the transitional characteristic (i.e., one of the most useful intuitions for knowledge graph completion) into consideration. -CapsE [23] combines convolutional neural network with capsule network [29] for knowledge graph embedding. Link prediction task is to complete a triple (h, r, t) with h or t missing, i.e., to predict the missing h given (r, t) or the missing t given (h, r). To evaluate the performance in link prediction, we follow the standard protocol used in [5] . For each test triple (h, r, t), we replace either h or t by each of other entities in E to create a set of corrupted triples, and calculate implausibility scores on the corrupted triples. Ranking these scores in ascending order, we can get the rank of the test triple. Notice that a corrupted triple may exist in train, validation or test set, we use the Filtered setting protocol [5] to eliminate its misleading effect, i.e., not taking any corrupted triples that appear in the knowledge graph into accounts. We employ two common evaluation metrics: mean rank (MR) and Hits@10. MR is the mean of the test triples' ranks. Hits@10 is the percentage of test triples that are ranked within top 10. Following the previous work [41] , we use the common Bernoulli trick to generate the head or tail entities when sampling invalid triples. Like in ConvKB [22] , we also use entity and relation embeddings produced by TransE to initialize entity and relation embeddings in PConvKB. We use the pre-trained 100-dimensional glove word embeddings [28] to train TransE model, and employ the TransE implementation provided by [25] . We select the learning rate in {5e −6 , 1e −5 , 5e −5 , 1e −4 }, the number of filters in {50, 100, 200, 400}. We fix the batch size at 128 and set the L 2 -regularizer λ at 0.001 in our objective function. We run PConvKB up to 150 epochs and monitor the Hits@10 score after every 10 training epochs to choose optimal hyper-parameters. We obtain the highest Hits@10 scores on the validation set when learning rate at 5e −5 , the number of filters at 400 on WN18; and learning rate at 1e −5 , the number of filters at 50 on FB15k; and the learning rate at 5e −6 , the number of filters at 400 on WN18RR; and the learning rate at 1e −5 , the number of filters at 200 on FB15k-237. For comparison methods, we use the codes released by [11] , [7] and [22] . Table 2 . Experiments results on link prediction. Hits@10 is reported in %. The best score is in bold, while the second best score is in underline. For comparison methods, the values in black color are the results listed in the original publication, except Con-vKB uses the [23] implemented version, which has been reported significantly better performance than the original one. The values in blue color are obtained by implementations from the OpenKE repository. Results. Table 2 shows the link prediction results of our model and the comparison methods on the four benchmark datasets. From the results, we can observe that: 1. PConvKB obtains the best MR and highest Hits@10 scores on the four benchmark datasets, demonstrating the effectiveness of incorporating relation paths for knowledge graph embedding. 2. Among PConvKB, PConvKB (local) and PConvKB (global), PConvKB obtains the best performance, which indicates that considering relation paths locally and globally is beneficial for knowledge graph embedding. 3. PConvKB does better than the closely related model ConvKB on all experimental datasets, especially on FB15k where PConvKB gains significant improvements of 275 − 247 = 28 in MR (which is about 10.1% relative improvement) and 59.8% − 54.7% = 5.1% absolute improvement in Hits@10. Triple classification task is to determine whether a given triple (h, r, t) is correct or not, i.e., binary classification on a triple. Evaluation Protocol. We follow the same protocol in [33] . For each triple in test set and validation set, we construct one negative triple by switching entities from test triples and validation triples, respectively. The triple classification decision rule is: for a triple (h, r, t), if its implausibility score is below the relationspecific threshold σ r , predict positive, otherwise negative. The relation-specific threshold σ r is determined by maximizing classification accuracy on the validation set. The triple classification accuracy is the percentage of triples in the test set that are classified correctly. Implementation Details. We use TransE to initialize entity and relation embeddings in PConvKB, select the learning rate in {5e −6 , 1e −5 , 5e −5 , 1e −4 }, the number of filters in {50, 100, 200, 400}. We set the batch size at 128 and set the L 2 -regularizer λ at 0.001 in our objective function. We run PConvKB up to 150 epochs and monitor the accuracy after every 10 training epochs to choose optimal hyper-parameters. We obtain the highest accuracy on the validation set when learning rate at 5e −5 , the number of filters at 400 on WN18; and learning rate at 1e −5 , the number of filters at 50 on FB15k; and the learning rate at 5e −6 , the number of filters at 400 on WN18RR; and the learning rate at 1e −5 , the number of filters at 200 on FB15k-237. For comparison methods, we implement them by the codes released by [11] , [7] and [22] . Table 3 shows the triple classification results of our model and the comparison methods on the four benchmark datasets. From the results, we can observe that: Various methods have been proposed for knowledge graph embedding, such as general linear-based models [6] , bilinear-based models [13, 27, 34] , translationbased models [5, 9, 14, 15, 20, 41, 43] , and neural network-based models [4, 7, 22, 23, [31] [32] [33] . We refer to [24, 39] for a recent survey. In this section, we focus on the most relevant neural network-based models, and briefly review the other related methods. Socher et al. [33] introduce neural tensor networks for knowledge graph embedding, which allows mediated interaction of entity embeddings via a tensor. Schlichtkrull et al. [31] present relational graph convolutional networks for knowledge graph completion. Shi and Weninger [32] present a shared variable neural network model called ProjE, which fills-in missing facts in a knowledge graph by learning joint embeddings of entities and relations. Dettmers et al. [7] present a multi-layer convolutional network model, namely ConvE, which uses 2D convolutions over embeddings to predict missing links in knowledge graphs. Nguyen et al. [22] present a CNN-based embedding model, i.e., ConvKB. It applies CNN to explore the global relationships among same dimensional entries in each embedding triple, which generalizes the transitional characteristics in the transition-based embedding models. Nguyen et al. [23] present CapsE, which combines CNN with capsule networks [29] for knowledge graph embedding. All these models treat a knowledge graph as a collection of triples, and disregard the rich information exist in relation paths. There are several translation-based models [12, 19, 37, 42, 44] incorporating relation paths to improve the embeddings of entities and relations. However, they fully rely on hand-designed features to measure the importance of each path, which is not differentiable and cannot adjust during training. Moreover, they all based on translation-based models, which are not suitable for CNNbased model. To the best of our knowledge, our model PConvKB is the first attempt which incorporates relation paths in CNN-based embedding model. In this paper, we present a novel CNN-based embedding model PConvKB, which improves knowledge graph embedding by incorporating relation paths locally and globally. In particular, we introduce attention mechanism to measure the local importance of relation paths. Moreover, we propose a simple yet effective measure DIPF to compute the global importance of relation paths. We evaluate our model on link prediction and triple classification. Experimental results show that our model achieves substantial improvements against state-of-the-art methods. DBpedia: a nucleus for a web of open data Neural machine translation by jointly learning to align and translate Freebase: a collaboratively created graph database for structuring human knowledge A semantic matching energy function for learning with multi-relational data -application to word-sense disambiguation Translating embeddings for modeling multi-relational data Learning structured embeddings of knowledge bases Convolutional 2D knowledge graph embeddings Knowledge vault: a web-scale approach to probabilistic knowledge fusion TorusE: knowledge graph embedding on a lie group Class specific TF-IDF boosting for short-text classification: application to short-texts generated during disasters OpenKE: an open toolkit for knowledge embedding Improved knowledge base completion by the pathaugmented TransR model A latent factor model for highly multi-relational data Knowledge graph embedding via dynamic mapping matrix Knowledge graph completion with adaptive sparse transfer matrix Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec Adam: a method for stochastic optimization Beyond RNNs: positional self-attention with co-attention for video question answering Modeling relation paths for representation learning of knowledge bases Learning entity and relation embeddings for knowledge graph completion WordNet: a lexical database for English A novel embedding model for knowledge base completion based on convolutional neural network A capsule networkbased embedding model for knowledge graph completion and search personalization An overview of embedding models of entities and relationships for knowledge base completion STransE: a novel embedding model of entities and relationships in knowledge bases Holographic embeddings of knowledge graphs A three-way model for collective learning on multi-relational data GloVe: global vectors for word representation Dynamic routing between capsules Detail-preserving pooling in deep networks Modeling relational data with graph convolutional networks ProjE: embedding projection for knowledge graph completion Reasoning with neural tensor networks for knowledge base completion Modelling relational data using Bayesian clustered tensor factorization Hybrid pooling for enhancement of generalization ability in deep convolutional neural networks Observed versus latent features for knowledge base and text inference Compositional learning of embeddings for relation paths in knowledge base and text Wikidata: a free collaborative knowledge base Knowledge graph embedding: a survey of approaches and applications Hierarchical attention network for image captioning Knowledge graph embedding by translating on hyperplanes Knowledge graph embedding via relation paths and dynamic mapping matrix TransGate: knowledge graph embedding with shared gate structure Discriminative path-based knowledge graph embedding for precise link prediction We acknowledge anonymous reviewers for their valuable comments. This work was supported by the National Natural Science Foundation of China