key: cord-0606386-krbwwq8z
authors: Choudhary, Nurendra; Rao, Nikhil; Katariya, Sumeet; Subbian, Karthik; Reddy, Chandan K.
title: Probabilistic Entity Representation Model for Reasoning over Knowledge Graphs
date: 2021-10-26
journal: nan
DOI: nan
sha: ae33e3182b2d1502f8f849696f3e79eb2fa5c758
doc_id: 606386
cord_uid: krbwwq8z

Logical reasoning over Knowledge Graphs (KGs) is a fundamental technique that can provide efficient querying mechanism over large and incomplete databases. Current approaches employ spatial geometries such as boxes to learn query representations that encompass the answer entities and model the logical operations of projection and intersection. However, their geometry is restrictive and leads to non-smooth strict boundaries, which further results in ambiguous answer entities. Furthermore, previous works propose transformation tricks to handle unions which results in non-closure and, thus, cannot be chained in a stream. In this paper, we propose a Probabilistic Entity Representation Model (PERM) to encode entities as a Multivariate Gaussian density with mean and covariance parameters to capture its semantic position and smooth decision boundary, respectively. Additionally, we also define the closed logical operations of projection, intersection, and union that can be aggregated using an end-to-end objective function. On the logical query reasoning problem, we demonstrate that the proposed PERM significantly outperforms the state-of-the-art methods on various public benchmark KG datasets on standard evaluation metrics. We also evaluate PERM's competence on a COVID-19 drug-repurposing case study and show that our proposed work is able to recommend drugs with substantially better F1 than current methods. Finally, we demonstrate the working of our PERM's query answering process through a low-dimensional visualization of the Gaussian representations.

Knowledge Graphs (KGs) are structured heterogeneous graphs where information is organized as triplets of entity pair and the relation between them. This organization provides a fluid schema with applications in several domains including e-commerce [1] , web ontologies [2, 3] , and medical research [4, 5] . Chain reasoning is a fundamental problem in KGs, which involves answering a chain of first-order existential (FOE) queries (translation, intersection, and union) using the KGs' relation paths. A myriad of queries can be answered using such logical formulation (some examples are given in Figure 1 ). Current approaches [6, 7, 8] in the field rely on mapping the entities and relations onto a representational latent space such that the FOE queries can be reduced to mathematical operations in order to further retrieve the relevant answer entities.

Euclidean vectors [6, 9] provide a nice mechanism to encode the semantic position of the entities by leveraging their neighborhood relations. They utilize a fixed threshold over the vector to query for answer entities (such as a k-nearest neighbor search). However, queries differ in their breadth. Certain queries would lead to a greater set of answers than others, e.g., query Canadians will result in a higher number of answers than query Canadian Turing Award winners. To capture this query behavior, spatial embeddings [7, 8, 10, 11] queries by controlling the volume of space enclosed by the query representations. However, these spatial embeddings rely on more complex geometries such as boxes [7] which do not have a closed form solution to the union operation, e.g., the union of two boxes is not a box. Thus, further FOE operations cannot be applied to the union operation. Additionally, their strict borders lead to some ambiguity in the border case scenarios and a non-smooth distance function, e.g., a point on the border will have a much smaller distance if it is considered to be inside the box than if it is considered to be outside. This challenge also applies to other geometric enclosures such as hyperboloids [8] .

Another line of work includes the use of structured geometric regions [12, 7] or density functions [13, 14, 11, 15] instead of vector points for representation learning. While these approaches utilize the representations for modeling individual entities and relations between them, we aim to provide a closed form solution to logical queries over KGs using the Gaussian density function which enables chaining the queries together. Another crucial difference in our work is in handling a stream of queries. Previous approaches rely on Disjunctive Normal Form (DNF) transformation which requires the entire query input. In our model, every operation is closed in the Gaussian space and, thus, operations of a large query can be handled individually and aggregated together for the final answers. (c) Distance of entities from query space. For comparability, distances are given relative to entity Bengio's distance to European query. Figure 2 : Results of the query Europeans ∪ Canadians. Entities in the darker areas have higher probability of being the answers than lighter areas. We can observe from (c) that the non-smooth borders of box geometry do not encompass the answer Hinton.

To alleviate the drawbacks of operations not being closed under unions and border ambiguities, we propose Probabilistic Entity Representation Model (PERM). PERM models entities as a mixture of Gaussian densities. Gaussian densities have been previously used in natural language processing [14] and graphs [15] to enable more expressive parameterization of decision boundaries. In our case, we utilize a mixture of multivariate Gaussian densities due to their intuitive closed form solution for translation, intersection, and union operations. In addition, they can also enable the use of a smooth distance function; Mahalanobis distance [16] . Figure 2 provides an example of such a case where the non-smooth boundaries of box query embeddings are not able to capture certain answers. We utilize the mean (µ) and co-variance (Σ) parameters of multivariate Gaussian densities to encode the semantic position and spatial query area of an entity, respectively. The closed form solution for the operations allows us to solve complex queries by chaining them in a pipeline. PERM does not need to rely on DNF transformations, since all the outputs are closed in the Gaussian space and complex queries can be consolidated in an end-to-end objective function, e.g., in Figure 2b , Europeans ∪ Canadians is a Gaussian mixture and the single objective is to minimize the distance between the mixture and entity Hinton, whereas in the case of boxes (shown in Figure 2a ), we have two independent objectives to minimize the distance from each box in the union query. Summarizing, the contributions of our work is as follows:

1. We develop Probabilistic Entity Representation Model (PERM), a method to reason over KGs using (mixture of) Gaussian densities. Gaussians are able to provide a closed form solution to intersection and union, and also a smooth distance function. This enables us to process a chain of complex logical queries in an end-to-end objective function.

2. PERM is able to outperform the current state-of-the-art baselines on logical query reasoning over standard benchmark datasets. Additionally, it is also able to provide better drug recommendations for COVID-19.

3. PERM is also interpretable since the Gaussian embeddings can be visualized after each query process to understand the complete query representation.

The rest of the paper is organized as follows: Section 2 presents the current work in the field. In section 3, we present PERM and define its various operations. Section 4 provides the formulation for building the reasoning chains for complex queries. We provide the experimental setup and results in section 5. We conclude our paper in section 6 and present its broader impact in section 7.

The topic of multi-hop chain reasoning over KGs has gained a lot of attention in recent years [17, 18, 19, 6] . These approaches utilize vector spaces to model query representation and retrieve results using a fixed threshold. While such representations are efficient at encoding semantic information, the fixed thresholds that are typically used in these models do not allow for an expressive (adjustable) boundary and, thus, are not best suited for representing queries. Spatial embeddings [7, 8, 20] enhance the simple vector representations by adding a learnable border parameter that controls the spatial area around a query representation. These methods have strict borders that rely on non-smooth distance function that creates ambiguity between border cases. On the other hand, in our model, the variance parameter of the query's Gaussian densities creates soft smoothly increasing borders in terms of the Mahalanobis distance. Additionally, the previous methods do not provide a closed form solution for unions which we solve using Gaussian mixture models.

Density-based embeddings have seen a recent surge of interest in various domains. Word2Gauss [14] provides a method of learning Gaussian densities for words from their distributional semantic information. In addition, the authors further apply this work to knowledge graphs [13] . Another approach [15] aims to learn Gaussian graph representations from their network connections. These methods are, however, focused on learning semantic information and do not easily extend to logical queries over knowledge graphs. PERM primarily focuses on learning spatial Gaussian densities for queries, while also capturing the semantic information. To achieve this, we derive closed form solutions to FOE queries.

Knowledge Graphs (KG) G : E × R are heterogeneous graphs that store entities (E) and relations (R). Each relation r ∈ R is a Boolean function r : E × E → {T rue, F alse} that indicates if the relation r exists between a pair of entities. Without loss of generality, KGs can also be organized as a set of triples e 1 , r, e 2 ⊆ G, defined by the Boolean relation function r(e 1 , e 2 ). In this work, we focus on the following three FOE operations: translation (t), intersection (∩), and union (∪). The operations are defined as below:

where q t , q ∩ , and q ∪ are the translation, intersection, and union queries, respectively; and V t , V ∩ , and V ∪ are the corresponding results [10] . As we notice above, each entity has a dual nature; one as being part of a query and another as a candidate answer to a query. In PERM, we model the query space of an entity e i ∈ E as a multivariate Gaussian density function; e i = N (µ i , Σ i ), where the learnable parameters µ i (mean) and Σ i (covariance) indicate the semantic position and the surrounding query density of the entity, respectively. As a candidate, we only consider the µ i and ignore the Σ i of the entity. We define the distance of a candidate entity v i = N (µ i , Σ i ) from a query Gaussian e j = N (µ j , Σ j ) using the Mahalanobis distance [16] given by:

Additionally, we need to define the FOE operations for the proposed Probabilistic Entity Representation Model. A visual interpretation of the operations; translation, intersection, and union is shown in Figure 3 . The operations are defined as follows:

Each entity e ∈ E and r ∈ R are encoded as N (µ e , Σ e ) and N (µ r , Σ r ), respectively. We define the translation query representation of an entity e with relation r as q t and the distance of resultant entity v t ∈ V t from the query as d q t given by:

Intersection (∩). Intuitively, the intersection of two Gaussian densities implies a random variable that belongs to both the densities. Given that the entity densities are independent of each other, we define the intersection of two entity density functions e 1 , e 2 as q ∩ and distance of resultant entity v ∩ ∈ V ∩ from the query as d q ∩ given by:

where, Σ −1

We provide a brief sketch of the proof that the intersection of Gaussian density functions is a closed operation. A complete proof is provided in Appendix A. Let us consider two Gaussian PDFs P (θ 1 ) = N (µ 1 , Σ 1 ) and P (θ 2 ) = N (µ 2 , Σ 2 ). Their intersection implies a random variable that is distributed as the product, P (θ 1 )P (θ 2 ) The intersection P (θ) = N (µ 3 , Σ 3 ) is derived as follows:

Union (∪). We model the union of multiple entities using Gaussian mixtures. The union of entity density functions given by e 1 , e 2 , e 3 , ..., e n is defined as q ∪ and the distance of resultant entity v ∪ ∈ V ∪ from the query as d q ∪ given by:

where, φ i = exp (N (µ ei , Σ ei )) n j=1 exp N (µ ej , Σ ej ) φ i ∈ Φ are the weights for each Gaussian density in the Gaussian mixture, calculated using the self-attention mechanism over the parameters of the Gaussians in the mixture, i.e., µ ei , Σ ei ∀i : 1 → n. 

We consider the Gaussian density function (embedding of a single entity) as a special case of Gaussian mixture with a single component. This ensures that all the operations defined in Section 3 are closed under the Gaussian space with an output that is either a single (for translations and intersections) or multi-component Gaussian mixture (for unions). Hence, for chaining the queries, we need to define the logical operators with a Gaussian density and a Gaussian mixture input. In this section, we define the different operators (depicted in Figure 3 ), in the case of a Gaussian mixture input.

Chain Translation. Let us assume that the input query embedding is an n-component mixture p = n i=1 N (µ i , Σ i ) and we need to translate it with relation r = N (µ r , Σ r ). Intuitively, we would like to translate all the Gaussians in the mixture with the relation. Hence, we model this translation as c t and the distance from entities v t ∈ V t as d c t given by:

Chain Intersection. A Gaussian mixture is a union over individual densities. Based on the distributive law of sets, an intersection over a Gaussian mixture p = n i=1 N (µ i , Σ i ) and entity e = N (µ e , Σ e ) implies the union of the intersection between the entity and each Gaussian density in the mixture. Hence, we derive this intersection as c ∩ and the distance from entities v ∩ ∈ V ∩ as d c ∩ :

where,

Chain Union. The union of an entity e = N (µ e , Σ e ) with a Gaussian mixture

is the addition of the entity to the mixture. Hence, the union c ∪ and the distance from entities v ∪ ∈ V ∪ d c ∪ can be defined as follows:

Implementation Details. To calculate the weights (φ i ∈ Φ) of the Gaussian mixtures, we use the popular self-attention mechanism [21] . The gradient descent over Mahalanobis distance (Eq. 4) and derivation for the product of Gaussians (Eq. 6) are given by [22] and Appendix A, respectively. Another important note is that we do not need to compute Σ for the operations, but rather we only need to compute the Σ −1 . Also, storing the complete Σ −1 requires quadratic memory, i.e., a Gaussian density of d variables requires d × d parameters for Σ. So, we only store a decomposed matrix L of Σ −1 : Σ −1 = LL T . Thus, for a Gaussian density of d variables our memory requirement is d × (r + 1) parameters (d for µ and d × r for Σ −1 ). For computing the µ 3 for intersection, in Eq. (6), we use a linear solver (torch.solve) for faster computation. All our models are implemented in Pytorch [23] and run on four Quadro RTX 8000. 1 

This section describes the experimental setup used to analyze the performance of PERM on various tasks with a focus on the following research questions:

1. Does PERM's query representations perform better than the state-of-the-art baselines on the task of logical reasoning over standard benchmark knowledge graphs?

2. What is the role of individual components in PERM's overall performance gain?

3. Is PERM able to recommend better therapeutic drugs for COVID-19 from drug re-purposing graph data compared to the current baselines?

4. Are we able to visualize the operations on PERM's query representations in the latent space?

We utilize the following standard benchmark datasets to compare PERM's performance on the task of reasoning over KGs:

• FB15K-237 [24] is comprised of the 149,689 relation triples and textual mentions of Freebase entity pairs. All the simply invertible relations are removed.

• NELL995 [25] consists of 107,982 triples obtained from the 995 th iteration of the Never-Ending Language Learning (NELL) system.

• DBPedia 2 is a subset of the Wikipedia snapshot that consists of a multi-level hierarchical taxonomy over 240,942 articles.

• DRKG [26] (Drug Re-purposing Knowledge Graph) is used to evaluate the performance of our model on both the logical reasoning and drug recommendation tasks. More detailed statistics of these datasets are provided in Table 1 . For our experiments, we select the following baselines based on (i) their performance on the logical reasoning task and (ii) their ability to extend to all FOE query combinations.

• Graph Query Embedding (GQE) [6] embeds entities and relations as a vector and utilizes TransE [17] to learn the query embeddings. The distance of the answer entities is calculated using L1-norm. • Query2Box (Q2B) [7] embeds entities and relations as axis aligned hyper-rectangles or boxes and utilize FOE queries to learn query representations. The distance of answer entities is given by a weighted combination of the answer's distance from the center and the border of the query box. • Beta Query Embedding (BQE) [11] utilizes beta distribution to learn query representations from FOE queries with a novel addition of negation queries. The distance is calculated as the dimension-wise KL divergence between the answer entity and the query beta embedding. • Complex Query Decomposition (CQD) [10] answers complex queries by reducing them to simpler sub-queries and aggregating the resultant scores with t-norms.

Some of the other baselines [27, 18] focus solely on the multi-hop problem. They could not be intuitively extended to handle all FOE queries, and hence, we did not include them in our study. 

To evaluate the efficacy of PERM's query representations, we compare it against the baselines on different FOE query types; (i) Single Operator: 1t, 2t, 3t, 2∩, 3∩, 2∪ and (ii) Compound Queries: ∩t, t∩, ∪t. We follow the standard evaluation protocol [7, 11, 8] and utilize the three splits of a KG for training G train , validation G valid , and evaluation G test (details in Table 1 ). The models are trained on G train with validation on G valid . The final evaluation metrics for comparison are calculated on G test . For the baselines, we calculate the relevance of the answer entities to the queries based on the distance measures proposed in their respective papers. In PERM, the distance of the answer entity from the query Gaussian density is computed according to the measures discussed in Sections 3 and 4. We use the evaluation metrics of HITS@K and MRR to compare the ranked set of results obtained from different models. Given the ground truthÊ and model outputs {e 1 , e 2 , ..., e n } ∈ E, the metrics are calculated as follows:

From the results provided in Table 2 , we observe that PERM, is able to outperform all the current state-of-the-art approaches, on an average across all FOE queries by 6.2%. Specifically, we see a consistent improvement for union queries; 9.5% and 93% in the case of 2∪ and ∪t, respectively. Comparing the models based on only geometries, we notice the clear efficacy of PERM query representations with an average improvement of 37.9%, 15.9%, and 37.3% over vectors (GQE), boxes (Q2B), and beta distribution (BQE), respectively. Given these improvements and the ability to handle compound queries in an end-to-end manner, we conclude that Gaussian distributions are better at learning query representations for FOE reasoning over KGs. Additionally, we provide PERM's results on sample queries from different datasets in Table 3 . Ribavirin, Dexamethasone, Hydroxychloroquine

In this section, we evaluate the need for different components and their effects on the overall performance of our model. First, we look at the contribution of utilizing different types of queries to the performance of our model. For this, we train our model on different subsets of queries; (i) only 1t queries, (ii) only translation (1t,2t,3t) queries and (iii) only single operator queries (1t,2t,3t,2∩,3∩,2∪). Furthermore, we look at the need for attentive aggregation in the case of union of Gaussian mixtures. We test other methods of aggregation; (i) vanilla averaging and (ii) MLP [28] . From Table 4 , we notice that utilizing only 1t queries significantly reduces the performance of our model by 22.3% and even increasing the scope to all translation queries is still lower in performance by 12.5% for this case. However, we notice that training on all single operator queries results in comparable performance to the final PERM model. But, given the better overall performance, we utilize all the queries in our final model. For union aggregation, we observe that attention has a clear advantage and both vanilla averaging and MLP lead to a lower performance by 3.33% and 1.28%, respectively. Thus, we adopt self-attention in our final model.

In this experiment, we utilize the expressive power of PERM's query representations to recommend therapeutic drugs for COVID-19 from the DRKG dataset. Drugs in the dataset are already approved for other diseases and the aim is to utilize the drug-protein-disease networks and employ them towards treating COVID-19. This can potentially reduce both the drug development time and cost [29] . For this experiment, we utilize the treatment relation in DRKG and retrieve drugs D : D treats − −−− → X, where X is a set of SARS diseases related to the COVID-19 virus. Given that we only need these limited set of entity types (only SARS diseases and drugs) and relation types (only treatments), we only consider the DRKG subgraph that contains this necessary set of entities and relations for learning the representations. We compare the recommendations of different models against a set of actual candidates currently in trials for COVID-19. We use the top-10 recommendations with the evaluation metrics of precision, recall, and F1-score for comparison. Table 5 : Performance comparison of various models on the COVID-19 drug recommendation problem using precision (P), recall (R), and F1-score (F1) metrics. The top three drugs recommended by the models are given in the last column. The recommendations given in green and red indicate correct and incorrect predictions, respectively. The last two rows provide the average relative improvement of PERM compared to the state-of-the-art baselines Q2B and CQD.

Model We can observe from Table 5 that PERM is able to provide the best drug recommendations, across all evaluation metrics. Our model is able to outperform the current methods by atleast 3.8%, 3.5%, and 8.2% in precision, recall, and F1, respectively. Also, the top recommended drugs by our PERM are more inline with the current drug development candidates, thus, showing the better performance of our model's query representations.

To visualize the entity and query in the latent space, we extract representative entity samples from the FB15K-237 dataset and present them in a 2-dimensional space for better comprehension. Figure 4 depicts the different entities and the mechanism through which PERM narrows down to the particular answer set. Notice that, we are able to perform an intersection after a union operation due to the closed form nature of our operations. This is currently not possible in state-of-the-art baseline methods. Additionally, it should be noted that, unions widen the query space and intersections narrow them down (as expected). Furthermore, the variance parameter acts as a control over the spatial area that an entity should cover and more general entities such as Turing Award and Europe occupy a larger area than their respective sub-categories, namely, winners and Europeans.

In this paper, we present Probabilistic Entity Representation Model (PERM), a model to learn query representations for chain reasoning over knowledge graphs. We show the representational power of PERM by defining closed form solutions to FOE queries and their chains. Additionally, we also demonstrate its superior performance compared to its state-of-the-art counterparts on the problems of reasoning over KGs and drug recommendation for COVID-19 from the DRKG dataset. Furthermore, we exhibit its interpretability by depicting the representational space through a sample query processing pipeline.

PERM is the first method that models an individual entity in knowledge graphs using Gaussian density function, making it possible to solve FOE queries using a closed form solution. This enables its application in domains that require chain reasoning. The main idea of the proposed solution can also be extended to any domain that can encode its basic units as Gaussians and extend the units through FOE queries, e.g., in topic modeling, topics can be encoded as Gaussians and documents as union of topics.

However, PERM depends on the integrity of the knowledge graph used for training. Any malicious attacks/errors [30, 31] that lead to incorrect relations could, further, lead to incorrect results and affect the confidence of our model. Furthermore, due to the connected nature of complex queries, this attack could propagate and affect a larger set of queries. Such incorrect results would be problematic in sensitive areas of research such as drug recommendations and cybersecurity and, thus, it is necessary to maintain the integrity of training data before learning representations and querying with PERM.

The following is the supplementary Appendix for the paper; Probabilistic Entity Representation Model for Reasoning over Knowledge Graphs. All the references given in the following sections are made in context of the main paper.

The following sections provide the proof for the product of Gaussians for both the univariate case and multivariate case (used in Eqs. (6) and (10)).

Notice that we need Σ 3 while calculation µ 3 . However, to save computational memory, we only store the inverses of covariances, i.e., Σ −1 1 , Σ −1 2 and Σ −1 3 . So, to solve for µ 3 and avoid the computationally expensive process of matrix inversion, we use the linear solver torch.solve on the equation

Algorithm 1 provides an outline of PERM's overall framework to learn representations of entities e ∈ E and relations r ∈ R. The algorithm describes the training from FOE operations of translation (lines 4-7), intersection (lines [8] [9] [10] [11] , and union (lines 12-15).

Algorithm 1: PERM training algorithm Input: Training data D t , D ∩ , D ∪ , which are set of all (query (Q), result (V )) for translation, intersection, and union, respectively; Output: Entity E and Relation R gaussian density functions; 1 Randomly initialize e = N (µ e , Σ e ) ∈ E and r = N (µ r , Σ r ) ∈ R); 2 for number of epochs; until convergence do 3 l = 0; # Initialize loss 4 for {(e, r, V t ) ∈ D t } do 5 q t = N (µ e + µ r , (Σ −1 e + Σ −1 r ) −1 ) from Eq. (5) # Update loss for translation queries (7) # Update loss for union queries 15 end # Update E and R with backpropagation 16 E ← E − ∆ E l 18 18 R ← R − ∆ R l 19 end 20 return E,R C MRR metrics for Reasoning over KGs Table 6 provides the Mean Reciprocal Rank (MRR) results for the reasoning over KGs experiment, given in section 5. Table 7 provides finer results of our ablation study. 

Autoknow: Self-driving knowledge collection for products of thousands of types

Dbpedia-a crystallization point for the web of data

Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web (Dagstuhl Seminar 18371)

Real-world data medical knowledge graph: construction and applications

Semantic health knowledge graph: semantic integration of heterogeneous medical knowledge and services

Embedding logical queries on knowledge graphs

Query2box: Reasoning over knowledge graphs in vector space using box embeddings

Self-supervised hyperboloid representations from logical queries over knowledge graphs

Complex embeddings for simple link prediction

Complex query answering with neural link predictors

Beta embeddings for multi-hop logical reasoning in knowledge graphs

Representing words as regions in vector space

Probabilistic embedding of knowledge graphs with box lattice measures

Word representations via gaussian embedding

Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking

Mahalanobis Distance

Translating embeddings for modeling multi-relational data

A three-way model for collective learning on multi-relational data

Chains of reasoning over entities, relations, and text using recurrent neural networks

Faithful embeddings for knowledge base queries

Attention is all you need

Closed-form training of mahalanobis distance for supervised clustering

Pytorch: An imperative style, highperformance deep learning library

Representing text for joint embedding of text and knowledge bases

Toward an architecture for never-ending language learning

Drkg -drug repurposing knowledge graph for covid-19

Embedding entities and relations for learning and inference in knowledge bases

Multilayer perceptrons for classification and regression

Drug repurposing: progress, challenges and recommendations

Adversarial attacks on neural networks for graph data

Adversarial attack on graph structured data