key: cord-0563703-xtcvaqc3 authors: Wang, Jie; Zhang, Zhanqiu; Shi, Zhihao; Cai, Jianyu; Ji, Shuiwang; Wu, Feng title: Duality-Induced Regularizer for Semantic Matching Knowledge Graph Embeddings date: 2022-03-24 journal: nan DOI: nan sha: 6e25c608be17fc5c7246a486f3cb27cb3d6c0378 doc_id: 563703 cord_uid: xtcvaqc3 Semantic matching models -- which assume that entities with similar semantics have similar embeddings -- have shown great power in knowledge graph embeddings (KGE). Many existing semantic matching models use inner products in embedding spaces to measure the plausibility of triples and quadruples in static and temporal knowledge graphs. However, vectors that have the same inner products with another vector can still be orthogonal to each other, which implies that entities with similar semantics may have dissimilar embeddings. This property of inner products significantly limits the performance of semantic matching models. To address this challenge, we propose a novel regularizer -- namely, DUality-induced RegulArizer (DURA) -- which effectively encourages the entities with similar semantics to have similar embeddings. The major novelty of DURA is based on the observation that, for an existing semantic matching KGE model (primal), there is often another distance based KGE model (dual) closely associated with it, which can be used as effective constraints for entity embeddings. Experiments demonstrate that DURA consistently and significantly improves the performance of state-of-the-art semantic matching models on both static and temporal knowledge graph benchmarks. K NOWLEDGE graphs store human knowledge in a structured way, with nodes being entities and edges being relations. In the past few years, knowledge graphs have made great achievements in many areas, such as natural language processing [1] , question answering [2] , recommendation systems [3] , and computer vision [4] . Some popular knowledge graphs, such as Freebase [5] , Wikidata [6] , and Yago3 [7] , consists of triples (subject, predicate, object). However, in many scenarios, triples cannot well represent knowledge as knowledge may evolve with time. For example, the fact (Donald Trump, president, USA) is only valid at the time interval 2017-2021. Therefore, temporal knowledge graphs that consists of quadruples (subject, predicate, object, timestamp) also attract increasing attention recently. Although commonly used knowledge graphs usually contain billions of triples or quadruples, they still suffer from the incompleteness problem that a lot of factual triples or quadruples are missing. Due to the large scale of knowledge graphs, it is impractical to find all valid triples or quadruples manually. Therefore, knowledge graph completion (KGC)-which aims to predict missing links between entities based on known links automatically-has attracted much attention. Knowledge graph embedding (KGE) is a powerful technique for KGC, which embeds entities, relations, and possible timestamps in KGs into low-dimensional continu-ous embedding spaces. KGE models define a score function for each triple/quadruple in embedding spaces. Valid triples/quadruples are expected to have higher scores than invalid ones. The success of many existing KGE models relies on an assumption that entities with similar semantics should have similar representations. For example, the triples (lions, is, mammals) and (tigers, is, mammals) will have similar scores if the entities lions and tiger have similar embeddings. In other words, given a known valid triple (lions, is, mammals), we can infer that the triple (tigers, is, mammals) is also valid with high confidence. Semantic matching (SM) models are an important suite of knowledge graph embedding approaches. They usually model static and temporal knowledge graphs as partially observed third-order and fourth-order binary tensors, respectively. Then, they formulate KGC as a tensor completion problem and try to solve it by tensor factorization. From the perspective of scoring function, SM models use inner products between embeddings to define the plausibility of given triples and quadruples. Theoretically, these models are highly expressive and can well handle complex relation patterns, such as 1-N, N-1, and N-N relations [8] , [9] . However, due to the property of inner products, entities with similar semantics may have dissimilar embeddings. For example, even if the embeddings of lions and tiger are orthogonal, the triples (lions, is, mammals) and (tigers, is, mammals) are likely to have the same scores when the two entity embeddings have proper norms (please refer to Section 4.1 for more detailed explanations). Similarly, we can observe the phenomenon in temporal semantic matching models (please refer to Section 5.3). The performance of SM models suffer from this phenomenon and thus cannot achieve stateof-the-art. To tackle the challenge, we propose a novel regularizer for semantic matching KGE models-namely, DUality-arXiv:2203.12949v2 [cs.CL] 6 Apr 2022 induced RegulArizer (DURA)-which effectively encourages entities with similar semantics to have similar embeddings. Specifically, by noticing that entities with similar semantics often connect to the same entity through the same relation, we propose to use a regularizer to constrain these entities' embeddings. Further, we propose DURA based on the observation called duality-for an existing semantic matching KGE model (primal), there is often another distance based KGE model closely associated with it (dual), which uses Minkowski distances to define score functions. The duality can be derived by expanding the squared score functions of the associated distance based models, where the cross-term in the expansion is exactly a SM model and the squared terms in it give us a regularizer. In other words, adding DURA to a SM model is equivalently to a SM model with its associative distance based model being regularization. Since the minimization of the Minkowski distances between vectors force these vectors to be the same, DURA indeed encourages entities with similar semantics to have similar embeddings. Moreover, we show that when relation embeddings are diagonal matrices, the minimization of DURA has a form of the tensor nuclear 2-norm [10] . Experiments demonstrate that DURA is widely applicable to various static and temporal semantic matching KGE models and yields consistent and significant improvements on benchmark datasets. An earlier version of this paper has been published at NeurIPS 2020 [11] . This journal manuscript significantly extends the initial version in several aspects. First, we propose a temporal version of DURA, named TDURA, which is designed for temporal knowledge graph embeddings. Second, we introduce a general semantic matching temporal KGE model, named TRESCAL, which is an extension of RESCAL to temporal KGs. Third, we rigorously analyze the diagonal relation embedding matrices and show that the minimization of TDURA also has a form of the tensor nuclear 2-norm [10] . Finally, we conduct experiments on temporal knowledge graphs and conduct more ablation studies to demonstrate the effectiveness of TDURA. We review the backgrounds of the knowledge graph, knowledge graph completion, and knowledge graph embeddings in Sections 2.1, 2.2, and 2.3, respectively. Then, we introduce the notations in Section 2.4. Static Knowledge Graphs Given a set E of entities and a set R of relations, a knowledge graph K = {(u i , r j , v k )} ⊂ E × R × E is a set of triplets, where u i and r j are the i-th entity and j-th relation, respectively. Temporal Knowledge Graphs Given a set E of entities, a set R of relations, and a set T of timestamp, a temporal knowledge graph K T = {(u i , r j , v k , t l )} ⊂ E × R × E × T is a set of quadruples, where u i , r j and t l are the i-th entity, j-th relation and l-th timestamp, respectively. Static Knowledge Graph Completion (KGC) The goal of KGC is to predict valid but unobserved triples based on the known triples in K, which is often solved by knowledge graph embedding (KGE) models. KGE models associate each entity u i ∈ E and relation r j ∈ R with an embedding (may be real or complex vectors, matrices, and tensors) u i and r j . Generally, they define a score function s : E × R × E → R to associate a score s(u i , r j , v k ) with each potential triplet (u i , r j , v k ) ∈ E × R × E. The scores measure the plausibility of triples. For a query (u i , r j , ?), KGE models first fill the blank with each entity in the knowledge graphs and then score the resulted triples. Valid triples are expected to have higher scores than invalid triples. Temporal Knowledge Graph Completion (TKGC) The goal of TKGC is to predict valid but unobserved quadruples based on the known quadruples in K T . TKGE models also contain two important categories: distance based models and semantic matching models, both of which are temporal knowledge graph embedding (TKGE) methods. TKGE models also associate each entity u i ∈ E, relation r j ∈ R, and timestamp t l ∈ T with an embedding (may be real or complex vectors, matrices, and tensors) u i , r j and t l . Generally, they define a score function s : E × R × E × T → R to associate a score s(u i , r j , v k , t l ) with each potential quadruples (u i , r j , v k , t l ) ∈ E × R × E × T . The scores measure the plausibility of quadruples. For a query (u i , r j , ?, t l ), TKGE models first fill the blank with each entity in the temporal knowledge graphs and then score the resulted quadruples. Valid quadruples are expected to have higher scores than invalid quadruples. Knowledge graph embeddings aim to embed entities, relations, and possible timestamps in static and temporal knowledge graphs into low-dimensional continuous vector spaces, which have two important categories-distance based models and semantic matching models. Distance Based (DB) Models for Static KGs DB models define the score function s with Minkowski distances. That is, the score functions have the formulation of s(u i , r j , v k ) = − Γ(u i , r j , v k ) p , where Γ is a model-specific function. Semantic Matching (SM) Models for Static KGs SM models regard a knowledge graph as a third-order binary tensor Suppose that X j denotes the j-th frontal slice of X , i.e., the adjacency matrix of the j-th relation. Generally, a SM KGE model factorizes , R j is a matrix representing relation r j , Re (·) and · are the real part and the conjugate of a complex matrix, respectively. That is, the score functions are defined as s(u i , r j , v k ) = Re (u i R j v k ). Note that the real part and the conjugate of a real matrix are itself. The aim of SM models is to seek matrices U, R 1 , . . . , R |R| , V, such that Re (UR j V ) can approximate X j . LetX j = Re (UR j V ) andX be a tensor of which the j-th frontal slice isX j . Then the regularized formulation of a semantic matching model can be written as where L(X j ,X j ) measures the discrepancy between X j ,X j , and g is the regularization function, and λ > 0 is a fixed parameter. Temporal DB models define the score function s with the Minkowski distance. That is, the score functions have the formulation of where Γ is a modelspecific function. Semantic Matching (SM) Models for Temporal KGs Temporal SM models regard a temporal knowledge graph as a fourth-order binary tensor X ∈ {0, 1} |E|×|R|×|E|×|T | . The Suppose that X jl denotes the (j, l) slice of X , i.e., the adjacency matrix given the j-th relation and the l-th timestamp. Generally, a SM temporal KGE model factorizes , R j is a matrix representing relation r j , T l is a matrix representing timestamp t l , Re (·) and · are the real part and the conjugate of a complex matrix, and is the element-wise multiplication between two matrices. The aim of temporal SM models is to seek matrices U, R 1 , . . . , R |R| , V and T 1 , . . . , T |T | , such that Re (U(R j T l )V ) can approximate X jl . LetX jl = Re (U(R j T l )V ) andX be a tensor of which the (j, l) slice isX jl . Then the regularized formulation of a semantic matching model can be written as where λ > 0 is a fixed parameter, L(X jl ,X jl ) measures the discrepancy between X jl andX jl , and g is the regularization function. We use u i ∈ E and v k ∈ E to distinguish head and tail entities. Let · 1 , · 2 , and · F denote the L 1 norm, the L 2 norm, and the Frobenius norm of matrices or vectors. We use ·, · to represent the inner products of two real or complex vectors. This work is related to knowledge graph embedding models and regularizer for semantic mathcing knowledge graph embeddings. Popular knowledge graph embeddings include distancebased, learning-based, and semantic matching models. Distance based models describe relations as relational maps between head and tail entities. Then, they use the Minkowski distance to measure the plausibility of a given triplet or quadruple. Lower distances correspond to higher plausibility of given triplets or quadruples. For example, TransE [12] and its variants [8] , [9] represent relations as translations in vector spaces. They assume that a valid triplet (u i , r j , v k ) satisfies u i,rj + r j ≈ v k,rj , where u i,rj and v k,rj mean that entity embeddings may be relationspecific. Structured embedding (SE) [13] uses linear maps to represent relations. Its score function is defined as [14] defines each relation as a rotation in a complex vector space and the score function is defined as [15] assumes that R 1 j is diagonal and R 2 j is an identity matrix. It shares a similar score function s(u i , r j , v k ) = − u i • r j − v k 1 with RotatE but u i , r j , v k ∈ R k . Some works extend existing distancebased models to temporal KGs. For example, TTransE [16] extends TransE by learning timestamp embeddings. Inspired by TransH [8] , HyTE [17] projects entity and relation embeddings into the timestamp-specific hyperplanes. Learning-based models use neural networks-such as convolutional neural networks [18] , capsule networks [19] , and graph neural networks [20] -to model the interaction among triplets or quadruples. Then, they use either inner products or distance functions to measure the plausibility of given triplets or quadruples. Semantic matching models usually formulate the KGC task as a third-order binary tensor completion problem. RESCAL [21] factorizes the j-th frontal slice of X as X j ≈ AR j A . That is, embeddings of head and tail entities are from the same space. As the relation specific matrices contain lots of parameters, RESCAL is prone to be overfitting. DistMult [22] simplifies the matrix R j in RESCAL to be diagonal, while it sacrifices the expressiveness of models and can only handle symmetric relations. In order to model asymmetric relations, ComplEx [23] extends DistMult to complex embeddings. Both DistMult and ComplEx can be regarded as variants of canonical decomposition (CP) [24] , which are in real and complex vector spaces, respectively. Some works also extend semantic mathcing models to temporal KGs. For example, ConT [25] extends TuckER [26] by additionally learning timestamp embeddings. ASALSAN [27] use three-way DEDICOM [28] to express temporal relations, which shares similar ideas with RESCAL [21] . Besides, TComplEx and TNTComplEx [29] reduce memory complexity by extending ComplEx to decomposition of the 4-order tensor. Compared with distance-based models, semantic matching knowledge graph embeddings are more time-efficient, as we can easily parallel the computation of linear transformations and inner products. Moreover, due to the property of inner products, semantic matching is good at modeling one-to-many/many-to-one/many-to-many relations. For example, given the entity embedding u and relation embedding r, we can find different entity embeddings v i such that (u, r, v i ) have the same scores. In order to handle complex relations, distance-based models usually turn to relation-specific projection and further increase the computational burden. Compared with learning-based model, semantic matching knowledge graph embeddings are more time-efficient thanks to their simpler formulations. However, also due to the property of inner products, in semantic matching, entities with similar semantics may have dissimilar embeddings, which will degrade the knowledge graph completion performance. As introduced in Section 4.1, even if the embeddings of Laue and Schottky are orthogonal, the triples (Laue, StudiedIn, HU Berlin) and (Schottky, StudiedIn, HU Berlin) are likely to have the same scores when the two entity embeddings have proper norms. Moreover, distance-based models are better at modeling properties of knowledge graphs than semantic matching. For example, RotatE [14] can handle composition of relations; HAKE [15] and MuRP [30] can model semantic hierarchies. It is worth noting that many general embedding learning methods are closely related to knowledge graph embeddings, all of which aim to optimize similarity between objects. For example, some graph embedding models [31] , [32] , [33] define similarity functions between nodes, and more generally, some feature embedding approaches [34] , [35] , [36] learn discriminative feature representations by optimizing similarity between samples. Different from these works, knowledge graph embeddings need to model complex relations between entities, leading to a more challenging problem. Semantic matching (SM) KGE models usually suffer from overfitting problem seriously, which motivates various regularizers. In the original papers of SM models, the authors usually use the squared Frobenius norm (L 2 norm) regularizer [21] , [22] , [23] . This regularizer cannot bring satisfying improvements. Consequently, SM models do not gain comparable performance to distance based models [14] , [15] . More recently, [37] propose to use the tensor nuclear 3-norm [10] (N3) as a regularizer, which brings more significant improvements than the squared Frobenius norm regularizer. However, it is designed for the CP-like models, such as CP and ComplEx, and not suitable for more general models such as RESCAL. Moreover, some regularization methods aim to leverage external background knowledge. For example, to model equivalence and inversion axioms, [38] impose a set of model-dependent soft constraints on the predicate embeddings. [39] use non-negativity constraints on entity embeddings and approximate entailment constraints on relation embeddings to impose prior beliefs upon the structure of the embeddings space. We introduce a novel regularizer-DUality-induced RegularizAer (DURA)-for semantic matching knowledge graph embeddings. We first introduce the motivations in Section 4.1 and then introduce a basic version of DURA in Section 4.2. Finally, we introduce the formulation of DURA for static KGE in Section 4.3. To perform accurate link prediction, we expect tail entities connected to a head entity through the same relation to have similar embeddings. First, we claim that tail entities connected to a head entity through the same relation should have similar semantics. Suppose that we know a head entity u i and a relation r j , and our aim is to predict the tail entity. If r j is a one-to-many relation, i.e., there exist two entities v 1 and v 2 such that both (u i , r j , v 1 ) and (u i , r j , v 2 ) are valid, then we expect that v 1 and v 2 have similar semantics. For example, if two triples (Planck, isDoctoralAdvisorOf, Figure 1a demonstrates that tail entities connected to a head entity through the same relation should have similar embeddings. (b) Figure 1b shows that SM models without regularization can get the same score even though the embeddings of v k are dissimilar. (c) Figure 1c shows that with DURA, embeddings of v k are encouraged to locate in a small region. Laue) and (Planck, isDoctoralAdvisorOf, Schottky) are valid, then Laue and Schottky should have similar semantics. Further, we expect that entities with similar semantics have similar embeddings. In this way, if we have known that (Laue, StudiedIn, HU Berlin) is valid, then we can predict that (Schottky, StudiedIn, HU Berlin) is likely to be valid, which is consistent with the fact. See Figure 1a for an illustration of the prediction process. However, SM models fail to achieve the above goal. As shown in Figure 1b , suppose that we have known u iRj when the embedding dimension is 2. Then, we can get the same score s(u i , r j , v k ) for k = 1, 2, . . . , n, so long as v k lies on the same line perpendicular to u iRj . Generally, the entities t 1 and t 2 have similar semantics. However, their embeddings v 1 and v 2 can even be orthogonal, which means that the two embeddings are dissimilar. Therefore, the performance of SM models for knowledge graph completion is usually unsatisfying. Consider the static knowledge graph completion problem (u i , r j , ?). That is, we are given the head entity and the relation, aiming to predict the tail entity. Suppose that f j (i, k) measures the plausibility of a given triplet (u i , r j , v k ), i.e., f j (i, k) = s(u i , r j , v k ). Then the score function of a SM model is It first maps the entity embeddings u i by a linear transformation R j , and then uses the real part of an inner product to measure the similarity between u i R j and v k . Notice that another commonly used similarity measure-Euclidean distance-can replace the dot product similarity in Equation (3). We can obtain an associated distance based model formulated as . Note that we use the squared score function that is equivalent to the form without square. That is to say, there exists a duality-for an existing semantic matching KGE model (primal), there is often another distance based KGE model (dual) closely associated with it. Specifically, the relationship between the primal and the dual can be formulated as When we train a distance based model, the aim is to maxi- Suppose that S is the set that contains all valid triples. Then, we have that Therefore, the duality induces a regularizer for semantic matching KGE models, i.e., which is called basic DURA. By Equation (4), we know that basic DURA constrains the distance between u iRj and v k . If u i andR j are known, then v k will lie in a small region (see Figure 1c ). Thus, tail entities connected to a head entity through the same relation will have similar embeddings, which meets our expectation and is beneficial to the prediction of unknown triples. Basic DURA encourages tail entities with similar semantics to have similar embeddings. However, it cannot handle the case that head entities have similar semantics. Suppose that (Laue, StudiedIn, HU Berlin) and (Schottky, StudiedIn, HU Berlin) are valid. Similar to the discussion in Section 4.1, we expect that Laue and Schottky have similar semantics and thus have similar embeddings. If we further know that (DU Berlin, hasCitizen, Laue) is valid, we can predict that (DU Berlin, hasCitizen, Schottky) is valid. However, basic DURA cannot handle the case. Let u 1 , u 2 , v 1 , and R 1 be the embeddings of Laue, Schottky, HU Berlin, and StudiedIn, respectively. Then, Re (u 1 R 1 v 1 ) and Re (u 2 R 1 v 1 ) can be equal even if u 1 and u 2 are orthogonal, as long as u 1 R 1 = u 2 R 1 . To tackle the above issue, noticing that Re Then, similar to the derivation in Equation (4), the duality induces a regularizer given by When a SM model are incorporated with regularizer (6), head entities with similar semantics will have similar embeddings. Finally, combining the regularizers (6) and (5), DURA has the form of Ignoring the sampling frequency and summarizing DURA on all possible entities and relations, we can write the regularizer as: where |E| and |R| are the number of entities and relations, respectively. In the rest of this section, we use the same definitions ofX j andX as in the problem (2) . When the relation embedding matrices R j are diagonal in R or C as in CP or ComplEx, DURA gives an upper bound to the tensor nuclear 2-norm ofX , which is an extension of trace norm regularizers in matrix completion. To simplify the notations, we take CP as an example, in which all involved embeddings are real numbers. The conclusion in complex space can be analogized accordingly. where u k,i ∈ R n k for k = 1, ..., 3, i = 1, ..., r, and ⊗ denotes the outer product. For notation convenience, we define a relation matrix R ∈ R |R|×D , of which the j-th row consists of the diagonal represents the entry in the i-th row and j-th column of the matrix A. In the knowledge graph completion problem, the tensor nuclear 2-norm ofX is where D is the embedding dimension, u :d , r :d , and v :d are the d-th columns of U, R, and V. For DURA, we have the following theorem, of which the proof are provided in Appendix A. Theorem 1. Suppose thatX j = UR j V for j = 1, 2, . . . , |R|, where U, V, R j are real matrices and R j is diagonal. Then, the following equation holds The minimizers satisfy u :d 2 r :d 2 = |R| v :d 2 and v :d 2 r :d 2 = |R| u :d 2 , ∀ d ∈ {1, 2, . . . , D}, where u :d , r :d , and v :d are the d-th columns of U, R, and V, respectively. Therefore, the minimization of DURA has a form of the tensor nuclear 2-norm, which is a tensor analog to the matrix trace norm. Matrix trace norm regularization has shown great power in the matrix completion problem. In the experiments, we will show that DURA reproduces the success in the tensor completion problem. We first introduce a general TKGE model TRESCAL-which is an extension of RESCAL to temporal KGs-and introduce the corresponding DURA regularizer. Then, we introduce the enhanced version of DURA for TComplEx [29] . Inspired by the classical static KGE model RESCAL, we propose its temporal extension TRESCAL as follows: where R j and T l are trainable real matrices. Note that the element-wise multiplication is commutative, i.e., A B = B A for all matrices A and B. In Equation (9), the product of relation and time matrices R j T l can be seen as the representation of time-dependent relations, leading to |R| × |T | equivalent relations in a KG. By a similar derivation in Section 4.3, we propose DURA regularizer (DURA1) for TRESCAL as follows: By introducing representations of time-dependent relations, we can extend ComplEx [23] to temporal KGs. The resulting model, TComplEx [29] , is formulated as As mentioned by Lacroix et al. [29] , timestamp embeddings in temporal semantic matching models can be used to equivalently modulate both the relations and entities to obtain time-dependent representations. Therefore, we introduce DURA for temporal KGE in two views. First, as relation embeddings and timestamp embeddings are diagonal matrices, the score function becomes . This formulation models time-dependent relation representations. We introduce a form of DURA (DURA1) as Note that when the relation and time matrices are diagonal, the right hand side of Equation (12) satisfy associative law. Thus, we can reformulate the score function where u i T l and v k T l are representations of time-dependent entity representations. Therefore, we derive another formulation of DURA (DURA2): We empirically evaluate the aforementioned two versions of DURA in Section 6. l , and T (2) l be embeddings of Planck, isDoctoralAdvisorOf, Laue, Schottky, 1902-1903, and 1909-1912, respectively. In temporal KGE, close timestamps usually have similar representations [29] by incorporating a timestamp smoothness regularizer. For analyses convenience and without loss of generality, we assume that T Figure 2b , we can get the same score s(u i , r j , v k , t l ) for any v k , so long as v k lies on the same line perpendicular to u i (R j T l ). When we train SM models without DURA1, the embeddings v can even be orthogonal, which means that the two embeddings are dissimilar. In contrast, Figure 2c shows that DURA1 encourages v (1) k and v (2) k to locate in a small region, which means that two embeddings of Laue and Schottky are similar. Thus, when (Laue, StudiedIn, HU Berlin, 1902-1903) is valid, DURA1 encourages (Schottky, StudiedIn, HU Berlin, 1905-1912) to have a high score. The prediction is reasonable, as two persons advised by the same person in a similar period of time tend to study in the same university. DURA2 encourages time-dependent entities with similar semantics to have similar embeddings, while similar embeddings of time-dependent entities do not imply similar embeddings of original entities. Similar to the above discussion, DURA2 encourages the time-dependent entities v (1) k T l and v (2) k T l to locate in a small region. However, as shown in Figure 2e , when embeddings of time-dependent entities v k T l are projections of v k onto the subspace spanned Figure 2b and 2d show that SM models without regularization can get the same score even though and the embeddings of original entities v k and time-dependent entities v k T l are dissimilar, respectively. Figure 2c shows that with DURA1, embeddings of v k are encouraged to locate in a small region. Figure 2e shows that with DURA2, embeddings of time-dependent entities v k T l are encouraged to locate in a small region, while the corresponding embeddings of original entities may be dissimilar. by u i R j , the embeddings of entities v (Schottky, StudiedIn, HU Berlin, 1902-1903) is not guaranteed to have a high score. However, as shown in Section 5.4, both DURA1 and DURA2 correspond to nuclear-2 norm of 4-order tensors. Thus, both of them can improve the performance of semantic matching models, while we expect DURA1 outperforms DURA2. Experiments in Section 6.2.2 confirm our points. We give an analysis for the case that relation and time matrices are diagonal (i.e., TComplEx). First, we review the definition of 4D tensors. ). The nuclear 2-norm of a 4D tensor A ∈ R n1 ⊗ R n2 ⊗ R n3 ⊗ R n4 is where u k,i ∈ R n k for k = 1, ..., 4, i = 1, ..., r, and ⊗ denotes the outer product. In the temporal knowledge graph completion problem, the tensor nuclear 2-norm ofX is where u :d , r :d , v :d , and t :d are the d-th columns of the head entities matrix U, the relation matrix R, the tail entities matrix V, and the time matrix V, respectively. Then, we have the following theorems for DURA, which correspond to the regularizers (13) and (14), respectively. The proofs are provided in the Appendixes B and C. Theorem 2. Suppose thatX jl = U(R j T l )V for j = 1, 2, . . . , |R|, where U, V, R j , T l are real matrices and R j and T l are diagonal. Then, the following equation holds Therefore, the minimization of DURA in the temporal knowledge graph case has a form of the tensor nuclear 2-norm of 4-order tensors. Theorem 3 also explains why DURA2 improves the performance of temporal semantic matching KGEs, even if it cannot encourage entity with similar semantics to have similar representations. 1: Hyper-parameters found by grid search. k is the embedding size, b is the batch size, λ is the regularization coefficients, and λ 1 and λ 2 are weights for different parts of the regularizer. FB15k-237 YAGO3-10 The knowledge graph completion task is a standard benchmark task to evaluate KGE models. To demonstrate that DURA is an effective and widely-applicable regularizer, we conduct extensive experiments on both static and temporal knowledge graph completion datasets. We introduce the experimental settings for static KGC in Section 6.1.1 and demonstrate the effectiveness of DURA on three benchmark datasets in Section 6.1.2. Then, we compare DURA with other regularizers in Section 6.1.3. We consider three public static knowledge graph datasets-WN18RR [40] , FB15k-237 [18] , and YAGO3-10 [7] for the knowledge graph completion task, which have been divided into training, validation, and testing set in previous works. The statistics of these datasets are shown in Appendix D. We use the cross entropy loss function and the "reciprocal" setting that creates a new triplet (v k , r −1 j , u i ) for each triplet (u i , r j , v k ) [37] , [41] . We use Adagrad [42] as the optimizer, and use grid search to find the best hyperparameters based on the performance on the validation datasets. As we regard the link prediction as a multi-class classification problem and use the cross entropy loss, we can assign different weights for different classes (i.e., tail entities) based on their frequency of occurrence in the training dataset. Specifically, suppose that the loss of a given query (u, r, ?) is ((u, r, ?) , v), where v is the true tail entity, then the weighted loss is w(t) ((u, r, ?), v), where w 0 is a fixed number, #v denotes the frequency of occurrence in the training set of the entity v. For all models on WN18RR and RESCAL on YAGO3-10, we choose w 0 = 0.1 and for all the other cases we choose w 0 = 0. Following [12] , we use entity ranking as the evaluation task. For each triplet (u i , r j , v k ) in the test dataset, the model is asked to answer (u i , r j , ?) and (v k , r −1 j , ?). To do this, we fill the positions of missing entities with candidate entities to create a set of candidate triplets, and then rank the triplets in descending order by their scores. Following the "Filtered" setting in [12] , we then filter out all existing triplets known to be true at ranking. We choose Mean Reciprocal Rank (MRR) and Hits at N (H@N) as the evaluation metrics. Higher MRR or H@N indicates better performance. We find the best hyper-parameters by grid search. Specifically, we search learning rates in {0.1, 0.01}, regularization coefficients in {0, 1 × 10 −3 , 5 × 10 −3 , 1 × 10 −2 , 5 × 10 −2 , 1 × 10 −1 , 5 × 10 −1 }. On WN18RR and FB15k-237, we search batch sizes in {100, 500, 1000} and embedding sizes in {500, 1000, 2000}. On YAGO3-10, we search batch sizes in {256, 512, 1024} and embedding sizes in {500, 1000}. Table 1 shows the best hyperparameters for DURA found by grid search. We demonstrate the performance of DURA on several semantic matching KGE models, including CP [24] , RESCAL [21] , and ComplEx [23] . Note that we reimplement CP, ComplEx, and RESCAL under the "reciprocal" setting [37] , [41] , and obtain better results than the reported performance in the original papers. Sun et al. [43] demonstrate that many existing learning-based models use inappropriate evaluation protocols and suffer from inflating performance. Therefore, for ConvE, ConvKB, and KB-GAT, we take their results from [43] for a fair comparison. The baseline models include distance-based models (TransE [12] , TransH [8] , TransD [44] , RotatE [14] , MuRP [30] , HAKE [15] ), learning-based models (ConvE [18] , Con-vKB [45] , KB-GAT [20] , InteractE [46] , and StAR [47] ), and semantic matching models (CP [24] , RESCAL [21] , ComplEx [23] ). TransE is the most representative translational distance model, while failing to deal with 1-to-N, N-to-1, and N-to-N relations. TransH introduces relation-specific hyperplanes to overcome these disadvantages. TransD shares a similar idea with TransH. It introduces relation-specific spaces rather than hyperplanes. TransD also decomposes the projection matrix into a product of two vectors for further simplification. RotatE defines each relation as a rotation from the source entity to the target entity in the complex vector space and naturally, it is able to model and infer various relation patterns. MuRP embeds multirelational graph in the Poincaré ball model of hyperbolic space to capture multiple simultaneous hierarchies. HAKE maps entities into the polar coordinate system to model semantic hierarchies, which are common in real-world applications. ConvE introduces a multi-layer convolutional network model to learn more expressive features with less parameters. ConvKB represents triples as 3-column matrixs, and employs a convolutional neural network to capture global relationships and transitional characteristics between entities and relations. KB-GAT proposes an attention-based feature embedding that captures both entity and relation features to cover the complex and hidden information in the local neighborhood surrounding a triple. InteractE improves ConvE by increasing feature interactions. StAR follows the textual encoding paradigm and augments it with graph embedding techniques. 2: Evaluation results on WN18RR, FB15k-237 and YAGO3-10 datasets. We reimplement CP, ComplEx, and RESCAL using the "reciprocal" setting [37] , [41] , which leads to better results than the reported results in the original paper. * indicates that the model use external textual information. Table 2 shows the evaluation results. Overall, DURA significantly improves the performance of all considered SM models. The results demonstrate that without regularization, semantic matching models perform worse than state-of-the-art distance-based models on WN18RR, e.g., RotatE, MuRP, and HAKE. When incorporating with DURA, these semantic matching models achieve competitive performance with distance-based models. Moreover, DURA further improve the performance gain of semantic matching models on FB15k-237 and YAGO3-10. Compared with learning-based models, semantic matching models with DURA outperforms all of them without using external textual information. Notably, semantic matching models with DURA even outperform StAR (Self-Adp) on FB15k-237, which use external textual information to enhance the expressiveness of KGE. Generally, models with more parameters and datasets with smaller sizes imply a larger risk of overfitting. Among the three datasets, WN18RR has the smallest size of only 11 kinds of relations and around 80k training samples. Therefore, the improvements brought by DURA on WN18RR are expected to be larger compared with other datasets, which is consistent with the experiments. As stated in [48] , RESCAL is a more expressive model, but it is prone to overfit on small-and medium-sized datasets because it represents relations with much more parameters. For example, on WN18RR dataset, RESCAL gets an H@10 score of 0.493, which is lower than ComplEx (0.522). The advantage of its expressiveness does not show up at all. Incorporated with DURA, RESCAL gets an 8.4% improvement on H@10 and finally attains 0.577, outperforming all compared models. On larger datasets such as YAGO3-10, overfitting also exists but will be insignificant. Nonetheless, DURA still leads to consistent improvement, showing the ability of DURA to prevent models from overfitting. We compare DURA to the squared Frobenius norm regularizer and the tensor nuclear 3-norm (N3) regularizer [37] . The squared Frobenius norm regularizer g(X ) = F . N3 regularizer is given by 3 ). where · 3 denotes L 3 norm of vectors. N3 regularizer is only suitable for models in which R j is diagonal, such as CP or ComplEx. 4: Hyper-parameters found by grid search. λ is the regularization coefficients, λ 1 and λ 2 are weights for different parts of the regularizer for DURA2, and λ 3 and λ 4 are weights for different parts of the regularizer for DURA1. TComplEx-DURA1 1e- 2 --1e-3 1e2 1e-3 --3e-2 30 1e-2 --1e-2 1e2 TComplEx-DURA2 1e-1 1e-1 3e-2 --1e-1 1e-2 3e-2 --1e-2 3e-2 3e-2 --TRESCAL-DURA1 1e-2 --1e1 1e1 1e-2 --1e1 1 1e-2 --1e-1 1e1 We implement both the squared Frobenius norm (FRO) and N3 regularizers in the weighted way as stated in [37] . Table 3 shows the performance of the three regularizers on three popular models: CP, ComplEx, and RESCAL. Note that when the considered model is RESCAL, we only compare DURA to the squared Frobenius norm regularization as N3 does not apply to it. For CP and ComplEx, DURA brings consistent improvements compared to FRO and N3 on all datasets. Specifically, on FB15k-237, compared to CP-N3, CP-DURA gets an improvement of 0.013 in terms of MRR. Even for the previous state-of-the-art SM model ComplEx, DURA brings further improvements against the N3 regularizer. Incorporated with FRO, RESCAL performs worse than the vanilla model, which is consistent with the results in [49] . However, RESCAL-DURA brings significant improvements against RESCAL. All the results demonstrate that DURA is more widely applicable than N3 and more effective than the squared Frobenius norm regularizer. We also conduct extensive experiments on temporal knowledge graph completion datasets. Specifically, we introduce the experimental settings for temporal KGC in Section 6.2.1 and show the effectiveness of DURA on three benchmark datasets in Section 6.2.2. We consider three public temporal knowledge graph datasets-ICEWS14 [50] , ICEWS05-15 [50] , and YAGO15k [51] for the knowledge graph completion task, which have been divided into training, validation, and testing set in previous works. The statistics of these datasets are shown in Appendix D. Following [29] , we use the following temporal regularizer for TComplEx to smooth timestamp embeddings where diag −1 (·) gives the vector made of the diagonal elements from the a matrix. For TRESCAL, we propose to use the temporal regularizer to smooth timestamp embeddings. Moreover, we find it better to assign different weights for the parts involved with relations. That is, the optimization problem has the form of where λ, λ 1 , λ 2 , λ 3 , λ 4 ≥ 0 are fixed hyperparameters. We denote the regularizer (13) by DURA2 (i.e., λ 3 = λ 4 = 0) and the regularizer (14) by DURA1 (i.e., λ 1 = λ 2 = 0). We use grid search to find the best hyperparameters based on the performance on the validation datasets. We search λ ∈ {1, 3e-1, 1e-1, 3e-2, 1e-2, 3e-3, 1e-3, 3e-4, 1e-4} for all experiments. We search λ 1 , λ 2 ∈ {1, 3e-1, 1e-1, 3e-2, 1e-2, 3e-3, 1e-3} for DURA2, and λ 3 ∈ {1e2, 3e1, 1e1, 3, 1, 3e-1, 1e-1, 3e-2, 1e-2, 3e-3, 1e-3, 3e-4, 1e-4}, λ 4 ∈ {1e3, 3e2, 1e2, 3e1, 1e1, 3, 1, 3e-1, 1e-1, 3e-2} for DURA1. Table 4 shows the best hyperparameters found by grid search. We compare DURA with the popular squared Frobenius norm and the tensor nuclear 3-norm (N3) regularizers [29] on the state-of-the-art temporal semantic matching KGE model TComplEx and our proposed TRESCAL method. We also include four recent distance-based models, including TTransE [16] , TA-TransE [52] , RFTE-HAKE [53] and BoxTE [54] . TTransE [16] proposes a time-aware knowledge graph completion model by using temporal order information among facts. TA-TransE [52] utilizes recurrent neural networks to learn time-aware representations. RFTE [53] is a framework to transplant static KGE models to temporal KGE models. It treats the sequence of graphs as a Markov chain. BoxTE [54] assumes that time is a filter that helps pick out answers to be correct during certain periods. Therefore, it introduces boxes to represent a set of answer entities to a time-agnostic query. Table 5 shows the effectiveness of DURA. Note that DURA2 as shown in the formulation (14) does not applicable to TRESCAL since the matrix multiplication and element-wise multiplication are not commutative. Moreover, the N3 regularizer is not applicable to TRESCAL since this model is not CP tensor factorization. Table 5 shows that DURA effectively improve the performance of TComplEx and TRESCAL. With DURA, these semantic marching models outperform the state-of-the-art distance-based models in terms of MRR. The experiments are conducted on FB15k-237. As knowledge graphs usually contain billions of entities, the storage of entity embeddings faces severe challenges. Intuitively, if embeddings are sparse, that is, most of the entries are zero, we can store them with less storage. Therefore, the sparsity of the generated entity embeddings becomes crucial for real-world applications. Generally, there are few entries of entity embeddings that are exactly equal to 0 after training, which means that it is hard to obtain sparse entity embeddings directly. However, when we score triples using the trained model, the embedding entries with values close to 0 will have minor contributions to the score of a triplet. If we set the embedding entries close to 0 to be exactly 0, we could transform the embeddings into sparse ones. Therefore, there is a trade-off between the sparsity and performance decrement. We define the following λ-sparsity to indicate the proportion of entries that are close to zero: where E ∈ R I×D is the entity embedding matrix, E id is the entry in the i-th row and d-th column of E, I is the number of entities, D is the embedding dimension, and 1 C (x) is the indicator function taking value of 1 if x ∈ C or otherwise 0. Figure 3 shows the effect of entity embeddings' λsparsity on MRR. Following Equation (15), we select entries of which the absolute value are less than a threshold and set them to be 0. Note that for any given s λ , we can always find a proper threshold λ to approximate it, as the formula TABLE 6: Ablation results on the combined and uncombined regularizers. "-I" and "-II" represents the regularizer defined in formula (5) and (6) is increasing with respect to λ. Results in the figure show that DURA causes much gentler performance decrement as the embedding sparsity increases. In Figure 3a , incorporated with DURA, CP maintains MRR of 0.366 unchanged even when 60% entries are set to 0. More surprisingly, when the sparsity reaches 70%, CP-DURA can still outperform CP-N3 with zero sparsity. For RESCAL, when set 80% entries to be 0, RESCAL+DURA still has the MRR of 0.341, which significantly outperforms vanilla RESCAL, whose MRR has decreased from 0.352 to 0.286. In a word, incorporating with DURA regularizer, the performance of CP and RESCAL remains comparable to the state-of-the-art models, even when 70% of entity embeddings' entries are set to 0. Following [55] , we store the sparse embedding matrices using compressed sparse row (CSR) or compressed sparse column (CSC) format. Experiments show that DURA brings about 65% fewer storage costs for entity embeddings when 70% of the entries are set to 0. We conduct ablation studies to show the effectiveness of the combination of (5) and (6) . The results are shown in Table 6 , where "-I" and "-II" represent the regularizer defined in formula (5) and (6), respectively. On WN18RR, the uncombined regularizer DURA-I and DURA-II can also improve the performance for the semantic matching models, but the combination brings further improvements. On FB15k-237, DURA-I and DURA-II even lead to a decrement in terms of MRR. One possible reason is that FB15k-237 has a large number of 1-N and N-1 relations (refer to [8] for the definitions relation types) and they have different preferences for the uncombined regularizers. Figure 5 shows the performance of RESCAL with different regularizers on different relation types. Compared with vanilla RESCAL, DURA-I only improves the MRR for 1-1 and 1-N relations, while DURA-II only improves the MRR for N-1 relations, which validates our analysis. DURA significantly outperforms both DURA-I and DURA-II on all relation types. A point represents a tail entity. Points in the same color represent tail entities that have the same (h r , r j ) context. We visualize the entity embeddings using T-SNE [56] to show that DURA encourages entities with similar semantics to have similar embeddings. Without loss of generality, we use tail entities for visualization. Suppose that (u i , r j ) is a query, where u i and r j are head entities and relations, respectively. An entity v k is an answer to a query (u i , r j ) if (u i , r j , v k ) is valid. We randomly selected 10 queries in FB15k-237, each of which has more than 50 answers. Then, we use T-SNE to visualize the answers' embeddings generated by CP and CP-DURA. Figure 6 shows the visualization results. Each entity is represented by a 2D point and points in the same color represent tail entities with the same (u i , r j ) context (i.e. query). Figure 6 shows that, with DURA, entities with the same (u i , r j ) contexts are indeed being assigned more similar representations, which verifies the claims in Section 4.1. (c) λ4 of DURA1 for TComplEx. (f) λ2 of DURA2 for TComplEx. . The x-axis is the logarithm of each hyper-parameter based on 10 and the y-axis is mean reciprocal rank (MRR) on test data. For static KGE, the hyper-parameters mainly include embedding sizes k, batch sizes b, and regularization coefficients λ, λ i (i = 1, 2); for temporal KGE, the hyperparameters mainly include regularization coefficients λ, λ i (i = 1, 2, 3, 4). We provide the performance curves on different datasets for these hyper-parameters. Note that when exploring the effect of a specific hyper-parameter, we fix the other hyper-parameters as their best values. As shown in Figure 4 , the performance is stable when the embedding and batch sizes vary. Figure 8 demonstrates that the performance of static KGE is relatively insensitive to λ i . However, when the regularization coefficients λ are too large, the performance drops quickly. It is expectable since too strong regularization will weaken models' ability to model complex relations. As shown in Figure 7 , the performance of temporal KGE is relatively insensitive to λ and λ i when the KGE model is TComplEx. However, the performance of TRESCAL drops quickly when these regularization coefficients are large. It suggests that we should choose hyperparameters for TRESCAL carefully. We propose a widely applicable and effective regularizernamely, DURA-for semantic matching knowledge graph embedding models. The formulation of DURA is based on the observation that, for an existing semantic matching KGE model (primal), there is often another distance based KGE model (dual) closely associated with it. Extensive experiments show that DURA yields consistent and significant improvements on both static and temporal knowledge graph embedding benchmark datasets. In the same manner, we know that The equality holds if (u :d , r :d , v :d , t :d ) = (u :d , r :d , v :d , t :d ), i.e., Therefore, the minimization of the left side is X * . UR j 2 F + V 2 F = |R| j=1 I i=1 u i • r j 2 F + D d=1 v :d 2 F = |R| j=1 D d=1 v :d 2 F + I i=1 D d=1 u 2 id r 2 jd = |R| j=1 D d=1 v :d Lemma 1. The nuclear p-norm is also derived by where u k,i ∈ R n k for k = 1, ..., D, i = 1, ..., r, and ⊗ denotes the outer product. Proof. Suppose that the right side of Equation (16) i.e., (u k,i ) for k = 1, ..., D, i = 1, ..., R also attains the nuclear p-norm A p * . It follows from the inequality of arithmetic and geometric means that where A = r i=1 u 1,i ⊗ · · · ⊗ u D,i . As the left side attains the minimization A p * at R and (u k,i ) for k = 1, ..., D, i = 1, ..., R, i.e., Therefore, the minimization of the left side is A p * . For static knowledge graph completion, we consider three public static knowledge graph datasets-WN18RR [40] , FB15k-237 [18] , and YAGO3-10 [7] , which have been divided into training, validation, and testing set in previous works. The statistics of these datasets are shown in Table 7 . For temporal knowledge graph completion, we consider three public temperal knowledge graph datasets-ICEWS14 [50] , ICEWS05-15 [50] , and YAGO15k [51] , which have been divided into training, validation, and testing set in previous works. ICEWS14 and ICEWS05-15 are extracted from the Integrated Conflict Early Warning System (ICEWS). YAGO15k is a modification of FB15k. The statistics of these datasets are shown in Table 8 . The mean reciprocal rank is the average of the reciprocal ranks of results for a sample of queries Q: The Hits@N is the ratio of ranks that no more than N : where 1 x≤N (rank i ) = 1 if rank i ≤ N or otherwise 1 x≤N (rank i ) = 0. In distance-based models, the commonly used p is either 1 or 2. When p = 2, DURA takes the form as the one in Equation (7) in the main text. If p = 1, we cannot expand the squared score function of the associated DB models as in Equation (4). Thus, the induced regularizer takes the form of (ui,rj ,v k )∈S u iRj − v k 1 + v k R j − u i 1 . The above regularizer with p = 1 (Reg p1) does not gives an upper (doctoral degree , /education/educational degree/people with this degree./education/education/major field of study,?) 5 (broccoli, /food/food/nutrients./food/nutrition fact/nutrient,?) 6 (shooting sport, /olympics/olympic sport/athletes./olympics/olympic athlete affiliation/country,?) 7 (synthpop, /music/genre/artists, ?) 8 (Italian American, /people/ethnicity/people,?) 9 (organ, /music/performance role/track performances./music/track contribution/role, ?) 10 (funk, /music/genre/artists, ?) bound on the tensor nuclear-2 norm as in Theorem 1. Table 10 shows that, DURA significantly outperforms Reg p1 on WN18RR and FB15k-237. Therefore, we choose p = 2. In Table 9 , we list the ten queries used in the T-SNE visualization (Section 6.5 in the main text). Note that a query is represented as (u, r, ?), where u denotes the head entity and r denotes the relation. ERNIE: Enhanced language representation with informative entities Knowledge graph embedding based question answering Ripplenet: Propagating user preferences on the knowledge graph for recommender systems The more you know: Using knowledge graphs for image classification Freebase: A collaboratively created graph database for structuring human knowledge Wikidata: a free collaborative knowledgebase YAGO3: A knowledge base from multilingual wikipedias Knowledge graph embedding by translating on hyperplanes Learning entity and relation embeddings for knowledge graph completion Nuclear norm of higher-order tensors Duality-induced regularizer for tensor factorization based knowledge graph completion Translating embeddings for modeling multirelational data Learning structured embeddings of knowledge bases Rotate: Knowledge graph embedding by relational rotation in complex space Learning hierarchy-aware knowledge graph embeddings for link prediction Towards time-aware knowledge graph completion HyTE: Hyperplanebased temporally aware knowledge graph embedding Convolutional 2d knowledge graph embeddings A Capsule Network-based Embedding Model for Knowledge Graph Completion and Search Personalization Learning attention-based embeddings for relation prediction in knowledge graphs A three-way model for collective learning on multi-relational data Embedding entities and relations for learning and inference in knowledge bases Complex embeddings for simple link prediction The expression of a tensor or a polyadic as a sum of products Embedding models for episodic knowledge graphs TuckER: Tensor factorization for knowledge graph completion Temporal analysis of semantic graphs using asalsan A model for the analysis of asymmetric data in marketing research Tensor decompositions for temporal knowledge base completion Multi-relational poincaré graph embeddings Node2vec: Scalable feature learning for networks Deepwalk: Online learning of social representations Line: Large-scale information network embedding Augmentation invariant and instance spreading feature for softmax embedding Representation learning with contrastive predictive coding A simple framework for contrastive learning of visual representations Canonical tensor decomposition for knowledge base completion Regularizing knowledge graph embeddings via equivalence and inversion axioms Improving knowledge graph embedding using simple constraints Observed versus latent features for knowledge base and text inference Simple embedding for link prediction in knowledge graphs Adaptive subgradient methods for online learning and stochastic optimization A reevaluation of knowledge graph completion methods Knowledge graph embedding via dynamic mapping matrix Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing A novel embedding model for knowledge base completion based on convolutional neural network Interacte: Improving convolution-based knowledge graph embeddings by increasing feature interactions Structure-augmented text representation learning for efficient knowledge graph completion Knowledge graph embedding: A survey of approaches and applications You can teach an old dog new tricks! on training knowledge graph embeddings Learning sequence encoders for temporal knowledge graph completion Learning sequence encoders for temporal knowledge graph completion Rtfe: a recursive temporal fact embedding framework for temporal knowledge graph completion Temporal knowledge graph completion using box embeddings Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding Visualizing data using t-SNE Suppose that X * can be attained at D and (u * :d , r * :d , v * :d , t * :d ) for d = 1, ..., D. Leti.e., (u * :d , r * :d , v * :d , t * :d ) for d = 1, ..., D also attains the nuclear p-norm X * .It follows from the inequality of arithmetic and geometric means that