key: cord-0117425-qg5m6x66 authors: Jain, Prachi; Rathi, Sushant; Mausam,; Chakrabarti, Soumen title: Knowledge Base Completion: Baseline strikes back (Again) date: 2020-05-02 journal: nan DOI: nan sha: db542d728be9c1d112a8e1ca60bb315e61b49dd5 doc_id: 117425 cord_uid: qg5m6x66 Knowledge Base Completion has been a very active area recently, where multiplicative models have generally outperformed additive and other deep learning methods -- like GNN, CNN, path-based models. Several recent KBC papers propose architectural changes, new training methods, or even a new problem reformulation. They evaluate their methods on standard benchmark datasets - FB15k, FB15k-237, WN18, WN18RR, and Yago3-10. Recently, some papers discussed how 1-N scoring can speed up training and evaluation. In this paper, we discuss how by just applying this training regime to a basic model like Complex gives near SOTA performance on all the datasets -- we call this model COMPLEX-V2. We also highlight how various multiplicative methods recently proposed in literature benefit from this trick and become indistinguishable in terms of performance on most datasets. This paper calls for a reassessment of their individual value, in light of these findings. A Knowledge base (KB) is a collection of world knowledge facts in the form of a triple where the subject (s) is related to object (o) via relation (r), like -COVID-19, originated in, Bats . Most KBs are incomplete -the Knowledge Base Completion (KBC) task infers missing facts from the known triples, hence making them more effective for end tasks like search and question answering. Various translation (Bordes et al., 2013; Ji et al., 2015; Sun et al., 2019a) , multiplicative (Yang et al., 2015; Trouillon et al., 2016a; Lacroix et al., 2018; Jain et al., 2018; Balažević et al., 2019; Kazemi and Poole, 2018) , and deep learning based (Graph Neural Networks (Nathani et al., 2019) and Convolution Neural Networks (Dettmers et al., 2018; Nguyen et al., 2018; Schlichtkrull et al., 2018) ), approaches for KBC have been discussed in the literature. The scoring functions of these methods takes in the embeddings of s, r, and o and generates a score indicating the truthfulness of the fact. Dettmers et al. (2018) showed that 1-N scoring i.e. computing all possible fact variations -(s, r, * ) and ( * , r, o), at the same time, can improve model performance, with faster convergence as well. We leverage this idea to compute a fullsoftmax (exact probability), instead of approximating it by using negative sampling (a standard practice in KBC literature), for improving performance of older models to match those of recent SOTA models. We also find that this training method is difficult to incorporate in some models-in particular, translation models such as TransE, RotatE, due to memory constraints. However, a majority of recently released models, as well as older models such as SimplE and Complex, can improve their performance significantly when trained with a full-softmax (we refer to the improved models as SIMPLE-V2 and COMPLEX-V2 respectively). To our surprise, COMPLEX-V2 outperforms or is very close in performance to all recent state of the art KBC models! In light of these findings we draw two conclusions: (1) there is a need to reassess the value offered by recent KBC models against the older CX model, and (2) any new KBC model must compare against the baseline of COMPLEX-V2 to demonstrate empirical gains. Moreover, as long as scalable, all KBC models must use the training regime of 1-N scoring for superior performance. We will release an open-source implementation 1 of the models and experiments discussed in this paper for further exploration. We are given an incomplete KB with entities E and relations R. The KB also contains T = { s, r, o }, a set of known valid tuples, each with subject and object entities s, o ∈ E, and relation r ∈ R. The goal of KBC model is to predict the validity of any tuple not present in T . Previous approaches fit continuous representations (loosely, "embeddings") to entities and relation, so that the belief in the veracity of s, r, o can be estimated as an algebraic expression (called a scoring function φ) involving those embeddings. The scoring functions for the models considered in this work are outlined in Table-1 . Translation Models: The embedding of s, o and r are denoted as e s , e o , r respectively. The three are combined using an additive function. Row 1 and 2 of Table 1 represents two different types of translation models. Multiplicative Models: Row 3, 4 and 5 of Table 1 represents three popular multiplicative models, where score of a fact s, r, o is obtained by a product of the embeddings. Our work focusses on multiplicative models. Com-plEx (Trouillon et al., 2016b) , abbreviated as CX is a popular multiplicative model. It embeds s, r, o to vectors of complex elements e s , r, e o ∈ C D . CX defines the score φ of a fact (s, r, o) as is a 3-way inner product, e ⋆ o is the complex conjugate of e o , and ℜ(c) is real part of c ∈ C. Note that, e s , e o ∈ E for CX. For SimplE (Kazemi and Poole, 2018), h s ∈ H and t o ∈ T , where H and T are separate entity (E) representation for entities in head and tail position respectively. It also learns seperate representation for r ∈ R and its reciprocal r −1 ∈ R inv . Lacroix et al. (2018) proposed a popular variant of CX model -Complex-N3 (CX-N3). This model uses a weighted L3 regularization, a modified training objective to accommodate reciprocals. Deep Learning Models use a neural network to learn how the entity and relation embeddings interact, to compute score of a fact. Row 6 of Table 1 Loss Functions: In this paper we use two popular loss functions -Cross Entropy (CE) Loss and Binary Cross Entropy (BCE) Loss. The crossentropy loss or the log-likelihood loss first computes the probability of predicting a response o for a query (s, r, ?) as follows: The sum over o ′ in the denominator is generally sampled based on contrastive sampling, so the left hand side is not a formal probability. However the 1-N scoring trick allows us to use all entities in the denominator, hence making computation of fullsoftmax (exact probability) possible. A similar term is added for Pr(s|r, o). The log-likelihood loss minimizes: − s,r,o ∈P log P r(o|s, r; θ) + log P r(s|o, r; θ) The summation is over P which is the set of all positive facts. In this text when CE loss uses full softmax to compute probability of response, we call it full-CE, and samples-CE otherwise. Here Y sro is 1 if the fact (s, r, o) is true and −1 otherwise. In case of sampled-BCE, T is the set of all positive facts along with the negative samples. For full-BCE, T is the set of all possible facts of the form (s, r, * ) and ( * , r, o). The full-BCE loss can be computed over all entities simultaneously, using the 1-N scoring trick described for full-CE loss. The model: CX was introduced by Trouillon et al. (2016a) . SimplE was introduced by Kazemi and Poole (2018). In this paper we train the models using full-CE loss function, without any special bells or whistles. We call CX trained with full-CE as COMPLEX-V2 and SimplE trained with full-CE as SIMPLE-V2. Recently, Kadlec et al. (2017) cast doubt on the claim that the performance improvements of recent KBC models are due to architectural changes as opposed to hyperparameter tuning or different training objectives. Their work focused on hyperparameter optimization, and tuning of a number of impactful hyperparameters such as -learning rate, batch size, regularization penalty, dimensions and many more. They also discuss how CE loss is better that max-margin loss for some models. In this work, we focus on a complimentary aspect, and highlight that using full softmax in CE loss, or scoring all possible combinations of (s, r, * ) and ( * , r, o) at the same time in BCE loss, improves the performance of multiplicative models significantly. Datasets: We evaluate on a comprehensive set of five standard KBC datasets -FB15K, WN18, YAGO3-10, FB15K-237 and WN18RR (Bordes et al., 2013; Mahdisoltani et al., 2015; Toutanova et al., 2015; Dettmers et al., 2018) . We retain the exact train, dev and test folds used in previous works. Metrics: Link prediction test queries are of the form (s, r, ?), which have a gold o * . The cases of (?, r, o) and (s, r, ?) are symmetric and receive analogous treatment. KBC models outputs a list of o ∈ E ordered (decending) by their scores. We report MRR (Mean Reciprocal Rank) and the fraction of queries where o * is recalled within rank 1 and rank 10 (HITS). The filtered evaluation (Garcia-Duran et al., 2015) removes valid, train or test tuples ranking above (s, r, o * ) to prevent unreasonable model penalization (for predicting another correct answer). Implementation details: We reused the original implementations and the best hyper-parameters released for RotatE (Sun et al., 2019a) , Tucker (Balažević et al., 2019), SimplE (Kazemi and Poole, 2018) and ConvE (Dettmers et al., 2018) . We also re-implemented CX (Trouillon et al., 2016a) , CX-N3 (Lacroix et al., 2018), SimplE (Kazemi and Poole, 2018) in PyTorch. AdaGrad was used for fitting model weights, run for up to 1000 epochs for all losses, with early stopping on Results: Table 2 shows models trained with full-CE or full-BCE shows near similar performance, questioning the value of new variations proposed in form of model architecture, problem reformulation, regularization. A basic model trained with full-CE -COMPLEX-V2, shows near SOTA performance, making it a potential baseline for future KBC models. For training speedups, most previous works (Yang et al., 2015; Kazemi and Poole, 2018; Sun et al., 2019a) randomly sample the negative triples to get approximate probability scores. However, 1-N scoring enables us to compute exact probability (using full softmax) in an efficient manner. We emphasize that computing the exact probabilities and considering original distribution of the data while training helps improve the performance of the models significantly. In this paper, we performed an extensive reexamination of recent KBC techniques. We find that multiplicative models can significantly benefit from full-BCE and full-CE training loss. The relative performance gaps between models trained in this manner are small. Moreover COMPLEX-V2 showed SOTA or near SOTA performance on all datasets, making it a strong baseline for other models to use. Tucker: Tensor factorization for knowledge graph completion Translating embeddings for modeling multirelational data Convolutional 2d knowledge graph embeddings Composing relationships with translations Type-sensitive knowledge base inference without explicit type supervision Knowledge graph embedding via dynamic mapping matrix Knowledge base completion: Baselines strike back Simple embedding for link prediction in knowledge graphs Canonical tensor decomposition for knowledge base completion YAGO3: A knowledge base from multilingual wikipedias Learning attention-based embeddings for relation prediction in knowledge graphs A novel embedding model for knowledge base completion based on convolutional neural network Modeling relational data with graph convolutional networks Rotate: Knowledge graph embedding by relational rotation in complex space A reevaluation of knowledge graph completion methods Representing text for joint embedding of text and knowledge bases Complex embeddings for simple link prediction Complex embeddings for simple link prediction Embedding entities and relations for learning and inference in knowledge bases This work is supported by IBM AI Horizons Network grant, an IBM SUR award, grants by Google, Bloomberg and 1MG, and a Visvesvaraya faculty award by Govt. of India. We thank IIT Delhi HPC facility for compute resources. Soumen Chakrabarti is supported by grants from IBM and Amazon. Further implementation details: CX, CX-N3, RotatE used 2000-dim complex vectors. For TuckeR we used 250 dimensional embedding, in order to caliberate it to same size as that of other models. Note that TuckeR learns a large 3-D tensor of size d 3 , where d is the size of embeddings. However, we get best results with 200 dimensional embedding for TuckeR, hence we report those scores in Table 2 . For Yago3-10 dataset we use 1000 dimension embeddings for CX, CX-N3, RotatE, while for Tucker we used 200 dimension embeddings. Note that training ConvE and SimplE with 2000 dimension under originally reported settings did not result in any performance gain. So we report original results from the paper.