Edinburgh Research Explorer Discrete-State Variational Autoencoders for Joint Discovery and Factorization of Relations Citation for published version: Marcheggiani, D & Titov, I 2016, 'Discrete-State Variational Autoencoders for Joint Discovery and Factorization of Relations', Transactions of the Association for Computational Linguistics, vol. 4, pp. 231- 244. Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Transactions of the Association for Computational Linguistics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 https://transacl.org/ojs/index.php/tacl/article/view/761 https://www.research.ed.ac.uk/portal/en/publications/discretestate-variational-autoencoders-for-joint-discovery-and-factorization-of-relations(7b6ff228-0589-4b78-bf5c-289ee79084d8).html Discrete-State Variational Autoencoders for Joint Discovery and Factorization of Relations Diego Marcheggiani ILLC University of Amsterdam marcheggiani@uva.nl Ivan Titov ILLC University of Amsterdam titov@uva.nl Abstract We present a method for unsupervised open- domain relation discovery. In contrast to previous (mostly generative and agglomera- tive clustering) approaches, our model relies on rich contextual features and makes mini- mal independence assumptions. The model is composed of two parts: a feature-rich re- lation extractor, which predicts a semantic relation between two entities, and a factor- ization model, which reconstructs arguments (i.e., the entities) relying on the predicted re- lation. The two components are estimated jointly so as to minimize errors in recovering arguments. We study factorization models in- spired by previous work in relation factoriza- tion and selectional preference modeling. Our models substantially outperform the genera- tive and agglomerative-clustering counterparts and achieve state-of-the-art performance. 1 Introduction The task of Relation Extraction (RE) consists of de- tecting and classifying the semantic relations present in text. RE has been shown to benefit a wide range of NLP tasks, such as information retrieval (Liu et al., 2014), question answering (Ravichandran and Hovy, 2002) and textual entailment (Szpektor et al., 2004). Supervised methods for RE have been success- ful when small restricted sets of relations are con- sidered. However, human annotation is expensive and time-consuming, and consequently these ap- proaches do not scale well to the open-domain set- ting where a large number of relations need to be detected in a heterogeneous text collection (e.g., the entire Web). Though weakly-supervised ap- proaches, such as distantly supervised methods and bootstrapping (Mintz et al., 2009; Agichtein and Gravano, 2000), reduce the amount of necessary su- pervision, they still require examples for every rela- tion considered. These limitations led to the emergence of unsu- pervised approaches for RE. These methods extract surface or syntactic patterns between two entities and either directly use these patterns as substitutes for semantic relations (Banko et al., 2007; Banko and Etzioni, 2008) or cluster the patterns (sometimes in context-sensitive way) to form relations (Lin and Pantel, 2001; Yao et al., 2011; Nakashole et al., 2012; Yao et al., 2012). The existing methods, given their generative (or agglomerative clustering) nature, rely on simpler features than their supervised coun- terparts and also make strong modeling assumptions (e.g., assuming that arguments are conditionally in- dependent of each other given the relation). These shortcomings are likely to harm their performance. In this work, we tackle the aforementioned chal- lenges and introduce a new model for unsupervised relation extraction. We also describe an efficient es- timation algorithm which lets us experiment with large unannotated collections. Our model is com- posed of two components: • an encoding component: a feature-rich relation extractor which predicts a semantic relation be- tween two entities in a specific sentence given contextual features; • a reconstruction component: a factorization model which reconstructs arguments (i.e., the entities) relying on the predicted relation. The two components are estimated jointly so as to minimize errors in reconstructing arguments. While 231 Transactions of the Association for Computational Linguistics, vol. 4, pp. 231–244, 2016. Action Editor: Sebastian Riedel. Submission batch: 10/2015; Revision batch: 2/2016; Published 6/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. learning to predict left-out arguments, the inference algorithm will search for latent relations that sim- plify the argument prediction task as much as possi- ble. Roughly, such an objective will favour inducing relations that maximally constrain the set of admis- sible argument pairs. Our hypothesis is that relations induced in this way will be interpretable by humans and useful in practical applications. Why is this hy- pothesis plausible? Primarily because humans typi- cally define relations as an abstraction capturing the essence of the underlying situation. And the under- lying situation (rather than surface linguistic details like syntactic functions) is precisely what imposes constraints on admissible argument pairs. This framework allows us to both exploit rich fea- tures (in the encoding component) and capture inter- dependencies between arguments in a flexible way (both in the reconstruction and encoding compo- nents). The use of a reconstruction-error objective, pre- viously considered primarily in the context of train- ing neural autoencoders (Hinton, 1989; Vincent et al., 2008), gives us an opportunity to borrow ideas from the well-established area of statistical rela- tional learning (Getoor and Taskar, 2007), and, more specifically, relation factorization. In this area, tensor and matrix factorization methods have been shown to be effective for inferring missing facts in knowledge bases (Bordes et al., 2011; Riedel et al., 2013; Chang et al., 2014; Bordes et al., 2014; Sutskever et al., 2009). In our work, we also adopt a fairly standard RESCAL factorization (Nickel et al., 2011) and use it within our reconstruction compo- nent. Though there is a clear analogy between statisti- cal relational learning and our setting, there is also a very significant difference. In contrast to rela- tional learning, rather than factorizing existing re- lations (an existing ‘database’), our method simulta- neously discovers the relational schema (i.e., an in- ventory of relations) and a mapping from text to the relations (i.e., a relation extractor), and it does it in such a way as to maximize performance on recon- struction (i.e., inference) tasks. This analogy also highlights one important property of our framework: unlike generative models, we explicitly force our se- mantic representations to be useful for at least the most basic form of semantic inference (i.e., infer- ring an argument based on the relation and another argument). It is important to note that the model is completely agnostic about the real semantic relation between two arguments, as the relational schema is discovered during learning. We consider both a factorization method inspired by previous research in knowledge base modeling (as discussed above) and another, even simpler one, based on ideas from previous research on model- ing selectional preferences (e.g., Resnik (1997); Ó Séaghdha (2010); Van de Cruys (2010)), plus their combination. Our models are applied to a version of the New York Times corpus (Sandhaus, 2008). In order to evaluate our approach, we follow Yao et al. (2011) and align named entities in our collection to Freebase (Bollacker et al., 2008), a large collabo- rative knowledge base. In this way we can evaluate a subset of our induced relations against relations in Freebase. Note that Freebase has not been used dur- ing learning, making this a fair evaluation scenario for an unsupervised relation induction method. We also qualitatively evaluate our model by both con- sidering several examples of induced relations (both appearing and not appearing in Freebase) and vi- sualizing embeddings of named entities induced by our model. As expected, the choice of a factoriza- tion model affects the model performance. Our best models substantially outperform the state-of-the-art generative Rel-LDA model of Yao et al. (2011): 35.8% F1 and 29.6% F1 for our best model and Rel- LDA, respectively. The rest of the paper is structured as follows. In the following section, we formally describe the problem. In Section 3, we motivate our approach. In Section 4, we formally describe the method. In Section 5 we describe our experimental setting and discuss the results. We give more background on RE, knowledge base completion and autoencoders in Section 6. 2 Problem Definition In the most standard form of RE considered in this work, an extractor, given a sentence and a pair of named entities e1 and e2, needs to predict the under- lying semantic relation r between these entities. For example, in the sentence Roger Ebert wrote a review of The Fall 232 Feature representation of “Ebert is the first ….” ( ) Relation extractor ( = Encoding) AWARDED(e1: Ebert, e2: Pulitzer prize) Entity prediction ( = Reconstruction) Hidden Log-linear feature-rich model Factorization model x Pulitzer prize Figure 1: Inducing relations with discrete-state autoencoders. we have two entities e1 = Roger Ebert and e2 = The Fall, and the extractor should predict the semantic relation r = REVIEWED.1 The stan- dard approach to this task is to either rely on hu- man annotated data (i.e., supervised learning) or use data generated automatically by aligning knowledge bases (e.g., Freebase) with text (called distantly- supervised methods). Both classes of approaches as- sume a predefined inventory of relations and a man- ually constructed resource. In contrast, the focus of this paper is on open- domain unsupervised RE (also known as relation discovery) where no fixed inventory of relations is provided to the learner. The methods induce rela- tions from the data itself. Previous work on this task (Banko et al., 2007), as well as on its general- ization, called unsupervised semantic parsing (Poon and Domingos, 2009; Titov and Klementiev, 2011), groups patterns between entity pairs (e.g., wrote a review, wrote a critique and reviewed) and uses these clusters as relations. Other approaches (e.g., Shinyama and Sekine (2006); Yao et al. (2011); Yao et al. (2012); de Lacalle and Lapata (2013)), in- cluding the one introduced in this paper, perform context-sensitive clustering, that is, they treat rela- tions as latent variables and induce them for each entity-pair occurrence individually. Rather than re- lying solely on a pattern between entity pairs, the latter class of methods can use additional context to decide that Napoleon reviewed the Old Guard and the above sentence about Roger Ebert should not be labeled with the same relation. 1In some of our examples we will use relation names, al- though our method, as virtually any other latent variable model, will not induce names but only indices. Unsupervised relation discovery is an important task because existing knowledge bases (e.g., Free- base, Yago (Suchanek et al., 2007), DBpedia (Auer et al., 2007)) do not have perfect coverage even for most standard domains (e.g., music or sports), and, arguably more importantly, because there are many domains not covered by these resources. Though one option is to provide a list of relations with seed examples for each of them and then use bootstrap- ping (Agichtein and Gravano, 2000), it requires do- main knowledge and may thus be problematic. In these cases unsupervised relation discovery is the only non-labour-intensive way to construct a rela- tion extractor. Moreover, unsupervised methods can also aid in building new knowledge bases by pro- viding an initial set of relations which can then be refined. As is common, in this work we limit ourselves to only considering binary relations between entities occurring in the same sentence. We focus only on extracting semantic relations, assuming that named entities have already been recognized by an exter- nal method (Finkel et al., 2005). As in previous work (Yao et al., 2011), we are not trying to detect if there is a relation between two entities or not; our aim is to detect a relation between each pair of enti- ties appearing in a sentence. In principle, heuristics (i.e., based on the syntactic dependency paths con- necting arguments) can be used to get rid of unlikely pairs. 3 Our Approach We approach the problem by introducing a latent variable model which defines the interactions be- tween a latent relation r and the observables: the 233 entity pair (e1,e2) and other features of the sentence x. The idea which underlies much of latent vari- able modeling is that a good latent representation is the one that helps us to reconstruct the input (i.e., x, including (e1,e2)). In practice, we are not inter- ested in predicting x, as x is observable, but rather in inducing an appropriate latent representation (i.e., r). Thus, it is crucial to design the model in such a way that a good r (the one predictive of x) indeed encodes relations rather than some other form of ab- straction. In our approach, we encode this reconstruction idea very explicitly. As a motivating example, con- sider the following sentence: Ebert is the first journalist to win the Pulitzer prize. As shown in Figure 1, let us assume that we hide one argument, chosen at random: for example, e2 = Pulitzer prize. Now the purpose of the re- construction component is to reconstruct (i.e., infer) this argument relying on another argument (e1 = Ebert), the latent relations r and nothing else. At learning time, our inference algorithm will search through the space of potential relation clusterings to find the one that makes these reconstruction tasks as simple as possible. For example, if the algorithm clusters expressions is the first journalist to win to- gether with was awarded, the prediction is likely to be successful, assuming that the passage Ebert was awarded the Pulitzer prize has been observed else- where in the training data. On the contrary, if the algorithm clustered is the first journalist to win with presented, we are likely to make a wrong inference (i.e., predict Golden Thumb award). Given that we optimize the reconstruction objective, the former clustering is much more likely than the latter. Re- construction can be seen as a knowledge base fac- torization approach similar to the ones of Bordes et al. (2014). Notice that the model’s final goal is to learn a good relation clustering, and that the recon- struction objective is used as a means to reach this goal. For reasons which will be clear in a moment, we will refer to the model performing the prediction of entities relying on other entities and relations as a decoder (a.k.a. the reconstruction component). Despite our description of the model as pattern- clustering, it is important to stress that we are in- ducing clusters in a context-sensitive way. In other words, we are learning an encoder: a feature-rich classifier, which predicts a relation for a specific sen- tence and an entity pair in this sentence. Clearly, this is a better approach because some of the patterns be- tween entities are ambiguous and require extra fea- tures to disambiguate them (recall the example from the previous section), whereas other patterns may not be frequent enough to induce reliable clustering (e.g., is the first journalist to win). The encoding and reconstruction components are learned jointly so as to minimize the prediction error. In this way, the encoder is specialized to the defined reconstruction problem. 4 Reconstruction Error Minimization In order to implement the desiderata sketched in the previous section, we take inspiration from a framework popular in the neural network commu- nity, namely autoencoders (Hinton, 1989). Autoen- coders are composed of two components: an en- coder which predicts a latent representation y from an input x, and a decoder which relies on the la- tent representation y to recover the input (x̃). In the learning phase, the parameters of both the encoding and reconstruction part are chosen so as to minimize a reconstruction error (e.g., the Euclidean distance ||x− x̃||2). Although popular within the neural network com- munity (where y is defined as a real-valued vec- tor), autoencoders have recently been applied to the discrete-state setting (where y is defined as a cat- egorical random variable, a tuple of variables or a graph). For example, such models have been used in the context of dependency parsing (Daumé III, 2009), or in the context of POS tagging and word alignment (Ammar et al., 2014; Lin et al., 2015a). The most related previous work (Titov and Khod- dam, 2015) considers induction of semantic roles of verbal arguments (e.g., an agent, a performer of an action vs. a patient, an affected entity), though no grouping of predicates into relations was consid- ered. We refer to such models as discrete-state au- toencoders. We use different model families for the decod- ing and reconstruction components. The encoding part is a log-linear feature-rich model, while the re- construction part is a tensor (or matrix) factorization 234 model which seeks to reconstruct entities, relying on the outcome of the encoding component. 4.1 Encoding component The encoding component, that is, the actual relation extractor that will be used to process new sentences, is a feature-rich classifier that, given a set of fea- tures extracted from the sentence, predicts the cor- responding semantic relation r ∈ R. We use a log- linear model (‘softmax regression’) q(r|x,w) = exp(w T g(r,x)))∑ r′∈R exp(w T g(r′,x)) , (1) where g(r,x) is a high-dimensional feature repre- sentation and w is the corresponding vector of pa- rameters. In principle, the encoding model can be any model as long as the relation posteriors q(r|x,w) and their gradients can be efficiently com- puted or approximated. We discuss the features we use in the experimental section (Section 5). 4.2 Reconstruction component In the reconstruction component (i.e., decoder), we seek to predict an entity ei ∈ E in a specific position i ∈ {1,2} given the relation r and an- other entity e−i, where e−i denotes the complement {e1,e2}\{ei}. Note that this model does not have access to any features of the sentence; this is crucial since in this way we ensure that all the essential in- formation is encoded by the relation variable. This bottleneck forces the learning algorithm to induce informative relations rather than cluster relation oc- currences in a random fashion or assign them all to the same relation. To simplify our notation, let us assume that we predict e1; the model for e2 will be analogous. We write the conditional probability models in the fol- lowing form p(e1|e2,r,θ) = exp(ψ(e1,e2,r,θ))∑ e′∈E exp(ψ(e ′,e2,r,θ)) , (2) where E is the set of all entities; ψ is a general scor- ing function which, as we will show, can be instan- tiated in several ways; θ represents its parameters. The actual set of parameters represented by θ will depend on the choice of scoring function. However, in all the cases we consider in this paper, the param- eters will include entity embeddings (ue ∈ Rd for every e ∈ E). These embeddings will be learned within our model. In this work we explore three different factoriza- tions ψ for the decoding component: a tensor fac- torization model inspired by previous work on re- lation factorization, a simple selectional-preference model which scores each argument independently of the other, and a third model which is a combination of the first two. 4.2.1 ψRS: RESCAL The first reconstruction model we consider is RESCAL, a model very successful in the relational modeling context (Nickel et al., 2011; Chang et al., 2014). It is a restricted version of the classic Tucker tensor decomposition (Tucker, 1966; Kolda and Bader, 2009) and is defined as ψRS(e1,e2,r,θ) = u T e1 Crue2, (3) where ue1,ue2 ∈ Rd are the entity embeddings cor- responding to the entities e1 and e2. Cr ∈ Rd×d is a matrix associated with the latent semantic relation r; it evaluates (i.e., scores) the compatibility between the two arguments of the relation. 4.2.2 ψSP : Selectional preferences The second factorization ψSP scores how well each argument fits the selectional preferences of a given relation r ψSP (e1,e2,r,θ) = 2∑ i=1 uTeicir, (4) where c1r and c2r ∈ Rd encode selectional prefer- ences for the first and second argument of the rela- tion r, respectively. This factorization is also known as model E in Riedel et al. (2013). In contrast to the previous model, it does not model the interaction be- tween arguments: it is easy to see that p(e1|e2,r,θ) for this model (expression (2)) does not depend on e2 (i.e., on ue2 and c2r). Consequently, such a de- coder would be more similar to generative models of relations which typically assume that arguments are conditionally independent (Yao et al., 2011). Note however that our joint model can still capture ar- gument interdependencies in the encoding compo- nent. Still, this approach does not fully implement the desiderata described in the previous section, so 235 we generally expect this model to be weaker on reasonably-sized collections (this hypothesis will be confirmed in our experimental evaluation). 4.2.3 ψHY : Hybrid model The RESCAL model may be too expressive to be accurately estimated for infrequent relations, whereas the selectional preference model cannot, in turn, capture interdependencies between arguments. Thus it seems natural to hope that their combination ψHY will be more accurate overall: ψHY (e1,e2,r,θ) = u T e1 Crue2 + 2∑ i=1 uTeicir. (5) This model is very similar to the tensor factorization approach proposed in Socher et al. (2013). 4.3 Learning We first provide an intuition behind the objective we optimize. We derive it more formally in the sub- sequent section, where we show that it can be re- garded as a variational lower bound on pseudolike- lihood (Section 4.3.1). As the resulting objective is still computationally expensive to optimize (due to a summation over all potential entities), we introduce further approximations in Section 4.3.2. The parameters of the encoding and decoding components (i.e., w and θ) are estimated jointly. Our general idea is to optimize the quality of argu- ment prediction while averaging over relations 2∑ i=1 ∑ r∈R q(r|x,w) log p(ei|e−i,r,θ). (6) Though this objective seems natural, it has one seri- ous drawback: the induced posteriors q(r|x,w) end up being extremely sharp which, in turn, makes the search algorithm more prone to getting stuck in local minima. As we will see in the experimental results, this version of the objective results in lower aver- age performance. This behaviour can be explained by drawing connections with variational inference. Roughly speaking, direct optimization of the above objective behaves very much like using hard EM for generative latent-variable models. Intuitively, one solution is, instead of optimizing expression (6), to consider an entropy-regularized version that favours more uniform posterior distributions q(r|x,w) 2∑ i=1 ∑ r∈R q(r|x,w)log p(ei|e−i,r,θ)+H(q(·|x,w)), (7) where the last term H denotes the entropy over q. The entropy term can be seen as posterior regu- larization (Ganchev et al., 2010) which pushes the posterior q(r|x,w) to be more uniform. As we will see in a moment, this approach can be for- mally justified by drawing connections to variational inference (Jaakkola and Jordan, 1996) and, more specifically, to variational autoencoders (Kingma and Welling, 2014). 4.3.1 Variational inference This subsection presents a justification for the ob- jectives (6) and (7); however, a reader not interested in this explanation can safely skip it and proceed di- rectly to Section 4.3.2. For the moment let us assume that we perform generative modeling, and we consider optimization of the following pseudo-likelihood (Besag, 1975) objective 2∑ i=1 log ∑ r p(ei|e−i,r,θ)pu(r), (8) where pu(r) is the uniform distribution over rela- tions. Note that currently the encoding model is not part of this objective. The pseudo-likelihood (by Jensen’s inequality) can be lower-bounded by the following variational bound 2∑ i=1 ∑ r∈R qi(r) log p(ei|e−i,r,θ)pu(r) + H(qi), (9) where qi is an arbitrary distribution over relations. Note that pu(r) can be dropped from the expression as it corresponds to a constant with respect to the choice of both the variational distributions qi and the (reconstruction) model parameters θ. In variational inference, the maximization of the original (pseudo-)likelihood objective (8) is replaced with the maximization of expression (9) both with respect to qi and θ. This is typically achieved with an EM-like step-wise procedure: steps where qi is 236 selected for a given θ are alternated with steps where the parameters θ are updated while keeping qi fixed. One idea, introduced by Kingma and Welling (2014) for the continuous case, is to replace the search for an optimal qi with a predictor (a classifier in our discrete case) trained within the same optimization procedure. Our encoding model q(r|x,w) is ex- actly such a predictor. With these two modifications (dropping the nuisance term pu and replacing qi with q(r|x,w)), we obtain the objective (7). 4.3.2 Approximation The objective (7) cannot be efficiently optimized in its exact form as the partition function of ex- pression (2) requires the summation over the en- tire set of possible entities E. In order to deal with this challenge we rely on the negative sampling ap- proach of Mikolov et al. (2013). Specifically we avoid the softmax in expression (2) and substitute log p(e1|e2,r,θ) in the objective (7) with the follow- ing expression log σ(ψ(e1,e2,r,θ)) + ∑ e neg 1 ∈S log σ(−ψ(eneg1 ,e2,r,θ)), where S is a random sample of n entities from the distribution of entities in the collection and σ is the sigmoid function. Intuitively, this expression pushes up the scores of arguments seen in the text and pushes down the scores of ‘negative’ arguments. When there are multiple entities e1 which satisfy the relation r with e2 (for example, Natasha Obama and Malia Ann Obama, in relation CHILD OF with Barack Obama) the scores for all such en- tities will be pushed up. Assuming both daughters are mentioned with a similar frequency, they will get similar scores. Generally, arguments more fre- quently mentioned in text will get higher scores. In the end, instead of directly optimizing expres- sion (7), we use the following objective 2∑ i=1 Eq(·|x,w) [ log σ(ψ(ei,e−i,r,θ)) + ∑ e neg i ∈S log σ(−ψ(enegi ,e−i,r,θ)) ] + αH(q(·|x,w)), (10) where Eq(·|x,w) [ . . . ] denotes an expectation com- puted with respect to the encoder distribution q(r|x,w). Note the non-negative parameter α: after substituting the softmax with the negative sampling term, the entropy parameter and the expectation are not on the same scale anymore. Though we could try estimating the scaling parameter α, we chose to tune it on the validation set. The gradients of the above objective can be cal- culated using backpropagation. With the proposed approximation, the computation of the gradients is quite efficient since the reconstruction model has a fairly simple form (e.g., bilinear) and learning the encoder is no more expensive than learning a su- pervised classifier. We used AdaGrad (Duchi et al., 2011) as an optimization algorithm. 5 Experiments In this section we evaluate how effective our model is in discovering relations between pairs of entities in a sentence. We consider the unsupervised setting, so we use clustering measures for evaluation. Since we want to directly compare to Rel- LDA (Yao et al., 2011), we use the transductive set-up: we train our model on the entire training set (with labels removed) and we evaluate the es- timated model on a subset of the training set. Given that we train the relation classifier (i.e., the encod- ing model), unlike some of the previous approaches, there is nothing in our approach which prevents us from applying it in an inductive scenario (i.e., to un- seen data). Towards the end of this section we also provide qualitative evaluation of the induced relations and entity embeddings. 5.1 Data and evaluation measures We tested our model on the New York Times cor- pus (Sandhaus, 2008) using articles from 2000 to 2007. We use the same filtering and preprocessing steps (POS tagging, NER, and syntactic parsing) as the ones described in Yao et al. (2011). In that way we obtained about 2 million entity pairs (i.e., poten- tial relation realizations). In order to evaluate our models, we aligned each entity pair with Freebase, and, as in Yao et al. (2012), we discarded unaligned ones from the eval- 237 uation. We consider Freebase relations as gold- standard clusterings and evaluated induced relations against them. Note that we use the micro-reading scenario (Nakashole and Mitchell, 2014), that is, we predict a relation on the basis of a single occurrence of an entity pair rather than aggregating informa- tion across all the occurrences of the pair in the cor- pus. Though it is likely to harm our performance when evaluating against Freebase, this is a deliberate choice as we believe extracting relations about less frequent entities (where there is little redundancy in a collection) and modelling content of specific documents is a more challenging and important re- search direction. Moreover, feature-rich models are likely to be especially beneficial in these scenarios, as for micro-reading the information extraction sys- tems cannot fall back to easier non-ambiguous con- texts. We use the B3 metric (Bagga and Baldwin, 1998) as the scoring function. B3 is a standard measure for evaluating precision and recall of clustering tasks (Yao et al., 2012). As the final evaluation score we use F1, the harmonic mean of precision and recall. 5.2 Features The crucial characteristic of the learning method we propose is the ability to handle a rich (and overlap- ping) set of features. With this in mind we adopted the following set of features: 1. bag of words between e1 and e2; 2. the surface form of e1 and e2; 3. the lemma of the ‘trigger’2 (i.e., for the passage Microsoft is based in Redmond, the trigger is based and its lemma is base); 4. the part-of-speech sequence between e1 and e2; 5. the entity type of e1 and e2 (as a pair); 6. the entity type of e1; 7. the entity type of e2; 8. words on the syntactic dependency path be- tween e1 and e2, i.e., the lexicalized path be- tween the entities stripped of dependency labels and their direction. 2We define triggers as in Yao et al. (2011), namely “all the words on the dependency path except stop words”. For example, from the sentence Stephen Moore, director of fiscal policy studies at the conservative Cato Institute, we would extract the following features: 1. BOW:director, BOW:of, BOW:fiscal, BOW:policy, BOW:studies, BOW:at, BOW:the; 2. E1:Stephen Moore, E2:Cato Institute; 3. Trigger:director; 4. PoS:NN IN JJ NN NNS IN DT JJ; 5. PairType:PERSON ORGANIZATION; 6. E1Type:PERSON; 7. E2Type:ORGANIZATION; 8. Path:director at. 5.3 Parameters and baselines All model parameters (w,θ) were initialized ran- domly. The embedding dimensionality d was set to 30. We induced 100 relations, the same as used for Rel-LDA in Yao et al. (2011). We also set the mini batch size to 100, the initial learning rate of AdaGrad to 0.1 and the number of negative sam- ples n to 20. The results reported in Table 1 are average results of three runs obtained after 5 itera- tions over the entire training set. For each model we tuned the weight for the L2 regularization penalty and chose 0.1 as it worked well across all the mod- els. We tuned the α coefficient (i.e., the weight for the entropy term) for each model: we chose 0.25 for RESCAL, 0.01 for the selectional preferences, and 0.1 for the hybrid model. All model selection was performed on a validation set: we selected a random 20% of the entire dataset, and considered all entity pairs aligned to Freebase. The final evaluation was done on the remaining 80%. In order to compare our models with the state of the art in unsupervised RE, we used as a base- line the Rel-LDA model introduced in Yao et al. (2011). Rel-LDA is an application of the LDA topic model (Blei et al., 2003) to the relation discovery task. In Rel-LDA topics correspond to latent rela- tions, and, instead of relying on words as LDA does, 238 RESCAL Selectional Pref. Hybrid Rel-LDA (our feats) Rel-LDA (Yao et al., 2012) feats HAC (DIRT) 34.5±1.3 33.4±1.1 35.8±2.0 29.6±0.9 26.3±0.8 28.3 Table 1: Average F1 results (%), and the standard deviation, across 3 runs of different models on the test set. Rel-LDA uses predefined features, including argu- ment words. In a similar fashion to our selectional- preference decoder, it assumes that arguments are conditionally independent given the relation. As an- other baseline, following Yao et al. (2012), we used hierarchical agglomerative clustering (HAC). This baseline is very similar to the standard unsupervised relation extraction method DIRT (Lin and Pantel, 2001). The HAC cut-off parameter was set to 0.95 based on the development set performance. We used the same feature representation for all the models, including the baselines. We also report results of Rel-LDA using the features from Yao et al. (2012).3 5.4 Results and discussion The results we report in Table 1 are mean and stan- dard deviations across 3 runs with different random initialization of the parameters (except for the deter- ministic HAC approach). First, we can observe that using richer features is beneficial for the generative baseline. It leads to a substantial improvement in F1 (from 26.3% to 29.6% F1). The HAC baseline is outperformed by Rel-LDA (28.3% vs. 29.6% F1). However, all our proposed models substantially out- perform all 3 baselines: the best result is 35.8% F1. The selectional preference model on average per- forms better than the best baseline (33.4% vs. 29.6% F1). As we predicted in Section 4, compared with the RESCAL model, the selectional preference model has slightly lower performance (34.5% vs. 33.4% F1). This is not surprising as the argument in- dependence assumption is very strong, and the gen- eral motivation we provided in Section 2 does not really apply to the selectional preference model. Combining RESCAL and selection preference models, as we expected, gives some advantage in terms of performance. The hybrid model is the best performing model with 35.8% F1, and it is, on aver- age, 6.2% more accurate than Rel-LDA. The introduction of entropy in expression (7) does 3Yao et al. (2012) is a follow-up work for Yao et al. (2011). 0 0.01 0.1 1 alpha 22 24 26 28 30 32 34 36 38 40 F 1 Figure 2: Results of the hybrid model on the validation set, with different α. not only add an extra justification to the objective we optimize, but also helps to improve the mod- els’ performance. In fact, as shown in Figure 2 for the Hybrid model, the difference between having or not the entropy term makes a big difference, going from 23.9% without regularization to 34.3% F1 with regularization. Note that the method is quite sta- ble within the range α ∈ [0.1,1], and more fine- grained tuning of α seems only mildly beneficial. However the performance with small values of α (0.01) is more problematic: Hybrid both does not outperform Rel-LDA and has a large variance across runs. Somewhat counter-intuitively, with α = 0 (no entropy regularization) the variance is almost negli- gible. However, given the low performance in this regime, it probably just means that we get consis- tently stuck in equally bad local minima. Though it may seem that the need to tune the entropy term weight is an unfortunate side effect of using the non-probabilistic objective from Sec- tion 4.3.2, the reality is more subtle. In fact, even for fully probabilistic variational autoencoders with real-valued states y, using the weight of 1, as prescribed by their variational interpretation (see Section 4.3.1), does not lead to stable perfor- mance (Bowman et al., 2016). Instead, annealing over α seems necessary. Though annealing is likely 239 Relation 66 Relation 62 Relation 19 president review professor director review restaurant dean chairman review production graduate executive review book director spokesman review performance specialist manager column review attend analyst review concert expert owner review revival professor study professor review rise chairman Table 2: Relation clusters ordered from left to right by their frequency. to benefit our method as well, we leave it for future work. Since the proposed model is unsupervised, it is in- teresting to inspect the relations induced by our best model. In order to do so, we select the most likely relation according to our relation extractor (i.e., en- coding model) for every context in the validation set and then, for every relation, we count occurrences of every trigger. The most-frequent trigger for three induced relations are presented in Table 2. Relation 62 encodes the relation REVIEWED (not present in Freebase), as in Anthony Tommasini reviews Metropolitan Opera’s production of Cosi Fan Tutte. Clusters 19 and 66 are examples of more coarse rela- tions. Relation 19 represents a broader ACADEMIC relation, as in the passage Dr. Susan Merritt, dean of the School of Computer Science and Information Systems. or as in the passage George Myers graduated from Yale University. Cluster 66 instead groups together expressions such as leads or president (of), so it can vaguely be de- scribed as a LEADERSHIP relation, but it also con- tains the relation triggered by the word professor (in). In fact, this is the most frequent relation in- duced by our model. We can check further by look- ing at the learned embeddings of named entities vi- sualized with the t-SNE algorithm (Van der Maaten and Hinton, 2008). In Figure 3, we can see that en- tities representing universities and non-academic or- Semi-sup RESCAL 62.3 Semi-sup Selectional Pref. 58.1 Semi-sup Hybrid 61.5 Unsup Hybrid 34.3 Table 3: Average F1 results (%) for semi-supervised and unsupervised models, across 3 runs of different models tested on Te. ganizations end up being very close in the embed- ding space. This artefact is likely to be related to the coarseness of Relations 66 and 19, though it does not provide a good explanation for why this has hap- pened, since the entity embeddings are also induced within our model. However, overlaps in embeddings do not seem to be a general problem: the t-SNE visualization shows that most entities are well clustered into fine-grained types, for example, football teams, nations, and mu- sic critics. 5.5 Decoder influence In order to examine the influence of the decoder on the model performance, we performed additional ex- periments in a more controlled setting. We reduced the dataset to entity pairs participating in Freebase relations, ending up with a total of about 42,000 re- lation realizations. We randomly split the dataset in two. We used the first half as a test set Te, while we used the second half as a training set Tr. We fur- ther randomly split the training set Tr in two parts, Tr1 and Tr2. We use Tr1 as a (distantly) labeled dataset to learn only the decoding part for each pro- posed model. To make it comparable to our unsuper- vised models with 100 induced relations, we trained the decoder on the 99 most frequent Freebase re- lations plus a further OTHER relation, which is a union of the remaining less frequent relations. This approach is similar to the KB factorization adopted in Bordes et al. (2011). With the decoder learned and fixed, we trained the encoder part on unlabeled examples in Tr2, while leveraging the previously trained decoder. In other words, we optimize the objective (10) on Tr2 but update only the encoder parameters w.4 In this setting the decoder provides a learning signal for the encoder. The better the gen- 4We also update embeddings of entities not appearing in Tr1. 240 Political organizations Universities General Organizations Figure 3: t-SNE visualization of entity embeddings learned during the training process. eralization properties of the decoder are, the better the resulting encoder should be. We expect more ex- pressive decoders (i.e., RESCAL and Hybrid) to be able to capture relations better than the selectional preference model and, thus, yield better encoders. In order to have a lower bound for the semi-supervised models, we also trained our best model from the pre- vious experiments (Hybrid) on Tr2 in a fully unsu- pervised way. All models are tested on the test set Te. As expected, all models with a supervised de- coder are much more accurate than the unsuper- vised model (Table 3). The best results with a supervised decoder are obtained by the RESCAL model with 62.3% F1, while the result of the un- supervised hybrid model is 34.3% F1. As expected the RESCAL and Hybrid outperform the selectional preference model in this setting (62.3% and 61.5% vs. 58.1% F1 respectively). Somewhat surprisingly, the RESCAL model is slightly more accurate (0.8% F1) than the hybrid model. These experiments con- firm that more accurate decoder models lead to bet- ter performing encoders. The results also hint at a potential extension of our approach to a more real- istic semi-supervised setting, though we leave any serious investigation of this set-up for future work. 6 Additional Related Work In this section, we mainly consider lines of related work not discussed in other sections of the paper, and we emphasize their relationship to our approach. Distant supervision. These methods can be re- garded as a half-way point between unsupervised and supervised methods. Distantly supervised mod- els are trained on data generated automatically by aligning knowledge bases (e.g., Freebase and Wikipedia infoboxes) with text (Mintz et al., 2009; Riedel et al., 2010; Surdeanu et al., 2012; Zeng et al., 2015). Similarly to our method they can use feature-rich models without the need for manually annotated data. However, a relation extractor trained in this way will only be able to predict relations al- ready present in a knowledge base. These meth- ods cannot be used to discover new relations. The framework we propose is completely unsupervised and does not have this shortcoming. Bootstrapping. Bootstrapping methods for rela- tion extraction (Agichtein and Gravano, 2000; Brin, 1998; Batista et al., 2015) iteratively label new ex- amples by finding the ones which are the most sim- ilar, according to some similarity function, to a seed set of labeled examples. The process continues until some convergence criteria is met. Even though this approach is not very labor-intensive (i.e., it requires 241 only few manually labeled examples for the initial seed set), it requires some domain knowledge from the model designer. In contrast, unsupervised mod- els are domain-agnostic and require only unlabeled text. Knowledge base factorization. Knowledge base completion via matrix or tensor factorization has re- ceived a lot of attention in the past few years (Bor- des et al., 2011; Jenatton et al., 2012; Weston et al., 2013; Bordes et al., 2013; Socher et al., 2013; Garcı́a-Durán et al., 2014; Bordes et al., 2014; Lin et al., 2015b; Chang et al., 2014; Nickel et al., 2011). But in contrast to what we propose here, namely, in- duction of new relations, these models factorize re- lations already present in knowledge bases. Universal schema methods (Riedel et al., 2013) use factorization models to infer facts (e.g., predict missing entities), but they do not attempt to induce relations. In other words, they consider each given context as a relation and induce an embedding for each of them. They do not attempt to induce a clus- tering over the contexts. Our work can be regarded as an extension of these methods. Autoencoders with discrete states. Aside from the work cited above (Daumé III, 2009; Ammar et al., 2014; Titov and Khoddam, 2015; Lin et al., 2015a), we are not aware of previous work using autoencoders with discrete states (i.e., a categori- cal latent variable or a graph). The semisupervised version of variational autoencoders (Kingma et al., 2014) used a combination of a real-valued vector and a categorical variable as its hidden representa- tion and yielded impressive results on the MNIST image classification task. However, their approach cannot be directly applied to unsupervised classifi- cation, as there is no reason to believe that latent classes would be captured by the categorical vari- able rather than in some way represented by the real- valued vector. The only other application of variational autoen- coders to natural language is the very recent work of Bowman et al. (2016). They study language mod- eling with recurrent language models and consider only real-valued vectors as states. Generative models with rich features have also been considered in the past (Berg-Kirkpatrick et al., 2010). However, autoencoders are potentially more flexible than generative models as they can use very different encoding and decoding components and can be faster to train. 7 Conclusions and Discussion We presented a new method for unsupervised rela- tion extraction.5 The model consists of a feature- rich classifier that predicts relations, and a tensor factorization component that relies on the predicted relations to infer left-out arguments. These models are jointly estimated by optimizing the argument re- construction objective. We studied three alternative factorization models building on ideas from knowledge base factoriza- tion and selectional preference modeling. We em- pirically showed that our factorization models yield relation extractors that are more accurate than state- of-the-art generative and agglomerative clustering baselines. As the proposed modeling framework is quite flexible, the model can be extended in many differ- ent ways. Our approach can be regarded as learn- ing semantic representations that are informative for basic inference tasks (in our case, the inference task was recovering individual arguments). More general classes of inference tasks can be considered in future work. Moreover, it would be interesting to evaluate the proposed model on how accurately it infers these facts (rather than only on the quality of the induced latent representations). The work presented in this paper can also be combined with the approach of Titov and Khoddam (2015) to induce both relations and semantic roles (i.e., essentially to induce seman- tic frames (Fillmore, 1976)). Another potential di- rection is the use of labeled data: our feature-rich model (namely its discriminative encoding compo- nent) is likely to have much better asymptotic per- formance than its generative counterpart, and, con- sequently, labeled data should be much more bene- ficial. Acknowledgments This work is supported by NWO Vidi Grant 016.153.327, Google Focused Award on Natural Language Understanding and partially supported by ISTI Grant for Young Mobility. The authors thank 5github.com/diegma/relation-autoencoder 242 the action editor and the anonymous reviewers for their valuable suggestions and Limin Yao for an- swering our questions about data and baselines. References Eugene Agichtein and Luis Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In 5th ACM Conference on Digital Libraries. Waleed Ammar, Chris Dyer, and Noah A. Smith. 2014. Conditional random field autoencoders for unsuper- vised structured prediction. In NIPS. Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary G. Ives. 2007. Dbpedia: A nucleus for a web of open data. In 6th International Semantic Web Conference (ISWC). Amit Bagga and Breck Baldwin. 1998. Algorithms for scoring coreference chains. In LREC. Michele Banko and Oren Etzioni. 2008. The tradeoffs between open and traditional relation extraction. In ACL. Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In IJCAI. David S. Batista, Bruno Martins, and Mário J. Silva. 2015. Semi-supervised bootstrapping of relationship extractors with distributional semantics. In EMNLP. Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, John DeNero, and Dan Klein. 2010. Painless unsu- pervised learning with features. In HLT - NAACL. Julian Besag. 1975. Statistical analysis of non-lattice data. The statistician, pages 179–195. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022. Kurt D. Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collabo- ratively created graph database for structuring human knowledge. In SIGMOD. Antoine Bordes, Jason Weston, Ronan Collobert, and Yoshua Bengio. 2011. Learning structured embed- dings of knowledge bases. In AAAI. Antoine Bordes, Nicolas Usunier, Alberto Garcı́a-Durán, Jason Weston, and Oksana Yakhnenko. 2013. Irreflex- ive and hierarchical relations as translations. In Struc- tured Learning: Inferring Graphs from Structured and Unstructured Inputs (SLG-ICML). Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. 2014. A semantic matching energy function for learning with multi-relational data. Jour- nal of Machine Learning, 94(2):233–259. Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An- drew M. Dai, Rafal Józefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In ICLR. Sergey Brin. 1998. Extracting patterns and relations from the world wide web. In The World Wide Web and Databases Workshop (WebDB). Kai-Wei Chang, Wen-tau Yih, Bishan Yang, and Christo- pher Meek. 2014. Typed tensor decomposition of knowledge bases for relation extraction. In EMNLP. Hal Daumé III. 2009. Unsupervised search-based struc- tured prediction. In ICML. Oier Lopez de Lacalle and Mirella Lapata. 2013. Un- supervised relation extraction with general domain knowledge. In EMNLP. John C. Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159. Charles J. Fillmore. 1976. Frame semantics and the na- ture of language. Annals of the New York Academy of Sciences, 280(1):20–32. Jenny Rose Finkel, Trond Grenager, and Christopher D. Manning. 2005. Incorporating non-local informa- tion into information extraction systems by Gibbs sam- pling. In ACL. Kuzman Ganchev, Joao Graca, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for struc- tured latent variable models. Journal of Machine Learning Research, 11:2001–2049. Alberto Garcı́a-Durán, Antoine Bordes, and Nicolas Usunier. 2014. Effective blending of two and three- way interactions for modeling multi-relational data. In European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD). Lise Getoor and Ben Taskar. 2007. Introduction to sta- tistical relational learning. MIT press. Geoffrey E. Hinton. 1989. Connectionist learning proce- dures. Artificial Intelligence, 40(1-3):185–234. Tommi S. Jaakkola and Michael I. Jordan. 1996. Com- puting upper and lower bounds on likelihoods in in- tractable networks. In 12th Annual Conference on Un- certainty in Artificial Intelligence (UAI). Rodolphe Jenatton, Nicolas Le Roux, Antoine Bordes, and Guillaume Obozinski. 2012. A latent factor model for highly multi-relational data. In NIPS. Diederik P. Kingma and Max Welling. 2014. Auto- encoding variational Bayes. In ICLR. Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. 2014. Semi-supervised learning with deep generative models. In NIPS. Tamara G. Kolda and Brett W. Bader. 2009. Ten- sor decompositions and applications. SIAM Review, 51(3):455–500. 243 Dekang Lin and Patrick Pantel. 2001. DIRT - discovery of inference rules from text. In SIGKDD. Chu-Cheng Lin, Waleed Ammar, Chris Dyer, and Lori S. Levin. 2015a. Unsupervised POS induction with word embeddings. In NAACL HLT. Yankai Lin, Zhiyuan Liu, Huan-Bo Luan, Maosong Sun, Siwei Rao, and Song Liu. 2015b. Modeling relation paths for representation learning of knowledge bases. In EMNLP. Xitong Liu, Fei Chen, Hui Fang, and Min Wang. 2014. Exploiting entity relationship for query expansion in enterprise search. Information Retrieval, 17(3):265– 294. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. In ICLR. Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction with- out labeled data. In ACL. Ndapandula Nakashole and Tom M. Mitchell. 2014. Mi- cro reading with priors: Towards second generation machine readers. In AKBC at NIPS. Ndapandula Nakashole, Gerhard Weikum, and Fabian M. Suchanek. 2012. PATTY: A taxonomy of relational patterns with semantic types. In EMNLP. Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A three-way model for collective learning on multi-relational data. In ICML. Diarmuid Ó Séaghdha. 2010. Latent variable models of selectional preference. In ACL. Hoifung Poon and Pedro M. Domingos. 2009. Unsuper- vised semantic parsing. In EMNLP. Deepak Ravichandran and Eduard H. Hovy. 2002. Learning surface text patterns for a question answer- ing system. In ACL. Philip Resnik. 1997. Selectional preference and sense disambiguation. In ACL SIGLEX Workshop on Tag- ging Text with Lexical Semantics: Why, What, and How. Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text. In ECML-PKDD. Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In NAACL. Evan Sandhaus. 2008. The New York Times annotated corpus. Linguistic Data Consortium, Philadelphia, 6(12). Yusuke Shinyama and Satoshi Sekine. 2006. Preemp- tive information extraction using unrestricted relation discovery. In NAACL HLT. Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Y. Ng. 2013. Reasoning with neural ten- sor networks for knowledge base completion. In NIPS. Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: A core of semantic knowledge. In WWW. Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning. 2012. Multi-instance multi- label learning for relation extraction. In EMNLP- CoNLL. Ilya Sutskever, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. 2009. Modelling relational data using Bayesian clustered tensor factorization. In NIPS. Idan Szpektor, Hristo Tanev, Ido Dagan, and Bonaven- tura Coppola. 2004. Scaling web-based acquisition of entailment relations. In EMNLP. Ivan Titov and Ehsan Khoddam. 2015. Unsupervised in- duction of semantic roles within a reconstruction-error minimization framework. In NAACL. Ivan Titov and Alexandre Klementiev. 2011. A Bayesian model for unsupervised semantic parsing. In ACL. Ledyard R. Tucker. 1966. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311. Tim Van de Cruys. 2010. A non-negative tensor fac- torization model for selectional preference induction. Journal of Natural Language Engineering, 16(4):417– 437. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(2579-2605):85. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and com- posing robust features with denoising autoencoders. In ICML. Jason Weston, Antoine Bordes, Oksana Yakhnenko, and Nicolas Usunier. 2013. Connecting language and knowledge bases with embedding models for relation extraction. In EMNLP. Limin Yao, Aria Haghighi, Sebastian Riedel, and Andrew McCallum. 2011. Structured relation discovery using generative models. In EMNLP. Limin Yao, Sebastian Riedel, and Andrew McCallum. 2012. Unsupervised relation discovery with sense dis- ambiguation. In ACL. Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In EMNLP. 244