A Latent Variable Model Approach to PMI-based Word Embeddings Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, Andrej Risteski Computer Science Department, Princeton University 35 Olden St, Princeton, NJ 08540 {arora,yuanzhil,yingyul,tengyu,risteski}@cs.princeton.edu Abstract Semantic word embeddings represent the meaning of a word via a vector, and are cre- ated by diverse methods. Many use non- linear operations on co-occurrence statistics, and have hand-tuned hyperparameters and reweighting methods. This paper proposes a new generative model, a dynamic version of the log-linear topic model of Mnih and Hinton (2007). The method- ological novelty is to use the prior to com- pute closed form expressions for word statis- tics. This provides a theoretical justifica- tion for nonlinear models like PMI, word2vec, and GloVe, as well as some hyperparame- ter choices. It also helps explain why low- dimensional semantic embeddings contain lin- ear algebraic structure that allows solution of word analogies, as shown by Mikolov et al. (2013a) and many subsequent papers. Experimental support is provided for the gen- erative model assumptions, the most impor- tant of which is that latent word vectors are fairly uniformly dispersed in space. 1 Introduction Vector representations of words (word embeddings) try to capture relationships between words as dis- tance or angle, and have many applications in com- putational linguistics and machine learning. They are constructed by various models whose unify- ing philosophy is that the meaning of a word is defined by “the company it keeps” (Firth, 1957), namely, co-occurrence statistics. The simplest meth- ods use word vectors that explicitly represent co- occurrence statistics. Reweighting heuristics are known to improve these methods, as is dimension reduction (Deerwester et al., 1990). Some reweight- ing methods are nonlinear, which include taking the square root of co-occurrence counts (Rohde et al., 2006), or the logarithm, or the related Pointwise Mu- tual Information (PMI) (Church and Hanks, 1990). These are collectively referred to as Vector Space Models, surveyed in (Turney and Pantel, 2010). Neural network language models (Rumelhart et al., 1986; Rumelhart et al., 1988; Bengio et al., 2006; Collobert and Weston, 2008a) propose an- other way to construct embeddings: the word vec- tor is simply the neural network’s internal repre- sentation for the word. This method is nonlinear and nonconvex. It was popularized via word2vec, a family of energy-based models in (Mikolov et al., 2013b; Mikolov et al., 2013c), followed by a ma- trix factorization approach called GloVe (Penning- ton et al., 2014). The first paper also showed how to solve analogies using linear algebra on word em- beddings. Experiments and theory were used to sug- gest that these newer methods are related to the older PMI-based models, but with new hyperparameters and/or term reweighting methods (Levy and Gold- berg, 2014b). But note that even the old PMI method is a bit mysterious. The simplest version considers a sym- metric matrix with each row/column indexed by a word. The entry for (w,w′) is PMI(w,w′) = log p(w,w′) p(w)p(w′) , where p(w,w ′) is the empirical prob- ability of words w,w′ appearing within a window of certain size in the corpus, and p(w) is the marginal 385 Transactions of the Association for Computational Linguistics, vol. 4, pp. 385–399, 2016. Action Editor: Daichi Mochihashi. Submission batch: 10/2015; Revision batch: 2/2016; 3/2016; Published 7/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. probability of w. (More complicated models could use asymmetric matrices with columns correspond- ing to context words or phrases, and also involve ten- sorization.) Then word vectors are obtained by low- rank SVD on this matrix, or a related matrix with term reweightings. In particular, the PMI matrix is found to be closely approximated by a low rank ma- trix: there exist word vectors in say 300 dimensions, which is much smaller than the number of words in the dictionary, such that 〈vw,vw′〉≈ PMI(w,w′) (1.1) where ≈ should be interpreted loosely. There appears to be no theoretical explanation for this empirical finding about the approximate low rank of the PMI matrix. The current paper addresses this. Specifically, we propose a probabilistic model of text generation that augments the log-linear topic model of Mnih and Hinton (2007) with dynamics, in the form of a random walk over a latent discourse space. The chief methodological contribution is us- ing the model priors to analytically derive a closed- form expression that directly explains (1.1); see The- orem 2.2 in Section 2. Section 3 builds on this in- sight to give a rigorous justification for models such as word2vec and GloVe, including the hyperparam- eter choices for the latter. The insight also leads to a mathematical explanation for why these word em- beddings allow analogies to be solved using linear algebra; see Section 4. Section 5 shows good empir- ical fit to this model’s assumtions and predictions, including the surprising one that word vectors are pretty uniformly distributed (isotropic) in space. 1.1 Related work Latent variable probabilistic models of language have been used for word embeddings before, includ- ing Latent Dirichlet Allocation (LDA) and its more complicated variants (see the survey (Blei, 2012)), and some neurally inspired nonlinear models (Mnih and Hinton, 2007; Maas et al., 2011). In fact, LDA evolved out of efforts in the 1990s to provide a gen- erative model that “explains” the success of older vector space methods like Latent Semantic Index- ing (Papadimitriou et al., 1998; Hofmann, 1999). However, none of these earlier generative models has been linked to PMI models. Levy and Goldberg (2014b) tried to relate word2vec to PMI models. They showed that if there were no dimension constraint in word2vec, specifically, the “skip-gram with negative sampling (SGNS)” version of the model, then its solutions would satisfy (1.1), provided the right hand side were replaced by PMI(w,w′)−β for some scalar β. However, skip-gram is a discriminative model (due to the use of negative sampling), not generative. Fur- thermore, their argument only applies to very high- dimensional word embeddings, and thus does not address low-dimensional embeddings, which have superior quality in applications. Hashimoto et al. (2016) focuses on issues simi- lar to our paper. They model text generation as a random walk on words, which are assumed to be embedded as vectors in a geometric space. Given that the last word produced was w, the probability that the next word is w′ is assumed to be given by h(|vw − vw′|2) for a suitable function h, and this model leads to an explanation of (1.1). By contrast our random walk involves a latent discourse vector, which has a clearer semantic interpretation and has proven useful in subsequent work, e.g. understand- ing structure of word embeddings for polysemous words Arora et al. (2016). Also our work clarifies some weighting and bias terms in the training objec- tives of previous methods (Section 3) and also the phenomenon discussed in the next paragraph. Researchers have tried to understand why vec- tors obtained from the highly nonlinear word2vec models exhibit linear structures (Levy and Goldberg, 2014a; Pennington et al., 2014). Specifically, for analogies like “man:woman::king:??,” queen hap- pens to be the word whose vector vqueen is the most similar to the vector vking − vman + vwoman. This suggests that simple semantic relationships, such as masculine vs feminine tested in the above example, correspond approximately to a single direction in space, a phenomenon we will henceforth refer to as RELATIONS=LINES. Section 4 surveys earlier attempts to explain this phenomenon and their shortcoming, namely, that they ignore the large approximation error in rela- tionships like (1.1). This error appears larger than the difference between the best solution and the sec- ond best (incorrect) solution in analogy solving, so that this error could in principle lead to a complete 386 failure in analogy solving. In our explanation, the low dimensionality of the word vectors plays a key role. This can also be seen as a theoretical expla- nation of the old observation that dimension reduc- tion improves the quality of word embeddings for various tasks. The intuitive explanation often given —that smaller models generalize better—turns out to be fallacious, since the training method for cre- ating embeddings makes no reference to analogy solving. Thus there is no a priori reason why low- dimensional model parameters (i.e., lower model ca- pacity) should lead to better performance in anal- ogy solving, just as there is no reason they are bet- ter at some other unrelated task like predicting the weather. 1.2 Benefits of generative approaches In addition to giving some form of “unification” of existing methods, our generative model also brings more intepretability to word embeddings beyond tra- ditional cosine similarity and even analogy solving. For example, it led to an understanding of how the different senses of a polysemous word (e.g., bank) reside in linear superposition within the word em- bedding (Arora et al., 2016). Such insight into em- beddings may prove useful in the numerous settings in NLP and neuroscience where they are used. Another new explanatory feature of our model is that low dimensionality of word embeddings plays a key theoretical role —unlike in previous papers where the model is agnostic about the di- mension of the embeddings, and the superiority of low-dimensional embeddings is an empirical finding (starting with Deerwester et al. (1990)). Specifically, our theoretical analysis makes the key assumption that the set of all word vectors (which are latent vari- ables of the generative model) are spatially isotropic, which means that they have no preferred direction in space. Having n vectors be isotropic in d dimen- sions requires d � n. This isotropy is needed in the calculations (i.e., multidimensional integral) that yield (1.1). It also holds empirically for our word vectors, as shown in Section 5. The isotropy of low-dimensional word vectors also plays a key role in our explanation of the RELATIONS=LINES phenomenon (Section 4). The isotropy has a “purification” effect that mitigates the effect of the (rather large) approximation error in the PMI models. 2 Generative model and its properties The model treats corpus generation as a dynamic process, where the t-th word is produced at step t. The process is driven by the random walk of a dis- course vector ct ∈