Word Embeddings as Metric Recovery in Semantic Spaces Tatsunori B. Hashimoto, David Alvarez-Melis and Tommi S. Jaakkola CSAIL, Massachusetts Institute of Technology {thashim, davidam, tommi}@csail.mit.edu Abstract Continuous word representations have been remarkably useful across NLP tasks but re- main poorly understood. We ground word embeddings in semantic spaces studied in the cognitive-psychometric literature, taking these spaces as the primary objects to recover. To this end, we relate log co-occurrences of words in large corpora to semantic similarity assessments and show that co-occurrences are indeed consistent with an Euclidean semantic space hypothesis. Framing word embedding as metric recovery of a semantic space uni- fies existing word embedding algorithms, ties them to manifold learning, and demonstrates that existing algorithms are consistent metric recovery methods given co-occurrence counts from random walks. Furthermore, we propose a simple, principled, direct metric recovery al- gorithm that performs on par with the state-of- the-art word embedding and manifold learning methods. Finally, we complement recent fo- cus on analogies by constructing two new in- ductive reasoning datasets—series completion and classification—and demonstrate that word embeddings can be used to solve them as well. 1 Introduction Continuous space models of words, objects, and sig- nals have become ubiquitous tools for learning rich representations of data, from natural language pro- cessing to computer vision. Specifically, there has been particular interest in word embeddings, largely due to their intriguing semantic properties (Mikolov et al., 2013b) and their success as features for down- stream natural language processing tasks, such as named entity recognition (Turian et al., 2010) and parsing (Socher et al., 2013). The empirical success of word embeddings has prompted a parallel body of work that seeks to better understand their properties, associated estimation al- gorithms, and explore possible revisions. Recently, Levy and Goldberg (2014a) showed that linear lin- guistic regularities first observed with word2vec extend to other embedding methods. In particu- lar, explicit representations of words in terms of co- occurrence counts can be used to solve analogies in the same way. In terms of algorithms, Levy and Goldberg (2014b) demonstrated that the global min- imum of the skip-gram method with negative sam- pling of Mikolov et al. (2013b) implicitly factorizes a shifted version of the pointwise mutual informa- tion (PMI) matrix of word-context pairs. Arora et al. (2015) explored links between random walks and word embeddings, relating them to contextual (prob- ability ratio) analogies, under specific (isotropic) as- sumptions about word vectors. In this work, we take semantic spaces stud- ied in the cognitive-psychometric literature as the prototypical objects that word embedding algo- rithms estimate. Semantic spaces are vector spaces over concepts where Euclidean distances between points are assumed to indicate semantic similar- ities. We link such semantic spaces to word co-occurrences through semantic similarity assess- ments, and demonstrate that the observed co- occurrence counts indeed possess statistical proper- ties that are consistent with an underlying Euclidean space where distances are linked to semantic simi- larity. 273 Transactions of the Association for Computational Linguistics, vol. 4, pp. 273–286, 2016. Action Editor: Scott Yih. Submission batch: 12/2015; Revision batch: 3/2016; Published 7/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Analogy A B C I D1 D2 D3 D4 Series Completion A B I D1 D2 D3 D4 Classification A B C I D1 D2 D3 D4 1 Analogy A B C I D1 D2 D3 D4 Series Completion A B I D1 D2 D3 D4 Classification A B C I D1 D2 D3 D4 1 Analogy A B C I D1 D2 D3 D4 Series Completion A B I D1 D2 D3 D4 Classification A B C I D1 D2 D3 D4 1 Figure 1: Inductive reasoning in semantic space proposed in Sternberg and Gardner (1983). A, B, C are given, I is the ideal point and D are the choices. The correct answer is shaded green. Formally, we view word embedding methods as performing metric recovery. This perspective is sig- nificantly different from current approaches. Instead of aiming for representations that exhibit specific se- mantic properties or that perform well at a particu- lar task, we seek methods that recover the underly- ing metric of the hypothesized semantic space. The clearer foundation afforded by this perspective en- ables us to analyze word embedding algorithms in a principled task-independent fashion. In particu- lar, we ask whether word embedding algorithms are able to recover the metric under specific scenarios. To this end, we unify existing word embedding al- gorithms as statistically consistent metric recovery methods under the theoretical assumption that co- occurrences arise from (metric) random walks over semantic spaces. The new setting also suggests a simple and direct recovery algorithm which we eval- uate and compare against other embedding methods. The main contributions of this work can be sum- marized as follows: • We ground word embeddings in semantic spaces via log co-occurrence counts. We show that PMI (pointwise mutual information) re- lates linearly to human similarity assessments, and that nearest-neighbor statistics (centrality and reciprocity) are consistent with an Eu- clidean space hypothesis (Sections 2 and 3). • In contrast to prior work (Arora et al., 2015), we take metric recovery as the key object of study, unifying existing algorithms as consis- tent metric recovery methods based on co- occurrence counts from simple Markov random walks over graphs and manifolds. This strong link to manifold estimation opens a promis- ing direction for extensions of word embedding methods (Sections and 4 and 5). • We propose and evaluate a new principled di- rect metric recovery algorithm that performs comparably to the existing state of the art on both word embedding and manifold learning tasks, and show that GloVe (Pennington et al., 2014) is closely related to the second-order Taylor expansion of our objective. • We construct and make available two new in- ductive reasoning datasets1—series completion and classification—to extend the evaluation of word representations beyond analogies, and demonstrate that these tasks can be solved with vector operations on word embeddings as well (Examples in Table 1). 2 Word vectors and semantic spaces Most current word embedding algorithms build on the distributional hypothesis (Harris, 1954) where similar contexts imply similar meanings so as to tie co-occurrences of words to their underlying mean- ings. The relationship between semantics and co- occurrences has also been studied in psychomet- rics and cognitive science (Rumelhart and Abraham- son, 1973; Sternberg and Gardner, 1983), often by means of free word association tasks and seman- tic spaces. The semantic spaces, in particular, pro- vide a natural conceptual framework for continu- ous representations of words as vector spaces where semantically related words are close to each other. For example, the observation that word embeddings can solve analogies was already shown by Rumel- hart and Abrahamson (1973) using vector represen- tations of words derived from surveys of pairwise word similarity judgments. A fundamental question regarding vector space models of words is whether an Euclidean vector 1http://web.mit.edu/thashim/www/supplement materials.zip 274 Task Prompt Answer Analogy king:man::queen:? woman Series penny:nickel:dime:? quarter Classification horse:zebra:{deer, dog, fish} deer Table 1: Examples of the three inductive reasoning tasks proposed by Sternberg and Gardner (1983). space is a valid representation of semantic concepts. There is substantial empirical evidence in favor of this hypothesis. For example, Rumelhart and Abra- hamson (1973) showed experimentally that analog- ical problem solving with fictitious words and hu- man mistake rates were consistent with an Euclidean space. Sternberg and Gardner (1983) provided fur- ther evidence supporting this hypothesis, proposing that general inductive reasoning was based upon op- erations in metric embeddings. Using the analogy, series completion and classification tasks shown in Table 1 as testbeds, they proposed that subjects solve these problems by finding the word closest (in se- mantic space) to an ideal point: the vertex of a par- allelogram for analogies, a displacement from the last word in series completion, and the centroid in the case of classification (Figure 1). We use semantic spaces as the prototypical struc- tures that word embedding methods attempt to un- cover, and we investigate the suitability of word co- occurrence counts for doing so. In the next section, we show that co-occurrences from large corpora in- deed relate to semantic similarity assessments, and that the resulting metric is consistent with an Eu- clidean semantic space hypothesis. 3 The semantic space of log co-occurrences Most word embedding algorithms are based on word co-occurrence counts. In order for such meth- ods to uncover an underlying Euclidean semantic space, we must demonstrate that co-occurrences themselves are indeed consistent with some seman- tic space. We must relate co-occurrences to semantic similarity assessments, on one hand, and show that they can be embedded into a Euclidean metric space, on the other. We provide here empirical evidence for both of these premises. We commence by demonstrating in Figure 2 that the pointwise mutual information (Church and Hanks, 1990) evaluated from co-occurrence Figure 2: Normalized log co-occurrence (PMI) linearly correlates with human semantic similarity judgments (MEN survey). counts has a strong linear relationship with seman- tic similarity judgments from survey data (Pearson’s r=0.75).2 However, this suggestive linear relation- ship does not by itself demonstrate that log co- occurrences (with normalization) can be used to de- fine an Euclidean metric space. Earlier psychometric studies have asked whether human semantic similarity evaluations are consis- tent with an Euclidean space. For example, Tver- sky and Hutchinson (1986) investigate whether con- cept representations are consistent with the geomet- ric sampling (GS) model: a generative model in which points are drawn independently from a con- tinuous distribution in an Euclidean space. They use two nearest neighbor statistics to test agreement with this model, and conclude that certain hierarchi- cal vocabularies are not consistent with an Euclidean embedding. Similar results are observed by Griffiths et al. (2007). We extend this embeddability analy- sis to lexical co-occurrences and show that semantic similarity estimates derived from these are mostly consistent with an Euclidean space hypothesis. The first test statistic for the GS model, the cen- trality C, is defined as C = 1 n n∑ i=1 ( n∑ j=1 Nij )2 where Nij = 1 iff i is j’s nearest neighbor. Under the GS model (i.e. when the words are consistent with a Euclidean space representation), C ≤ 2 with high probability as the number of words n →∞ re- gardless of the dimension or the underlying density (Tversky and Hutchinson, 1986). For metrically em- beddable data, typical non-asymptotic values of C 2Normalizing the log co-occurrence with the unigram fre- quency taken to the 3/4th power maximizes the linear correla- tion in Figure 2, explaining this choice of normalization in prior work (Levy and Goldberg, 2014a; Mikolov et al., 2013b). 275 Corpus C Rf Free association 1.51 0.48 Wikipedia corpus 2.21 0.63 Word2vec corpus 2.24 0.73 GloVe corpus 2.66 0.62 Table 2: Semantic similarity data derived from mul- tiple sources show evidence of embeddability range between 1 and 3, while non-embeddable hier- archical structures have C > 10. The second statistic, the reciprocity fraction Rf (Schwarz and Tversky, 1980; Tversky and Hutchin- son, 1986), is defined as Rf = 1 n n∑ i=1 n∑ j=1 NijNji and measures the fraction of words that are their nearest neighbor’s nearest neighbor. Under the GS model, this fraction should be greater than 0.5.3 Table 2 shows the two statistics computed on three popular large corpora and a free word associ- ation dataset (see Section 6 for details). The near- est neighbor calculations are based on PMI. The results show a surprisingly high agreement on all three statistics for all corpora, with C and Rf con- tained in small intervals: C ∈ [2.21, 2.66] and Rf ∈ [0.62, 0.73]. These results are consistent with Euclidean semantic spaces and the GS model in particular. The largest violators of C and Rf are consistent with Tversky’s analysis: the word with the largest centrality in the non-stopword Wikipedia corpus is ‘the’, whose inclusion would increase C to 3.46 compared to 2.21 without it. Tversky’s original analysis of semantic similarities argued that certain words, such as superordinate and function words, could not be embedded. Despite such specific ex- ceptions, we find that for an appropriately normal- ized corpus, the majority of words are consistent with the GS model, and therefore can be represented meaningfully as vectors in Euclidean space. The results of this section are an important step towards justifying the use of word co-occurrence counts as the central object of interest for seman- tic vector representations of words. We have shown 3Both R and C are asymptotically dimension independent because they rely only on the single nearest neighbor. Esti- mating the latent dimensionality requires other measures and assumptions (Kleindessner and von Luxburg, 2015). that they are empirically related to a human notion of semantic similarity and that they are metrically em- beddable, a desirable condition if we expect word vectors derived from them to truly behave as ele- ments of a metric space. This, however, does not yet fully justify their use to derive semantic repre- sentations. The missing piece is to formalize the connection between these co-occurrence counts and some intrinsic notion of semantics, such as the se- mantic spaces described in Section 2. In the next two sections, we establish this connection by fram- ing word embedding algorithms that operate on co- occurrences as metric recovery methods. 4 Semantic spaces and manifolds We take a broader, unified view on metric recov- ery of semantic spaces since the notion of semantic spaces and the associated parallelogram rule for ana- logical reasoning extend naturally to objects other than words. For example, images can be approxi- mately viewed as points in an Euclidean semantic space by representing them in terms of their under- lying degrees of freedom (e.g. orientation, illumina- tion). Thus, questions about the underlying seman- tic spaces and how they can be recovered should be related. The problem of recovering an intrinsic Euclidean coordinate system over objects has been specifically addressed in manifold learning. For example, meth- ods such as Isomap (Tenenbaum et al., 2000) re- constitute an Euclidean space over objects (when possible) based on local comparisons. Intuitively, these methods assume that naive distance metrics such as the L2 distance over pixels in an image may be meaningful only when images are very sim- ilar. Longer distances between objects are evaluated through a series of local comparisons. These longer distances—geodesic distances over the manifold— can be approximated by shortest paths on a neigh- borhood graph. If we view the geodesic distances on the manifold (represented as a graph) as seman- tic distances, then the goal is to isometrically embed these distances into an Euclidean space. Tenenbaum (1998) showed that such isometric embeddings of image manifolds can be used to solve “visual analo- gies” via the parallelogram rule. Typical approaches to manifold learning as dis- 276 cussed above differ from word embedding in terms of how the semantic distances between objects are extracted. Word embeddings approximate semantic distances between words using the negative log co- occurrence counts (Section 3), while manifold learn- ing approximates semantic distances using neigh- borhood graphs built from local comparisons of the original, high-dimensional points. Both views seek to estimate a latent geodesic distance. In order to study the problem of metric recov- ery from co-occurrence counts, and to formalize the connection between word embedding and manifold learning, we introduce a simple random walk model over the underlying objects (e.g. words or images). This toy model permits us to establish clean consis- tency results for recovery algorithms. We emphasize that while the random walk is introduced over the words, it is not intended as a model of language but rather as a tool to understand the recovery problem. 4.1 Random walk model Consider now a simple metric random walk Xt over words where the probability of transitioning from word i to word j is given by P(Xt = j|Xt−1 = i) = h( 1σ||xi −xj|| 2 2) (1) Here ||xi −xj||22 is the Euclidean distance between words in the underlying semantic space to be recov- ered, and h is some unknown, sub-Gaussian func- tion linking semantic similarity to co-occurrence.4 Under this model, the log frequency of occur- rences of word j immediately after word i will be proportional to log(h(||xi −xj||22/σ)) as the corpus size grows large. Here we make the surprising ob- servation that if we consider co-occurrences over a sufficiently large window, the log co-occurrence in- stead converges to −||xi−xj||22/σ, i.e. it directly re- lates to the underlying metric. Intuitively, this result is an analog of the central limit theorem for random walks. Note that, for this reason, we do not need to know the link function h. Formally, given an m-token corpus consisting of sentences generated according to Equation 1 from a 4This toy model ignores the role of syntax and function words, but these factors can be included as long as the moment bounds originally derived in Hashimoto et al. (2015b) remain fulfilled. vocabulary of size n, let Cm,nij (tn) be the number of times word j occurs tn steps after word i in the corpus.5 We can show that there exist unigram nor- malizers am,ni ,b m,n i such that the following holds: Lemma 1. Given a corpus generated by Equation 1 there exists ai and bj such that simultaneously over all i,j: lim m,n→∞ − log(Cm,nij (tn))−a m,n i −b m,n j →||xi−xj|| 2 2. We defer the precise statement and conditions of Lemma 1 to Corollary 6. Conceptually, this lim- iting6 result captures the intuition that while one- step transitions in a sentence may be complex and include non-metric structure expressed in h, co- occurrences over large windows relate directly to the latent semantic metric. For ease of notation, we henceforth omit the corpus and vocabulary size descriptors m,n (using Cij, ai, and bj in place of C m,n ij (tn), a m,n i , and b m,n j ), since in practice the cor- pus is large but fixed. Lemma 1 serves as the basis for establishing con- sistency of recovery for word embedding algorithms (next section). It also allows us to establish a precise link between between manifold learning and word embedding, which we describe in the remainder of this section. 4.2 Connection to manifold learning Let {v1 . . .vn} ∈ RD be points drawn i.i.d. from a density p, where D is the dimension of observed inputs (e.g. number of pixels, in the case of im- ages), and suppose that these points lie on a mani- fold M⊂ RD that is isometrically embeddable into d < D dimensions, where d is the intrinsic dimen- sionality of the data (e.g. coordinates representing illumination or camera angle in the case of images). The problem of manifold learning consists of finding an embedding of v1 . . .vn into Rd that preserves the structure of M by approximately preserving the dis- tances between points along this manifold. In light 5The window-size tn depends on the vocabulary size to en- sure that all word pairs have nonzero co-occurrence counts in the limit of large vocabulary and corpus. For details see the definition of gn in Appendix A. 6In Lemma 1, we take m → ∞ (growing corpus size) to ensure all word pairs appear sufficiently often, and n → ∞ (growing vocabulary) to ensure that every point in the semantic space has a nearby word. 277 of Lemma 1, this problem can be solved with word embedding algorithms in the following way: 1. Construct a neighborhood graph (e.g. connect- ing points within a distance ε) over {v1 . . .vn}. 2. Record the vertex sequence of a simple random walk over these graphs as a sentence, and con- catenate these sequences initialized at different nodes into a corpus. 3. Use a word embedding method on this cor- pus to generate d-dimensional vector represen- tations of the data. Under the conditions of Lemma 1, the negative log co-occurrences over the vertices of the neigh- borhood graph will converge, as n → ∞, to the geodesic distance (squared) over the manifold. In this case we will show that the globally optimal so- lutions of word embedding algorithms recover the low dimensional embedding (Section 5).7 5 Recovering semantic distances with word embeddings We now show that, under the conditions of Lemma 1, three popular word embedding methods can be viewed as doing metric recovery from co-occurrence counts. We use this observation to derive a new, sim- ple word embedding method inspired by Lemma 1. 5.1 Word embeddings as metric recovery GloVe The Global Vectors (GloVe) (Pennington et al., 2014) method for word embedding optimizes the objective function min x̂,ĉ,a,b ∑ i,j f(Cij)(2〈x̂i, ĉj〉 + ai + bj − log(Cij))2 with f(Cij) = min(Cij, 100)3/4. If we rewrite the bias terms as ai = âi −||x̂i||22 and bj = b̂j −||ĉj||22, we obtain the equivalent representation: min x̂,ĉ,â,̂b ∑ i,j f(Cij)(− log(Cij)−||x̂i−ĉj||22+âi+b̂j))2. Together with Lemma 1, we recognize this as a weighted multidimensional scaling (MDS) objective 7This approach of applying random walks and word embed- dings to general graphs has already been shown to be surpris- ingly effective for social networks (Perozzi et al., 2014), and demonstrates that word embeddings serve as a general way to connect metric random walks to embeddings. with weights f(Cij). Splitting the word vector x̂i and context vector ĉi is helpful in practice but not necessary under the assumptions of Lemma 1 since the true embedding x̂i = ĉi = xi/σ and âi, b̂i = 0 is a global minimum whenever dim(x̂) = d. In other words, GloVe can recover the true metric provided that we set d correctly. word2vec The skip-gram model of word2vec approximates a softmax objective: min x̂,ĉ ∑ i,j Cij log ( exp(〈x̂i, ĉj〉)∑n k=1 exp(〈x̂i, ĉk〉) ) . Without loss of generality, we can rewrite the above with a bias term bj by making dim(x̂) = d + 1 and setting one of the dimensions of x̂ to 1. By re- defining the bias b̂j = bj − ||ĉj||22/2, we see that word2vec solves min x̂,ĉ,̂b ∑ i,j Cij log ( exp(−1 2 ||x̂i − ĉj||22 + b̂j)∑n k=1 exp(−12||x̂i − ĉk||22 + b̂k) ) . Since according to Lemma 1 Cij/ ∑n k=1 Cik approaches exp(−‖|xi−xj|| 2 2/σ 2)∑n k=1 exp(−‖|xi−xk||22/σ2) , this is the stochastic neighbor embedding (SNE) (Hinton and Roweis, 2002) objective weighted by ∑n k=1 Cik. The global optimum is achieved by x̂i = ĉi = xi( √ 2/σ) and b̂j = 0 (see Theorem 8). The neg- ative sampling approximation used in practice be- haves much like the SVD approach of Levy and Goldberg (2014b), and by applying the same sta- tionary point analysis as they do, we show that in the absence of a bias term the true embedding is a global minimum under the additional assumption that ||xi||22(2/σ2) = log( ∑ j Cij/ √∑ ij Cij) (Hin- ton and Roweis, 2002). SVD The method of Levy and Goldberg (2014b) uses the log PMI matrix defined in terms of the uni- gram frequency Ci as: Mij = log(Cij)−log(Ci)−log(Cj) + log (∑ j Cj ) and computes the SVD of the shifted and truncated matrix: (Mij + τ)+ where τ is a truncation param- eter to keep Mij finite. Under the limit of Lemma 1, the corpus is sufficiently large that no truncation is necessary (i.e. τ = −min(Mij) < ∞). We will 278 recover the underlying embedding if we additionally assume 1 σ ||xi||22 = log(Ci/ √∑ j Cj) via the law of large numbers since Mij → 〈xi,xj〉 (see Theorem 7). Centering the matrix Mij before obtaining the SVD would relax the norm assumption, resulting ex- actly in classical MDS (Sibson, 1979). 5.2 Metric regression from log co-occurrences We have shown that by simple reparameterizations and use of Lemma 1, existing embedding algo- rithms can be interpreted as consistent metric recov- ery methods. However, the same Lemma suggests a more direct regression method for recovering the la- tent coordinates, which we propose here. This new embedding algorithm serves as a litmus test for our metric recovery paradigm. Lemma 1 describes a log-linear relationship be- tween distance and co-occurrences. The canonical way to fit this relationship would be to use a general- ized linear model, where the co-occurrences follow a negative binomial distribution Cij ∼ NegBin(θ,p), where p = θ/[θ + exp(−1 2 ||xi −xj||22 + ai + bj)]. Under this overdispersed log linear model, E[Cij] = exp(−12||xi −xj|| 2 2 + ai + bj) Var(Cij) = E[Cij]2/θ + E[Cij] Here, the parameter θ controls the contribution of large Cij, and is akin to GloVe’s f(Cij) weight function. Fitting this model is straightforward if we define the log-likelihood in terms of the expected rate λij = exp(−12||xi −xj|| 2 2 + ai + bj) as: LL(x,a,b,θ) = ∑ i,j θ log(θ) −θ log(λij + θ)+ Cij log ( 1 − θ λij +θ ) + log ( Γ(Cij +θ) Γ(θ)Γ(Cij +1) ) To generate word embeddings, we minimize the negative log-likelihood using stochastic gradient de- scent. The implementation mirrors that of GloVe and randomly selects word pairs i,j and attracts or repulses the vectors x̂ and ĉ in order to achieve the relationship in Lemma 1. Implementation details are provided in Appendix C. Relationship to GloVe The overdispersion pa- rameter θ in our metric regression model sheds light on the role of GloVe’s weight function f(Cij). Tak- ing the Taylor expansion of the log-likelihood at log(λij) ≈− log(Cij), we have LL(x,a,b,θ) = ∑ ij kij− Cijθ2(Cij +θ) (uij) 2+o ( (uij) 3 ) , where uij = (log λij − log Cij) and kij does not depend on x. Note the similarity of the sec- ond order term with the GloVe objective. As Cij grows, the weight functions Cijθ 2(Cij +θ) and f(Cij) = max(Cij,xmax) 3/4 converge to θ/2 and xmax re- spectively, down-weighting large co-occurrences. 6 Empirical validation We will now experimentally validate two aspects of our theory: the semantic space hypothesis (Sections 2 and 3), and the correspondence between word em- bedding and manifold learning (Sections 4 and 5). Our goal with this empirical validation is not to find the absolute best method and evaluation metric for word embeddings, which has been studied before (e.g. Levy et al. (2015)). Instead, we provide empir- ical evidence in favor of the semantic space hypoth- esis, and show that our simple algorithm for metric recovery is competitive with the state-of-the-art on both semantic induction tasks and manifold learn- ing. Since metric regression naturally operates over integer co-occurrences, we use co-occurrences over unweighted windows for this and—for fairness—for the other methods (see Appendix C for details). 6.1 Datasets Corpus and training: We used three different corpora for training: a Wikipedia snapshot of 03/2015 (2.4B tokens), the original word2vec cor- pus (Mikolov et al., 2013a) (6.4B tokens), and a combination of Wikipedia with Gigaword5 emulat- ing GloVe’s corpus (Pennington et al., 2014) (5.8B tokens). We preprocessed all corpora by removing punctuation, numbers and lower-casing all the text. The vocabulary was restricted to the 100K most fre- quent words in each corpus. We trained embeddings using four methods: word2vec, GloVe, random- ized SVD,8 and metric regression (referred to as re- gression). Full implementation details are provided in the Appendix. 8We used randomized, rather than full SVD due to the diffi- culty of scaling SVD to this problem size. For performance of full SVD factorizations see Levy et al. (2015). 279 Google Semantic Google Syntactic Google Total SAT Classification Sequence Method L2 Cos L2 Cos L2 Cos L2 Cos L2 Cos L2 Cos Regression 75.5 78.4 70.9 70.8 72.6 73.7 39.2 37.8 87.6 84.6 58.3 59.0 GloVe 71.1 76.4 68.6 71.9 69.6 73.7 36.9 35.5 74.6 80.1 53.0 58.9 SVD 50.9 58.1 51.4 52.0 51.2 54.3 32.7 24.0 71.6 74.1 49.4 47.6 word2vec 71.4 73.4 70.9 73.3 71.1 73.3 42.0 42.0 76.4 84.6 54.4 56.2 Table 3: Accuracies on Google, SAT analogies and on two new inductive reasoning tasks. Manifold Learning Word Embedding Semantic 83.3 70.7 Syntactic 8.2 76.9 Total 51.4 73.4 Table 4: Semantic similarity alone can solve the Google analogy tasks For fairness we fix all hyperparameters, and de- velop and test the code for metric regression exclu- sively on the first 1GB subset of the wiki dataset. For open-vocabulary tasks, we restrict the set of an- swers to the top 30K words, since this improves per- formance while covering the majority of the ques- tions. In the following, we show performance for the GloVe corpus throughout but include results for all corpora along with our code package. Evaluation tasks: We test the quality of the word embeddings on three types of inductive tasks: analo- gies, sequence completion and classification (Figure 1). For the analogies, we used the standard open- vocabulary analogy task of Mikolov et al. (2013a) (henceforth denoted Google), consisting of 19,544 semantic and syntactic questions. In addition, we use the more difficult SAT analogy dataset (ver- sion 3) (Turney and Littman, 2005), which contains 374 questions from actual exams and guidebooks. Each question consists of 5 exemplar pairs of words word1:word2, where the same relation holds for all pairs. The task is to pick from among another five pairs of words the one that best fits the category im- plicitly defined by the exemplars. Inspired by Sternberg and Gardner (1983), we propose two new difficult inductive reasoning tasks beyond analogies to verify the semantic space hy- pothesis: sequence completion and classification. As described in Section 2, the former involves choosing the next step in a semantically coherent sequence of words (e.g. hour,minute, . . .), and the latter consists of selecting an element within the same category out of five possible choices. Given the lack of publicly available datasets, we generated our own questions using WordNet (Fellbaum, 1998) relations and word-word PMI values. These datasets were constructed before training the embeddings, so as to avoid biasing them towards any one method. For the classification task, we created in-category words by selecting words from WordNet relations associated to root words, from which we pruned to four words based on PMI-similarity to the other words in the class. Additional options for the mul- tiple choice questions were created searching over words related to the root by a different relation type, and selecting those most similar to the root. For the sequence completion task, we obtained WordNet trees of various relation types, and pruned these based on similarity to the root word to obtain the sequence. For the multiple-choice questions, we proceeded as before to select additional (incorrect) options of a different relation type to the root. After pruning, we obtain 215 classification ques- tions and 220 sequence completion questions, of which 51 are open-vocabulary and 169 are multiple choice. These two new datasets are available9. 6.2 Results on inductive reasoning tasks Solving analogies using survey data alone: We demonstrate that, surprisingly, word embeddings trained directly on semantic similarity derived from survey data can solve analogy tasks. Extending a study by Rumelhart and Abrahamson (1973), we use a free-association dataset (Nelson et al., 2004) to construct a similarity graph, where vertices cor- respond to words and the weights wij are given by the number of times word j was considered most similar to word i in the survey. We take the largest connected component of this graph (consisting of 4845 words and 61570 weights) and embed it us- ing Isomap for which squared edge distances are de- fined as − log(wij/ maxkl(wkl)). We use the result- 9http://web.mit.edu/thashim/www/supplement materials.zip 280 Figure 3: Dimensionality reduction using word embedding and manifold learning. Performance is quantified by percentage of 5-nearest neighbors sharing the same digit label. ing vectors as word embeddings to solve the Google analogy task. The results in Table 4 show that em- beddings obtained with Isomap on survey data can outperform the corpus based metric regression vec- tors on semantic, but not syntactic tasks. We hypoth- esize that free-association surveys capture semantic, but not syntactic similarity between words. Analogies: The results on the Google analogies shown in Table 3 demonstrate that our proposed framework of metric regression and L2 distance is competitive with the baseline of word2vec with cosine distance. The performance gap across meth- ods is small and fluctuates across corpora, but met- ric regression consistently outperforms GloVe on most tasks and outperforms all methods on seman- tic analogies, while word2vec does better on syn- tactic categories. For the SAT dataset, the L2 dis- tance performs better than the cosine similarity, and we find word2vec to perform best, followed by metric regression. The results on these two analogy datasets show that directly embedding the log co- occurrence metric and taking L2 distances between vectors is competitive with current approaches to analogical reasoning. Sequence and classification tasks: As predicted by the semantic field hypothesis, word embeddings perform well on the two novel inductive reasoning tasks (Table 3). Again, we observe that the metric recovery with metric regression coupled with L2 dis- tance consistently performs as well as and often bet- ter than the current state-of-the-art word embedding methods on these two additional semantic datasets. 6.3 Word embeddings can embed manifolds In Section 4 we proposed a reduction for solving manifold learning problems with word embeddings which we show achieves comparable performance to manifold learning methods. We now test this rela- tion by performing nonlinear dimensionality reduc- tion on the MNIST digit dataset, reducing from D = 256 to two dimensions. Using a four-thousand im- age subset, we construct a k-nearest neighbor graph (k = 20) and generate 10 simple random walks of length 200 starting from each vertex in the graph, re- sulting in 40,000 sentences of length 200 each. We compare the four word embedding methods against standard dimensionality reduction methods: PCA, Isomap, SNE and, t-SNE. We evaluate the meth- ods by clustering the resulting low-dimensional data and computing cluster purity, measured using the percentage of 5-nearest neighbors having the same digit label. The resulting embeddings, shown in Fig. 3, demonstrate that metric regression is highly effective at this task, outperforming metric SNE and beaten only by t-SNE (91% cluster purity), which is a visualization method specifically designed to pre- serve cluster separation. All word embedding meth- ods including SVD (68%) embed the MNIST digits remarkably well and outperform baselines of PCA (48%) and Isomap (49%). 7 Discussion Our work recasts word embedding as a metric recov- ery problem pertaining to the underlying semantic space. We use co-occurrence counts from random walks as a theoretical tool to demonstrate that exist- ing word embedding algorithms are consistent met- ric recovery methods. Our direct regression method is competitive with the state of the art on various se- mantics tasks, including two new inductive reason- ing problems of series completion and classification. Our framework highlights the strong interplay and common foundation between word embedding methods and manifold learning, suggesting several avenues for recovering vector representations of phrases and sentences via properly defined Markov processes and their generalizations. 281 Appendix A Metric recovery from Markov processes on graphs and manifolds Consider an infinite sequence of points Xn = {x1, . . . ,xn}, where xi are sampled i.i.d. from a density p(x) over a compact Riemannian manifold equipped with a geodesic metric ρ. For our pur- poses, p(x) should have a bounded log-gradient and a strict lower bound p0 over the manifold. The ran- dom walks we consider are over unweighted spatial graphs defined as Definition 2 (Spatial graph). Let σn : Xn → R>0 be a local scale function and h : R≥0 → [0, 1] a piecewise continuous function with sub-Gaussian tails. A spatial graph Gn corresponding to σn and h is a random graph with vertex set Xn and a di- rected edge from xi to xj with probability pij = h(ρ(xi,xj) 2/σn(xi) 2). Simple examples of spatial graphs where the con- nectivity is not random include the ε ball graph (σn(x) = ε) and the k-nearest neighbor graph (σn(x) =distance to k-th neighbor). Log co-occurrences and the geodesic will be con- nected in two steps. (1) we use known results to show that a simple random walk over the spatial graph, properly scaled, behaves similarly to a dif- fusion process; (2) the log-transition probability of a diffusion process will be related to the geodesic metric on a manifold. (1) The limiting random walk on a graph: Just as the simple random walk over the integers con- verges to a Brownian motion, we may expect that under specific constraints the simple random walk Xnt over the graph Gn will converge to some well- defined continuous process. We require that the scale functions converge to a continuous function σ̄ (σn(x)g−1n a.s.−−→ σ̄(x)); the size of a single step van- ish (gn → 0) but contain at least a polynomial num- ber of points within σn(x) (gnn 1 d+2 log(n) − 1 d+2 → ∞). Under this limit, our assumptions about the density p(x), and regularity of the transitions10, the 10For t = Θ(g−2n ), the marginal distribution nP(Xt|X0) must be a.s. uniformly equicontinuous. For undirected spatial graphs, this is always true (Croydon and Hambly, 2008), but for directed graphs it is an open conjecture from (Hashimoto et al., 2015b). following holds: Theorem 3 ((Hashimoto et al., 2015b; Ting et al., 2011)). The simple random walk Xnt on Gn con- verges in Skorokhod space D([0,∞),D) after a time scaling t̂ = tg2n to the Itô process Yt̂ valued in C([0,∞),D) as Xn t̂g−2n → Yt̂. The process Yt̂ is de- fined over the normal coordinates of the manifold (D,g) with reflecting boundary conditions on D as dYt̂ = ∇ log(p(Yt̂))σ(Yt̂) 2dt̂ + σ(Yt̂)dWt̂ (2) The equicontinuity constraint on the marginal densities of the random walk implies that the tran- sition density for the random walk converges to its continuum limit. Lemma 4 (Convergence of marginal densities). (Hashimoto et al., 2015a) Let x0 be some point in our domain Xn and define the marginal densities q̂t(x) = P(Yt = x|Y0 = x0) and qtn (x) = P(Xnt = x|Xn0 = x0). If tng2n = t̂ = Θ(1), then under condition (?) and the results of Theorem 3 such that Xnt → Y nt weakly, we have lim n→∞ nqtn (x) = q̂t̂(x)p(x) −1. (2) Log transition probability as a metric We may now use the stochastic process Yt̂ to connect the log transition probability to the geodesic distance using Varadhan’s large deviation formula. Theorem 5 ((Varadhan, 1967; Molchanov, 1975)). Let Yt be a Itô process defined over a complete Riemann manifold (D,g) with geodesic distance ρ(xi,xj) then lim t→0 −t log(P(Yt = xj|Y0 = xi)) → ρ(xi,xj)2. This estimate holds more generally for any space admitting a diffusive stochastic process (Saloff- Coste, 2010). Taken together, we finally obtain: Corollary 6 (Varadhan’s formula on graphs). For any δ,γ,n0 there exists some t̂, n > n0, and se- quence bnj such that the following holds for the sim- ple random walk Xnt : P ( sup xi,xj∈Xn0 ∣∣∣t̂ log(P(Xn t̂g−2n = xj | Xn0 = xi)) − t̂bnj −ρσ(x)(xi,xj)2 ∣∣∣ > δ ) < γ 282 Where ρσ(x) is the geodesic defined as ρσ(x)(xi,xj) = min f∈C1:f(0)=xi,f(1)=xj ∫ 1 0 σ(f(t))dt Proof. The proof is in two parts. First, by Varad- han’s formula (Theorem 5, (Molchanov, 1975, Eq. 1.7)) for any δ1 > 0 there exists some t̂ such that: sup y,y′∈D |−t̂ log(P(Yt̂ = y ′|Y0 = y))−ρσ(x)(y′,y)2| < δ1 The uniform equicontinuity of the marginals implies their uniform convergence (Lemma S4), so for any δ2 > 0 and γ0, there exists a n such that P( sup xj,xi∈Xn0 |P(Yt̂ = xj|Y0 = xi) −np(xj)P(Xng−2n t̂ = xj|X n 0 = xi)| > δ2) < γ0 By the lower bound on p and compactness of D, P(Yt̂|Y0) is lower bounded by some strictly positive constant c and we can apply uniform continuity of log(x) over (c,∞) to get that for some δ3 and γ, P ( sup xj,xi∈Xn0 | log(P(Yt̂ = xj|Y0 = xi))−log(np(xj)) − log(P(Xn g−2n t̂ = xj|Xn0 = xi))| > δ3 ) < γ. (3) Finally we have the bound, P ( sup xi,xj∈Xn0 |− t̂ log(P(Xn g−2n t̂ = xj|Xn0 = xi)) −t̂ log(np(xj))−ρσ(x)(xi,xj)2| > δ1 +t̂δ3 ) < γ To combine the bounds, given some δ and γ, set bnj = log(np(xj)), pick t̂ such that δ1 < δ/2, then pick n such that the bound in Eq. 3 holds with prob- ability γ and error δ3 < δ/(2t̂). B Consistency proofs for word embedding Lemma 7 (Consistency of SVD). Assume the norm of the latent embedding is proportional to the uni- gram frequency ||xi||/σ2 = Ci/( ∑ j Cj) 1 2 Under these conditions, Let X̂ be the embedding de- rived from the SVD of Mij as 2X̂X̂T = Mij = log(Cij) − log ( Ci ) − log ( Cj ) + log (∑ i Ci ) + τ. Then there exists a τ such that this embedding is close to the true embedding under the same equiva- lence class as Lemma S7 P (∑ i ||Ax̂i/σ2 −xj||22 > δ ) < ε. Proof. By Corollary 6 for any δ1 > 0 and ε1 > 0 there exists a m such that P(sup i,j |− log(Cij) − (||xi −xj||22/σ2) − log(mc)| > δ1) < ε1. Now additionally, if Ci/ √∑ j Cj = ||xi||2/σ2 then we can rewrite the above bound as P(sup i,j | log(Cij)−log(Ci)−log(Cj)+log( ∑ i Ci) − 2〈xi,xj〉/σ2 − log(mc)| > δ1) < ε1. and therefore, P(sup i,j |Mij −2〈xi,xj〉/σ2 − log(mc)| > δ1) < ε1. Given that the dot product matrix has error at most δ1, the resulting embedding it known to have at most√ δ1 error (Sibson, 1979). This completes the proof, since we can pick τ = − log(mc), δ1 = δ2 and ε1 = ε. Theorem 8 (Consistency of softmax/word2vec). Define the softmax objective function with bias as g(x̂, ĉ, b̂) = ∑ ij Cij log exp(−||x̂i − ĉj||22 + b̂j)∑n k=1 exp(−||x̂i − ĉk||22 + b̂k) Define xm,cm,bm as the global minima of the above objective function for a co-occurrence Cij over a corpus of size m. For any ε > 0 and δ > 0 there exists some m such that P(|g( x σ , x σ , 0) −g(xm,cm,bm)| > δ) < ε Proof. By differentiation, any objective of the form min λij Cij log ( exp(−λij)∑ k exp(−λik) ) has the minima λ∗ij = − log(Cij) + ai up to un-identifiable ai with objective function value 283 Cij log(Cij/ ∑ k Cik). This gives a global function lower bound g(xm,cm,bm) ≥ ∑ ij Cij log ( Cij∑ k Cik ) (4) Now consider the function value of the true embed- ding x σ ; g( x σ , x σ , 0) = ∑ ij Cij log exp(− 1 σ2 ||xi −xj||22)∑ k exp(− 1σ2 ||xi −xk|| 2 2) = ∑ ij Cij log ( exp(log(Cij) + δij + ai)∑ k exp(log(Cik) + δik + ai) ) . We can bound the error variables δij using Corol- lary 6 as supij |δij| < δ0 with probability ε0 for sufficiently large m with ai = log(mi) − log( ∑n k=1 exp(−||xi −xk||22/σ2)). Taking the Taylor expansion at δij = 0, we have g( x σ , x σ , 0) = ∑ ij Cij log Cij∑ k Cik + n∑ l=1 Cil∑ k Cikδil + o(||δ||22) By the law of large numbers of Cij, P (∣∣∣g( xσ, x σ , 0)− ∑ ij Cij log ( Cij∑ k Cik )∣∣∣ > nδ0 ) < ε0 which combined with (4) yields P(|g( x σ , x σ , 0) −g(x,c,b)| > nδ0) < ε0. To obtain the original theorem statement, take m to fulfil δ0 = δ/n and ε0 = ε. Note that for word2vec with negative-sampling, applying the stationary point analysis of Levy and Goldberg (2014b) combined with the analysis in Lemma S7 shows that the true embedding is a global minimum. C Empirical evaluation details C.1 Implementation details We used off-the-shelf implementations of word2vec11 and GloVe12. The two other methods (randomized) SVD and regression embed- ding are both implemented on top of the GloVe codebase. We used 300-dimensional vectors and window size 5 in all models. Further details are provided below. 11http://code.google.com/p/word2vec 12http://nlp.stanford.edu/projects/glove word2vec. We used the skip-gram version with 5 negative samples, 10 iterations, α = 0.025 and fre- quent word sub-sampling with a parameter of 10−3. GloVe. We disabled GloVe’s corpus weighting, since this generally produced superior results. The default step-sizes results in NaN-valued embed- dings, so we reduced them. We used XMAX = 100, η = 0.01 and 10 iterations. SVD. For the SVD algorithm of Levy and Gold- berg (2014b), we use the GloVe co-occurrence counter combined with a parallel randomized pro- jection SVD factorizer, based upon the redsvd li- brary due to memory and runtime constraints.13 Fol- lowing Levy et al. (2015), we used the square root factorization, no negative shifts (τ = 0 in our nota- tion), and 50,000 random projections. Regression Embedding. We use standard SGD with two differences. First, we drop co-occurrence values with probability proportional to 1 − Cij/10 when Cij < 10, and scale the gradient, which re- sulted in training time speedups with no loss in ac- curacy. Second, we use an initial line search step combined with a linear step size decay by epoch. We use θ = 50 and η is line-searched starting at η = 10. C.2 Solving inductive reasoning tasks The ideal point for a task is defined below: • Analogies: Given A:B::C, the ideal point is given by B −A + C (parallelogram rule). • Analogies (SAT): Given prototype A:B and candidates C1 : D1 . . .Cn : Dn, we compare Di −Ci to the ideal point B −A. • Categories: Given a category implied by w1, . . . ,wn, the ideal point is I = 1 n ∑n i=1 wi. • Sequence: Given sequence w1 : · · · : wn we compute the ideal as I = wn + 1 n (wn −w1). Once we have the ideal point I, we pick the answer as the word closest to I among the options, using L2 or cosine distance. For the latter, we normalize I to unit norm before taking the cosine distance. For L2 we do not apply any normalization. 13https://github.com/ntessore/redsvd-h 284 References Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2015. Random walks on con- text spaces: Towards an explanation of the myster- ies of semantic word embeddings. arXiv preprint arXiv:1502.03520. Kenneth Ward Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicogra- phy. Computational Linguistics, 16(1):22–29. David A Croydon and Ben M Hambly. 2008. Local limit theorems for sequences of simple random walks on graphs. Potential Analysis, 29(4):351–389. Christiane Fellbaum, editor. 1998. WordNet: an elec- tronic lexical database. Thomas L Griffiths, Mark Steyvers, and Joshua B Tenen- baum. 2007. Topics in semantic representation. Psy- chological review, 114(2):211. Zellig S Harris. 1954. Distributional structure. Word, 10(23):146–162. Tatsunori Hashimoto, Yi Sun, and Tommi Jaakkola. 2015a. From random walks to distances on un- weighted graphs. In Advances in Neural Information Processing Systems. Tatsunori Hashimoto, Yi Sun, and Tommi Jaakkola. 2015b. Metric recovery from directed unweighted graphs. In Artificial Intelligence and Statistics, pages 342–350. Geoffrey E Hinton and Sam T Roweis. 2002. Stochastic neighbor embedding. In Advances in Neural Informa- tion Processing Systems, pages 833–840. Matthäus Kleindessner and Ulrike von Luxburg. 2015. Dimensionality estimation without distances. In Pro- ceedings of the Artificial Intelligence and Statistics conference (AISTATS). Omer Levy and Yoav Goldberg. 2014a. Linguistic Regularities in Sparse and Explicit Word Representa- tions. In Proceedings 18th Conference on Computa- tional Natural Language Learning, pages 171–180. Omer Levy and Yoav Goldberg. 2014b. Neural word embedding as implicit matrix factorization. In Ad- vances in Neural Information Processing Systems, pages 2177–2185. Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- proving distributional similarity with lessons learned from word embeddings. Transactions of the Associa- tion for Computational Linguistics, 3:211–225. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representa- tions in vector space. In ICLR Workshop. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013b. Distributed represen- tations of words and phrases and their composition- ality. In Advances in Neural Information Processing Systems, pages 3111–3119. SA Molchanov. 1975. Diffusion processes and rie- mannian geometry. Russian Mathematical Surveys, 30(1):1. Douglas L Nelson, Cathy L McEvoy, and Thomas A Schreiber. 2004. The university of south florida free association, rhyme, and word fragment norms. Be- havior Research Methods, Instruments, & Computers, 36(3):402–407. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word rep- resentation. In Proceedings of the Empiricial Methods in Natural Language Processing, pages 1532–43. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD Interna- tional Conference on Knowledge Discovery and Data Mining, pages 701–710. ACM. David E Rumelhart and Adele A Abrahamson. 1973. A model for analogical reasoning. Cognitive Psychol- ogy, 5(1):1–28. Laurent Saloff-Coste. 2010. The heat kernel and its es- timates. Probabilistic approach to geometry, 57:405– 436. Gideon Schwarz and Amos Tversky. 1980. On the reci- procity of proximity relations. Journal of Mathemati- cal Psychology, 22(3):157–175. Robin Sibson. 1979. Studies in the robustness of multidi- mensional scaling: Perturbational analysis of classical scaling. Journal of the Royal Statistical Society. Series B (Methodological), pages 217–229. Richard Socher, John Bauer, Christopher D Manning, and Andrew Y Ng. 2013. Parsing with compositional vec- tor grammars. In Proceedings of the ACL conference. Robert J Sternberg and Michael K Gardner. 1983. Uni- ties in inductive reasoning. Journal of Experimental Psychology: General, 112(1):80. Joshua B Tenenbaum, Vin De Silva, and John C Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323. Joshua B Tenenbaum. 1998. Mapping a manifold of per- ceptual observations. In Advances in Neural Informa- tion Processing Systems, pages 682–688. Daniel Ting, Ling Huang, and Michael Jordan. 2011. An analysis of the convergence of graph laplacians. arXiv preprint arXiv:1101.5435. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics, pages 384–394. ACL. Peter D Turney and Michael L Littman. 2005. Corpus- based learning of analogies and semantic relations. Machine Learning, 60(1-3):251–278. 285 Amos Tversky and J Hutchinson. 1986. Nearest neigh- bor analysis of psychological spaces. Psychological Review, 93(1):3. Srinivasa RS Varadhan. 1967. Diffusion processes in a small time interval. Communications on Pure and Applied Mathematics, 20(4):659–685. 286