Word Embeddings as Metric Recovery in Semantic Spaces

Tatsunori B. Hashimoto, David Alvarez-Melis and Tommi S. Jaakkola
CSAIL, Massachusetts Institute of Technology

{thashim, davidam, tommi}@csail.mit.edu

Abstract

Continuous word representations have been
remarkably useful across NLP tasks but re-
main poorly understood. We ground word
embeddings in semantic spaces studied in
the cognitive-psychometric literature, taking
these spaces as the primary objects to recover.
To this end, we relate log co-occurrences of
words in large corpora to semantic similarity
assessments and show that co-occurrences are
indeed consistent with an Euclidean semantic
space hypothesis. Framing word embedding
as metric recovery of a semantic space uni-
fies existing word embedding algorithms, ties
them to manifold learning, and demonstrates
that existing algorithms are consistent metric
recovery methods given co-occurrence counts
from random walks. Furthermore, we propose
a simple, principled, direct metric recovery al-
gorithm that performs on par with the state-of-
the-art word embedding and manifold learning
methods. Finally, we complement recent fo-
cus on analogies by constructing two new in-
ductive reasoning datasets—series completion
and classification—and demonstrate that word
embeddings can be used to solve them as well.

1 Introduction

Continuous space models of words, objects, and sig-
nals have become ubiquitous tools for learning rich
representations of data, from natural language pro-
cessing to computer vision. Specifically, there has
been particular interest in word embeddings, largely
due to their intriguing semantic properties (Mikolov
et al., 2013b) and their success as features for down-
stream natural language processing tasks, such as

named entity recognition (Turian et al., 2010) and
parsing (Socher et al., 2013).

The empirical success of word embeddings has
prompted a parallel body of work that seeks to better
understand their properties, associated estimation al-
gorithms, and explore possible revisions. Recently,
Levy and Goldberg (2014a) showed that linear lin-
guistic regularities first observed with word2vec
extend to other embedding methods. In particu-
lar, explicit representations of words in terms of co-
occurrence counts can be used to solve analogies in
the same way. In terms of algorithms, Levy and
Goldberg (2014b) demonstrated that the global min-
imum of the skip-gram method with negative sam-
pling of Mikolov et al. (2013b) implicitly factorizes
a shifted version of the pointwise mutual informa-
tion (PMI) matrix of word-context pairs. Arora et
al. (2015) explored links between random walks and
word embeddings, relating them to contextual (prob-
ability ratio) analogies, under specific (isotropic) as-
sumptions about word vectors.

In this work, we take semantic spaces stud-
ied in the cognitive-psychometric literature as the
prototypical objects that word embedding algo-
rithms estimate. Semantic spaces are vector spaces
over concepts where Euclidean distances between
points are assumed to indicate semantic similar-
ities. We link such semantic spaces to word
co-occurrences through semantic similarity assess-
ments, and demonstrate that the observed co-
occurrence counts indeed possess statistical proper-
ties that are consistent with an underlying Euclidean
space where distances are linked to semantic simi-
larity.

273

Transactions of the Association for Computational Linguistics, vol. 4, pp. 273–286, 2016. Action Editor: Scott Yih.
Submission batch: 12/2015; Revision batch: 3/2016; Published 7/2016.

c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


Analogy

A

B

C

I
D1
D2

D3
D4

Series Completion

A

B

I
D1
D2

D3

D4

Classification

A

B

C

I D1
D2

D3

D4

1

Analogy

A

B

C

I
D1
D2

D3
D4

Series Completion

A

B

I
D1
D2

D3

D4

Classification

A

B

C

I D1
D2

D3

D4

1

Analogy

A

B

C

I
D1
D2

D3
D4

Series Completion

A

B

I
D1
D2

D3

D4

Classification

A

B

C

I D1
D2

D3

D4

1

Figure 1: Inductive reasoning in semantic space proposed in Sternberg and Gardner (1983). A, B, C are
given, I is the ideal point and D are the choices. The correct answer is shaded green.

Formally, we view word embedding methods as
performing metric recovery. This perspective is sig-
nificantly different from current approaches. Instead
of aiming for representations that exhibit specific se-
mantic properties or that perform well at a particu-
lar task, we seek methods that recover the underly-
ing metric of the hypothesized semantic space. The
clearer foundation afforded by this perspective en-
ables us to analyze word embedding algorithms in
a principled task-independent fashion. In particu-
lar, we ask whether word embedding algorithms are
able to recover the metric under specific scenarios.
To this end, we unify existing word embedding al-
gorithms as statistically consistent metric recovery
methods under the theoretical assumption that co-
occurrences arise from (metric) random walks over
semantic spaces. The new setting also suggests a
simple and direct recovery algorithm which we eval-
uate and compare against other embedding methods.

The main contributions of this work can be sum-
marized as follows:

• We ground word embeddings in semantic
spaces via log co-occurrence counts. We show
that PMI (pointwise mutual information) re-
lates linearly to human similarity assessments,
and that nearest-neighbor statistics (centrality
and reciprocity) are consistent with an Eu-
clidean space hypothesis (Sections 2 and 3).

• In contrast to prior work (Arora et al., 2015),
we take metric recovery as the key object of
study, unifying existing algorithms as consis-
tent metric recovery methods based on co-
occurrence counts from simple Markov random
walks over graphs and manifolds. This strong
link to manifold estimation opens a promis-
ing direction for extensions of word embedding
methods (Sections and 4 and 5).

• We propose and evaluate a new principled di-
rect metric recovery algorithm that performs
comparably to the existing state of the art on
both word embedding and manifold learning
tasks, and show that GloVe (Pennington et
al., 2014) is closely related to the second-order
Taylor expansion of our objective.

• We construct and make available two new in-
ductive reasoning datasets1—series completion
and classification—to extend the evaluation of
word representations beyond analogies, and
demonstrate that these tasks can be solved with
vector operations on word embeddings as well
(Examples in Table 1).

2 Word vectors and semantic spaces

Most current word embedding algorithms build on
the distributional hypothesis (Harris, 1954) where
similar contexts imply similar meanings so as to tie
co-occurrences of words to their underlying mean-
ings. The relationship between semantics and co-
occurrences has also been studied in psychomet-
rics and cognitive science (Rumelhart and Abraham-
son, 1973; Sternberg and Gardner, 1983), often by
means of free word association tasks and seman-
tic spaces. The semantic spaces, in particular, pro-
vide a natural conceptual framework for continu-
ous representations of words as vector spaces where
semantically related words are close to each other.
For example, the observation that word embeddings
can solve analogies was already shown by Rumel-
hart and Abrahamson (1973) using vector represen-
tations of words derived from surveys of pairwise
word similarity judgments.

A fundamental question regarding vector space
models of words is whether an Euclidean vector

1http://web.mit.edu/thashim/www/supplement materials.zip

274


Task Prompt Answer
Analogy king:man::queen:? woman

Series penny:nickel:dime:? quarter
Classification horse:zebra:{deer, dog, fish} deer

Table 1: Examples of the three inductive reasoning
tasks proposed by Sternberg and Gardner (1983).

space is a valid representation of semantic concepts.
There is substantial empirical evidence in favor of
this hypothesis. For example, Rumelhart and Abra-
hamson (1973) showed experimentally that analog-
ical problem solving with fictitious words and hu-
man mistake rates were consistent with an Euclidean
space. Sternberg and Gardner (1983) provided fur-
ther evidence supporting this hypothesis, proposing
that general inductive reasoning was based upon op-
erations in metric embeddings. Using the analogy,
series completion and classification tasks shown in
Table 1 as testbeds, they proposed that subjects solve
these problems by finding the word closest (in se-
mantic space) to an ideal point: the vertex of a par-
allelogram for analogies, a displacement from the
last word in series completion, and the centroid in
the case of classification (Figure 1).

We use semantic spaces as the prototypical struc-
tures that word embedding methods attempt to un-
cover, and we investigate the suitability of word co-
occurrence counts for doing so. In the next section,
we show that co-occurrences from large corpora in-
deed relate to semantic similarity assessments, and
that the resulting metric is consistent with an Eu-
clidean semantic space hypothesis.

3 The semantic space of log co-occurrences

Most word embedding algorithms are based on word
co-occurrence counts. In order for such meth-
ods to uncover an underlying Euclidean semantic
space, we must demonstrate that co-occurrences
themselves are indeed consistent with some seman-
tic space. We must relate co-occurrences to semantic
similarity assessments, on one hand, and show that
they can be embedded into a Euclidean metric space,
on the other. We provide here empirical evidence for
both of these premises.

We commence by demonstrating in Figure 2
that the pointwise mutual information (Church
and Hanks, 1990) evaluated from co-occurrence

Figure 2: Normalized log co-occurrence (PMI)
linearly correlates with human semantic similarity
judgments (MEN survey).

counts has a strong linear relationship with seman-
tic similarity judgments from survey data (Pearson’s
r=0.75).2 However, this suggestive linear relation-
ship does not by itself demonstrate that log co-
occurrences (with normalization) can be used to de-
fine an Euclidean metric space.

Earlier psychometric studies have asked whether
human semantic similarity evaluations are consis-
tent with an Euclidean space. For example, Tver-
sky and Hutchinson (1986) investigate whether con-
cept representations are consistent with the geomet-
ric sampling (GS) model: a generative model in
which points are drawn independently from a con-
tinuous distribution in an Euclidean space. They
use two nearest neighbor statistics to test agreement
with this model, and conclude that certain hierarchi-
cal vocabularies are not consistent with an Euclidean
embedding. Similar results are observed by Griffiths
et al. (2007). We extend this embeddability analy-
sis to lexical co-occurrences and show that semantic
similarity estimates derived from these are mostly
consistent with an Euclidean space hypothesis.

The first test statistic for the GS model, the cen-
trality C, is defined as

C =
1

n

n∑

i=1

( n∑

j=1

Nij
)2

where Nij = 1 iff i is j’s nearest neighbor. Under
the GS model (i.e. when the words are consistent
with a Euclidean space representation), C ≤ 2 with
high probability as the number of words n →∞ re-
gardless of the dimension or the underlying density
(Tversky and Hutchinson, 1986). For metrically em-
beddable data, typical non-asymptotic values of C

2Normalizing the log co-occurrence with the unigram fre-
quency taken to the 3/4th power maximizes the linear correla-
tion in Figure 2, explaining this choice of normalization in prior
work (Levy and Goldberg, 2014a; Mikolov et al., 2013b).

275


Corpus C Rf
Free association 1.51 0.48
Wikipedia corpus 2.21 0.63
Word2vec corpus 2.24 0.73
GloVe corpus 2.66 0.62

Table 2: Semantic similarity data derived from mul-
tiple sources show evidence of embeddability

range between 1 and 3, while non-embeddable hier-
archical structures have C > 10.

The second statistic, the reciprocity fraction Rf
(Schwarz and Tversky, 1980; Tversky and Hutchin-
son, 1986), is defined as

Rf =
1

n

n∑

i=1

n∑

j=1

NijNji

and measures the fraction of words that are their
nearest neighbor’s nearest neighbor. Under the GS
model, this fraction should be greater than 0.5.3

Table 2 shows the two statistics computed on
three popular large corpora and a free word associ-
ation dataset (see Section 6 for details). The near-
est neighbor calculations are based on PMI. The
results show a surprisingly high agreement on all
three statistics for all corpora, with C and Rf con-
tained in small intervals: C ∈ [2.21, 2.66] and
Rf ∈ [0.62, 0.73]. These results are consistent
with Euclidean semantic spaces and the GS model
in particular. The largest violators of C and Rf are
consistent with Tversky’s analysis: the word with
the largest centrality in the non-stopword Wikipedia
corpus is ‘the’, whose inclusion would increase C to
3.46 compared to 2.21 without it. Tversky’s original
analysis of semantic similarities argued that certain
words, such as superordinate and function words,
could not be embedded. Despite such specific ex-
ceptions, we find that for an appropriately normal-
ized corpus, the majority of words are consistent
with the GS model, and therefore can be represented
meaningfully as vectors in Euclidean space.

The results of this section are an important step
towards justifying the use of word co-occurrence
counts as the central object of interest for seman-
tic vector representations of words. We have shown

3Both R and C are asymptotically dimension independent
because they rely only on the single nearest neighbor. Esti-
mating the latent dimensionality requires other measures and
assumptions (Kleindessner and von Luxburg, 2015).

that they are empirically related to a human notion of
semantic similarity and that they are metrically em-
beddable, a desirable condition if we expect word
vectors derived from them to truly behave as ele-
ments of a metric space. This, however, does not
yet fully justify their use to derive semantic repre-
sentations. The missing piece is to formalize the
connection between these co-occurrence counts and
some intrinsic notion of semantics, such as the se-
mantic spaces described in Section 2. In the next
two sections, we establish this connection by fram-
ing word embedding algorithms that operate on co-
occurrences as metric recovery methods.

4 Semantic spaces and manifolds

We take a broader, unified view on metric recov-
ery of semantic spaces since the notion of semantic
spaces and the associated parallelogram rule for ana-
logical reasoning extend naturally to objects other
than words. For example, images can be approxi-
mately viewed as points in an Euclidean semantic
space by representing them in terms of their under-
lying degrees of freedom (e.g. orientation, illumina-
tion). Thus, questions about the underlying seman-
tic spaces and how they can be recovered should be
related.

The problem of recovering an intrinsic Euclidean
coordinate system over objects has been specifically
addressed in manifold learning. For example, meth-
ods such as Isomap (Tenenbaum et al., 2000) re-
constitute an Euclidean space over objects (when
possible) based on local comparisons. Intuitively,
these methods assume that naive distance metrics
such as the L2 distance over pixels in an image
may be meaningful only when images are very sim-
ilar. Longer distances between objects are evaluated
through a series of local comparisons. These longer
distances—geodesic distances over the manifold—
can be approximated by shortest paths on a neigh-
borhood graph. If we view the geodesic distances
on the manifold (represented as a graph) as seman-
tic distances, then the goal is to isometrically embed
these distances into an Euclidean space. Tenenbaum
(1998) showed that such isometric embeddings of
image manifolds can be used to solve “visual analo-
gies” via the parallelogram rule.

Typical approaches to manifold learning as dis-

276


cussed above differ from word embedding in terms
of how the semantic distances between objects are
extracted. Word embeddings approximate semantic
distances between words using the negative log co-
occurrence counts (Section 3), while manifold learn-
ing approximates semantic distances using neigh-
borhood graphs built from local comparisons of the
original, high-dimensional points. Both views seek
to estimate a latent geodesic distance.

In order to study the problem of metric recov-
ery from co-occurrence counts, and to formalize the
connection between word embedding and manifold
learning, we introduce a simple random walk model
over the underlying objects (e.g. words or images).
This toy model permits us to establish clean consis-
tency results for recovery algorithms. We emphasize
that while the random walk is introduced over the
words, it is not intended as a model of language but
rather as a tool to understand the recovery problem.

4.1 Random walk model

Consider now a simple metric random walk Xt over
words where the probability of transitioning from
word i to word j is given by

P(Xt = j|Xt−1 = i) = h( 1σ||xi −xj||
2
2) (1)

Here ||xi −xj||22 is the Euclidean distance between
words in the underlying semantic space to be recov-
ered, and h is some unknown, sub-Gaussian func-
tion linking semantic similarity to co-occurrence.4

Under this model, the log frequency of occur-
rences of word j immediately after word i will be
proportional to log(h(||xi −xj||22/σ)) as the corpus
size grows large. Here we make the surprising ob-
servation that if we consider co-occurrences over a
sufficiently large window, the log co-occurrence in-
stead converges to −||xi−xj||22/σ, i.e. it directly re-
lates to the underlying metric. Intuitively, this result
is an analog of the central limit theorem for random
walks. Note that, for this reason, we do not need to
know the link function h.

Formally, given an m-token corpus consisting of
sentences generated according to Equation 1 from a

4This toy model ignores the role of syntax and function
words, but these factors can be included as long as the moment
bounds originally derived in Hashimoto et al. (2015b) remain
fulfilled.

vocabulary of size n, let Cm,nij (tn) be the number
of times word j occurs tn steps after word i in the
corpus.5 We can show that there exist unigram nor-
malizers am,ni ,b

m,n
i such that the following holds:

Lemma 1. Given a corpus generated by Equation 1
there exists ai and bj such that simultaneously over
all i,j:

lim
m,n→∞

− log(Cm,nij (tn))−a
m,n
i −b

m,n
j →||xi−xj||

2
2.

We defer the precise statement and conditions of
Lemma 1 to Corollary 6. Conceptually, this lim-
iting6 result captures the intuition that while one-
step transitions in a sentence may be complex and
include non-metric structure expressed in h, co-
occurrences over large windows relate directly to
the latent semantic metric. For ease of notation,
we henceforth omit the corpus and vocabulary size
descriptors m,n (using Cij, ai, and bj in place of
C
m,n
ij (tn), a

m,n
i , and b

m,n
j ), since in practice the cor-

pus is large but fixed.
Lemma 1 serves as the basis for establishing con-

sistency of recovery for word embedding algorithms
(next section). It also allows us to establish a precise
link between between manifold learning and word
embedding, which we describe in the remainder of
this section.

4.2 Connection to manifold learning
Let {v1 . . .vn} ∈ RD be points drawn i.i.d. from
a density p, where D is the dimension of observed
inputs (e.g. number of pixels, in the case of im-
ages), and suppose that these points lie on a mani-
fold M⊂ RD that is isometrically embeddable into
d < D dimensions, where d is the intrinsic dimen-
sionality of the data (e.g. coordinates representing
illumination or camera angle in the case of images).
The problem of manifold learning consists of finding
an embedding of v1 . . .vn into Rd that preserves the
structure of M by approximately preserving the dis-
tances between points along this manifold. In light

5The window-size tn depends on the vocabulary size to en-
sure that all word pairs have nonzero co-occurrence counts in
the limit of large vocabulary and corpus. For details see the
definition of gn in Appendix A.

6In Lemma 1, we take m → ∞ (growing corpus size) to
ensure all word pairs appear sufficiently often, and n → ∞
(growing vocabulary) to ensure that every point in the semantic
space has a nearby word.

277


of Lemma 1, this problem can be solved with word
embedding algorithms in the following way:

1. Construct a neighborhood graph (e.g. connect-
ing points within a distance ε) over {v1 . . .vn}.

2. Record the vertex sequence of a simple random
walk over these graphs as a sentence, and con-
catenate these sequences initialized at different
nodes into a corpus.

3. Use a word embedding method on this cor-
pus to generate d-dimensional vector represen-
tations of the data.

Under the conditions of Lemma 1, the negative
log co-occurrences over the vertices of the neigh-
borhood graph will converge, as n → ∞, to the
geodesic distance (squared) over the manifold. In
this case we will show that the globally optimal so-
lutions of word embedding algorithms recover the
low dimensional embedding (Section 5).7

5 Recovering semantic distances with
word embeddings

We now show that, under the conditions of Lemma
1, three popular word embedding methods can be
viewed as doing metric recovery from co-occurrence
counts. We use this observation to derive a new, sim-
ple word embedding method inspired by Lemma 1.

5.1 Word embeddings as metric recovery
GloVe The Global Vectors (GloVe) (Pennington
et al., 2014) method for word embedding optimizes
the objective function

min
x̂,ĉ,a,b

∑

i,j

f(Cij)(2〈x̂i, ĉj〉 + ai + bj − log(Cij))2

with f(Cij) = min(Cij, 100)3/4. If we rewrite the
bias terms as ai = âi −||x̂i||22 and bj = b̂j −||ĉj||22,
we obtain the equivalent representation:

min
x̂,ĉ,â,̂b

∑

i,j

f(Cij)(− log(Cij)−||x̂i−ĉj||22+âi+b̂j))2.

Together with Lemma 1, we recognize this as a
weighted multidimensional scaling (MDS) objective

7This approach of applying random walks and word embed-
dings to general graphs has already been shown to be surpris-
ingly effective for social networks (Perozzi et al., 2014), and
demonstrates that word embeddings serve as a general way to
connect metric random walks to embeddings.

with weights f(Cij). Splitting the word vector x̂i
and context vector ĉi is helpful in practice but not
necessary under the assumptions of Lemma 1 since
the true embedding x̂i = ĉi = xi/σ and âi, b̂i = 0 is
a global minimum whenever dim(x̂) = d. In other
words, GloVe can recover the true metric provided
that we set d correctly.

word2vec The skip-gram model of word2vec
approximates a softmax objective:

min
x̂,ĉ

∑

i,j

Cij log

(
exp(〈x̂i, ĉj〉)∑n
k=1 exp(〈x̂i, ĉk〉)

)
.

Without loss of generality, we can rewrite the above
with a bias term bj by making dim(x̂) = d + 1
and setting one of the dimensions of x̂ to 1. By re-
defining the bias b̂j = bj − ||ĉj||22/2, we see that
word2vec solves

min
x̂,ĉ,̂b

∑

i,j

Cij log

(
exp(−1

2
||x̂i − ĉj||22 + b̂j)∑n

k=1 exp(−12||x̂i − ĉk||22 + b̂k)

)
.

Since according to Lemma 1 Cij/
∑n

k=1 Cik

approaches exp(−‖|xi−xj||
2
2/σ

2)∑n
k=1 exp(−‖|xi−xk||22/σ2)

, this is the
stochastic neighbor embedding (SNE) (Hinton and
Roweis, 2002) objective weighted by

∑n
k=1 Cik.

The global optimum is achieved by x̂i = ĉi =
xi(
√

2/σ) and b̂j = 0 (see Theorem 8). The neg-
ative sampling approximation used in practice be-
haves much like the SVD approach of Levy and
Goldberg (2014b), and by applying the same sta-
tionary point analysis as they do, we show that in
the absence of a bias term the true embedding is
a global minimum under the additional assumption
that ||xi||22(2/σ2) = log(

∑
j Cij/

√∑
ij Cij) (Hin-

ton and Roweis, 2002).

SVD The method of Levy and Goldberg (2014b)
uses the log PMI matrix defined in terms of the uni-
gram frequency Ci as:

Mij = log(Cij)−log(Ci)−log(Cj) + log
(∑

j

Cj
)

and computes the SVD of the shifted and truncated
matrix: (Mij + τ)+ where τ is a truncation param-
eter to keep Mij finite. Under the limit of Lemma
1, the corpus is sufficiently large that no truncation
is necessary (i.e. τ = −min(Mij) < ∞). We will

278


recover the underlying embedding if we additionally
assume 1

σ
||xi||22 = log(Ci/

√∑
j Cj) via the law of

large numbers since Mij → 〈xi,xj〉 (see Theorem
7). Centering the matrix Mij before obtaining the
SVD would relax the norm assumption, resulting ex-
actly in classical MDS (Sibson, 1979).

5.2 Metric regression from log co-occurrences
We have shown that by simple reparameterizations
and use of Lemma 1, existing embedding algo-
rithms can be interpreted as consistent metric recov-
ery methods. However, the same Lemma suggests a
more direct regression method for recovering the la-
tent coordinates, which we propose here. This new
embedding algorithm serves as a litmus test for our
metric recovery paradigm.

Lemma 1 describes a log-linear relationship be-
tween distance and co-occurrences. The canonical
way to fit this relationship would be to use a general-
ized linear model, where the co-occurrences follow a
negative binomial distribution Cij ∼ NegBin(θ,p),
where p = θ/[θ + exp(−1

2
||xi −xj||22 + ai + bj)].

Under this overdispersed log linear model,

E[Cij] = exp(−12||xi −xj||
2
2 + ai + bj)

Var(Cij) = E[Cij]2/θ + E[Cij]

Here, the parameter θ controls the contribution of
large Cij, and is akin to GloVe’s f(Cij) weight
function. Fitting this model is straightforward if we
define the log-likelihood in terms of the expected
rate λij = exp(−12||xi −xj||

2
2 + ai + bj) as:

LL(x,a,b,θ) =
∑

i,j

θ log(θ) −θ log(λij + θ)+

Cij log
(

1 − θ
λij +θ

)
+ log

(
Γ(Cij +θ)

Γ(θ)Γ(Cij +1)

)

To generate word embeddings, we minimize the
negative log-likelihood using stochastic gradient de-
scent. The implementation mirrors that of GloVe
and randomly selects word pairs i,j and attracts or
repulses the vectors x̂ and ĉ in order to achieve the
relationship in Lemma 1. Implementation details are
provided in Appendix C.

Relationship to GloVe The overdispersion pa-
rameter θ in our metric regression model sheds light
on the role of GloVe’s weight function f(Cij). Tak-
ing the Taylor expansion of the log-likelihood at

log(λij) ≈− log(Cij), we have

LL(x,a,b,θ) =
∑

ij

kij− Cijθ2(Cij +θ) (uij)
2+o

(
(uij)

3
)
,

where uij = (log λij − log Cij) and kij does
not depend on x. Note the similarity of the sec-
ond order term with the GloVe objective. As Cij
grows, the weight functions Cijθ

2(Cij +θ)
and f(Cij) =

max(Cij,xmax)
3/4 converge to θ/2 and xmax re-

spectively, down-weighting large co-occurrences.

6 Empirical validation

We will now experimentally validate two aspects of
our theory: the semantic space hypothesis (Sections
2 and 3), and the correspondence between word em-
bedding and manifold learning (Sections 4 and 5).
Our goal with this empirical validation is not to find
the absolute best method and evaluation metric for
word embeddings, which has been studied before
(e.g. Levy et al. (2015)). Instead, we provide empir-
ical evidence in favor of the semantic space hypoth-
esis, and show that our simple algorithm for metric
recovery is competitive with the state-of-the-art on
both semantic induction tasks and manifold learn-
ing. Since metric regression naturally operates over
integer co-occurrences, we use co-occurrences over
unweighted windows for this and—for fairness—for
the other methods (see Appendix C for details).

6.1 Datasets
Corpus and training: We used three different
corpora for training: a Wikipedia snapshot of
03/2015 (2.4B tokens), the original word2vec cor-
pus (Mikolov et al., 2013a) (6.4B tokens), and a
combination of Wikipedia with Gigaword5 emulat-
ing GloVe’s corpus (Pennington et al., 2014) (5.8B
tokens). We preprocessed all corpora by removing
punctuation, numbers and lower-casing all the text.
The vocabulary was restricted to the 100K most fre-
quent words in each corpus. We trained embeddings
using four methods: word2vec, GloVe, random-
ized SVD,8 and metric regression (referred to as re-
gression). Full implementation details are provided
in the Appendix.

8We used randomized, rather than full SVD due to the diffi-
culty of scaling SVD to this problem size. For performance of
full SVD factorizations see Levy et al. (2015).

279


Google Semantic Google Syntactic Google Total SAT Classification Sequence

Method L2 Cos L2 Cos L2 Cos L2 Cos L2 Cos L2 Cos

Regression 75.5 78.4 70.9 70.8 72.6 73.7 39.2 37.8 87.6 84.6 58.3 59.0
GloVe 71.1 76.4 68.6 71.9 69.6 73.7 36.9 35.5 74.6 80.1 53.0 58.9
SVD 50.9 58.1 51.4 52.0 51.2 54.3 32.7 24.0 71.6 74.1 49.4 47.6
word2vec 71.4 73.4 70.9 73.3 71.1 73.3 42.0 42.0 76.4 84.6 54.4 56.2

Table 3: Accuracies on Google, SAT analogies and on two new inductive reasoning tasks.

Manifold Learning Word Embedding

Semantic 83.3 70.7
Syntactic 8.2 76.9
Total 51.4 73.4

Table 4: Semantic similarity alone can solve the
Google analogy tasks

For fairness we fix all hyperparameters, and de-
velop and test the code for metric regression exclu-
sively on the first 1GB subset of the wiki dataset.
For open-vocabulary tasks, we restrict the set of an-
swers to the top 30K words, since this improves per-
formance while covering the majority of the ques-
tions. In the following, we show performance for the
GloVe corpus throughout but include results for all
corpora along with our code package.

Evaluation tasks: We test the quality of the word
embeddings on three types of inductive tasks: analo-
gies, sequence completion and classification (Figure
1). For the analogies, we used the standard open-
vocabulary analogy task of Mikolov et al. (2013a)
(henceforth denoted Google), consisting of 19,544
semantic and syntactic questions. In addition, we
use the more difficult SAT analogy dataset (ver-
sion 3) (Turney and Littman, 2005), which contains
374 questions from actual exams and guidebooks.
Each question consists of 5 exemplar pairs of words
word1:word2, where the same relation holds for all
pairs. The task is to pick from among another five
pairs of words the one that best fits the category im-
plicitly defined by the exemplars.

Inspired by Sternberg and Gardner (1983), we
propose two new difficult inductive reasoning tasks
beyond analogies to verify the semantic space hy-
pothesis: sequence completion and classification.
As described in Section 2, the former involves
choosing the next step in a semantically coherent
sequence of words (e.g. hour,minute, . . .), and
the latter consists of selecting an element within the
same category out of five possible choices. Given

the lack of publicly available datasets, we generated
our own questions using WordNet (Fellbaum, 1998)
relations and word-word PMI values. These datasets
were constructed before training the embeddings, so
as to avoid biasing them towards any one method.

For the classification task, we created in-category
words by selecting words from WordNet relations
associated to root words, from which we pruned
to four words based on PMI-similarity to the other
words in the class. Additional options for the mul-
tiple choice questions were created searching over
words related to the root by a different relation type,
and selecting those most similar to the root.

For the sequence completion task, we obtained
WordNet trees of various relation types, and pruned
these based on similarity to the root word to obtain
the sequence. For the multiple-choice questions, we
proceeded as before to select additional (incorrect)
options of a different relation type to the root.

After pruning, we obtain 215 classification ques-
tions and 220 sequence completion questions, of
which 51 are open-vocabulary and 169 are multiple
choice. These two new datasets are available9.

6.2 Results on inductive reasoning tasks

Solving analogies using survey data alone: We
demonstrate that, surprisingly, word embeddings
trained directly on semantic similarity derived from
survey data can solve analogy tasks. Extending a
study by Rumelhart and Abrahamson (1973), we
use a free-association dataset (Nelson et al., 2004)
to construct a similarity graph, where vertices cor-
respond to words and the weights wij are given by
the number of times word j was considered most
similar to word i in the survey. We take the largest
connected component of this graph (consisting of
4845 words and 61570 weights) and embed it us-
ing Isomap for which squared edge distances are de-
fined as − log(wij/ maxkl(wkl)). We use the result-

9http://web.mit.edu/thashim/www/supplement materials.zip

280


Figure 3: Dimensionality reduction using word embedding and manifold learning. Performance is quantified
by percentage of 5-nearest neighbors sharing the same digit label.

ing vectors as word embeddings to solve the Google
analogy task. The results in Table 4 show that em-
beddings obtained with Isomap on survey data can
outperform the corpus based metric regression vec-
tors on semantic, but not syntactic tasks. We hypoth-
esize that free-association surveys capture semantic,
but not syntactic similarity between words.

Analogies: The results on the Google analogies
shown in Table 3 demonstrate that our proposed
framework of metric regression and L2 distance is
competitive with the baseline of word2vec with
cosine distance. The performance gap across meth-
ods is small and fluctuates across corpora, but met-
ric regression consistently outperforms GloVe on
most tasks and outperforms all methods on seman-
tic analogies, while word2vec does better on syn-
tactic categories. For the SAT dataset, the L2 dis-
tance performs better than the cosine similarity, and
we find word2vec to perform best, followed by
metric regression. The results on these two analogy
datasets show that directly embedding the log co-
occurrence metric and taking L2 distances between
vectors is competitive with current approaches to
analogical reasoning.

Sequence and classification tasks: As predicted
by the semantic field hypothesis, word embeddings
perform well on the two novel inductive reasoning
tasks (Table 3). Again, we observe that the metric
recovery with metric regression coupled with L2 dis-
tance consistently performs as well as and often bet-
ter than the current state-of-the-art word embedding
methods on these two additional semantic datasets.

6.3 Word embeddings can embed manifolds

In Section 4 we proposed a reduction for solving
manifold learning problems with word embeddings
which we show achieves comparable performance to
manifold learning methods. We now test this rela-

tion by performing nonlinear dimensionality reduc-
tion on the MNIST digit dataset, reducing from D =
256 to two dimensions. Using a four-thousand im-
age subset, we construct a k-nearest neighbor graph
(k = 20) and generate 10 simple random walks of
length 200 starting from each vertex in the graph, re-
sulting in 40,000 sentences of length 200 each. We
compare the four word embedding methods against
standard dimensionality reduction methods: PCA,
Isomap, SNE and, t-SNE. We evaluate the meth-
ods by clustering the resulting low-dimensional data
and computing cluster purity, measured using the
percentage of 5-nearest neighbors having the same
digit label. The resulting embeddings, shown in
Fig. 3, demonstrate that metric regression is highly
effective at this task, outperforming metric SNE and
beaten only by t-SNE (91% cluster purity), which is
a visualization method specifically designed to pre-
serve cluster separation. All word embedding meth-
ods including SVD (68%) embed the MNIST digits
remarkably well and outperform baselines of PCA
(48%) and Isomap (49%).

7 Discussion

Our work recasts word embedding as a metric recov-
ery problem pertaining to the underlying semantic
space. We use co-occurrence counts from random
walks as a theoretical tool to demonstrate that exist-
ing word embedding algorithms are consistent met-
ric recovery methods. Our direct regression method
is competitive with the state of the art on various se-
mantics tasks, including two new inductive reason-
ing problems of series completion and classification.

Our framework highlights the strong interplay
and common foundation between word embedding
methods and manifold learning, suggesting several
avenues for recovering vector representations of
phrases and sentences via properly defined Markov
processes and their generalizations.

281


Appendix

A Metric recovery from Markov processes
on graphs and manifolds

Consider an infinite sequence of points Xn =
{x1, . . . ,xn}, where xi are sampled i.i.d. from a
density p(x) over a compact Riemannian manifold
equipped with a geodesic metric ρ. For our pur-
poses, p(x) should have a bounded log-gradient and
a strict lower bound p0 over the manifold. The ran-
dom walks we consider are over unweighted spatial
graphs defined as

Definition 2 (Spatial graph). Let σn : Xn → R>0
be a local scale function and h : R≥0 → [0, 1]
a piecewise continuous function with sub-Gaussian
tails. A spatial graph Gn corresponding to σn and
h is a random graph with vertex set Xn and a di-
rected edge from xi to xj with probability pij =
h(ρ(xi,xj)

2/σn(xi)
2).

Simple examples of spatial graphs where the con-
nectivity is not random include the ε ball graph
(σn(x) = ε) and the k-nearest neighbor graph
(σn(x) =distance to k-th neighbor).

Log co-occurrences and the geodesic will be con-
nected in two steps. (1) we use known results to
show that a simple random walk over the spatial
graph, properly scaled, behaves similarly to a dif-
fusion process; (2) the log-transition probability of
a diffusion process will be related to the geodesic
metric on a manifold.
(1) The limiting random walk on a graph: Just
as the simple random walk over the integers con-
verges to a Brownian motion, we may expect that
under specific constraints the simple random walk
Xnt over the graph Gn will converge to some well-
defined continuous process. We require that the
scale functions converge to a continuous function σ̄
(σn(x)g−1n

a.s.−−→ σ̄(x)); the size of a single step van-
ish (gn → 0) but contain at least a polynomial num-
ber of points within σn(x) (gnn

1
d+2 log(n)

− 1
d+2 →

∞). Under this limit, our assumptions about the
density p(x), and regularity of the transitions10, the

10For t = Θ(g−2n ), the marginal distribution nP(Xt|X0)
must be a.s. uniformly equicontinuous. For undirected spatial
graphs, this is always true (Croydon and Hambly, 2008), but for
directed graphs it is an open conjecture from (Hashimoto et al.,
2015b).

following holds:

Theorem 3 ((Hashimoto et al., 2015b; Ting et al.,
2011)). The simple random walk Xnt on Gn con-
verges in Skorokhod space D([0,∞),D) after a time
scaling t̂ = tg2n to the Itô process Yt̂ valued in
C([0,∞),D) as Xn

t̂g−2n
→ Yt̂. The process Yt̂ is de-

fined over the normal coordinates of the manifold
(D,g) with reflecting boundary conditions on D as

dYt̂ = ∇ log(p(Yt̂))σ(Yt̂)
2dt̂ + σ(Yt̂)dWt̂ (2)

The equicontinuity constraint on the marginal
densities of the random walk implies that the tran-
sition density for the random walk converges to its
continuum limit.

Lemma 4 (Convergence of marginal densities).
(Hashimoto et al., 2015a) Let x0 be some point in
our domain Xn and define the marginal densities
q̂t(x) = P(Yt = x|Y0 = x0) and qtn (x) = P(Xnt =
x|Xn0 = x0). If tng2n = t̂ = Θ(1), then under
condition (?) and the results of Theorem 3 such that
Xnt → Y nt weakly, we have

lim
n→∞

nqtn (x) = q̂t̂(x)p(x)
−1.

(2) Log transition probability as a metric We may
now use the stochastic process Yt̂ to connect the log
transition probability to the geodesic distance using
Varadhan’s large deviation formula.

Theorem 5 ((Varadhan, 1967; Molchanov, 1975)).
Let Yt be a Itô process defined over a complete
Riemann manifold (D,g) with geodesic distance
ρ(xi,xj) then

lim
t→0
−t log(P(Yt = xj|Y0 = xi)) → ρ(xi,xj)2.

This estimate holds more generally for any space
admitting a diffusive stochastic process (Saloff-
Coste, 2010). Taken together, we finally obtain:

Corollary 6 (Varadhan’s formula on graphs). For
any δ,γ,n0 there exists some t̂, n > n0, and se-
quence bnj such that the following holds for the sim-
ple random walk Xnt :

P
(

sup
xi,xj∈Xn0

∣∣∣t̂ log(P(Xn
t̂g−2n

= xj | Xn0 = xi))

− t̂bnj −ρσ(x)(xi,xj)2
∣∣∣ > δ

)
< γ

282


Where ρσ(x) is the geodesic defined as

ρσ(x)(xi,xj) = min
f∈C1:f(0)=xi,f(1)=xj

∫ 1

0
σ(f(t))dt

Proof. The proof is in two parts. First, by Varad-
han’s formula (Theorem 5, (Molchanov, 1975, Eq.
1.7)) for any δ1 > 0 there exists some t̂ such that:

sup
y,y′∈D

|−t̂ log(P(Yt̂ = y
′|Y0 = y))−ρσ(x)(y′,y)2| < δ1

The uniform equicontinuity of the marginals implies
their uniform convergence (Lemma S4), so for any
δ2 > 0 and γ0, there exists a n such that

P( sup
xj,xi∈Xn0

|P(Yt̂ = xj|Y0 = xi)

−np(xj)P(Xng−2n t̂ = xj|X
n
0 = xi)| > δ2) < γ0

By the lower bound on p and compactness of D,
P(Yt̂|Y0) is lower bounded by some strictly positive
constant c and we can apply uniform continuity of
log(x) over (c,∞) to get that for some δ3 and γ,

P
(

sup
xj,xi∈Xn0

| log(P(Yt̂ = xj|Y0 = xi))−log(np(xj))

− log(P(Xn
g−2n t̂

= xj|Xn0 = xi))| > δ3
)
< γ. (3)

Finally we have the bound,

P
(

sup
xi,xj∈Xn0

|− t̂ log(P(Xn
g−2n t̂

= xj|Xn0 = xi))

−t̂ log(np(xj))−ρσ(x)(xi,xj)2| > δ1 +t̂δ3
)
< γ

To combine the bounds, given some δ and γ, set
bnj = log(np(xj)), pick t̂ such that δ1 < δ/2, then
pick n such that the bound in Eq. 3 holds with prob-
ability γ and error δ3 < δ/(2t̂).

B Consistency proofs for word embedding

Lemma 7 (Consistency of SVD). Assume the norm
of the latent embedding is proportional to the uni-
gram frequency

||xi||/σ2 = Ci/(
∑

j

Cj)
1
2

Under these conditions, Let X̂ be the embedding de-
rived from the SVD of Mij as

2X̂X̂T = Mij = log(Cij) − log
(
Ci

)

− log
(
Cj

)
+ log

(∑

i

Ci

)
+ τ.

Then there exists a τ such that this embedding is
close to the true embedding under the same equiva-
lence class as Lemma S7

P
(∑

i

||Ax̂i/σ2 −xj||22 > δ
)
< ε.

Proof. By Corollary 6 for any δ1 > 0 and ε1 > 0
there exists a m such that

P(sup
i,j
|− log(Cij) − (||xi −xj||22/σ2)

− log(mc)| > δ1) < ε1.

Now additionally, if Ci/
√∑

j Cj = ||xi||2/σ2 then
we can rewrite the above bound as

P(sup
i,j
| log(Cij)−log(Ci)−log(Cj)+log(

∑

i

Ci)

− 2〈xi,xj〉/σ2 − log(mc)| > δ1) < ε1.

and therefore,

P(sup
i,j
|Mij −2〈xi,xj〉/σ2 − log(mc)| > δ1) < ε1.

Given that the dot product matrix has error at most
δ1, the resulting embedding it known to have at most√
δ1 error (Sibson, 1979).
This completes the proof, since we can pick τ =
− log(mc), δ1 = δ2 and ε1 = ε.
Theorem 8 (Consistency of softmax/word2vec).
Define the softmax objective function with bias as

g(x̂, ĉ, b̂) =
∑

ij

Cij log
exp(−||x̂i − ĉj||22 + b̂j)∑n
k=1 exp(−||x̂i − ĉk||22 + b̂k)

Define xm,cm,bm as the global minima of the above
objective function for a co-occurrence Cij over a
corpus of size m. For any ε > 0 and δ > 0 there
exists some m such that

P(|g( x
σ
, x
σ
, 0) −g(xm,cm,bm)| > δ) < ε

Proof. By differentiation, any objective of the form

min
λij

Cij log

(
exp(−λij)∑
k exp(−λik)

)

has the minima λ∗ij = − log(Cij) + ai up to
un-identifiable ai with objective function value

283


Cij log(Cij/
∑

k Cik). This gives a global function
lower bound

g(xm,cm,bm) ≥
∑

ij

Cij log
(

Cij∑
k Cik

)
(4)

Now consider the function value of the true embed-
ding x

σ
;

g( x
σ
, x
σ
, 0) =

∑

ij

Cij log
exp(− 1

σ2
||xi −xj||22)∑

k exp(− 1σ2 ||xi −xk||
2
2)

=
∑

ij

Cij log

(
exp(log(Cij) + δij + ai)∑
k exp(log(Cik) + δik + ai)

)
.

We can bound the error variables δij using Corol-
lary 6 as supij |δij| < δ0 with probability ε0
for sufficiently large m with ai = log(mi) −
log(

∑n
k=1 exp(−||xi −xk||22/σ2)).

Taking the Taylor expansion at δij = 0, we have

g( x
σ
, x
σ
, 0) =

∑

ij

Cij log
Cij∑
k Cik

+
n∑

l=1

Cil∑
k Cikδil

+ o(||δ||22)

By the law of large numbers of Cij,

P
(∣∣∣g( xσ,

x
σ
, 0)−

∑

ij

Cij log
(

Cij∑
k Cik

)∣∣∣ > nδ0
)
< ε0

which combined with (4) yields

P(|g( x
σ
, x
σ
, 0) −g(x,c,b)| > nδ0) < ε0.

To obtain the original theorem statement, take m to
fulfil δ0 = δ/n and ε0 = ε.

Note that for word2vec with negative-sampling,
applying the stationary point analysis of Levy and
Goldberg (2014b) combined with the analysis in
Lemma S7 shows that the true embedding is a global
minimum.

C Empirical evaluation details

C.1 Implementation details
We used off-the-shelf implementations of
word2vec11 and GloVe12. The two other
methods (randomized) SVD and regression embed-
ding are both implemented on top of the GloVe
codebase. We used 300-dimensional vectors and
window size 5 in all models. Further details are
provided below.

11http://code.google.com/p/word2vec
12http://nlp.stanford.edu/projects/glove

word2vec. We used the skip-gram version with
5 negative samples, 10 iterations, α = 0.025 and fre-
quent word sub-sampling with a parameter of 10−3.

GloVe. We disabled GloVe’s corpus weighting,
since this generally produced superior results. The
default step-sizes results in NaN-valued embed-
dings, so we reduced them. We used XMAX = 100,
η = 0.01 and 10 iterations.

SVD. For the SVD algorithm of Levy and Gold-
berg (2014b), we use the GloVe co-occurrence
counter combined with a parallel randomized pro-
jection SVD factorizer, based upon the redsvd li-
brary due to memory and runtime constraints.13 Fol-
lowing Levy et al. (2015), we used the square root
factorization, no negative shifts (τ = 0 in our nota-
tion), and 50,000 random projections.

Regression Embedding. We use standard SGD
with two differences. First, we drop co-occurrence
values with probability proportional to 1 − Cij/10
when Cij < 10, and scale the gradient, which re-
sulted in training time speedups with no loss in ac-
curacy. Second, we use an initial line search step
combined with a linear step size decay by epoch. We
use θ = 50 and η is line-searched starting at η = 10.

C.2 Solving inductive reasoning tasks

The ideal point for a task is defined below:

• Analogies: Given A:B::C, the ideal point is
given by B −A + C (parallelogram rule).
• Analogies (SAT): Given prototype A:B and

candidates C1 : D1 . . .Cn : Dn, we compare
Di −Ci to the ideal point B −A.
• Categories: Given a category implied by
w1, . . . ,wn, the ideal point is I =

1
n

∑n
i=1 wi.

• Sequence: Given sequence w1 : · · · : wn we
compute the ideal as I = wn +

1
n

(wn −w1).

Once we have the ideal point I, we pick the answer
as the word closest to I among the options, using L2
or cosine distance. For the latter, we normalize I to
unit norm before taking the cosine distance. For L2
we do not apply any normalization.

13https://github.com/ntessore/redsvd-h

284


References
Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma,

and Andrej Risteski. 2015. Random walks on con-
text spaces: Towards an explanation of the myster-
ies of semantic word embeddings. arXiv preprint
arXiv:1502.03520.

Kenneth Ward Church and Patrick Hanks. 1990. Word
association norms, mutual information, and lexicogra-
phy. Computational Linguistics, 16(1):22–29.

David A Croydon and Ben M Hambly. 2008. Local
limit theorems for sequences of simple random walks
on graphs. Potential Analysis, 29(4):351–389.

Christiane Fellbaum, editor. 1998. WordNet: an elec-
tronic lexical database.

Thomas L Griffiths, Mark Steyvers, and Joshua B Tenen-
baum. 2007. Topics in semantic representation. Psy-
chological review, 114(2):211.

Zellig S Harris. 1954. Distributional structure. Word,
10(23):146–162.

Tatsunori Hashimoto, Yi Sun, and Tommi Jaakkola.
2015a. From random walks to distances on un-
weighted graphs. In Advances in Neural Information
Processing Systems.

Tatsunori Hashimoto, Yi Sun, and Tommi Jaakkola.
2015b. Metric recovery from directed unweighted
graphs. In Artificial Intelligence and Statistics, pages
342–350.

Geoffrey E Hinton and Sam T Roweis. 2002. Stochastic
neighbor embedding. In Advances in Neural Informa-
tion Processing Systems, pages 833–840.

Matthäus Kleindessner and Ulrike von Luxburg. 2015.
Dimensionality estimation without distances. In Pro-
ceedings of the Artificial Intelligence and Statistics
conference (AISTATS).

Omer Levy and Yoav Goldberg. 2014a. Linguistic
Regularities in Sparse and Explicit Word Representa-
tions. In Proceedings 18th Conference on Computa-
tional Natural Language Learning, pages 171–180.

Omer Levy and Yoav Goldberg. 2014b. Neural word
embedding as implicit matrix factorization. In Ad-
vances in Neural Information Processing Systems,
pages 2177–2185.

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-
proving distributional similarity with lessons learned
from word embeddings. Transactions of the Associa-
tion for Computational Linguistics, 3:211–225.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013a. Efficient estimation of word representa-
tions in vector space. In ICLR Workshop.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013b. Distributed represen-
tations of words and phrases and their composition-
ality. In Advances in Neural Information Processing
Systems, pages 3111–3119.

SA Molchanov. 1975. Diffusion processes and rie-
mannian geometry. Russian Mathematical Surveys,
30(1):1.

Douglas L Nelson, Cathy L McEvoy, and Thomas A
Schreiber. 2004. The university of south florida free
association, rhyme, and word fragment norms. Be-
havior Research Methods, Instruments, & Computers,
36(3):402–407.

Jeffrey Pennington, Richard Socher, and Christopher D
Manning. 2014. Glove: Global vectors for word rep-
resentation. In Proceedings of the Empiricial Methods
in Natural Language Processing, pages 1532–43.

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014.
Deepwalk: Online learning of social representations.
In Proceedings of the 20th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data
Mining, pages 701–710. ACM.

David E Rumelhart and Adele A Abrahamson. 1973. A
model for analogical reasoning. Cognitive Psychol-
ogy, 5(1):1–28.

Laurent Saloff-Coste. 2010. The heat kernel and its es-
timates. Probabilistic approach to geometry, 57:405–
436.

Gideon Schwarz and Amos Tversky. 1980. On the reci-
procity of proximity relations. Journal of Mathemati-
cal Psychology, 22(3):157–175.

Robin Sibson. 1979. Studies in the robustness of multidi-
mensional scaling: Perturbational analysis of classical
scaling. Journal of the Royal Statistical Society. Series
B (Methodological), pages 217–229.

Richard Socher, John Bauer, Christopher D Manning, and
Andrew Y Ng. 2013. Parsing with compositional vec-
tor grammars. In Proceedings of the ACL conference.

Robert J Sternberg and Michael K Gardner. 1983. Uni-
ties in inductive reasoning. Journal of Experimental
Psychology: General, 112(1):80.

Joshua B Tenenbaum, Vin De Silva, and John C
Langford. 2000. A global geometric framework
for nonlinear dimensionality reduction. Science,
290(5500):2319–2323.

Joshua B Tenenbaum. 1998. Mapping a manifold of per-
ceptual observations. In Advances in Neural Informa-
tion Processing Systems, pages 682–688.

Daniel Ting, Ling Huang, and Michael Jordan. 2011. An
analysis of the convergence of graph laplacians. arXiv
preprint arXiv:1101.5435.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.
Word representations: a simple and general method for
semi-supervised learning. In Proceedings of the 48th
annual meeting of the Association for Computational
Linguistics, pages 384–394. ACL.

Peter D Turney and Michael L Littman. 2005. Corpus-
based learning of analogies and semantic relations.
Machine Learning, 60(1-3):251–278.

285


Amos Tversky and J Hutchinson. 1986. Nearest neigh-
bor analysis of psychological spaces. Psychological
Review, 93(1):3.

Srinivasa RS Varadhan. 1967. Diffusion processes in
a small time interval. Communications on Pure and
Applied Mathematics, 20(4):659–685.

286