Higher-order Lexical Semantic Models for Non-factoid Answer Reranking

Daniel Fried1, Peter Jansen1, Gustave Hahn-Powell1, Mihai Surdeanu1, and Peter Clark2
1 University of Arizona, Tucson, AZ, USA

2 Allen Institute for Artificial Intelligence, Seattle, WA, USA
{dfried,pajansen,hahnpowell,msurdeanu}@email.arizona.edu

peterc@allenai.org

Abstract

Lexical semantic models provide robust
performance for question answering, but,
in general, can only capitalize on direct ev-
idence seen during training. For example,
monolingual alignment models acquire
term alignment probabilities from semi-
structured data such as question-answer
pairs; neural network language models
learn term embeddings from unstructured
text. All this knowledge is then used to
estimate the semantic similarity between
question and answer candidates. We in-
troduce a higher-order formalism that al-
lows all these lexical semantic models to
chain direct evidence to construct indirect
associations between question and answer
texts, by casting the task as the traversal
of graphs that encode direct term associa-
tions. Using a corpus of 10,000 questions
from Yahoo! Answers, we experimentally
demonstrate that higher-order methods are
broadly applicable to alignment and lan-
guage models, across both word and syn-
tactic representations. We show that an
important criterion for success is control-
ling for the semantic drift that accumu-
lates during graph traversal. All in all, the
proposed higher-order approach improves
five out of the six lexical semantic mod-
els investigated, with relative gains of up
to +13% over their first-order variants.

1 Introduction

Open-domain question answering (QA), which
finds short textual answers to natural language
questions, is often viewed as the successor to key-
word search (Etzioni, 2011) and one of the most
difficult and widely applicable end-user applica-
tions of natural language processing (NLP). From

syntactic parsing, discourse processing, and lex-
ical semantics, QA necessitates a level of func-
tionality across a variety of topics that make it
a natural, yet challenging, proving ground for
many aspects of NLP. Here, we address a partic-
ularly challenging QA subtask: open-domain non-
factoid QA, where queries take the form of com-
plex questions (e.g., manner or How questions),
and answers range from single sentences to en-
tire paragraphs. Because this task is so complex
and large in scope, current state-of-the-art open-
domain systems perform at only about 30% P@1,
or answering roughly one out of three questions
correctly (Jansen et al., 2014).

In this paper we focus on answer ranking (AR),
a key component of non-factoid QA that focuses
on ordering candidate answers based on the like-
lihood that they capture the information needed to
answer a question. Unlike keyword search, Berger
(2000) observed that lexical matching methods are
generally insufficient for QA, where questions and
answers often have little to no lexical overlap (as
in the case of Where should we go for breakfast?
and Zoe’s Diner has great pancakes). Previous
work has shown that lexical semantics (LS) mod-
els are well suited to bridging this “lexical chasm”,
and at least two flavors of lexical semantics have
been successfully applied to QA. The first treats
QA as a monolingual alignment problem, learning
associations between words (or other structures)
that appear in question-answer pairs (Surdeanu et
al., 2011; Yao et al., 2013). The second computes
the semantic similarity between question and an-
swer using language models acquired from rele-
vant texts (Yih et al., 2013; Jansen et al., 2014).

Here we argue that while these models begin
to bridge the “lexical chasm”, many still suffer
from sparsity and only capitalize on direct evi-
dence. Returning to our example question, if we
also train on the QA pair What goes well with pan-
cakes? and hashbrowns and toast, we can use the

197

Transactions of the Association for Computational Linguistics, vol. 3, pp. 197–210, 2015. Action Editor: Sharon Goldwater.
Submission batch: 12/2014; Revision batch 3/2015; Published 4/2015.

c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license.


Figure 1: A hypothetical graph derived from a lexical se-
mantic alignment model.

indirect association between breakfast – pancakes
and pancakes – hashbrowns to locate additional
answers for where to visit for breakfast, including
Regee’s has the best hashbrowns in town.

We can represent LS models as a graph, as in
Figure 1. For a word-based alignment model, the
graph nodes represent individual words, and the
(directed) edges capture the likelihood of a word
wa appearing in a gold answer given word wq
in the question. Grounding this in our example,
w1 may represent breakfast, w2 pancakes, and w4
hashbrowns. For a language model, the edges cap-
ture the similarity of contexts between the words
in the nodes. For both alignment and language
model graphs, we observe that semantic associa-
tion between two words (or structures) stored in
the nodes can be determined by investigating the
different paths that connect these two nodes, even
when they are not directly connected.

We call this class of models higher-order lexical
models, in contrast to the first-order lexical mod-
els introduced in previous work, which rely only
on direct evidence to estimate association strength.
For example, all alignment models in previous
work can estimate P(wq|wa), i.e., the probabil-
ity of generating the question word wq if the an-
swer contains the word wa, only if this pair has
been seen at least once in training. On the other
hand, our approach can estimate P(wq|wa) from
indirect evidence, e.g., from chaining two distinct
question/answer pairs that contain the words (wq,
wi) and (wi, wa), respectively.

The contributions of this work are:

1. This is the first work to our knowledge that
proposes higher-order LS models for QA. We
show that an important criterion for success
is controlling for the semantic drift that ac-
cumulates in the graph traversal paths, which
plagues traditional random walk algorithms.
For example, we empirically demonstrate

that paths up to a length of three are gener-
ally beneficial, but longer paths hurt or do not
improve performance.

2. We show that a variety of LS models and
representations, including alignment and lan-
guage models, over both words and syntac-
tic structures, can be adapted to the proposed
higher-order formalism. In this latter respect
we introduce a novel syntax-based variant of
the neural network language model (NNLM)
of Mikolov et al. (2013) that models syntac-
tic dependencies rather than words, which al-
lows it to capture knowledge that is comple-
mentary to that of word-based NNLMs.

3. The training process for alignment models
requires a large corpus of QA pairs. Due
to these resource requirements, we evaluate
our higher-order LS models on a commu-
nity question answering (CQA) task (Wang
et al., 2009; Jansen et al., 2014) across
thousands of how questions, and show that
most higher-order models perform signifi-
cantly better than their first-order variants.

4. We demonstrate that language models and
alignment models capture complementary in-
formation, and can be combined to improve
the performance of the CQA system for man-
ner questions.

2 Related Work

We focus on statistical LS methods for open-
domain QA, in particular CQA tasks, such as the
ones driven by Yahoo! Answers datasets (Wang et
al., 2009; Jansen et al., 2014).

Berger et al. (2000) were the first to propose a
statistical LS model to “bridge the lexical chasm”
between questions and answers for a QA task.
Building off this work, a number of LS models us-
ing either words or other syntactic and semantic
structures have been proposed for QA (Echihabi
and Marcu, 2003; Soricut and Brill, 2006; Rie-
zler et al., 2007; Surdeanu et al., 2011; Yao et al.,
2013; Jansen et al., 2014), or related tasks, such
as semantic textual similarity (Sultan et al., 2014).
However, all of these models fall into the class of
first-order models. To our knowledge, this work
is the first to investigate higher-order LS models
for QA, which can take advantage of indirect ev-
idence, i.e., the “neighbors of neighbors” in the
association graph.

198


Second-order lexical models have recently been
brought to bear on a variety of other NLP tasks.
Zapirain et al. (2013) use a second-order model
based on distributional similarity (Lin, 1998) to
improve a selectional preference model, and Lee
et al. (2012) use a similar approach for coreference
resolution. Our work expands on these ideas with
models of arbitrary order, an approach to control
semantic drift, and an application to QA.

This work falls under the larger umbrella
of algorithms for graph-based inference, which
have been successfully applied to other NLP
and information retrieval problems, such as rela-
tion extraction (Chakrabarti and Agarwal, 2006;
Chakrabarti, 2007; Lao and Cohen, 2010), in-
ference over knowledge bases (Lao et al., 2011),
name disambiguation (Minkov et al., 2006), and
search (Bhalotia et al., 2002; Tong et al., 2006).
While many of these approaches use random walk
algorithms, in pilot experiments we observed that
random walks, such as PageRank (PR) (Page et al.,
1999), tend to accumulate semantic drift from the
originating node because they consider all possi-
ble paths in the graph. This semantic drift reduces
the quality of the higher-order associations, which
impacts QA performance. Here we implement a
conservative graph traversal algorithm, similar in
spirit to the “cautious bootstrapping” algorithm of
Yarowsky (Whitney and Sarkar, 2012; Yarowsky,
1995). By constraining the traversal paths, our al-
gorithm runs two orders of magnitude faster than
PR, while controlling for semantic drift.

3 Approach

The architecture of our proposed QA framework is
illustrated in Figure 2. Here we evaluate both first-
order and higher-order LS models in the context
of community question answering (CQA), using a
large dataset of QA pairs from Yahoo! Answers1.
We use a standard CQA evaluation task (Jansen
et al., 2014), where one must rank a set of user-
generated answers to a given question, such that
the community-selected best answer appears in
the top position. More specifically, for a given
question, the candidate retrieval (CR) component
fetches all its community-generated answers from
the answer collection, and gives each answer can-
didate an initial ranking using shallow informa-

1 http://answers.yahoo.com

Figure 2: Architecture of the reranking framework for QA,
showing a reranking component that incorporates two classes

of lexical semantic models (alignment and neural network

language models), implemented over two representations of

content (words and syntax).

tion retrieval scoring.2 These answer candidates
are then passed on to the answer reranking com-
ponent, the focus of this work.

The answer reranking (AR) component ana-
lyzes the answer candidates using more expensive
techniques to extract first-order and higher-order
LS features, and these features are then used in
concert with a learning framework to rerank the
candidates and elevate correct answers to higher
positions. For the learning framework, we use
SVMrank, a variant of Support Vector Machines
for structured output adapted to ranking prob-
lems.3 In addition to these features, each reranker
also includes a single feature containing the score
of each candidate, as computed by the above can-
didate retrieval component.4

4 First-Order Lexical Models

We next introduce a series of first-order LS mod-
els, which serve as the foundation for the higher-
order models discussed in the next section.

4.1 Neural Network Language Models
Inspired by previous work (Yih et al., 2013; Jansen
et al., 2014), we adapt the word2vec NNLM of

2We used the same scoring as Jansen et al. (2014): co-
sine similarity between the question and candidate answer’s
lemma vector representations, with lemmas weighted using
tf.idf (Ch. 6, (Manning et al., 2008)).

3 http://www.cs.cornell.edu/people/tj/
svm_light/svm_rank.html

4Including these scores as features in the reranker model
is a common strategy that ensures that the reranker takes ad-
vantage of the analysis already performed by the CR model.

199

http://answers.yahoo.com
http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html
http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html


Mikolov et al. (2013) to this QA task. In particu-
lar, we use their skip-gram model with hierarchi-
cal sampling. This model predicts the context of a
word given the word itself, and through this pro-
cess embeds words into a latent conceptual space
with a fixed number of dimensions. Consequently,
related words tend to have vector representations
that are close to each other in this space. This type
of predictive algorithm has been found to perform
considerably better than count-based approaches
to distributional similarity on a variety of seman-
tic tasks (Baroni et al., 2014).

We derive four LS measures from these vec-
tors, which are then are included as features in
the reranker. The first is a measure of the over-
all similarity of the question and answer candi-
date, which is computed as the cosine similarity
between two composite vectors. These composite
vectors are assembled by summing the vectors for
individual question (or answer candidate) words,
and re-normalizing this composite vector to unit
length.5 In addition to this overall similarity score,
we compute the pairwise similarities between each
word in the question and answer candidates, and
include as features the average, minimum, and
maximum pairwise similarities.

4.2 Alignment Models
Berger et al. (2000) showed that learning question-
to-answer transformations using a statistical ma-
chine translation (SMT) model begins to “bridge
the lexical chasm” between questions and an-
swers. We build upon this observation, using the
IBM Model 1 (Brown et al., 1993) variant of Sur-
deanu et al. (2011) to determine the probability
that a question Q is a translation of an answer A,
P(Q|A):

P(Q|A) =
∏

q∈Q
P(q|A) (1)

P(q|A) = (1−λ)Pml(q|A) + λPml(q|C) (2)
Pml(q|A) =

∑

a∈A
(T(q|a)Pml(a|A)) (3)

where the probability that the question term q is
generated from answer A, P(q|A), is smoothed
using the prior probability that the term q is gen-
erated from the entire collection of answers C,
Pml(q|C). Pml(q|A) is computed as the sum of

5We also tried the multiplicative strategy for word-vector
combination of Levy and Goldberg (2014b), but it did not
improve our results.

the probabilities that the question term q is a trans-
lation of any answer term a, T(q|a), weighted
by the probability that a is generated from A.
The translation table for T(q|a) is computed us-
ing GIZA++ (Och and Ney, 2003).

Similar to Surdeanu et al. (2011) we: (a) set
Pml(q|C) to a small value for out-of-vocabulary
words; (b) modify the T(.|.) distributions to guar-
antee that the probability of translating a word to
itself, i.e., T(w|w), is highest (Murdock and Croft,
2005); and (c) tune the smoothing parameter λ on
a development corpus.

QA systems generally use the above model to
determine the global alignment probability be-
tween a given question and answer candidate,
P(Q|A). A novel contribution of our work is
that we also use the alignment model’s proba-
bility distributions (from a source word to des-
tination words) as distributed representations for
source words, based on the observation that words
with similar alignment distributions are likely to
have a similar meaning. Formally, we denote the
alignment vector representation for the ith word
in the vocabulary as w(i), and define the jth en-
try in the vector as w(i)j = T(wordj|wordi),
where T(.|.) is the translation probability distri-
bution from Eq. 3. That is, the jth entry of the
vector for word i is the conditional probability of
seeing wordj in a question, given that wordi was
seen in a corresponding answer. The vector for
each word then represents a discrete (conditional)
probability distribution. To compare two words,
we use the square root of the Jensen-Shannon di-
vergence (JSD) between their conditional distri-
butions.6 Let m(i,j) be the element-wise aver-
age of the vectors w(i) and w(j), that is (w(i) +
w(j))/2, and let K(w,v) be the Kullback-Leibler
divergence between distributions (represented as
vectors) w and v:

K(w,v) =

|V |∑

i=0

wi ln
wi
vi

(4)

where |V | is the vocabulary size (and the dimen-
sionality of w). Then the distance between words
i and j is

J(w(i), w(j)) =

√
K(w(i), m(i, j)) + K(w(j), m(i, j))

2
(5)

6We use the square root of the Jensen-Shannon diver-
gence, derived from Kullback-Leibler divergence, since it is
a distance metric (in particular it is finite and symmetric).

200


We derive four additional LS features from
these alignment vector representations, which par-
allel the features derived from the NNLM vec-
tor representations (§4.1). The first is the JSD
between the composite alignment vectors for
the question and the answer candidate, where the
composite is constructed by summing the vec-
tors for individual question (or answer candidate)
words, and dividing by the number of vectors
summed. We also generate features from the av-
erage, minimum, and maximum pairwise JSD
between words in the question and words in the
answer candidate. In practice, we have found that
the addition of these four features considerably
improves the performance of all alignment mod-
els investigated for the QA task.

4.3 Modeling Syntactic Structures
Surdeanu et al. (2011) demonstrated that align-
ment models can readily capitalize on represen-
tations other than words, including representa-
tions of syntax. Here we expand on their work
and explore three syntactic variations of the align-
ment and NNLM models. For these models we
make use of collapsed syntactic dependencies with
processed CC complements (de Marneffe et al.,
2006), extracted from the Annotated English Gi-
gaword (Napoles et al., 2012).

The first model is a straightforward adaptation
of the alignment model in §4.2, where the repre-
sentations of questions and answers are changed
from bags of words to “bags of syntactic depen-
dencies” (Surdeanu et al., 2011). That is, the terms
to be aligned (q and a in Eq. 1 to 3) become
〈head,label,dependent〉 dependencies, such as
〈eat,dobj,pancakes〉, rather than words. The
motivation behind this change is that it allows
the model to align simple propositions rather than
words, which are a more accurate representation
of the information encoded in the question. While
tuning on the development set of questions we
found that unlabeled dependencies, which ignore
the middle term in the above tuples, performed
better than labeled dependencies, possibly due to
the sparsity of the labeled dependencies. As such,
all our dependency alignment models use unla-
beled dependencies.

The second model adapts the idea of working
with bags of syntactic dependencies, instead of
bags of words, to NNLMs, and builds an LM over
dependencies rather than words. However, be-

cause the NNLM embeddings are generated from
context, we must find a consistent ordering of de-
pendencies that maintains contextual information.
Here we order dependencies using a depth-first,
left-to-right traversal of the dependency graph
originating at the predicate or root node.7 Fig-
ure 3 shows an example of such a traversal, which
results in the following ordering (using unlabeled
dependencies):

(6) be time time the time same be country country no
be should be business business the
business concealing concealing history history its

Figure 3: DFS L→R ordering of word pairs starting
from the sentential root. The sentence is taken from
LTW ENG 20081201.0002 of the English Gigaword.

The model uses the default word2vec imple-
mentation to compute the term embeddings. To
our knowledge, this syntax-driven extension of
NNLMs is novel.

During tuning we compared the labeled and
unlabeled dependency representations against an
NNLM model using traditional bigrams (where
words are paired based on their linear order, i.e.,
without any syntactic motivation). Again, we
found that unlabeled dependencies outperformed
surface bigrams and labeled dependencies.

Lastly, we also evaluate the recent NNLM
model proposed by Levy and Goldberg (2014a),
who modified the word2vec embedding proce-
dure to include context derived from dependen-
cies, arguing that this embedding model encodes
functional similarity better than Mikolov et al.’s
unigram surface structure model. It is important
to note that while in our syntactic NNLM model
the LM terms are (lexicalized unlabeled) syntactic
dependency tuples, in Levy and Goldberg’s model
the LM terms are still words, but the context is
driven by syntax instead of surface structure.

5 Higher-order lexical models

The first-order models we have just described
make use of the direct evidence seen in training
data to create associations between words (or de-
pendencies). Unfortunately this training data is
often quite sparse, and while it may be possi-
ble to exhaustively create training data with broad

7Other ordering schemes were tested during development,
but DFS proved to be the most stable across sentences.

201


coverage for artificially narrow domains such as
things to eat for breakfast, this quickly becomes
intractable for larger domains. To mitigate this
sparsity, here we propose a method that identifies
associations that were not explicitly seen during
training. Returning to our example in §1, had an
alignment model learned the associations break-
fast – pancakes and pancakes – hashbrowns, the
model should be able to make use of the indi-
rect association between breakfast – hashbrowns
to further “bridge the lexical chasm”, and iden-
tify answer candidates containing the word hash-
browns to the question Where is a good place to
eat breakfast?

At first glance, algorithms such as PageR-
ank (Page et al., 1999) appear to offer an attractive
and intuitive method to determine the association
between arbitrary nodes in a weighted graph by
modeling a random walk to convergence (from any
given starting node). As we show in Section 6, we
have found that these methods quickly accumulate
semantic drift (McIntosh, 2010), because: (a) they
run to convergence, i.e., they traverse paths of ar-
bitrary length, and (b) they explore the entire asso-
ciation space encoded in the neighborhood graph,
both of which introduce considerable noise. For
example, by chaining even small numbers of di-
rect associations, such as breakfast – pancakes,
pancakes – hashbrowns, hashbrowns – potato, and
potato – field, a model that does not control for se-
mantic drift may be tempted to answer questions
about breakfast venues with answers that discuss
wheat fields or soccer fields. Thus, keeping these
indirect associations short and on-context is criti-
cal for QA.

Based on the above observations, here we out-
line a general method for creating higher-order
lexical semantic models that incorporate simple
measures to control for semantic drift, and demon-
strate how to apply it to both alignment graph rep-
resentations and the distributed vector space repre-
sentations of a NNLM. Our method relies on two
general properties of the first-order LS models: (a)
all models include a distributed representation
for each term8 that is encoded as a vector (e.g.,
for NNLMs this is a term’s embedding in vector
space, and for alignment models this is a vector of
alignment probabilities from a given answer term
to all possible question terms), and (b) models de-

8Henceforth we use “term” to indicate either a word or a
dependency in a lexical semantics model.

fine a pairwise function that assigns a lexical sim-
ilarity score between pairs of directly connected
terms (e.g., for NNLMs this is the cosine similar-
ity of the corresponding term embeddings, and for
alignment models this is given directly by T(q|a),
from Equation 3).

Intuitively, we incorporate indirect evidence in
our models by averaging the distributed represen-
tation of a term t with the distributed representa-
tions of its neighbors, i.e., those terms with high-
est lexical similarity to t. To control for seman-
tic drift we consider: (a) only the k most similar
neighbors to the current term, and (b) short paths
in the association graph, e.g., neighbors of neigh-
bors but no further. For example, for a second-
order model, we construct this new vector repre-
sentation for each node in the graph by averaging
the vector of the node itself and the vectors of its k
closest neighbors (weighted by each pair’s similar-
ity), so that the vector representation of breakfast
becomes a weighted sum of breakfast, morning,
food, cereal, and so forth. By using only the clos-
est neighbors, we exclude low associates of break-
fast, such as spoon or vacation, which are only
marginally related and therefore likely to generate
semantic drift.

Formally, let w be the vector representation (ei-
ther from an NNLM or alignment model) of a
given term. We construct a higher-order represen-
tation ŵ by linearly interpolating w with its top
associates, those vectors with highest value of a
pairwise scoring function s:

ŵ(i) =
∑

j∈Nk(w(i))
s(w(i),w(j)) w(j) (7)

where s is the function that measures the lexi-
cal similarity between terms, and depends on the
representations used (salign or scos, see below).
Nk(w(i)), the neighborhood of the vector w(i),
denotes the indices of the k term vectors with
highest values of s(w(i), ·), and k is a hyperpa-
rameter that controls how many vectors are inter-
polated. The resulting vectors are renormalized,
as discussed later.9

It is important to note that this process of
creating higher-order vectors can be trivially

9Nk(w) includes w, because in all LS models proposed
here, each term has maximal similarity with itself. The nor-
malization applied after combining all vectors has the effect
of modulating the contribution of the vector w in the higher
order representation ŵ, based on how similar the other vec-
tors in Nk(w) are to w.

202


iterated arbitrarily, constructing a second-order
model from a first-order model, then a third-order
model from a second, etc. Intuitively, this models
paths of increasing length in the association
graph over terms. We now describe the concrete
implementations of the procedure in the alignment
and the NNLM settings.

Higher-order Alignment: In the alignment set-
ting, where the vector w(i) for term i contains
the probabilities that the source (answer) term i is
aligned with the destination (question) term j, we
use

salign(w(i),w(j)) = T(termj|termi) (8)

where T is the translation probability from Equa-
tion (3). We normalize the ŵ vectors of Equa-
tion (7) using the L1 norm, so that each contin-
ues to represent a probability distribution. In-
tuitively, the second-order alignment distribution
ŵ(i) for source term i is a linear combination of
term i’s most probable destination terms’ distri-
butions, weighted by the destination probabilities
from w(i).

The higher-order models greatly reduce the
sparsity of the first-order alignment graph. In a
first-order word alignment model (i.e., using only
direct evidence) trained on approximately 70,000
question/answer pairs (see §6.1), source words
align with an average of 31 destination words
with non-zero probability, out of a 103,797 word
vocabulary. The second-order model aligns 1,535
destination words to each source word on average;
the third-order model, 8,287; and the fourth-order
model, 15,736. In the results section, we observe
that this sparsity reduction is only helpful up to a
certain point; having too many associates for each
word reduces performance on the QA task.

Higher-order NNLM: In the NNLM setting, we
use the cosine similarity of vectors as interpolation
weights and to choose the nearest neighbors:

scos(w(i),w(j)) =
w(i) ·w(j)
||w(i)|| ||w(j)|| (9)

We found that applying the softmax function to
each term’s vector of k-highest scos similarities,
to ensure all interpolation weights are positive and
have a consistent range across terms, improved
performance. As such, all higher-order NNLM
models use this softmax normalization.

The resulting interpolation can be conceptual-
ized in several ways. Viewing cosine similarity
as a representation of entailment (Beltagy et al.,
2013), the higher-order NNLM model reflects
multiple-hop inference on top of the correspond-
ing association graph, similar to the higher-order
alignment model. The interpolation could also
be viewed as smoothing term representations in
vector space, averaging each term’s vector with
its nearest neighbors according to their cosine
similarity.

Higher-order Hybrid Model: We also imple-
ment a hybrid model, which interpolates the align-
ment distribution vectors, but using the pairwise
cosine similarities from the NNLM setting:

ŵA(i) =
∑

j∈Nk(wN(i))
scos(wN(i),wN(j)) wA(j)

where wN(i) and wA(i) are respectively the
NNLM vector and alignment vector representa-
tions for term i, and scos is cosine similarity. In
addition, we experimented with the opposite hy-
brid model: interpolating the NNLM vectors us-
ing alignment associate probabilities as weights,
but found that it did not perform as well as using
NNLM similarities to interpolate alignment vec-
tors. Our conjecture is that the higher order tech-
nique can be viewed as a method of reducing spar-
sity (either in the vectors or in the context used to
create them), and since the NNLM vectors trained
on words (as opposed to dependencies) are less
sparse, they benefit less from this technique.

6 Experiments

6.1 Data
To test the utility of our approach, we experi-
mented with the QA scenario introduced in §3
using the subset of Yahoo! Answers Corpus in-
troduced by Jansen et al. (2014)10. Yahoo! An-
swers11 is an open domain community-generated
QA site, with questions and answers that span for-
mal and precise to informal and ambiguous lan-
guage. This corpus contains 10,000 QA pairs from
a corpus of how questions. Each question contains
at least four community-generated answers, one of
which was voted as the top answer. The number of
answers to each question ranged from 4 to over 50,

10http://nlp.sista.arizona.edu/
releases/acl2014

11http://answers.yahoo.com

203

http://nlp.sista.arizona.edu/releases/acl2014
http://nlp.sista.arizona.edu/releases/acl2014
http://answers.yahoo.com


with the average 9. 50% of QA pairs were used for
training, 25% for development, and 25% for test.

The following additional resources were used:

NNLM Corpus: We generated vector represen-
tations for words using the word2vec model
of Mikolov et al. (2013), using the skip-gram
architecture with hierarchical sampling. Vec-
tors are trained using the entire Gigaword cor-
pus of approximately 4G words.12 We used 200-
dimensional vectors (Jansen et al., 2014).

Alignment Corpus: We use a separate partition
of QA pairs to train an alignment model between
questions and top answers. This corpus of 67,972
QA pairs does not overlap with the collection of
10,000 pairs used for the training, development,
and testing partitions, but was chosen according
to the same criteria. The alignment model was
trained using GIZA++ over five iterations of IBM
Model 1.

6.2 Tuning
The following hyperparameters were tuned inde-
pendently to maximize P@1 on the development
partition of the Yahoo! Answers dataset:
Number of Vectors Interpolated: We investi-
gated the effects of varying k in Eq. 7, i.e., the
number of most-similar neighbor vectors interpo-
lated when constructing a higher-order model. We
experimented with values of k ranging from 1 to
100 and found that the value does not substantially
affect results across the second-order word align-
ment models. Since vectors used in the higher-
order interpolation are weighted by their similar-
ity to a vector w, this is likely due to a swift de-
crease in values of the pairwise similarity function,
s(w, ·), beyond the vectors closest to w. We chose
to use k = 20 because it comes from a stabler
maximum, and it is sufficiently small to make con-
struction of higher-order models feasible in terms
of time and memory usage.
Base Representation Type: We compared the
effects of using either words or lemmas as the
base lexical unit for the LS models, and found
that words achieved higher P@1 scores in both
the alignment and NNLM models on the develop-
ment dataset. As such, all results reported here use
words for the syntax-independent models, and tu-
ples of words for the syntax-driven models.
Content Filtering: We investigated using part-
of-speech (POS) tags to filter the content consid-

12LDC catalog number LDC2012T21

ered by the lexical similarity models, by excluding
certain non-informative classes of words such as
determiners. Using POS tags generated by Stan-
ford’s CoreNLP (Manning et al., 2014), we fil-
tered content to only include nouns, adjectives,
verbs, and adverbs for the word-based models,
and tuples where both words have one of these
four POS tags for the syntax-based models. We
found that this increased P@1 scores for all word-
based alignment and NNLM models (including
the Levy-Goldberg (L-G) model13), but did not
improve performance for models that used depen-
dency representations.14 Results reported in the
remainder of this paper use this POS filtering for
all word-based alignment and NNLM models (in-
cluding L-G’s) as well as the dependency align-
ment model, but not for our dependency NNLM
model.15

6.3 Baselines
We compare against five baseline models:

Random: selects an answer randomly;
CR: uses the ranking provided by our shallow an-
swer Candidate Retrieval (CR) component;

Jansen et al. (2014): the best-performing lexi-
cal semantic model reported in Jansen et al. (2014)
(row 6 in their Table 1).16

PRU: PageRank (Page et al., 1999) with uni-
form teleportation probabilities, constructed over
the word alignment graph. Let A be the row-
stochastic transition matrix of this graph, where
the ith row of A is the vector w(i) (see §4.2). Fol-
lowing the PR algorithm, we add small “teleporta-
tion” probabilities from a teleportation matrix T 17

to the alignment matrix, producing a PageRank
13We modified the L-G algorithm to ignore any dependen-

cies not matching this filtering criterion when building the
context for a word.

14We hypothesize that our filtering criterion for dependen-
cies, which requires both head and modifiers to have one of
the four POS tags, was too aggressive. We will explore other
filtering criteria in future work.

15Although filtered dependencies yielded slightly lower re-
sults in our tuning experiments, we use them in the depen-
dency alignment experiments because the unfiltered third-
order dependency alignment model was too large to fit into
memory.

16We do not report the performance of their discourse-
based models, because these models use additional resources
making them incompatible with the work shown here.

17We construct the teleportation matrix T by setting each
row of T to be a uniform probability distribution, i.e., each
entry in a row of T is 1/|V |, where |V | is the vocabulary
size.

204


matrix P = α∗A + (1 −α) ∗T . Following pre-
vious work (Page et al., 1999), we used α = 0.15.
Then, we compute the matrix products P2,P3, ...,
using rows in these products as vector representa-
tions for terms. Thus, the ith row of Pk gives the
conditional probabilities of arriving at each term
in the graph after k transitions beginning at term i.

PRP: Personalized PageRank (Agirre and Soroa,
2009). This method follows the same algorithm as
PRU, but uses NNLM similarities of terms to de-
fine non-uniform teleportation probabilities (fol-
lowing the intuition that teleportation likelihood
should be correlated with semantic similarity of
terms). In particular, we obtain the ith row of
T by computing the pairwise cosine similarities
of term i’s NNLM vector with every other term’s
vector, and taking the softmax of these similarities
to produce a probability distribution, v(i). To ac-
count for missing terms (due to the difference in
the alignment and NNLM corpora) we smooth all
teleportation probabilities by adding a small value,
�, to each entry of v(i). The renormalized v(i)
then becomes the ith row of T .

6.4 Results
Analysis of higher-order models
Table 1 shows the performance of first-order and
higher-order alignment and NNLM models across
representation type. Rows in the table are labeled
with the orders of the representations used. W in-
dicates a model where the base units are words,
while D indicates a model where the base units
are dependencies; A denotes an alignment model,
while N denotes an NNLM model. Distinct fea-
tures are generated for each order representation
(§§4.1, 4.2, 5) and used in the SVM model. For
example, WN(1-3) uses features constructed from
the 1st, 2nd, and 3rd order representations of the
word-based NNLM. Combining multiple orders
into a single feature space is important, as each or-
der captures different information. For example,
the top five closest associates for the term dog un-
der WN(1) are: puppy, cat, dogs, dachshund, and
pooch, whereas the top associates under WN(2)
are: mixedbreed, hound, puppy, rottweiler, and
groomer. Aggregating features from all orders
into a single ranking model guarantees that all this
information is jointly modeled.

We use the standard implementations for pre-
cision at 1 (P@1) and mean reciprocal rank
(MRR) (Manning et al., 2008).

Several trends are apparent in the table:
(i) All first-order lexical models outperform the
random and CR baselines, as well as the system
of Jansen et al. The latter result is explained by
the novel features proposed in §4.1 and §4.2. We
detail the contribution of these features in the ab-
lation experiment summarized later in Table 4.

(ii) More importantly, both P@1 and MRR gener-
ally increase as we incorporate higher-order ver-
sions of each lexical model. Out of the six LS
models investigated, five improve under their cor-
responding higher-order configurations. This gain
is most pronounced for the dependency align-
ment (DA) model, whose P@1 increases from
25.9% for the first-order model to 29.4% for a
model that incorporates first, second, and third-
order features. In general, the performance in-
crease in higher-order models is larger for sparser
first-order models, where the higher-order mod-
els fill in the knowledge gaps of their correspond-
ing first-order variants. This sparsity is gener-
ated by multiple causes: (a) alignment models are
sparser than NNLMs because they were trained
on less data (approximately 67K question-answer
pairs vs. the entire Gigaword); (b) models that use
syntactic dependencies as the base lexical units are
sparser than models that use words. We conjec-
ture that this is why we see the biggest improve-
ment from higher order for DA (which combines
both sources of sparsity), and we do not see an
improvement for the word-based NNLM (WN).
This hypothesis is supported by the analysis of
the corresponding association graphs: in the WA
model, counting the number of words compris-
ing the top 99% of the probability mass distribu-
tion for each word shows sparse representations
with 10.1 most-similar words on average; in the
WN model, the same statistic shows dense dis-
tributed representations with the top 99% of mass
distributed across 1.2 million most-similar words
per word (or, about 99% of the vocabulary).

(iii) The higher-order models perform well up to
an order of 3 or 4, but not further. For exam-
ple, the performance of the word alignment model
(WA) decreases at order 4 (row 7); similarly, the
performance of the word NNLM (WN) decreases
at order 5 (row 12). This supports our initial ob-
servation that long paths in the association graph
accumulate noise, which leads to semantic drift.

(iv) The Levy-Goldberg NNLM (DNL-G) is the
best first-order model, whereas our dependency-

205


P@1 MRR
# Model P@1 Impr. MRR Impr.

Baselines
1 Random 14.29 26.12
2 CR 19.57 – 43.15 –
3 Jansen et al. (2014) 26.57 – 49.31 –

Word Alignment (WA)
4 CR + WA(1) 27.33 +40% 49.20 +14%
5 CR + WA(1-2) 29.01* +48% 50.87* +18%
6 CR + WA(1-3) 30.49* +56% 51.92* +20%
7 CR + WA(1-4) 29.65* +52% 51.38* +19%

Word NNLM (WN)
8 CR + WN(1) 30.69 +57% 52.02 +21%
9 CR + WN(1-2) 29.57 +51% 51.29 +19%
10 CR + WN(1-3) 30.21 +54% 51.75 +20%
11 CR + WN(1-4) 30.41 +55% 51.91 +20%
12 CR + WN(1-5) 30.05 +54% 51.87 +20%

Dependency Alignment (DA)
13 CR + DA(1) 25.89 +32% 48.24 +12%
14 CR + DA(1-2) 28.81* +47% 50.50* +17%
15 CR + DA(1-3) 29.41* +50% 51.14* +19%

Our Dependency NNLM (DNO)
16 CR + DNO(1) 30.85 +58% 51.93 +20%
17 CR + DNO(1-2) 31.69* +62% 52.35* +21%
18 CR + DNO(1-3) 31.89* +63% 52.43* +22%

Levy-Goldberg Dependency NNLM (DNL-G)
19 CR + DNL-G(1) 31.25 +60% 52.60 +22%
20 CR + DNL-G(1-2) 31.57 +61% 52.76 +22%
21 CR + DNL-G(1-3) 31.69 +62% 52.73 +22%

Hybrid Model: Word Alignment with NNLM Similarity (WAN)
22 CR + WAN(1-2) 29.37 +50% 51.05 +18%
23 CR + WAN(1-3) 30.45* +56% 51.84* +20%

PageRank Model: Uniform Teleportation (PRU)
24 CR + PRU(1) 27.29 +39% 49.21 +14%
25 CR + PRU(1-2) 30.33* +55% 51.74* +20%
26 CR + PRU(1-3) 30.17* +54% 51.87* +20%

PageRank Model: Personalized Teleportation (PRP)
27 CR + PRP(1) 27.09 +38% 49.10 +14%
28 CR + PRP(1-2) 31.01* +58% 52.25* +21%
29 CR + PRP(1-3) 29.89* +53% 51.70* +20%

Table 1: Overall results on the Yahoo! Answers dataset using first-order representations (1) and first-order in combination
with higher-order representations (1-n). Bold font indicates the best score in a given column for each model group. An asterisk

indicates that a score is significantly better (p < 0.05) than the first-order version of that model. All significance tests were

implemented using one-tailed non-parametric bootstrap resampling using 10,000 iterations.

model creation time matrix size
WA (1) – 75 MB
WA (2) 33 seconds 1.8 GB
WA (3) 4.5 minutes 9.7 GB
WA (4) 8.6 minutes 19 GB
PR (1) – 41 GB
PR (2) 45.6 hours 41 GB
PR (3) 45.6 hours 41 GB

Table 2: Runtime and memory requirements for creation of
matrices, for the original PageRank and our cautious random

walk algorithms. Creation of both higher-order WA and PR

models is trivially parallelizable, and we report runtimes on a

2.7 GHz processor with parallel execution on 10 cores.

based NNLM (DNO) is the best higher-order
model. This is an encouraging result for DNO,
considering its simplicity. In general, NNLMs
perform better than alignment models for answer
reranking, which is not surprising considering the
difference in size of the training datasets. To our
knowledge, this is the first time this type of analy-
sis has been performed for QA.

(v) Both PRU(1-2) and PRP(1-2) perform better
than their first-order variants and better than our

206


P@1 MRR
# Model/Features P@1 Impr. MRR Impr.

Baselines
1 Random 14.29 26.12
2 CR 19.57 – 43.15 –
3 Jansen et al. (2014) 26.57 – 49.31 –

Words
4 CR + WA(1) + WN(1) 30.85 +58% 52.24
5 CR + WA(1-2) + WN(1-2) 31.85* +63% 52.86* +23%
6 CR + WA(1-3) + WN(1-3) 32.09* +64% 52.97* +23%
7 CR + WA(1-4) + WN(1-4) 31.69 +62% 52.75 +22%

Dependencies
8 CR + DA(1) + DNO(1) + DNL-G(1) 31.49 +61% 52.93 +23%
9 CR + DA(1-2) + DNO(1-2) + DNL-G(1-2) 32.85* +68% 53.73* +25%
10 CR + DA(1-3) + DNO(1-3) + DNL-G(1-3) 32.77* +67% 53.71* +24%

Words and Dependencies
11 CR + WA(1) + WN(1) + DA(1) + DNO(1) + DNL-G(1) 31.85 +63% 53.03 +23%
12 CR + WA(1-2) + WN(1-2) + DA(1-2) + DNO(1-2) + DNL-G(1-2) 32.89† +68% 53.69† +24%
13 CR + WA(1-3) + WN(1-3) + DA(1-3) + DNO(1-3) + DNL-G(1-3) 33.01† +69% 53.96† +25%

Table 3: Performance on the Yahoo! Answers dataset for word, dependency, and models that combine the word and de-
pendency representations. Bold font indicates the best score in a given column for each model group. An asterisk indicates

that a score is significantly better (p < 0.05) than the first-order version of that model group (1), and † indicates that a score
approaches significance (p < 0.10).

P@1 MRR
Model P@1 Delta MRR Delta
CR + WA(1)
all features 27.33 – 49.20 –
− P(Q|A) 25.69 -6% 48.35 -2%
− max JSD 27.33 0% 49.10 0%
− min JSD 23.57 -14% 46.64 -5%
− composite JSD 27.17 -1% 49.02 0%
− average JSD 25.41 -7% 47.70 -3%

CR + WN(1)
all features 30.69 – 52.02 –
− max cosine sim. 29.65 -3% 51.49 -1%
− min cosine sim. 29.69 -3% 51.43 -1%
− composite cosine sim. 27.01 -12% 49.13 -6%
− average cosine sim. 26.49 -14% 49.04 -6%

Table 4: Ablation experiments for first-order word models.
Each row removes a single feature from the corresponding

complete model (“all features”).

word-based second-order models (WA and WN).
As expected, the personalized PR algorithm per-
forms better than PRU. However, their perfor-
mance drops for the third-order models (rows 26
and 29), whereas our models continue to improve
in their third-order configuration. All our third-
order models perform better than all third-order
PR models. This is caused by our “cautious”
traversal strategy that considers only the closest
neighbors during random walks, which controls
for semantic drift better than the PR methods. Fur-
thermore, the resources required for PR are con-
siderably larger than the ones needed to imple-
ment our methods. Table 2 summarizes the re-

quirements of the PR algorithm compared to our
closest method (WA). As the table shows, PR has
a runtime that is two orders of magnitude larger
than WA’s, and requires four times as much mem-
ory to store the generated higher-order matrices.

Combining NNLMs and alignment models
We explore combinations of NNLMs and align-
ment models in Table 3, which lists results when
higher-order NNLMs and alignment models are
combined for a given representation type. Each
line in the table corresponds to a single an-
swer reranking model, which incorporates features
from multiple LS models. These experiments re-
inforce our previous observations: higher-order
models perform better than their first-order vari-
ants regardless of representation type, but perfor-
mance increases only for relatively low orders,
e.g., 3 orders for words and words combined with
dependencies, or 2 orders for dependencies.

Importantly, the table shows that the combina-
tion of NNLMs and alignment models across rep-
resentation types is beneficial. For example, the
best first-order combined model (row 11 in Ta-
ble 3) performs similarly to the best higher-order
individual LS model (row 18 in Table 1). The
best combined higher-order model (row 13 in Ta-
ble 3) has a P@1 score of 33.01, approximately
1.2% higher than the best individual LS model
(row 18 in Table 1). To our knowledge, this is the

207


highest performance reported on this Yahoo! An-
swers corpus, surpassing the best model of Jansen
et. al (2014), which incorporates both shallow
and deep discourse information, and achieves 32.9
P@1.

Feature ablation experiments
Although the main focus of this work is on higher-
order LS models, Table 1 shows that even our
first-order models perform better than the previous
state of the art. This is caused by the novel features
proposed over alignment models and NNLMs. To
understand the contribution of each of these fea-
tures, we performed an ablation experiment, sum-
marized in Table 4, for two models: one alignment
model (WA), and one NNLM (WN). This analysis
indicates that many of the features proposed are
important. For example, two of the novel JSD fea-
tures for WA have a higher contribution to over-
all QA performance than the alignment probability
(P(Q|A)) proposed in previous work (Surdeanu
et al., 2011). For WN, the two new features pro-
posed here (maximum and minimum cosine sim-
ilarity between embedding vectors) also have a
positive contribution to overall performance, but
less than the other two features proposed in previ-
ous work (Yih et al., 2013; Jansen et al., 2014).

6.5 Discussion
One important observation yielded by the previ-
ous empirical analysis is that higher-order models
perform well for sparse first-order variants (e.g.,
DA), but not for first-order models that rely on
already dense association graphs (e.g., WN). We
suspect that in densely populated graphs, model-
ing context becomes critical for successful higher-
order models. Returning to our example from §1,
a densely populated graph may already have the
associations breakfast – pancakes and pancakes
– hashbrowns that allow it to identify restaurants
with favorable breakfast foods. Exploring higher-
order associations in this situation is only ben-
eficial if context is carefully maintained, other-
wise an answer reranking system may erroneously
select answer candidates with different contexts
due to accumulated semantic drift (e.g., answers
that discuss films, using associations from texts
reviewing Breakfast at Tiffany’s). Incorporating
word-sense disambiguation or topic modeling into
alignment or NNLM models may begin to address
these contextual issues by preferentially associat-
ing terms within a given topic (such as restau-

rants or films), ultimately reducing semantic drift,
and extending these higher-order methods beyond
a few hops in the graph. Our analysis suggests
that the extra sparsity that contextually-dependent
representations introduce may make those models
even more amenable to the higher-order methods
discussed here.

More generally, we believe that graph-based in-
ference provides a robust but approximate middle-
ground to inference for QA. Where inference us-
ing first-order logic offers provably-correct an-
swers but is relatively brittle (see Ch. 1 in Mac-
Cartney (2009)), LS methods offer robustness, but
have lacked explanatory power and the ability to
connect knowledge into short inference chains.
Here we have demonstrated that higher-order
methods capitalize on indirect evidence gathered
by connecting multiple direct associations, and in
doing so significantly increase performance over
LS approaches that use direct evidence alone. By
incorporating contextual information and making
use of the short inference chains generated by
traversing these graphical models, we hypothe-
size that graph-based systems will soon be able to
construct simple justifications for their answer se-
lection (e.g., pancakes are a breakfast food, and
hashbrowns go well with pancakes). We hope it
will soon be possible to fill this gap in inference
methods for QA with higher-order LS models for
robust but approximate inference.

7 Conclusions
We introduce a higher-order formalism that al-
lows lexical semantic models to capitalize on
both direct and indirect evidence. We demon-
strate that many lexical semantic models, includ-
ing monolingual alignment and neural network
language models, working over surface or syntac-
tic representations, can be trivially adapted to this
higher-order formalism. Using a corpus of thou-
sands of non-factoid how questions, we experi-
mentally demonstrate that higher-order methods
perform better than their first-order variants for
most lexical semantic models investigated, with
statistically-significant relative gains of up to 13%
over the corresponding first order models.

Acknowledgements
We thank the Allen Institute for Artificial Intelli-
gence for funding this work. We would also like
to thank the three anonymous reviewers for their
helpful comments and suggestions.

208


References
Eneko Agirre and Aitor Soroa. 2009. Personalizing

PageRank for word sense disambiguation. In Pro-
ceedings of the 12th Conference of the European
Chapter of the Association for Computational Lin-
guistics.

Marco Baroni, Georgiana Dinu, and Germán
Kruszewski. 2014. Don’t count, predict! A
systematic comparison of context-counting vs.
context-predicting semantic vectors. In Proceedings
of the 52nd Annual Meeting of the Association for
Computational Linguistics.

Islam Beltagy, Cuong Chau, Gemma Boleda, Dan Gar-
rette, Katrin Erk, and Raymond Mooney. 2013.
Montague meets Markov: Deep semantics with
probabilistic logical form. In Proceedings of the
Second Joint Conference on Lexical and Computa-
tional Semantics.

Adam Berger, Rich Caruana, David Cohn, Dayne Frey-
tag, and Vibhu Mittal. 2000. Bridging the lexical
chasm: Statistical approaches to answer finding. In
Proceedings of the 23rd Annual International ACM
SIGIR Conference on Research & Development on
Information Retrieval.

Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe,
Soumen Chakrabarti, and Shashank Sudarshan.
2002. Keyword searching and browsing in
databases using BANKS. In Proceedings of
18th International Conference on Data Engineering
(ICDE).

Peter F. Brown, Stephen A. Della Pietra, Vincent
J. Della Pietra, and Robert L. Mercer. 1993.
The mathematics of statistical machine translation:
Parameter estimation. Computational Linguistics,
19(2):263–311.

Soumen Chakrabarti and Alekh Agarwal. 2006.
Learning parameters in entity relationship graphs
from ranking preferences. In Proceedings of 10th
European Conference on Principle and Practice of
Knowledge Discovery in Databases (PKDD).

Soumen Chakrabarti. 2007. Dynamic personalized
PageRank in entity-relation graphs. In Proceedings
of the 16th International World Wide Web Confer-
ence (WWW).

Marie-Catherine de Marneffe, Bill MacCartney, and
Christopher D. Manning. 2006. Generating typed
dependency parses from phrase structure parses.
In Proceedings of the International Conference on
Language Resources and Evaluation (LREC).

Abdessamad Echihabi and Daniel Marcu. 2003. A
noisy-channel approach to question answering. In
Proceedings of the 41st Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL).

Oren Etzioni. 2011. Search needs a shake-up. Nature,
476(7358):25–26.

Peter Jansen, Mihai Surdeanu, and Peter Clark. 2014.
Discourse complements lexical semantics for non-
factoid answer reranking. In Proceedings of the
52nd Annual Meeting of the Association for Com-
putational Linguistics (ACL).

Ni Lao and William W. Cohen. 2010. Relational re-
trieval using a combination of path-constrained ran-
dom walks. Machine Learning, 81(1):53–67.

Ni Lao, Tom Mitchell, and William W. Cohen. 2011.
Random walk inference and learning in a large scale
knowledge base. In Proceedings of the Conference
on Empirical Methods in Natural Language Pro-
cessing (EMNLP).

Heeyoung Lee, Marta Recasens, Angel Chang, Mi-
hai Surdeanu, and Dan Jurafsky. 2012. Joint en-
tity and event coreference resolution across docu-
ments. In Proceedings of the Joint Conference on
Empirical Methods in Natural Language Process-
ing and Computational Natural Language Learning
(EMNLP-CoNLL).

Omer Levy and Yoav Goldberg. 2014a. Dependency-
based word embeddings. In Proceedings of the 52nd
Annual Meeting of the Association for Computa-
tional Linguistics (ACL).

Omer Levy and Yoav Goldberg. 2014b. Linguistic reg-
ularities in sparse and explicit word representations.
In Proceedings of the Conference on Computational
Natural Language Learning (CoNLL).

Dekang Lin. 1998. Automatic retrieval and clustering
of similar words. In Proceedings of the 36th Annual
Meeting of the Association for Computational Lin-
guistics and the 17th International Conference on
Computational Linguistics (COLING-ACL-1998).

Bill MacCartney. 2009. Natural Language Inference.
Ph.D. thesis, Stanford University.

Christopher D. Manning, Prabhakar Raghavan, and
Hinrich Schütze. 2008. Introduction to Information
Retrieval. Cambridge University Press.

Christopher D. Manning, Mihai Surdeanu, John Bauer,
Jenny Finkel, Steven J. Bethard, and David Mc-
Closky. 2014. The Stanford CoreNLP natural lan-
guage processing toolkit. In Proceedings of the
52nd Annual Meeting of the Association for Com-
putational Linguistics (ACL).

Tara McIntosh. 2010. Reducing Semantic Drift in
Biomedical Lexicon Bootstrapping. Ph.D. thesis,
University of Sydney.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013. Efficient estimation of word repre-
sentations in vector space. In Proceedings of the
International Conference on Learning Representa-
tions (ICLR).

Einat Minkov, William W. Cohen, and Andrew Y. Ng.
2006. Contextual search and name disambiguation
in email using graphs. In Proceedings of SIGIR.

209


Vanessa Murdock and W. Bruce Croft. 2005. A trans-
lation model for sentence retrieval. In Proceedings
of the Conference on Human Language Technology
and Empirical Methods in Natural Language Pro-
cessing, pages 684–691.

Courtney Napoles, Matthew Gormley, and Benjamin
Van Durme. 2012. Annotated Gigaword. In Pro-
ceedings of the Joint Workshop on Automatic Knowl-
edge Base Construction and Web-scale Knowledge
Extraction, pages 95–100. Association for Compu-
tational Linguistics.

Franz Josef Och and Hermann Ney. 2003. A sys-
tematic comparison of various statistical alignment
models. Computational Linguistics, 29(1):19–51.

Lawrence Page, Sergey Brin, Rajeev Motwani, and
Terry Winograd. 1999. The PageRank citation
ranking: Bringing order to the web. Technical Re-
port 1999-66, Stanford InfoLab, November.

Stefan Riezler, Alexander Vasserman, Ioannis
Tsochantaridis, Vibhu Mittal, and Yi Liu. 2007.
Statistical machine translation for query expansion
in answer retrieval. In Proceedings of the 45th An-
nual Meeting of the Association for Computational
Linguistics (ACL).

Radu Soricut and Eric Brill. 2006. Automatic question
answering using the web: Beyond the factoid. Jour-
nal of Information Retrieval - Special Issue on Web
Information Retrieval, 9(2):191–206.

Md. Arafat Sultan, Steven Bethard, and Tamara Sum-
ner. 2014. Back to basics for monolingual align-
ment: Exploiting word similarity and contextual ev-
idence. Transactions of the Association for Compu-
tational Linguistics, 2:219–230.

Mihai Surdeanu, Massimiliano Ciaramita, and Hugo
Zaragoza. 2011. Learning to rank answers to non-
factoid questions from web collections. Computa-
tional Linguistics, 37(2):351–383.

Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan.
2006. Fast random walk with restart and its applica-
tions. In Proceedings of the 6th International Con-
ference on Data Mining (ICDM).

Xin-Jing Wang, Xudong Tu, Dan Feng, and Lei Zhang.
2009. Ranking community answers by modeling
question-answer relationships via analogical reason-
ing. In Proceedings of the Annual ACM SIGIR Con-
ference.

Max Whitney and Anoop Sarkar. 2012. Bootstrap-
ping via graph propagation. In Proceedings of the
50th Annual Meeting of the Association for Compu-
tational Linguistics.

Xuchen Yao, Benjamin Van Durme, Chris Callison-
Burch, and Peter Clark. 2013. Semi-Markov
phrase-based monolingual alignment. In Proceed-
ings of the Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP).

David Yarowsky. 1995. Unsupervised word sense dis-
ambiguation rivaling supervised methods. In Pro-
ceedings of the 33rd Annual Meeting of the Associa-
tion for Computational Linguistics (ACL).

Wen-tau Yih, Ming-Wei Chang, Christopher Meek, and
Andrzej Pastusiak. 2013. Question answering using
enhanced lexical semantic models. In Proceedings
of the 51st Annual Meeting of the Association for
Computational Linguistics (ACL).

Benat Zapirain, Eneko Agirre, Lluis Marquez, and Mi-
hai Surdeanu. 2013. Selectional preferences for se-
mantic role classification. Computational Linguis-
tics, 39(3).

210