Learning Distributed Representations of Texts and Entities from
Knowledge Base

Ikuya Yamada1,4 Hiroyuki Shindo2 Hideaki Takeda3 Yoshiyasu Takefuji4
ikuya@ousia.jp shindo@is.naist.jp takeda@nii.ac.jp takefuji@sfc.keio.ac.jp

1 Studio Ousia, Japan, 2 Nara Institute of Science and Technology, Japan
3 National Institute of Informatics, Japan, 4 Keio University, Japan

Abstract

We describe a neural network model that
jointly learns distributed representations of
texts and knowledge base (KB) entities. Given
a text in the KB, we train our proposed model
to predict entities that are relevant to the text.
Our model is designed to be generic with
the ability to address various NLP tasks with
ease. We train the model using a large cor-
pus of texts and their entity annotations ex-
tracted from Wikipedia. We evaluated the
model on three important NLP tasks (i.e., sen-
tence textual similarity, entity linking, and fac-
toid question answering) involving both unsu-
pervised and supervised settings. As a result,
we achieved state-of-the-art results on all three
of these tasks. Our code and trained models
are publicly available for further academic re-
search.1

1 Introduction

Methods capable of learning distributed representa-
tions of arbitrary-length texts (i.e., fixed-length con-
tinuous vectors that encode the semantics of texts),
such as sentences and paragraphs, have recently
attracted considerable attention (Le and Mikolov,
2014; Kiros et al., 2015; Li et al., 2015; Wieting
et al., 2016; Hill et al., 2016b; Kenter et al., 2016).
These methods aim to learn generic representations
that are useful across domains similar to word em-
bedding methods such as Word2vec (Mikolov et al.,
2013b) and GloVe (Pennington et al., 2014).

Another interesting approach is learning dis-
tributed representations of entities in a knowledge

1https://github.com/studio-ousia/ntee

base (KB) such as Wikipedia and Freebase. These
methods encode information of entities in the KB
into a continuous vector space. They are shown to
be effective for various KB-related tasks such as en-
tity search (Hu et al., 2015), entity linking (Hu et
al., 2015; Yamada et al., 2016), and link prediction
(Bordes et al., 2013; Wang et al., 2014; Lin et al.,
2015).

In this paper, we describe a novel method to
bridge these two different approaches. In particular,
we propose Neural Text-Entity Encoder (NTEE), a
neural network model to jointly learn distributed
representations of texts (i.e., sentences and para-
graphs) and KB entities. For every text in the KB,
our model aims to predict its relevant entities, and
places the text and the relevant entities close to each
other in a continuous vector space. We use human-
edited entity annotations obtained from Wikipedia
(see Table 1) as supervised data of relevant entities
to the texts containing these annotations.2

Note that, KB entities have been conventionally
used to model semantics of texts. A representa-
tive example is Explicit Semantic Analysis (ESA)
(Gabrilovich and Markovitch, 2007), which repre-
sents the semantics of a text using a sparse vector
space, where each dimension corresponds to the rel-
evance score of the text to each entity. Essentially,
ESA shows that text can be accurately represented
using a small set of its relevant entities. Based on

2Entity annotations in Wikipedia can be viewed as su-
pervised data of relevant entities because Wikipedia instructs
its contributors to create annotations only where they are
relevant in its manual: https://en.wikipedia.org/
wiki/Wikipedia:Manual_of_Style

397

Transactions of the Association for Computational Linguistics, vol. 5, pp. 397–411, 2017. Action Editor: Kristina Toutanova .
Submission batch: 12/2016; Revision batch: 3/2017; Published 11/2017.

c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


this fact, we hypothesize that we can use the anno-
tations of relevant entities as the supervised data of
learning text representations. Furthermore, we also
consider that placing texts and entities into the same
vector space enables us to easily compute the simi-
larity between texts and entities, which can be bene-
ficial for various KB-related tasks.

In order to test this hypothesis, we conduct three
experiments involving both the unsupervised and the
supervised tasks. First, we use standard semantic
textual similarity datasets to evaluate the quality of
the learned text representations of our method in
an unsupervised fashion. As a result, our method
clearly outperformed the state-of-the-art methods.

Furthermore, to test the effectiveness of our
method to perform KB-related tasks, we address
the following two important problems in the super-
vised setting: entity linking (EL) and factoid ques-
tion answering (QA). In both tasks, we adopt a sim-
ple multi-layer perceptron (MLP) classifier with the
learned representations as features. We tested our
method using two standard datasets (i.e., CoNLL
2003 and TAC 2010) for the EL task and a popular
factoid QA dataset based on the quiz bowl quiz game
for the factoid QA task. As a result, our method out-
performed recent state-of-the-art methods on both
the EL and the factoid QA tasks.

Additionally, there have also been proposed meth-
ods that map words and entities into the same con-
tinuous vector space (Wang et al., 2014; Yamada
et al., 2016; Fang et al., 2016). Our work differs
from these works because we aim to map texts (i.e.,
sentences and paragraphs) and entities into the same
vector space.

Our contributions are summarized as follows:

• We propose a neural network model that jointly
learns vector representations of texts and KB
entities. We train the model using a large
amount of entity annotations extracted directly
from Wikipedia.

• We demonstrate that our proposed representa-
tions are surprisingly effective for various NLP
tasks. In particular, we apply the proposed
model to three different NLP tasks, namely se-
mantic textual similarity, entity linking, and
factoid question answering, and achieve state-
of-the-art results on all three tasks.

The Lord of the Rings is an epic high-fantasy
novel written by English author J. R. R. Tolkien.

Entity Annotations: The Lord of the Rings, Epic
(genre), High fantasy, J. R. R. Tolkien

Table 1: An example of a sentence with entity annota-
tions.

• We release our code and trained models to
the community at https://github.com/
studio-ousia/ntee to facilitate further
academic research.

2 Our Approach

In this section, we propose our approach of learn-
ing distributed representations of texts and entities
in KB.

2.1 Model

Given a text t (a sequence of words w1, ..., wN ), we
train our model to predict entities e1, ..., en that ap-
pear in t. Formally, the probability that represents
the likelihood of an entity e appearing in t is defined
as the following softmax function:

P (e|t) = exp(ve
>vt)∑

e′∈EKB exp(ve′
>vt)

, (1)

where EKB is a set of all entities in KB, and ve ∈ Rd
and vt ∈ Rd are the vector representations of the
entity e and the text t, respectively. We compute vt
using the element-wise sum of word vectors in t with
L2 normalization and a fully connected layer. Let
us denote vs as a vector of the sum of word vectors
(vs =

∑N
i=1 vwi ), vt is computed as follows:

vt = W
vs
‖vs‖

+ b, (2)

where W ∈ Rd×d is a weight matrix, and b ∈ Rd
is a bias vector. Here, we initialize vw and ve using
the pre-trained representations described in the next
section.

The loss function of our model is defined as fol-
lows:

L = −
∑

(t,Et)∈Γ

∑

e∈Et
log P (e|t), (3)

398


where Γ denotes a set of pairs each of which consists
of a text t and its entity annotations Et in KB.

One problem in training our model is that the de-
nominator in Eq. (1) is computationally very expen-
sive because it involves summation over all entities
in KB. We address this problem by replacing EKB
in Eq. (1) with E∗, which is the union of the positive
entity e and the randomly chosen k negative entities
that do not appear in t. This method can be viewed
as negative sampling (Mikolov et al., 2013b) with a
uniform negative distribution.

In addition, because the length of a text t is ar-
bitrary in our model, we test the following two set-
tings: t as a paragraph, and t as a sentence3.

2.2 Parameters

The parameters to be learned by our model are the
vector representations of words and entities in our
vocabulary V , the weight matrix W , and the bias
vector b. Consequently, the total number of parame-
ters in our model is |V |×d + d2 + d.

We initialize the representations of words and
entities using pre-trained representations to reduce
the training time. We use the skip-gram model
of Word2vec (Mikolov et al., 2013a; Mikolov et
al., 2013b) with negative sampling trained with
Wikipedia articles. In order to create a corpus
for the skip-gram model from Wikipedia, we sim-
ply replace the name of each entity annotation in
Wikipedia articles with the unique identifier of the
entity the annotation refers to. This simple method
enables us to easily train the distributed represen-
tations of words and entities simultaneously. We
used a Wikipedia dump generated in July 20164. For
the hyper-parameters of the skip-gram model, we
used standard parameters such as the context win-
dow size being 10, and the size of negative samples
being 5. We used the Python Word2vec implemen-
tation in Gensim5. Additionally, the entity represen-
tations were normalized to unit length before they
were used as the pre-trained representations.

3We use the open-source Apache OpenNLP to detect sen-
tences.

4The Wikipedia dump was downloaded from Wikimedia
Downloads: https://dumps.wikimedia.org/

5https://radimrehurek.com/gensim/

2.3 Corpus

We trained our model by using the English DBpe-
dia abstract corpus (Brümmer et al., 2016), an open
corpus of Wikipedia texts with entity annotations
manually created by Wikipedia contributors.6 It was
extracted from the first introductory sections of 4.4
million Wikipedia articles. We train our model by
iterating over the texts and their entity annotations
in the corpus.

We used words that appear five times or more and
entities that appear three times or more in the cor-
pus, and simply ignored the other words and entities.
As a result, our vocabulary V consisted of 705,168
words and 957,207 entities. Further, the number
of valid words and entity annotations were approxi-
mately 382 million and 28 million, respectively.

Additionally, we also introduce one heuristic
method to generate entity annotations. For each text,
we add a pseudo-annotation that points to the entity
of which the KB page is the source of the text. Be-
cause every KB page describes its corresponding en-
tity, it typically contains many mentions referring to
the entity. However, because hyper-linking to the
web page itself does not make sense, these kinds
of mentions cannot be observed as annotations in
Wikipedia. Therefore, we use the aforementioned
heuristic method to address this problem.

2.4 Other Details

Our model has several hyper-parameters. Following
Kenter et al. (2016), the number of dimensions we
used was d = 300. The mini-batch size was fixed
at 100, the size of negative samples k was set to 30,
and the training consisted of one epoch.

The model was implemented using Python and
Theano (Theano Development Team, 2016). The
training took approximately six days using a
NVIDIA K80 GPU. We trained the model using
stochastic gradient descent (SGD) and its learning
rate was controlled by RMSprop (Tieleman and Hin-
ton, 2012).

6The corpus also includes annotations that are generated us-
ing heuristics. We did not use these pseudo-annotations and
used only the entity annotations that were created by Wikipedia
contributors.

399


3 Experiments

In order to evaluate our model presented in the previ-
ous section, we conduct experiments on three impor-
tant NLP tasks using the representations learned by
our model. First, we conduct an experiment on a se-
mantic textual similarity task in order to evaluate the
quality of the learned text representations. Next, we
conduct experiments on two important NLP prob-
lems (i.e., EL and factoid QA) in order to test the
effectiveness of our proposed representations as fea-
tures for downstream NLP tasks. Finally, we further
qualitatively analyze the learned representations.

Note that we separately describe how we ad-
dress each task using our representations in the sub-
section of each experiment.

3.1 Semantic Textual Similarity
Semantic textual similarity aims to test how well
a model reflects human judgments of the semantic
similarity between two sentence pairs. The task has
been used as a standard method to evaluate the qual-
ity of distributed representations of sentences in past
work (Kiros et al., 2015; Hill et al., 2016a; Kenter et
al., 2016).

3.1.1 Setup
Our experimental setup follows that of a previ-

ously published experiment (Hill et al., 2016a). We
use two standard datasets: (1) the STS 2014 dataset
(Agirre et al., 2014) consisting of 3,750 sentence
pairs and human ratings from six different sources
(e.g., newswire, web forums, dictionary glosses),
and (2) the SICK dataset (Marelli et al., 2014) con-
sisting of 10,000 pairs of sentences and human rat-
ings. In both datasets, the ratings take values be-
tween 1 and 5, where a rating of 1 indicates that the
sentence pair is not related, and a rating of 5 means
that they are highly related. All sentence pairs ex-
cept the 500 SICK trial pairs were used for our ex-
periments.

We train our model by experimenting with both
paragraphs and sentences. Further, we introduce
another training setting (denoted by fixed NTEE),
where the parameters in the word representations
and the entity representations are fixed throughout
the training.

We compute the cosine distance between the vec-
tors of the two sentences in each sentence pair (de-

rived using Eq. (2)) and measure the Pearson’s r and
Spearman’s p correlations between these distances
and the gold-standard human ratings. Additionally,
we use Pearson’s r as our primal score.

3.1.2 Baselines

For baselines for this experiment, we selected the
following four recent state-of-the-art models. Brief
descriptions of these models are as follows:

• Word2vec (Mikolov et al., 2013a; Mikolov
et al., 2013b) is a popular word embedding
model. We compute a sentence representation
by element-wise addition of the vectors of its
words (Mitchell and Lapata, 2008). We add its
skip-gram and CBOW models to our baselines.
We train the model with the hyper-parameters
and the Wikipedia corpus explained in Sec-
tion 2.2. Thus, the skip-gram model is equiv-
alent to the pre-trained representations used in
our model. Furthermore, in order to conduct a
fair comparison between the skip-gram model
and our model, we also add skip-gram (plain),
which is a skip-gram model trained using a
different corpus. In particular, the corpus is
augmented using the texts in DBpedia abstract
corpus7, and its entity annotations are treated
as regular text phrases (not replaced to their
unique identifiers).

• Skip-thought (Kiros et al., 2015) is a model
that is trained to predict adjacent sentences
given each sentence in a corpus. Sentences
are encoded using a recurrent neural network
(RNN) with gated recurrent units (GRU).

• Siamese CBOW (Kenter et al., 2016) is a
model that aims to predict sentences occurring
next to each other in a corpus. A sentence rep-
resentation is derived using a vector average of
words in the sentence.

We obtain a score of a sentence pair by using
the cosine distance between the sentence represen-
tations of the pair.

400


Name
STS 2014

SICK
News Forum OnWN Twitter Images Headlines

NTEE (sentence) .74/.68 .56/.55 .72/.74 .75/.66 .82/.77 .69/.63 .71/.60
NTEE (paragraph) .74/.68 .52/.51 .66/.69 .74/.66 .77/.72 .68/.61 .69/.61
Fixed NTEE (sentence) .72/.69 .47/.46 .75/.78 .74/.67 .78/.74 .65/.61 .73/.61
Fixed NTEE (paragraph) .72/.69 .47/.47 .75/.78 .73/.67 .77/.74 .65/.61 .72/.61
Skip-gram .65/.67 .36/.39 .62/.69 .65/.66 .54/.56 .62/.60 .66/.58
Skip-gram (plain) .63/.65 .36/.39 .61/.69 .62/.62 .56/.57 .60/.58 .66/.58
CBOW .58/.59 .35/.36 .57/.64 .70/.68 .54/.55 .57/.53 .61/.58
Skip-thought .45/.44 .15/.14 .34/.39 .43/.42 .60/.55 .44/.43 .60/.57
Siamese CBOW .59/.58 .41/.42 .61/.66 .73/.71 .65/.65 .64/.63 -

Table 2: Pearson’s r and Spearman’s p correlations of our models with the state-of-the-art models on semantic textual
similarity task. Best scores, in terms of r, are marked in bold.

3.1.3 Results

Table 2 shows our experimental results with
the baseline methods. We obtained the scores of
Skip-thought from Hill et al. (2016a) and those of
Siamese CBOW from Kenter et al. (2016).

Our NTEE models were able to outperform the
state-of-the-art models in all datasets in terms of
Pearson’s r. Moreover, our fixed NTEE models
outperformed the NTEE models in several datasets
and the skip-gram models in all datasets. Further,
our model trained with sentences consistently out-
performed the model trained with paragraphs. Ad-
ditionally, the skip-gram models performed mostly
similarly regardless of the difference of their corpus.

Note that, because we fix the word representations
and the entity representations during the training
of the fixed NTEE models, the difference between
the fixed NTEE models and the skip-gram model is
merely the presence of the learned fully connected
layer. Because our model places a text representa-
tion and the representations of its relevant entities
close to each other, the function of the layer can
be recognized as an affine transformation from the
word-based text representation to the entity-based
text representation. We consider that the reason
why the fixed NTEE model performed well among
datasets is that the entity-based text representations
are more semantic (less syntactic) and contain less
noise than the word-based text representations, thus
are much more suitable for addressing this task.

7We augment the corpus simply by appending the texts in
DBpedia abstract corpus to the Wikipedia corpus.

3.2 Entity Linking
Entity Linking (EL) (Cucerzan, 2007; Mihalcea and
Csomai, 2007; Milne and Witten, 2008; Ratinov et
al., 2011; Hajishirzi et al., 2013; Ling et al., 2015)
is the task of resolving ambiguous mentions of enti-
ties to their referent entities in KB. EL has recently
received considerable attention because of its effec-
tiveness in various NLP tasks such as information
extraction and semantic search. The task is chal-
lenging because of the ambiguity in the meaning of
entity mentions (e.g., “Washington” can refer to the
state, the capital of the US, the first US president
George Washington, and so forth).

The key to improve the performance of EL is
to accurately model the semantic context of entity
mentions. Because our model learns the likelihood
of an entity appearance in a given text, it can natu-
rally be used for modeling the context of EL.

3.2.1 Setup
Our experimental setup follows the setup de-

scribed in past work (Chisholm and Hachey, 2015;
He et al., 2013; Yamada et al., 2016). We use two
standard datasets: the CoNLL dataset and the TAC
2010 dataset. The CoNLL dataset, which was pro-
posed in Hoffart et al. (2011), includes training, de-
velopment, and test sets consisting of 946, 216, and
231 documents, respectively. We use the training set
to train our EL method, and the test set for measur-
ing the performance of our method. We report the
standard micro- (aggregates over all mentions) and
macro- (aggregates over all documents) accuracies
of the top-ranked candidate entities.

401


The TAC 2010 dataset is another dataset con-
structed for the Text Analysis Conference (TAC)8

(Ji et al., 2010). The dataset comprises training and
test sets containing 1,043 and 1,013 documents, re-
spectively. We use mentions only with a valid entry
in the KB, and report the micro-accuracy score of
the top-ranked candidate entities. We evaluate our
method on 1,020 mentions contained in the test set.
Further, we randomly select 10% of the documents
from the training set, and use these documents as a
development set.

Additionally, we collected two measures that have
frequently been used in past EL work: entity popu-
larity and prior probability. The entity popularity
of an entity e is defined as log(|Ae,∗| + 1), where
Ae,∗ is the set of KB anchors that point to e. The
prior probability of mention m referring to entity
e is defined as |Ae,m|/|A∗,m|, where A∗,m repre-
sents all KB anchors with the same surface as m,
and Ae,m is a subset of A∗,m that points to e. These
two measures were collected directly from the same
Wikipedia dump described in Section 2.2.

3.2.2 Our Method
Following past work, we address the EL task by

solving two sub-tasks: candidate generation and
mention disambiguation.

Candidate Generation In candidate generation,
candidates of referent entities are generated for each
mention. We use the candidate generation method
proposed in Yamada et al. (2016) for the sake of
compatibility with their state-of-the-art results. In
particular, we use a public dataset proposed in Per-
shina et al. (2015) for the CoNLL dataset. For the
TAC 2010 dataset, we use a dictionary that is di-
rectly built from the Wikipedia dump explained in
Section 2.2. We retrieved possible mention surfaces
of an entity from (1) the title of the entity, (2) the ti-
tle of another entity redirecting to the entity, and (3)
the names of anchors that point to the entity. Fur-
thermore, to improve the recall, we also tokenize the
title of each entity and treat resulted tokens as pos-
sible mention surfaces of the corresponding entity.
We sort the entity candidates according to their en-
tity popularities, and retain the top 100 candidates
for computational efficiency. The recall of the can-

8http://www.nist.gov/tac/

didate generation was 99.9% and 94.6% on the test
sets of the CoNLL and TAC 2010 datasets, respec-
tively.

Mention Disambiguation We address the men-
tion disambiguation task using a multi-layer percep-
tron (MLP) with a single hidden layer. Figure 1
shows the architecture of our neural network model.
The model selects an entity from among the entity
candidates for each mention m in a document t. For
each entity candidate e, we input the vector of the
entity ve9, the vector of the document vt (computed
with Eq. (2)), the dot product of ve and vt10,11, and
the small number of features for EL described below.
On top of these features, we stack a hidden layer
with nonlinearity using rectified linear units (ReLU)
and dropout. We also add an output layer onto the
hidden layer and select the most relevant entity using
softmax over the entity candidates.

Similar to past work (Chisholm and Hachey,
2015; Yamada et al., 2016), we include a small num-
ber of features in our model. First, we use the fol-
lowing three standard EL features: the entity popu-
larity of e, the prior probability of m referring to e,
and the maximum prior probability of e of all men-
tions in t. In addition, we optionally add features
representing string similarities between the title of
e and the surface of m (Meij et al., 2012; Yamada
et al., 2016). These similarities include whether the
title of e exactly equals or contains the surface of
m, and whether the title of e starts or ends with the
surface of m.

We tuned the following two hyper-parameters us-
ing the micro-accuracy on the development set of
each dataset: the number of units in the hidden layer
and the dropout probability. The results are listed in
Table 3.

Further, we trained the model by using stochastic
gradient descent (SGD). The learning rate was con-
trolled by RMSprop, and the mini-batch size was set
to 100. We also used the micro-accuracy on the de-
velopment set to locate the best epoch for testing.

9We normalized ve to unit length because of its overall
higher accuracy.

10Note that, the dot product represents the unnormalized like-
lihood that e appears in t (see Eq.(1)).

11We also tested using the cosine similarity rather than the
dot product, but it slightly degraded the performance in the EL
task and the factoid QA task described below.

402


Figure 1: Architecture of our neural network for EL and QA tasks.

We tested the NTEE model and the fixed NTEE
model to initialize the parameters of representations
vt and ve. Furthermore, we also tested two simple
methods using the pre-trained representations (i.e.,
skip-gram). The first method is that the representa-
tions of words and entities are initialized using the
pre-trained representations presented in Section 2.2,
and the other parameters are initialized randomly
(denoted by SG-proj). The second method is the
same method as in SG-proj except the training cor-
pus of the pre-trained representations is augmented
using the DBpedia abstract corpus (denoted by SG-
proj-dbp).12

Regarding the NTEE and the fixed NTEE models,
sentences (rather than paragraphs) were used to train
the proposed representations because of the superior
performance of this approach on both the CoNLL
and TAC 2010 datasets. Further, we did not update
our representations of words (vw) and entities (ve)
in the training of our EL method, because updat-
ing them did not generally improve the performance.
Additionally, we used a vector filled with zeros as
representations of entities that were not contained in
our vocabulary.

3.2.3 Baselines
We adopt the following six recent state-of-the-art

EL methods as our baselines:

• Hoffart (Hoffart et al., 2011) used a graph-
based approach that finds a dense subgraph of
entities in a document to address EL.

12We augmented the corpus by simply concatenating the
Wikipedia corpus and the DBpedia abstract corpus. Similar to
the Wikipedia corpus, we replaced each entity annotation in the
DBpedia abstract corpus by its unique identifier of the entity
referred by the annotation.

Dataset hidden units dropout
EL:

CoNLL (NTEE) 2,000 0.3
CoNLL (NTEE w/o strsim) 3,000 0.8
CoNLL (Fixed NTEE) 5,000 0.0
CoNLL (SG-proj) 2,000 0.9
CoNLL (SG-proj-dbp) 5,000 0.6
TAC10 (NTEE) 5,000 0.0
TAC10 (NTEE w/o strsim) 5,000 0.0
TAC10 (Fixed NTEE) 5,000 0.0
TAC10 (SG-proj) 2,000 0.4
TAC10 (SG-proj-dbp) 5,000 0.0

Factoid QA:
History (NTEE) 1,000 0.4
History (Fixed NTEE) 1,000 0.4
History (SG-proj) 3,000 0.1
History (SG-proj-dbp) 5,000 0.1
Literature (NTEE) 2,000 0.1
Literature (Fixed NTEE) 2,000 0.1
Literature (SG-proj) 2,000 0.1
Literature (SG-proj-dbp) 5,000 0.1

Table 3: Hyper-parameters used for EL and QA tasks.
hidden units is the number of units in the hidden layers,
and dropout is the dropout probability.

• He (He et al., 2013) proposed a method for
learning the representations of mention con-
texts and entities from KB using the stacked de-
noising auto-encoders. These representations
were then used to address EL.

• Chisholm (Chisholm and Hachey, 2015) used
a support vector machine (SVM) with vari-
ous features derived from KB and a Wikilinks
dataset (Singh et al., 2012).

403


• Pershina (Pershina et al., 2015) improved EL
by modeling coherence using the personalized
page rank algorithm.

• Globerson (Globerson et al., 2016) improved
the coherence model for EL by introducing an
attention mechanism in order to focus only on
strong relations of entities.

• Yamada (Yamada et al., 2016) proposed a
model for learning the joint distributed repre-
sentations of words and KB entities from KB,
and addressed EL using context models based
on the representations.

3.2.4 Results
Table 4 compares the results of our method with

those obtained with the state-of-the-art methods.
Our method achieved strong results on both the
CoNLL and the TAC 2010 datasets. In particular,
the NTEE model clearly outperformed the other pro-
posed models. We also tested the performance of
the NTEE model without using the string similarity
features (strsim) and found that these features also
contributed to the performance.

Furthermore, our method successfully outper-
formed all the recent strong state-of-the-art methods
on both datasets. This is remarkable because most
state-of-the-art EL methods, including all baseline
methods except that of He, adopt global approaches,
where all entity mentions in a document are simul-
taneously disambiguated based on coherence among
disambiguation decisions. Our method depends
only on the local (or textual) context available in the
target document. Thus, the performance can likely
be improved further by combining a global model
with our local model as frequently observed in past
work (Ratinov et al., 2011; Chisholm and Hachey,
2015; Yamada et al., 2016).

We also conducted a brief error analysis using the
NTEE model and the test set of the CoNLL dataset
by randomly inspecting 200 errors. As a result, 22%
of the errors were mentions of which the referent
entities were not contained in our vocabulary. In
this case, our method could not incorporate any con-
textual information, thus likely resulting in disam-
biguation errors. The other major types of errors
were the mentions of location names. The dataset
contains many location names (e.g., Japan) referring

CoNLL

(Micro)

CoNLL

(Macro)

TAC10

(Micro)

NTEE 94.7 94.3 87.7
NTEE (w/o strsim) 92.9 92.7 85.8
Fixed NTEE 92.6 93.1 85.9
SG-proj 87.8 89.5 82.5
SG-proj-dbp 86.5 89.5 83.0
Hoffart (2011) 82.5 81.7 -
He (2013) 85.6 84.0 81.0
Chisholm (2015) 88.7 - 80.7
Pershina (2015) 91.8 89.9 -
Globerson (2016) 92.7 - 87.2
Yamada (2016) 93.1 92.6 85.2

Table 4: Accuracies of the proposed method and the
state-of-the-art methods.

to sports team entities (e.g., Japan national football
team). It appeared that our method neglected to dis-
tinguish whether a location name refers to the loca-
tion itself or a sports team. In particular, our method
often wrongly resolved these mentions referring to
sports team entities into the corresponding location
entities and vice versa. They accounted for 20.5%
and 14.5% out of the total number of errors, respec-
tively. Moreover, we observed several difficult cases
such as selecting Hindu instead of Hindu national-
ism, Christian instead of Catholicism, New York City
instead of New York, and so forth.

3.3 Factoid Question Answering

Question Answering (QA) has been one of the cen-
tral problems in NLP research for the last few
decades. Factoid QA is one of the typical types of
QA that aims to predict an entity (e.g., events, au-
thors, and actors) that is discussed in a given ques-
tion. Quiz bowl is a popular trivia quiz game in
which players are asked questions consisting of 4–
6 sentence questions describing entities. The dataset
of the quiz bowl has been frequently used for evalu-
ating factoid QA methods in recent literature on QA
(Iyyer et al., 2014; Iyyer et al., 2015; Xu and Li,
2016).

In this section, we demonstrate that our proposed
representations can be effectively used as back-
ground knowledge for the QA task.

404


3.3.1 Setup
We followed an existing method (Xu and Li,

2016) for our experimental setup. We used the
public quiz bowl dataset proposed in Iyyer et al.
(2014).13 Following past work (Iyyer et al., 2014;
Iyyer et al., 2015; Xu and Li, 2016), we only used
questions belonging to the history and literature cat-
egories, and only used answers that appeared at least
six times. For questions referring to the same an-
swer, we sampled 20% of each for the development
set and test sets, and the remaining 60% for the train-
ing set. As a result, we obtained 1,535 training, 511
development, and 511 test questions for history, and
2,524 training, 840 development, and 840 test ques-
tions for literature. The number of possible answers
was 303 and 424 in the history and literature cate-
gories, respectively.

3.3.2 Our Method
Following past work (Iyyer et al., 2014; Iyyer et

al., 2015; Xu and Li, 2016), we address this task as
a classification problem that selects the most rele-
vant answer from the possible answers observed in
the dataset. We adopt the same neural network ar-
chitecture described in Section 3.2.2 (see Figure 1).
We use the following three features: the vector of
the entity ve14, the vector of the question vt (com-
puted using Eq. (2)), and the dot product of ve and
vt. Note that we do not include other features in this
task.

The hyper-parameters used in our model (i.e.,
the number of units in the hidden layer and the
dropout probability) are shown in Table 3. We tuned
these parameters using the development set of each
dataset.

Unlike the EL task, we updated all parameters
including representations of words and entities for
training our QA method. We used stochastic gra-
dient descent (SGD) to train the model. The mini-
batch size was fixed at 100, and the learning rate
was controlled by RMSprop. We used the accuracy

13The dataset was downloaded from https://cs.umd.
edu/˜miyyer/qblearn/. Note that the public dataset is
significantly smaller than the one used in past work (Iyyer et
al., 2014; Iyyer et al., 2015) because they also used a proprietary
dataset in addition to the public dataset.

14Similar to our EL method, we also normalize ve to unit
length because of its overall higher accuracy.

on the development set of each dataset to detect the
best epoch.

Similar to the EL task, we tested the four models
to initialize the representations vt and ve, i.e., the
NTEE, the fixed NTEE, the SG-proj, and the SG-
proj-dbp models. Further, the representations of the
NTEE model and the fixed NTEE model were those
that were trained with the sentences because of their
overall superior accuracy compared to those trained
with paragraphs.

3.3.3 Baselines
We use two types of baselines: two conventional

bag-of-words (BOW) models and two state-of-the-
art neural network models. The details of these mod-
els are as follows:

• BOW (Iyyer et al., 2014) is a conventional ap-
proach using a logistic regression (LR) classi-
fier trained with binary BOW features to pre-
dict the correct answer.

• BOW-DT (Iyyer et al., 2014) is based on the
BOW baseline augmented with the feature set
with dependency relation indicators.

• QANTA (Iyyer et al., 2014) is an approach
based on a recursive neural network to de-
rive the distributed representations of ques-
tions. The method also uses the LR classifier
with the derived representations as features.

• FTS-BRNN (Xu and Li, 2016) is based on the
bidirectional recurrent neural network (RNN)
with gated recurrent units (GRU). Similar to
QANTA, the method adopts the LR classifier
with the derived representations as features.

3.3.4 Results
Table 5 shows the results of our methods com-

pared with those of the baseline methods. The re-
sults of BOW, BOW-DT, and QANTA were obtained
from Xu and Li (2016). We also include the result
reported in Iyyer et al. (2014) (denoted by QANTA-
full), which used a significantly larger dataset than
ours for training and testing.

The experimental results show that our NTEE
model achieved the best performance compared to
the other proposed models and all the baseline meth-
ods on both the history and the literature datasets.

405


Name History Literature
NTEE 94.7 95.1
Fixed NTEE 90.0 93.5
SG-proj 86.5 87.9
SG-proj-dbp 86.5 87.3
BOW 50.8 46.2
BOW-DT 60.9 57.4
QANTA 65.8 63.0
QANTA-full 73.7 69.1
FTS-BRNN 88.1 93.1

Table 5: Accuracies of the proposed method and the
state-of-the-art methods for the factoid QA task.

In particular, despite the simplicity of the neural
network architecture of our method compared to
the state-of-the-art methods (i.e., QANTA and FTS-
BRNN), our method clearly outperformed these
methods. This demonstrates the effectiveness of our
proposed representations as background knowledge
for the QA task.

We also conducted a brief error analysis using the
test set of the history dataset. Our observations in-
dicated that our method mostly performed perfect in
terms of predicting the types of target answers (e.g.,
locations, events, and people). However, our method
erred in delicate cases such as predicting Henry II of
England instead of Henry I of England, and Syra-
cuse, Sicily instead of Sicily.

3.4 Qualitative Analysis

In order to investigate what happens inside our
model, we conducted a qualitative analysis using
our proposed representations trained with sentences.
We first inspected the word representations of our
model and our pre-trained representations (i.e., the
skip-gram model) by computing the top five similar
words of five words (i.e., her, dry, spanish, tennis,
moon) using cosine similarity. The results are pre-
sented in Table 6. Interestingly, our model is some-
what more specific than the skip-gram model. For
example, there is only one word she whose cosine
similarity to the word her is more than 0.5 in our
model, whereas all the corresponding similar words
in the skip-gram model (i.e., she, his, herself, him,
and mother) satisfy that condition. We observe a
similar trend for the similar words of dry. Further-
more, all the words similar to tennis are strictly re-

lated to the sport itself in our model, whereas the
corresponding similar words of the skip-gram model
contain broader words such as ball sports (e.g., bad-
minton and volleyball). A similar trend can be ob-
served for the similar words of spanish and moon.

Similarly, we also compared our entity represen-
tations with those of the pre-trained representations
by computing the top five similar entities of six en-
tities (i.e., Europe, Golf, Tea, Smartphone, Scarlett
Johansson, and The Lord of the Rings) with respect
to cosine similarity. Table 7 contains the results.
For the entities Europe and Golf, we observe similar
trends to our word representations. Particularly, in
our model, the most similar entities of Europe and
Golf are Eastern Europe and Golf course, respec-
tively, whereas those of the skip-gram model are
Asia and Tennis, respectively. However, the simi-
lar entities of most entities (e.g., Tea, Smartphone,
Scarlett Johansson and The Lord of the Rings) ap-
pear to be similar between our model and the skip-
gram model.

4 Related Work

Various neural network models that learn distributed
representations of arbitrary-length texts (e.g., para-
graphs and sentences) have recently been proposed.
These models aimed to produce general-purpose text
representations that can be used with ease in vari-
ous downstream NLP tasks. Although most of these
models learn text representations from an unstruc-
tured text corpus (Le and Mikolov, 2014; Kiros et
al., 2015; Kenter et al., 2016), there have also been
proposed models that learn text representations by
leveraging structured linguistic resources. For in-
stance, Wieting et al. (2016) trained their model us-
ing a large number of noisy phrase pairs retrieved
from the Paraphrase Database (PPDB) (Ganitkevitch
et al., 2013). Hill et al. (2016b) use several public
dictionaries to train the model by mapping defini-
tion texts in a dictionary to representations of the
words explained by these texts. To our knowledge,
our work is the first work to learn generic text repre-
sentations with the supervision of entity annotations.

Several methods have also been proposed for ex-
tending the word embedding methods. For example,
Levy and Goldberg (2014) proposed a method to
train word embedding with dependency-based con-

406


Word Our model Skip-gram

her

she (0.65)
to (0.41)
and (0.40)
his (0.40)
in (0.39)

she (0.86)
his (0.77)
herself (0.71)
him (0.66)
mother (0.64)

dry

wet (0.48)
arid (0.46)
moisture (0.44)
grows (0.44)
dried (0.43)

wet (0.81)
moist (0.73)
drier (0.72)
drying (0.70)
moister (0.69)

tennis

doubles (0.86)
atp (0.79)
wimbledon (0.78)
wta (0.75)
slam (0.74)

badminton (0.75)
hardcourt (0.73)
volleyball (0.72)
racquetball (0.71)
squash (0.68)

spanish

spain (0.76)
madrid (0.70)
andalusia (0.64)
valencia (0.61)
seville (0.60)

spain (0.68)
portuguese (0.68)
french (0.68)
catalan (0.67)
mexican (0.67)

moon

lunar (0.78)
crater (0.66)
rim (0.66)
craters (0.65)
midpoint (0.59)

lunar (0.68)
moons (0.68)
sun (0.68)
earth (0.67)
sadasaa (0.67)

Table 6: Examples of top five similar words with their co-
sine similarities in our learned word representations com-
pared with those of the skip-gram model.

texts, and Luan et al. (2016) used semantic role la-
beling for generating contexts to train word embed-
ding. Moreover, a few recent studies on learning en-
tity embedding based on word embedding methods
have been reported (Hu et al., 2015; Li et al., 2016).
These models are typically based on the skip-gram
model and directly model the semantic relatedness
between KB entities. Our work differs from these
studies because we aim to learn representations of
arbitrary-length texts in addition to entities.

Another related approach is the relational em-
bedding (or knowledge embedding) (Bordes et al.,
2013; Wang et al., 2014; Lin et al., 2015), which en-
codes entities as continuous vectors and relations as
some operations on the vector space, such as vector
addition. These models typically learn representa-
tions from large KB graphs consisting of entities and
relations. Similarly, the universal schema (Riedel et
al., 2013; Toutanova et al., 2015; Verga et al., 2016)

jointly learned continuous representations of KB re-
lations, entities, and surface text patterns for the re-
lation extraction task.

Finally, Yamada et al. (2016) recently proposed
a method to jointly learn the embeddings of words
and entities from Wikipedia using the skip-gram
model and applied it to EL. Our method differs from
their method in that their method does not directly
model arbitrary-length texts (i.e., paragraphs and
sentences), which we proved to be highly effective
for various tasks in this paper. Moreover, we also
showed that the joint embedding of texts and enti-
ties can be applied not only to EL but also for wider
applications such as semantic textual similarity and
factoid QA.

5 Conclusions

In this paper, we presented a novel model capable of
jointly learning distributed representations of texts
and entities from a large number of entity annota-
tions in Wikipedia. Our aim was to construct the
proposed general-purpose model such that it enables
practitioners to address various NLP tasks with ease.
We achieved state-of-the-art results on three impor-
tant NLP tasks (i.e., semantic textual similarity, en-
tity linking, and factoid question answering), which
clearly demonstrated the effectiveness of our model.
Furthermore, the qualitative analysis showed that the
characteristics of our learned representations appar-
ently differ from those of the conventional word em-
bedding model (i.e., the skip-gram model), which
we plan to investigate in more detail in the future.
Moreover, we make our code and trained models
publicly available for future research.

Future work includes analyzing our model more
extensively and exploring the effectiveness of our
model in terms of other NLP tasks. We also aim
to test more expressive neural network models (e.g.,
LSTM) to derive our text representations. Further-
more, we believe that one of the promising direc-
tions would be to incorporate the rich structural data
of the KB such as relationships between entities,
links between entities, and the hierarchical category
structure of entities.

407


Entity Our model Skip-gram

Europe

Eastern Europe (0.67)
Western Europe (0.66)
Central Europe (0.64)
Asia (0.64)
North America (0.64)

Asia (0.85)
Western Europe (0.78)
North America (0.76)
Central Europe (0.75)
Americas (0.73)

Golf

Golf course (0.76)
PGA Tour (0.74)
LPGA (0.74)
Professional golfer (0.73)
U.S. Open (0.71)

Tennis (0.74)
LPGA (0.72)
PGA Tour (0.69)
Golf course (0.68)
Nicklaus Design (0.66)

Tea

Coffee (0.82)
Green tea (0.81)
Black tea (0.80)
Camellia sinensis (0.78)
Spice (0.76)

Coffee (0.78)
Green tea (0.76)
Black tea (0.75)
Camellia sinensis (0.74)
Spice (0.73)

Smartphone

Tablet computer (0.93)
Mobile device (0.89)
Personal digital assistant (0.88)
Android (operating system) (0.86)
iPhone (0.85)

Tablet computer (0.91)
Personal digital assistant (0.84)
Mobile device (0.84)
Android (operating system) (0.82)
Feature phone (0.82)

Scarlett Johansson

Kirsten Dunst (0.85)
Anne Hathaway (0.85)
Cameron Diaz (0.85)
Natalie Portman (0.85)
Jessica Biel (0.84)

Anne Hathaway (0.79)
Natalie Portman (0.78)
Kirsten Dunst (0.78)
Cameron Diaz (0.78)
Kate Beckinsale (0.77)

The Lord of the Rings

The Hobbit (0.85)
J. R. R. Tolkien (0.84)
The Silmarillion (0.81)
The Fellowship of the Ring (0.80)
The Lord of the Rings (film series) (0.78)

The Hobbit (0.77)
J. R. R. Tolkien (0.76)
The Silmarillion (0.71)
The Fellowship of the Ring (0.70)
Elvish languages (0.69)

Table 7: Examples of top five similar entities with their cosine similarities in our learned entity representations with
those of the skip-gram model.

Acknowledgements

We would like to thank the TACL editor Kristina
Toutanova and the anonymous reviewers for helpful
comments on an earlier draft of this paper.

References
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel

Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo,
Rada Mihalcea, German Rigau, and Janyce Wiebe.
2014. SemEval-2014 Task 10: Multilingual Seman-
tic Textual Similarity. In Proceedings of the 8th In-
ternational Workshop on Semantic Evaluation, pages
81–91.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran,
Jason Weston, and Oksana Yakhnenko. 2013.

Translating Embeddings for Modeling Multi-relational
Data. In Advances in Neural Information Processing
Systems 26, pages 2787–2795.

Martin Brümmer, Milan Dojchinovski, and Sebastian
Hellmann. 2016. DBpedia Abstracts: A Large-Scale,
Open, Multilingual NLP Training Corpus. In Proceed-
ings of the Tenth International Conference on Lan-
guage Resources and Evaluation.

Andrew Chisholm and Ben Hachey. 2015. Entity Dis-
ambiguation with Web Links. Transactions of the As-
sociation for Computational Linguistics, 3:145–156.

Silviu Cucerzan. 2007. Large-Scale Named Entity Dis-
ambiguation Based on Wikipedia Data. In Proceed-
ings of the 2007 Joint Conference on Empirical Meth-
ods in Natural Language Processing and Computa-
tional Natural Language Learning, pages 708–716.

Wei Fang, Jianwen Zhang, Dilin Wang, Zheng Chen, and

408


Ming Li. 2016. Entity Disambiguation by Knowledge
and Text Jointly Embedding. In Proceedings of The
20th SIGNLL Conference on Computational Natural
Language Learning, pages 260–269.

Evgeniy Gabrilovich and Shaul Markovitch. 2007. Com-
puting Semantic Relatedness Using Wikipedia-Based
Explicit Semantic Analysis. In International Joint
Conference on Artificial Intelligence, pages 1606–
1611.

Juri Ganitkevitch, Benjamin Van Durme, and Chris
Callison-Burch. 2013. PPDB: The Paraphrase
Database. In Proceedings of the 2013 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, pages 758–764.

Amir Globerson, Nevena Lazic, Soumen Chakrabarti,
Amarnag Subramanya, Michael Ringaard, and Fer-
nando Pereira. 2016. Collective Entity Resolution
with Multi-Focal Attention. In Proceedings of the 54th
Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 621–631.

Hannaneh Hajishirzi, Leila Zilles, Daniel S Weld, and
Luke Zettlemoyer. 2013. Joint Coreference Res-
olution and Named-Entity Linking with Multi-Pass
Sieves. In Proceedings of the 2013 Conference on
Empirical Methods in Natural Language Processing,
pages 289–299.

Zhengyan He, Shujie Liu, Mu Li, Ming Zhou, Longkai
Zhang, and Houfeng Wang. 2013. Learning Entity
Representation for Entity Disambiguation. In Pro-
ceedings of the 51st Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 2: Short
Papers), pages 30–34.

Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016a.
Learning Distributed Representations of Sentences
from Unlabelled Data. In Proceedings of the 2016
Conference of the North American Chapter of the As-
sociation for Computational Linguistics: Human Lan-
guage Technologies, pages 1367–1377.

Felix Hill, Kyunghyun Cho, Anna Korhonen, and Yoshua
Bengio. 2016b. Learning to Understand Phrases by
Embedding the Dictionary. Transactions of the Asso-
ciation for Computational Linguistics, 4:17–30.

Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino,
Hagen Fürstenau, Manfred Pinkal, Marc Spaniol,
Bilyana Taneva, Stefan Thater, and Gerhard Weikum.
2011. Robust Disambiguation of Named Entities in
Text. In Proceedings of the 2011 Conference on
Empirical Methods in Natural Language Processing,
pages 782–792.

Zhiting Hu, Poyao Huang, Yuntian Deng, Yingkai Gao,
and Eric Xing. 2015. Entity Hierarchy Embedding. In
Proceedings of the 53rd Annual Meeting of the Associ-
ation for Computational Linguistics and the 7th Inter-

national Joint Conference on Natural Language Pro-
cessing (Volume 1: Long Papers), pages 1292–1300.

Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino,
Richard Socher, and Hal Daumé III. 2014. A Neural
Network for Factoid Question Answering over Para-
graphs. In Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing,
pages 633–644.

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,
and Hal Daumé III. 2015. Deep Unordered Com-
position Rivals Syntactic Methods for Text Classifica-
tion. In Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the 7th
International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), pages 1681–
1691.

Heng Ji, Ralph Grishman, Hoa Trang Dang, Kira Grif-
fitt, and Joe Ellis. 2010. Overview of the TAC 2010
Knowledge Base Population Track. In Proceeding of
Text Analytics Conference.

Tom Kenter, Alexey Borisov, and Maarten de Rijke.
2016. Siamese CBOW: Optimizing Word Embed-
dings for Sentence Representations. In Proceedings of
the 54th Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers), pages
941–951.

Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,
Richard Zemel, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. 2015. Skip-Thought Vectors. In Ad-
vances in Neural Information Processing Systems 28,
pages 3294–3302.

Quoc V. Le and Tomas Mikolov. 2014. Distributed Rep-
resentations of Sentences and Documents. In Proceed-
ings of the 31st International Conference on Machine
Learning (Volume 32), pages 1188–1196.

Omer Levy and Yoav Goldberg. 2014. Dependency-
Based Word Embeddings. In Proceedings of the 52nd
Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), pages 302–308.

Jiwei Li, Thang Luong, and Dan Jurafsky. 2015. A
Hierarchical Neural Autoencoder for Paragraphs and
Documents. In Proceedings of the 53rd Annual Meet-
ing of the Association for Computational Linguistics
and the 7th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers), pages
1106–1115.

Yuezhang Li, Ronghuo Zheng, Tian Tian, Zhiting Hu,
Rahul Iyer, and Katia Sycara. 2016. Joint Embed-
ding of Hierarchical Categories and Entities for Con-
cept Categorization and Dataless Classification. In
Proceedings of the 26th International Conference on
Computational Linguistics, pages 2678–2688.

409


Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and
Xuan Zhu. 2015. Learning Entity and Relation Em-
beddings for Knowledge Graph Completion. In Pro-
ceedings of the 29th AAAI Conference on Artificial In-
telligence, pages 2181–2187.

Xiao Ling, Sameer Singh, and Daniel S. Weld. 2015.
Design Challenges for Entity Linking. Transactions of
the Association for Computational Linguistics, 3:315–
328.

Yi Luan, Yangfeng Ji, Hannaneh Hajishirzi, and Boyang
Li. 2016. Multiplicative Representations for Unsuper-
vised Semantic Role Induction. In Proceedings of the
54th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 2: Short Papers), pages
118–123.

Marco Marelli, Stefano Menini, Marco Baroni, Luisa
Bentivogli, Raffaella Bernardi, and Roberto Zampar-
elli. 2014. A SICK Cure for the Evaluation of Compo-
sitional Distributional Semantic Models. In Proceed-
ings of the Ninth International Conference on Lan-
guage Resources and Evaluation, pages 216–223.

Edgar Meij, Wouter Weerkamp, and Maarten de Rijke.
2012. Adding Semantics to Microblog Posts. In Pro-
ceedings of the Fifth ACM International Conference
on Web Search and Data Mining, pages 563–572.

Rada Mihalcea and Andras Csomai. 2007. Wikify!:
Linking Documents to Encyclopedic Knowledge. In
Proceedings of the Sixteenth ACM Conference on In-
formation and Knowledge Management, pages 233–
242.

Tomas Mikolov, Greg Corrado, Kai Chen, and Jeffrey
Dean. 2013a. Efficient Estimation of Word Repre-
sentations in Vector Space. In Proceedings of the In-
ternational Conference on Learning Representations,
pages 1–12.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-
rado, and Jeff Dean. 2013b. Distributed Represen-
tations of Words and Phrases and their Composition-
ality. In Advances in Neural Information Processing
Systems 26, pages 3111–3119.

David Milne and Ian H. Witten. 2008. Learning to Link
with Wikipedia. In Proceeding of the 17th ACM Con-
ference on Information and Knowledge Management,
pages 509–518.

Jeff Mitchell and Mirella Lapata. 2008. Vector-based
Models of Semantic Composition. In Proceedings of
ACL-08: HLT, pages 236–244.

Jeffrey Pennington, Richard Socher, and Christopher D
Manning. 2014. GloVe: Global Vectors for Word
Representation. In Proceedings of the 2014 Confer-
ence on Empirical Methods in Natural Language Pro-
cessing, pages 1532–1543.

Maria Pershina, Yifan He, and Ralph Grishman. 2015.
Personalized Page Rank for Named Entity Disam-
biguation. In Proceedings of the 2015 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, pages 238–243.

Lev Ratinov, Dan Roth, Doug Downey, and Mike An-
derson. 2011. Local and Global Algorithms for Dis-
ambiguation to Wikipedia. In Proceedings of the 49th
Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies, pages
1375–1384.

Sebastian Riedel, Limin Yao, Andrew McCallum, and
Benjamin M Marlin. 2013. Relation Extraction with
Matrix Factorization and Universal Schemas. In Pro-
ceedings of the 2013 Conference of the North Ameri-
can Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, pages 74–
84.

Sameer Singh, Amarnag Subramanya, Fernando Pereira,
and Andrew McCallum. 2012. Wikilinks: A Large-
scale Cross-Document Coreference Corpus Labeled
via Links to Wikipedia. Technical Report UM-CS-
2012-015.

Theano Development Team. 2016. Theano: A Python
Framework for Fast Computation of Mathematical Ex-
pressions. arXiv preprint arXiv:1605.02688v1.

Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture
6.5 - RMSProp, COURSERA: Neural Networks for
Machine Learning. Technical report.

Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung
Poon, Pallavi Choudhury, and Michael Gamon. 2015.
Representing Text for Joint Embedding of Text and
Knowledge Bases. In Proceedings of the 2015 Con-
ference on Empirical Methods in Natural Language
Processing, pages 1499–1509.

Patrick Verga, David Belanger, Emma Strubell, Ben-
jamin Roth, and Andrew McCallum. 2016. Multi-
lingual Relation Extraction using Compositional Uni-
versal Schema. In Proceedings of the 2016 Confer-
ence of the North American Chapter of the Associa-
tion for Computational Linguistics: Human Language
Technologies, pages 886–896.

Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng
Chen. 2014. Knowledge Graph and Text Jointly Em-
bedding. In Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing,
pages 1591–1601.

John Wieting, Mohit Bansal, Kevin Gimpel, and Karen
Livescu. 2016. Towards Universal Paraphrastic Sen-
tence Embeddings. In Proceedings of the 2016 Inter-
national Conference on Learning Representations.

410


Dong Xu and Wu-Jun Li. 2016. Full-Time Supervi-
sion based Bidirectional RNN for Factoid Question
Answering. arXiv preprint arXiv:1606.05854v2.

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and
Yoshiyasu Takefuji. 2016. Joint Learning of the Em-
bedding of Words and Entities for Named Entity Dis-
ambiguation. In Proceedings of the 20th SIGNLL Con-
ference on Computational Natural Language Learn-
ing, pages 250–259.

411


412