Morpho-syntactic Lexicon Generation
Using Graph-based Semi-supervised Learning

Manaal Faruqui
Carnegie Mellon University
mfaruqui@cs.cmu.edu

Ryan McDonald
Google Inc.

ryanmcd@google.com

Radu Soricut
Google Inc.

rsoricut@google.com

Abstract

Morpho-syntactic lexicons provide informa-
tion about the morphological and syntactic
roles of words in a language. Such lexicons
are not available for all languages and even
when available, their coverage can be limited.
We present a graph-based semi-supervised
learning method that uses the morphologi-
cal, syntactic and semantic relations between
words to automatically construct wide cover-
age lexicons from small seed sets. Our method
is language-independent, and we show that we
can expand a 1000 word seed lexicon to more
than 100 times its size with high quality for
11 languages. In addition, the automatically
created lexicons provide features that improve
performance in two downstream tasks: mor-
phological tagging and dependency parsing.

1 Introduction

Morpho-syntactic lexicons contain information
about the morphological attributes and syntactic
roles of words in a given language. A typical lexicon
contains all possible attributes that can be displayed
by a word. Table 1 shows some entries in a sample
English morpho-syntactic lexicon. As these lexicons
contain rich linguistic information, they are useful
as features in downstream NLP tasks like machine
translation (Nießen and Ney, 2004; Minkov et al.,
2007; Green and DeNero, 2012), part of speech tag-
ging (Schmid, 1994; Denis and Sagot, 2009; Moore,
2015), dependency parsing (Goldberg et al., 2009),
language modeling (Arisoy et al., 2010) and mor-
phological tagging (Müller and Schuetze, 2015) in-
ter alia. There are three major factors that limit the
use of such lexicons in real world applications: (1)

played POS:VERB, TENSE:PAST,
VFORM:FIN, . . .

playing POS:VERB, TENSE:PRES,
VFORM:GER, . . .

awesome POS:ADJ, DEGREE:POS

Table 1: A sample English morpho-syntactic lexicon.

They are often constructed manually and are expen-
sive to obtain (Kokkinakis et al., 2000; Dukes and
Habash, 2010); (2) They are currently available for
only a few languages; and (3) Size of available lexi-
cons is generally small.

In this paper, we present a method that takes as in-
put a small seed lexicon, containing a few thousand
annotated words, and outputs an automatically con-
structed lexicon which contains morpho-syntactic
attributes (henceforth referred to as attributes) for
a large number of words of a given language. We
model the problem of morpho-syntactic lexicon gen-
eration as a graph-based semi-supervised learning
problem (Zhu, 2005; Bengio et al., 2006; Subra-
manya and Talukdar, 2014). We construct a graph
where nodes represent word types and the goal is
to label them with attributes. The seed lexicon pro-
vides attributes for a subset of these nodes. Nodes
are connected to each other through edges that de-
note features shared between them or surface mor-
phological transformation between them.

Our entire framework of lexicon generation, in-
cluding the label propagation algorithm and the fea-
ture extraction module is language independent. We
only use word-level morphological, semantic and
syntactic relations between words that can be in-
duced from unannotated corpora in an unsuper-

1

Transactions of the Association for Computational Linguistics, vol. 4, pp. 1–16, 2016. Action Editor: Alexander Clark.
Submission batch: 10/2015; Revision batch: 12/2015; Published 1/2016.

c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


play played

playing

run

running

clus:20 prefix:pl, suffix:{null}:ed

prefix:pl

suffix:ing:ed
clus:15

suffix:ing

suffix:{null}:ning prefix:pl

suffix:{null}:ing

Figure 1: A subgraph from the complete graph of English
showing different kinds of features shared on the edges
between words. Some possible features/edges have been
removed for enhancing clarity.

vised manner. One particularly novel aspect of
our graph-based framework is that edges are fea-
turized. Some of these features measure similarity,
e.g., singular nouns tend to occur in similar distri-
butional contexts as other singular nouns, but some
also measure transformations from one inflection to
another, e.g., adding a ‘s’ suffix could indicate flip-
ping the NUM:SING attribute to NUM:PLUR (in En-
glish). For every attribute to be propagated, we learn
weights over features on the edges separately. This
is in contrast to traditional label propagation, where
edges indicate similarity exclusively (Zhu, 2005).

We construct lexicons in 11 languages of vary-
ing morphological complexity. We perform intrin-
sic evaluation of the quality of generated lexicons
obtained from either the universal dependency tree-
bank or created manually by humans (§4). We show
that these automatically created lexicons provide
useful features in two extrinsic NLP tasks which re-
quire identifying the contextually plausible morpho-
logical and syntactic roles: morphological tagging
(Hajič and Hladká, 1998; Hajič, 2000) and syntactic
dependency parsing (Kübler et al., 2009). We ob-
tain an average of 15.4% and 5.3% error reduction
across 11 languages for morphological tagging and
dependency parsing respectively on a set of publicly
available treebanks (§5). We anticipate that the lexi-
cons thus created will be useful in a variety of NLP
problems.

2 Graph Construction

The approach we take propagates information over
lexical graphs (§3). In this section we describe how
to construct the graph that serves as the backbone

of our model. We construct a graph in which nodes
are word types and directed edges are present be-
tween nodes that share one or more features. Edges
between nodes denote that there might be a relation-
ship between the attributes of the two nodes, which
we intend to learn. As we want to keep our model
language independent, we use edge features that can
be induced between words without using any lan-
guage specific tools. To this end, we describe three
features in this section that can be obtained using
unlabeled corpora for any given language.1 Fig. 1
shows a subgraph of the full graph constructed for
English.

Word Clusters. Previous work has shown that un-
labeled text can be used to induce unsupervised
word clusters which can improve the performance
of many NLP tasks in different languages (Clark,
2003; Koo et al., 2008; Turian et al., 2010; Faruqui
and Padó, 2010; Täckström et al., 2012; Owoputi
et al., 2013). Word clusters capture semantic and
syntactic similarities between words, for example,
play and run are present in the same cluster. We
obtain word clusters by using Exchange clustering
algorithm (Kneser and Ney, 1993; Martin et al.,
1998; Uszkoreit and Brants, 2008) on large unla-
beled corpus of every language. As in Täckström et
al. (2012), we use one year of news articles scrapped
from a variety of sources and cluster only the most
frequent 1M words into 256 different clusters. An
edge was introduced for every word pair sharing the
same word cluster and a feature for the cluster is
fired. Thus, there are 256 possible cluster features
on an edge, though in our case only a single one can
fire.

Suffix & Prefix. Suffixes are often strong indi-
cators of the morpho-syntactic attributes of a word
(Ratnaparkhi, 1996; Clark, 2003). For example, in
English, -ing denotes gerund verb forms like, study-
ing, playing and -ed denotes past tense like studied,
played etc. Prefixes like un-, in- often denote ad-
jectives. Thus we include both 2-gram and 3-gram

1Some of these features can cause the graph to become very
dense making label propagation prohibitive. We keep the size
of the graph in check by only allowing a word node to be con-
nected to at most 100 other (randomly selected) word nodes
sharing one particular feature. This reduces edges while still
keeping the graph connected.

2


suffix and prefix as edge features.2 We introduce an
edge between two words sharing a particular suffix
or prefix feature.

Morphological Transformations. Soricut and
Och (2015) presented an unsupervised method of
inducing prefix- and suffix-based morphological
transformations between words using word embed-
dings. In their method, statistically, most of the
transformations are induced between words with the
same lemma (without using any prior information
about the word lemma). For example, their method
induces the transformation between played and
playing as suffix:ed:ing. This feature indicates
TENSE:PAST to turn off and TENSE:PRES to turn
on.3 We train the morphological transformation
prediction tool of Soricut and Och (2015) on the
news corpus (same as the one used for training word
clusters) for each language. An edge is introduced
between two words that exhibit a morphological
transformation feature from one word to another as
indicated by the tool’s output.

Motivation for the Model. To motivate our
model, consider the words played and playing. They
have a common attribute POS:VERB but they differ
in tense, showing TENSE:PAST and TENSE:PRES
resp. Typical graph-propagation algorithms model
similarity (Zhu, 2005) and thus propagate all at-
tributes along the edges. However, we want to model
if an attribute should propagate or change across an
edge. For example, having a shared cluster feature,
is an indication of similar POS tag (Clark, 2003;
Koo et al., 2008; Turian et al., 2010), but a sur-
face morphological transformation feature like suf-
fix:ed:ing possibly indicates a change in the tense
of the word. Thus, we will model attributes prop-
agation/transformation as a function of the features
shared on the edges between words. The features
described in this section are specially suitable for
languages that exhibit concatenative morphology,
like English, German, Greek etc. and might not
work very well with languages that exhibit non-
concatenative morphology i.e, where root modifica-
tion is highly frequent like in Arabic and Hebrew.

2We only include those suffix and prefix which appear at
least twice in the seed lexicon.

3Our model will learn the following transformation:
TENSE:PAST: 1 → -1, TENSE:PRESENT: -1 → 1 (§3).

However, it is important to note that our framework
is not limited to just the features described here, but
can incorporate any arbitrary information over word
pairs (§8).

3 Graph-based Label Propagation

We now describe our model. Let W =
{w1,w2, . . . ,w|W|} be the vocabulary with |W|
words and A = {a1,a2, . . . ,a|A|} be the set
of lexical attributes that words in W can ex-
press; e.g. W = {played,playing, . . .} and
A = {NUM:SING, NUM:PLUR, TENSE:PAST, . . .}.
Each word type w ∈ W is associated with a vec-
tor aw ∈ [−1,1]|A|, where ai,w = 1 indicates that
word w has attribute i and ai,w = −1 indicates its
absence; values in between are treated as degrees of
uncertainty. For example, TENSE:PASTplayed = 1
and TENSE:PASTplaying = −1.4

The vocabulary W is divided into two disjoint
subsets, the labeled words L for which we know
their aw’s (obtained from seed lexicon)5 and the un-
labeled words U whose attributes are unknown. In
general |U| � |L|. The words in W are organized
into a directed graph with edges E between words.
Let, vector φ(w,v) ∈ [0,1]|F| denote the features
on the directed edge between words w and v, with
1 indicating the presence and 0 the absence of fea-
ture fk ∈ F, where, F = {f1,f2, . . . ,f|F|} are the
set of possible binary features shared between two
words in the graph. For example, the features on
edges between played and playing from Fig. 1 are:

φk(played,playing) =





1, iffk = suffix:ed:ing
1, iffk = prefix:pl
0, iffk = suffix:ly
. . .

We seek to determine which subsets of A are valid
for each word w ∈ W. We learn how a particular
attribute of a node is a function of that particular
attribute of its neighboring nodes and features on the
edge connecting them. Let ai,w be an attribute of
word w and let âi,w be the empirical estimate of that

4We constrain ai,w ∈ [−1, 1] as its easier to model the flip-
ping of an attribute value from −1 to 1 as opposed to [0, 1].

5We use labeled, seed, and training lexicon to mean the same
thing interchangeably.

3


<1, -1, …, -1>

<0, 0, …, 0> <0.2, 0.1, …, 0.9>

<1, -1, …, -1>

<0.2, 0.1, …, 0.9>

<-1, -1, …, -1>

<1, 1, …, 1>

<1, -1, …, 1>

<1, 1, …, -1>

Figure 2: Word graph with edges between words showing the labeled (grey) and the unlabeled (white) word nodes.
Only nodes connected via solid edges are visible to each other, dotted edges block visibility. This figure demonstrates
interaction between nodes during model estimation (left), label propagation (center), and paradigm projection (right).
Attribute-value vectors of the words are shown in angled brackets. The solid edge in the right figure shows the closest
attribute paradigm to which the empirical vector is projected.

attribute. We posit that âi,w can be estimated from
the neighbors N(w) of w as follows:

âi,w = tanh




∑

v∈N(w)
(φ(w,v) ·θi)×ai,v


 (1)

where, θi ∈ R|F| is weight vector of the edge fea-
tures for estimating attribute ai. ‘·’ represents dot
product betwen two vectors. We use tanh as the non-
linearity to make sure that âi,w ∈ [−1,1]. The set of
such weights θ ∈ R|A|×|F| for all attributes are the
model parameters that we learn. Our graph resem-
bles the Ising model, which is a lattice model pro-
posed for describing intermolecular forces (Ising,
1925), and eq. 1 solves the naive mean field approx-
imation of the Ising model (Wang et al., 2007).

Intuitively, one can view the node to node mes-
sage function from v to w: φ(w,v)·θi×ai,v as either
(1) supporting the value ai,v when φ(w,v) ·θi > 0;
(2) inverting ai,v when φ(w,v) · θi < 0; or (3)
dampening or neutering ai,v when φ(w,v) ·θi ≈ 0.
Returning to our motivation, if w = played and
v = playing, a feature indicating the suffix sub-
stitution suffix:ed:ing should have a highly nega-
tive weight for TENSE:PAST, indicating a change in
value. This is because TENSE:PAST = -1 for play-
ing, and a negative value of φ(w,v) ·θi will push it
to positive for played.

It should be noted that this framework for con-
structing lexicons does not explicitly distinguish
between morpho-syntactic paradigms, but simply
identifies all possible attribute-values a word can
take. If we consider an example like “games”
and two attributes, the syntactic part-of-speech,
POS, and number, NUM, games can either be 1)
{POS:VERB, NUM:SING}, as in John games the
system; or {POS:NOUN, NUM:PLUR}, as in The
games have started. Our framework will mereley
return that all the above attribute-values are possi-
ble, which implies that the singluar noun and plu-
ral verb interpretations are valid. One possible way
to account for this is to make full morphological
paradigms the “attributes” in or model. But this
leads to slower runtimes and sparser learning. We
leave as future work extensions to full paradigm pre-
diction.

Our framework has three critical components,
each described below: (1) model estimation, i.e.,
learning θ; (2) label propagation to U; and option-
ally (3) paradigm projection to known valid morpho-
logical paradigms. The overall procedure is illus-
trated in Figure 2 and made concrete in Algorithm 1.

3.1 Model Estimation
We estimate all individual elements of an attribute
vector using eq. 1. We define loss as the squared loss
between the empirical and observed attribute vectors

4


Data: W, L, U, A, F, P
Result: θ, labeled U
// model estimation

1 while not convergence do
2 for w ∈L do
3 loss ←‖aw − âw‖22
4 Update θ using ∂loss

∂θ

// label propagation
5 while not convergence do
6 for w ∈U do
7 aw ← âw
// paradigm projection

8 for w ∈U do
9 mindist ←∞, closest ←∅

10 for p ∈P do
11 dist ←‖aw −p‖22
12 if dist < mindist then
13 mindist ← dist, closest ← p

14 aw ← closest
Algorithm 1: Graph-based semi-supervised label
propagation algorithm.

on every labeled node in the graph, thus the total loss
can be computed as:

∑

w∈L
‖aw − âw‖22 (2)

We train the edge feature weights θ by minimiz-
ing the loss function in eq. 2. In this step, we only
use labeled nodes and the edge connections between
labeled nodes. As such, this is strictly a supervised
learning setup. We minimize the loss function using
online adaptive gradient descent (Duchi et al., 2011)
with `2 regularization on the feature weights θ. This
is the first step in Algorithm 1 (lines 1–4).

3.2 Label Propagation
In the second step, we use the learned weights
of the edge features to estimate the attribute val-
ues over unlabeled nodes iteratively. The attribute
vector of all unlabeled words is initialized to null,
∀w ∈ U,aw = 〈0,0, . . . ,0〉. In every iteration, an
unlabeled node estimates its empirical attributes by
looking at the corresponding attributes of its labeled
and unlabeled neighbors using eq. 1, thus this is

the semi-supervised step. We stop after the squared
euclidean distance between the attribute vectors at
two consecutive iterations for a node becomes less
than 0.1 (averaged over all unlabeled nodes). This
is the second step in Algorithm 1 (lines 5–7). After
convergence, we can directly obtain attributes for a
word by thresholding: a word w is said to possess an
attribute ai if ai,w > 0.

3.3 Paradigm Projection

Since a word can be labeled with multiple lexical at-
tributes, this is a multi-label classification problem.
For such a task, several advanced methods that take
into account the correlation between attributes have
been proposed (Ghamrawi and McCallum, 2005;
Tsoumakas and Katakis, 2006; Fürnkranz et al.,
2008; Read et al., 2011), here we have adopted
the binary relevance method which trains a classi-
fier for every attribute independently of the other at-
tributes, for its simplicity (Godbole and Sarawagi,
2004; Zhang and Zhou, 2005).

However, as the decision for the presence of an
attribute over a word is independent of all the other
attributes, the final set of attributes obtained for a
word in §3.2 might not be a valid paradigm.6 For ex-
ample, a word cannot only exhibit the two attributes
POS:NOUN and TENSE:PAST, since the presence of
the tense attribute implies POS:VERB should also
be true. Further, we want to utilize the inherent
correlations between attribute labels to obtain bet-
ter solutions. We thus present an alternative, simpler
method to account for this problem. To ensure that
we obtain a valid attribute paradigm, we project the
empirical attribute vector obtained after propagation
to the space of all valid paradigms.

We first collect all observed and thus valid at-
tribute paradigms from the seed lexicon (P =
{aw|w ∈ L}). We replace the empirical at-
tribute vector obtained in §3.2 by a valid attribute
paradigm vector which is nearest to it according to
euclidean distance. This projection step is inspired
from the decoding step in label-space transforma-
tion approaches to multilabel classification (Hsu et
al., 2009; Ferng and Lin, 2011; Zhang and Schnei-
der, 2011). This is the last step in Algorithm 1 (lines
8–14). We investigate for each language if paradigm

6A paradigm is defined as a set of attributes.

5


|L| |W| |E| |A| |P| Prop
(k) (k) (m) (k)

eu 3.4 130 13 83 811 118
bg 8.1 309 27 57 53 285
hr 6.9 253 26 51 862 251
cs 58.5 507 51 122 4,789 403
da 5.9 259 26 61 352 246
en 4.1 1,006 100 50 412 976
fi 14.4 372 30 100 2,025 251
el 3.9 358 26 40 409 236
hu 1.9 451 25 85 490 245
it 8.5 510 28 52 568 239
sv 4.8 305 26 41 265 268

Table 2: Graph statistics for different languages, showing
the approximate number of labeled seed nodes (|L|), la-
beled and unlabeled nodes (|W|), edges between words
(|E|), the number of unique attributes (|A|), attribute
paradigms (|P|) and size of the constructed lexicon
(Prop). k: thousands, m: millions.

projection is helpful (§4.1).

4 Intrinsic Evaluation

To ascertain how our graph-propagation framework
predicts morphological attributes for words, we pro-
vide an intrinsic evaluation where we compare pre-
dicted attributes to gold lexicons that have been ei-
ther read off from a treebank or derived manually.

4.1 Dependency Treebank Lexicons
The universal dependency treebank (McDonald et
al., 2013; De Marneffe et al., 2014; Agić et al., 2015)
contains dependency annotations for sentences and
morpho-syntactic annotations for words in context
for a number of languages.7 A word can display dif-
ferent attributes depending on its role in a sentence.
In order to create morpho-syntactic lexicon for ev-
ery language, we take the union of all the attributes
that the word realizes in the entire treebank. Al-
though, it is possible that this lexicon might not con-
tain all realizable attributes if a particular attribute
or paradigm is not seen in the treebank (we address
this issue in §4.2). The utility of evaluating against
treebank derived lexicons is that it allows us to eval-
uate on a large set of languages. In particular, in
the universal dependency treebanks v1.1 (Agić et al.,

7We use version 1.1 released in May 2015.

2015), 11 diverse languages contain the morphology
layer, including Romance, Germanic and Slavic lan-
guages plus isolates like Basque and Greek.

We use the train/dev/test set of the treebank to cre-
ate training (seed),8 development and test lexicons
for each language. We exclude words from the dev
and test lexicon that have been seen in seed lexicon.
For every language, we create a graph with the fea-
tures described in §2 with words in the seed lexicon
as labeled nodes. The words from development and
test set are included as unlabeled nodes for the prop-
agation stage.9 Table 2 shows statistics about the
constructed graph for different languages.10

We perform feature selection and hyperparame-
ter tuning by optimizing prediction on words in the
development lexicon and then report results on the
test lexicon. The decision whether paradigm projec-
tion (§3.3) is useful or not is also taken by tuning
performance on the development lexicon. Table 3
shows the features that were selected for each lan-
guage. Now, for every word in the test lexicon we
obtain predicted lexical attributes from the graph.
For a given attribute, we count the number of words
for which it was correctly predicted (true positive),
wrongly predicted (false positive) and not predicted
(false negative). Aggregating these counts over all
attributes (A), we compute the micro-averaged F1
score and achieve 74.3% on an average across 11
languages (cf. Table 4). Note that this systemati-
cally underestimates performance due to the effect
of missing attributes/paradigms that were not ob-
served in the treebank.

Propagated Lexicons. The last column in Table 2
shows the number of words in the propagated lexi-
con, and the first column shows the number of words
in the seed lexicon. The ratio of the size of propa-
gated and seed lexicon is different across languages,
which presumably depends on how densely con-
nected each language’s graph is. For example, for

8We only include those words in the seed lexicon that occur
at least twice in the training set of the treebank.

9Words from the news corpus used for word clustering are
also used as unlabeled nodes.

10Note that the size of the constructed lexicon (cf. Table 2)
is always less than or equal to the total number of unlabeled
nodes in the graph because some unlabeled nodes are not able
to collect enough mass for acquiring an attribute i.e, ∀a ∈ A :
aw < 0 and thus they remain unlabeled (cf. §3.2).

6


Clus Suffix Prefix MorphTrans Proj
eu X X X
bg X X
hr X X X
cs X X X X X
da X X X
en X X X
fi X X
el X X X X
hu X X X
it X X X
sv X X X

Table 3: Features selected and the decision of paradigm
projection (Proj) tuned on the development lexicon for
each language. Xdenotes a selected feature.

English the propagated lexicon is around 240 times
larger than the seed lexicon, whereas for Czech, its 8
times larger. We can individually tune how densely
connected graph we want for each language depend-
ing on the seed size and feature sparsity, which we
leave for future work.

Selected Edge Features. The features most fre-
quently selected across all the languages are the
word cluster and the surface morphological transfor-
mation features. This essentially translates to hav-
ing a graph that consists of small connected com-
ponents of words having the same lemma (discov-
ered in an unsupervised manner) with semantic links
connecting such components using word cluster fea-
tures. Suffix features are useful for highly inflected
languages like Czech and Greek, while the prefix
feature is only useful for Czech. Overall, the se-
lected edge features for different languages corre-
spond well to the morphological structure of these
languages (Dryer, 2013).

Corpus Baseline. We compare our results to a
corpus-based method of obtaining morpho-syntactic
lexicons. We hypothesize that if we use a morpho-
logical tagger of reasonable quality to tag the entire
wikipedia corpus of a language and take the union of
all the attributes for a word type across all its occur-
rences in the corpus, then we can acquire all possible
attributes for a given word. Hence, producing a lex-
icon of reasonable quality. Moore (2015) used this
technique to obtain a high quality tag dictionary for

words Corpus Propagation
eu 3409 54.0 57.5
bg 2453 66.4 73.6
hr 1054 69.7 71.6
cs 14532 79.1 80.8
da 1059 68.2 78.1
en 3142 57.2 72.0
fi 2481 58.2 68.2
el 1093 72.2 82.4
hu 1004 64.9 70.9
it 1244 78.8 81.7
sv 3134 69.8 80.7

avg. 3146 67.1 74.3

Table 4: Micro-averaged F1 score (%) for prediction of
lexical attributes on the test set using our propagation
algorithm (Propagation) and the corpus-based baseline
(Corpus). Also, shown are the no. of words in test set.

words F1
cs 115,218 87.5
fi 39,856 71.9
hu 135,386 79.7
avg. 96,820 79.7

Table 5: Micro-averaged F1 score (%) for prediction of
lexical attributes on the test lexicon of human-curated
lexicons.

POS-tagging. We thus train a morphological tagger
(detail in §5.1) on the training portion of the depen-
dency treebank and use it to tag the entire wikipedia
corpus. For every word, we add an attribute to the
lexicon if it has been seen at least k times for the
word in the corpus, where k ∈ [2,20]. This thresh-
old on the frequency of the word-attribute pair helps
prevent noisy judgements. We tune k for each lan-
guage on the development set and report results on
the test set in Table 4. We call this method the Cor-
pus baseline. It can be seen that for every language
we outperform this baseline, which on average has
an F1 score of 67.1%.

4.2 Manually Curated Lexicons

We have now showed that its possible to automat-
ically construct large lexicons from smaller seed
lexicons. However, the seed lexicons used in §4.1
have been artifically constructed from aggregating
attributes of word types over the treebank. Thus, it

7


word exchange cluster∗

lowercase(word) capitalization
{1,2,3}-g suffix∗ digit
{1,2,3}-g prefix∗ punctuation

Table 6: Features used to train the morphological tagger
on the universal dependency treebank. ∗:on for word off-
sets {-2, -1, 0, 1, 2}. Conjunctions of the above are also
included.

can be argued that these constructed lexicons might
not be complete i.e, the lexicon might not exhibit all
possible attributes for a given word. On the other
hand, manually curated lexicons are unavailable for
many languages, inhibiting proper evaluation.

To test the utility of our approach on manually cu-
rated lexicons, we investigate publicly available lex-
icons for Finnish (Pirinen, 2011), Czech (Hajič and
Hladká, 1998) and Hungarian (Trón et al., 2006).
We eliminate numbers and punctuation from all lex-
icons. For each of these languages, we select 10000
words for training and the rest of the word types for
evaluation. We train models obtained in §4.1 for a
given language using suffix, brown and morpholog-
ical transformation features with paradigm projec-
tion. The only difference is the source of the seed
lexicon and test set. Results are reported in Table 5
averaged over 10 different randomly selected seed
set for every language. For each language we ob-
tain more than 70% F1 score and on an average ob-
tain 79.7%. Critically, the F1 score on human cu-
rated lexicons is higher for each language than the
treebank constructed lexicons, in some cases as high
as 9% absolute. This shows that the average 74.3%
F1 score across all 11 languages is likely underesti-
mated.

5 Extrinsic Evaluation

We now show that the automatically generated lex-
icons provide informative features that are useful in
two downstream NLP tasks: morphological tagging
(§5.1) and syntactic dependency parsing (§5.2).

5.1 Morphological Tagging

Morphological tagging is the task of assigning a
morphological reading to a token in context. The
morphological reading consists of features such as
part of speech, case, gender, person, tense etc.

None Seed Propagation
eu 84.1 84.4 85.2
bg 94.2 94.6 95.9
hr 92.5 93.6 93.2
cs 96.8 97.1 97.1
da 96.4 97.1 97.3
en 94.4 94.7 94.8
fi 92.8 93.6 94.0
el 93.4 94.6 94.2
hu 91.7 92.3 93.5
it 96.8 97.1 97.1
sv 95.4 96.5 96.5
avg. 93.5 94.2 94.5

Table 7: Macro-averaged F1 score (%) for morphologi-
cal tagging: without using any lexicon (None), with seed
lexicon (Seed), with propagated lexicon (Propagation).

(Oflazer and Kuruöz, 1994; Hajič and Hladká,
1998). The model we use is a standard atomic se-
quence classifier, that classifies the morphological
bundle for each word independent of the others (with
the exception of features derived from these words).
Specifically, we use a linear SVM model classifier
with hand tuned features. This is similar to com-
monly used analyzers like SVMTagger (Giménez
and Marquez, 2004) and MateTagger (Bohnet and
Nivre, 2012).

Our taggers are trained in a language independent
manner (Hajič, 2000; Smith et al., 2005; Müller et
al., 2013). The list of features used in training the
tagger are listed in Table 6. In addition to the stan-
dard features, we use the morpho-syntactic attributes
present in the lexicon for every word as features in
the tagger. As shown in Müller and Schuetze (2015),
this is typically the most important feature for mor-
phological tagging, even more useful than clusters
or word embeddings. While predicting the contex-
tual morphological tags for a given word, the mor-
phological attributes present in the lexicon for the
current word, the previous word and the next word
are used as features.

We use the same 11 languages from the univer-
sal dependency treebanks (Agić et al., 2015) that
contain morphological tags to train and evaluate the
morphological taggers. We use the pre-specified
train/dev/test splits that come with the data. Ta-
ble 7 shows the macro-averaged F1 score over all

8


None Seed Propagation
eu 60.5 62.3 62.9
bg 78.3 78.8 79.3
hr 72.8 74.7 74.7
cs 78.3 78.4 78.4
da 67.5 69.4 70.1
en 74.4 74.1 74.4
fi 66.1 67.4 67.9
el 75.0 75.6 75.8
hu 67.6 69.0 71.1
it 82.4 82.8 83.1
sv 69.7 70.1 70.1
avg. 72.0 73.0 73.5

Table 8: Labeled accuracy score (LAS, %) for depen-
dency parsing: without using any lexicon (None), with
seed (Seed), with propagated lexicon (Propagation).

attributes for each language on the test lexicon. The
three columns show the F1 score of the tagger when
no lexicon is used; when the seed lexicon derived
from the training data is used; and when label prop-
agation is applied.

Overall, using lexicons provides a significant im-
provement in accuracy, even when just using the
seed lexicon. For 9 out of 11 languages, the high-
est accuracy is obtained using the lexicon derived
from graph propagation. In some cases the gain is
quite substantial, e.g., 94.6% → 95.9% for Bulgar-
ian. Overall there is 1.0% and 0.3% absolute im-
provement over the baseline and seed resp., which
corresponds roughly to a 15% and 5% relative re-
duction in error. It is not surprising that the seed
lexicon performs on par with the derived lexicon
for some languages, as it is derived from the train-
ing corpus, which likely contains the most frequent
words of the language.

5.2 Dependency Parsing

We train dependency parsers for the same 11 uni-
versal dependency treebanks that contain the mor-
phological layer (Agić et al., 2015). We again use
the supplied train/dev/test split of the dependency
treebank to develop the models. Our parsing model
is the transition-based parsing system of Zhang and
Nivre (2011) with identical features and a beam of
size 8.

We augment the features of Zhang and Nivre

Figure 3: Micro-average F1 score on test lexicon while
using varying seed sizes for cs, hu and fi.

(2011) in two ways: using the context-independent
morphological attributes present in the different lex-
icons; and using the corresponding morphological
taggers from §5.1 to generate context-dependent at-
tributes. For each of the above two kinds of features,
we fire the attributes for the word on top of the stack
and the two words on at the front of the buffer. Ad-
ditionally we take the cross product of these features
between the word on the top of the stack and at the
front of the buffer.

Table 8 shows the labeled accuracy score (LAS)
for all languages. Overall, the generated lexicon
gives an improvement of absolute 1.5% point over
the baseline (5.3% relative reduction in error) and
0.5% over the seed lexicon on an average across
11 languages. Critically this improvement holds
for 10/11 languages over the baseline and 8/11 lan-
guages over the system that uses seed lexicon only.

6 Further Analysis

In this section we further investigate our model and
results in detail.

Size of seed lexicon. We first test how the size
of the seed lexicon affects performance of attribute
prediction on the test set. We use the manually
constructed lexicons described in §4.2 for experi-
ments. For each language, instead of using the
full seed lexicon of 10000 words, we construct sub-
sets of this lexicon by taking 1000 and 5000 ran-
domly sampled words. We then train models ob-
tained in §4.1 on these lexicons and plot the perfor-
mance on the test set in Figure 3. On average across

9


Word Attributes

en
study (seed) POS:Verb, VForm:Fin, Mood:Ind, Tense:Pres, Num:Sing, POS:Noun
studied POS:Verb, VForm:Fin, Mood:Ind, Tense:Past, VForm:Part
taught POS:Verb, VForm:Fin, Mood:Ind, Tense:Past, VForm:Part, Voice:Pass

it
tavola (seed) POS:Noun, Gender:Fem, Num:Sing
tavoli POS:Noun, Gender:Masc, Num:Plur
divano POS:Noun, Gender:Masc, Num:Sing

Table 9: Attributes induced for words which are semantically or syntactically related to a word in the seed lexicon for
English and Italian.

VFORM:GER NUM:PLUR
Clus:105 + Clus:19 +
Clus:77 + Clus:97 +
Clus:44 + Clus:177 +
suffix:ing:{null} - suffix:ies:y -
suffix:ping:{null} - suffix:gs:g -
suffix:ing:er - suffix:ons:on -

Table 10: Highest (upper half) and lowest (lower half)
weighted features (with their sign) for predicting a given
attribute of English words.

three languages, we observe that the absolute perfor-
mance improvement from 1000 to 5000 seed words
is ≈10% whereas it reduces to ≈2% from 5000 to
10000 words.

Feature analysis. Table 10 shows the highest and
the lowest weighted features for predicting a given
attribute of English words. The highest weighted
features for both VFORM:GER and NUM:PLUR are
word clusters, indicating that word clusters exhibit
strong syntactic and semantic coherence. More
interestingly, it can be seen that for predicting
VFORM:GER i.e, continuous verb forms, the lowest
weighted features are those morphological transfor-
mations that substitute “ing” with something else.
Thus, if there exists an edge between the words
studying and study, containing the feature: suf-
fix:ing:{null}, the model would correctly predict
that studying is VFORM:GER as study is not so and
the negative feature weight can flip the label values.
The same observation holds true for NUM:PLUR.

Feature ablation. One key question is which of
the features in our graph are important for project-
ing morphological attribute-values. Table 3 suggests
that this is language specific, which is intuitive, as

cs hu fi
S + C + MT 87.5 79.9 71.6
S + C 86.5 78.8 68.2
S + MT 85.7 77.0 68.7
C + MT 75.7 57.4 62.2
S + C + MT + P 86.7 66.0 61.3

Table 11: Feature ablation study for induced lexicons
evaluated on manually curated gold lexicons. Reported
scores are micro-averaged F1 score (%) for prediction of
lexical attributes. S = suffix; P = prefix; C = clusters; and
MT = morphological transformations.

morphology can be represented more or less regu-
larly through the surface form depending on the lan-
guage. To understand this, we did a feature ablation
study for the three languages with manually curated
lexicons (§4.2) using the same feature set as before:
clusters, suffix and morphological transformations
with paradigm projection. We then leave out each
feature to measure how performance drops. Unlike
§4.2, we do not average over 10 runs but use a sin-
gle static graph where features (edges) are added or
removed as necessary.

Table 11 contains the results. Critically, all fea-
tures are required for top accuracy across all lan-
guages and leaving out suffix features has the most
detrimental effect. This is not surprising considering
all three language primarily express morphological
properties via suffixes. Furthermore, suffix features
help to connect the graph and assist label propaga-
tion. Note that the importance of suffix features here
is in contrast to the evaluation on treebank derived
lexicons in §4.1, where suffix features were only se-
lected for 4 out of 11 languages based on the devel-
opment data (Table 3), and not for Hungarian and
Finnish. This could be due to the nature of the lex-

10


icons derived from treebanks versus complete lexi-
cons constructed by humans.

Additionally, we also added back prefix features
and found that for all languages, this resulted in a
drop in accuracy, particularly for Finnish and Hun-
garian. The primary reason for this is that prefix fea-
tures often create spurious edges in the graph. This
in and of itself is not a problem for our model, as the
edge weights should learn to discount this feature.
However, the fact that we sample edges to make in-
ference tractable means that more informative edges
could be dropped in favor of those that are only con-
nected via a prefix features.

Prediction examples. Table 9 shows examples of
predictions made by our model for English and Ital-
ian. For each language, we first select a random
word from the seed lexicon, then we pick one syn-
tactic and one semantically related word to the se-
lected word from the set of unlabeled words. For
e.g., in Italian tavola means table, whereas tavoli is
the plural form and divano means sofa. We correctly
identify attributes for these words.

7 Related Work

We now review the areas of related work.

Lexicon generation. Eskander et al. (2013) con-
struct morpho-syntactic lexicons by incrementally
merging inflectional classes with shared morpholog-
ical features. Natural language lexicons have often
been created from smaller seed lexcions using var-
ious methods. Thelen and Riloff (2002) use pat-
terns extracted over a large corpus to learn semantic
lexicons from smaller seed lexicons using bootstrap-
ping. Alfonseca et al. (2010) use distributional simi-
larity scores across instances to propagate attributes
using random walks over a graph. Das and Smith
(2012) learn potential semantic frames for unknown
predicates by expanding a seed frame lexicon. Sen-
timent lexicons containing semantic polarity labels
for words and phrases have been created using boot-
strapping and graph-based learning (Banea et al.,
2008; Mohammad et al., 2009; Velikovich et al.,
2010; Takamura et al., 2007; Lu et al., 2011).

Graph-based learning. In general, graph-based
semi-supervised learning is heavily used in NLP

(Talukdar and Cohen, 2013; Subramanya and Taluk-
dar, 2014). Graph-based learning has been used
for class-instance acquisition (Talukdar and Pereira,
2010), text classification (Subramanya and Bilmes,
2008), summarization (Erkan and Radev, 2004),
structured prediction problems (Subramanya et al.,
2010; Das and Petrov, 2011; Garrette et al., 2013)
etc. Our work differs from most of these approaches
in that we specifically learn how different features
shared between the nodes can correspond to either
the propagation of an attribute or an inversion of
the attribute value (cf. equ 1). In terms of the ca-
pability of inverting an attribute value, our method
is close to Goldberg et al. (2007), who present a
framework to include dissimilarity between nodes
and Talukdar et al. (2012), who learn which edges
can be excluded for label propagation. In terms of
featurizing the edges, our work resembles previous
work which measured similarity between nodes in
terms of similarity between the feature types that
they share (Muthukrishnan et al., 2011; Saluja and
Navrátil, 2013). Our work is also related to graph-
based metric learning, where the objective is to learn
a suitable distance metric between the nodes of a
graph for solving a given problem (Weinberger et
al., 2005; Dhillon et al., 2012).

Morphology. High morphological complexity ex-
acerbates the problem of feature sparsity in many
NLP applications and leads to poor estimation of
model parameters, emphasizing the need of mor-
phological analysis. Morphological analysis en-
compasses fields like morphological segmentation
(Creutz and Lagus, 2007; Demberg, 2007; Snyder
and Barzilay, 2008; Poon et al., 2009; Narasimhan
et al., 2015), and inflection generation (Yarowsky
and Wicentowski, 2000; Wicentowski, 2004). Such
models of segmentation and inflection generation
are used to better understand the meaning and re-
lations between words. Our task is complementary
to the task of morphological paradigm generation.
Paradigm generation requires generating all possible
morphological forms of a given base-form according
to different linguistic transformations (Dreyer and
Eisner, 2011; Durrett and DeNero, 2013; Ahlberg et
al., 2014; Ahlberg et al., 2015; Nicolai et al., 2015;
Faruqui et al., 2016), whereas our task requires iden-
tifying linguistic transformations between two dif-

11


ferent word forms.

Low-resourced languages. Our algorithm can be
used to generate morpho-syntactic lexicons for low-
resourced languages, where the seed lexicon can
be constructed, for example, using crowdsourcing
(Callison-Burch and Dredze, 2010; Irvine and Kle-
mentiev, 2010). Morpho-syntactic resources have
been developed for east-european languages like
Slovene (Dzeroski et al., 2000; Erjavec, 2004),
Bulgarian (Simov et al., 2004) and highly agglu-
tinative languages like Turkish (Sak et al., 2008).
Morpho-syntactic lexicons are crucial components
in acousting modeling and automatic speech recog-
nition, where they have been developed for low-
resourced languages (Huet et al., 2008; Besacier et
al., 2014).

One alternative method to extract morphosyntac-
tic lexicons is via parallel data (Das and Petrov,
2011). However, such methods assume that both
the source and target langauges are isomorphic with
respect to morphology. This can be the case with
attributes like coarse part-of-speech or case, but is
rarely true for other attributes like gender, which is
very language specific.

8 Future Work

There are three major ways in which the current
model can be possibly improved.

Joint learning and propagation. In the current
model, we are first learning the weights in a su-
pervised manner (§3.1) and then propagating labels
across nodes in a semi-supervised step with fixed
feature weights (§3.2). These can also be performed
jointly: perform one iteration of weight learning,
propagate labels using these weights, perform an-
other iteration of weight learning assuming empir-
ical labels as gold labels and continue to learn and
propagate until convergence. This joint learning
would be slower than the current approach as prop-
agating labels across the graph is an expensive step.

Multi-label classifcation. We are currently using
the binary relevance method which trains a binary
classifier for every attribute independently (Godbole
and Sarawagi, 2004; Zhang and Zhou, 2005) with
paradigm projection as a post-processing step (§3.3).
Thus we are accounting for attribute correlations

only at the end. We can instead model such cor-
relations as constraints during the learning step to
obtain better solutions (Ghamrawi and McCallum,
2005; Tsoumakas and Katakis, 2006; Fürnkranz et
al., 2008; Read et al., 2011).

Richer feature set. In addition our model can ben-
efit from a richer set of features. Word embeddings
can be used to connect word node which are similar
in meaning (Mikolov et al., 2013). We can use ex-
isting morphological segmentation tools to discover
the morpheme and inflections of a word to connect it
to word with similar inflections which might be bet-
ter than the crude suffix or prefix features. We can
also use rich lexical resources like Wiktionary11 to
extract relations between words that can be encoded
on our graph edges.

9 Conclusion

We have presented a graph-based semi-supervised
method to construct large annotated morpho-
syntactic lexicons from small seed lexicons. Our
method is language independent and we have con-
structed lexicons for 11 different languages. We
showed that the lexicons thus constructed help im-
prove performance in morphological tagging and de-
pendency parsing, when used as features.

Acknowledgement

This work was performed when the first author
was an intern at Google. We thank action editor
Alexander Clark, and the three anonymous review-
ers for their helpful suggestions in preparing the
manuscript. We thank David Talbot for his help
in developing the propagation framework and help-
ful discussions about evaluation. We thank Avneesh
Saluja, Chris Dyer and Partha Pratim Talukdar for
their comments on drafts of this paper.

References

Željko Agić, Maria Jesus Aranzabe, Aitziber Atutxa,
Cristina Bosco, Jinho Choi, Marie-Catherine de Marn-
effe, Timothy Dozat, Richárd Farkas, Jennifer Foster,
Filip Ginter, Iakes Goenaga, Koldo Gojenola, Yoav
Goldberg, Jan Hajič, Anders Trærup Johannsen, Jenna

11https://www.wiktionary.org/

12


Kanerva, Juha Kuokkala, Veronika Laippala, Alessan-
dro Lenci, Krister Lindén, Nikola Ljubešić, Teresa
Lynn, Christopher Manning, Héctor Alonso Martı́nez,
Ryan McDonald, Anna Missilä, Simonetta Monte-
magni, Joakim Nivre, Hanna Nurmi, Petya Osenova,
Slav Petrov, Jussi Piitulainen, Barbara Plank, Prokopis
Prokopidis, Sampo Pyysalo, Wolfgang Seeker, Moj-
gan Seraji, Natalia Silveira, Maria Simi, Kiril Simov,
Aaron Smith, Reut Tsarfaty, Veronika Vincze, and
Daniel Zeman. 2015. Universal dependencies 1.1.
LINDAT/CLARIN digital library at Institute of Formal
and Applied Linguistics, Charles University in Prague.

Malin Ahlberg, Markus Forsberg, and Mans Hulden.
2014. Semi-supervised learning of morphological
paradigms and lexicons. In Proc. of EACL.

Malin Ahlberg, Markus Forsberg, and Mans Hulden.
2015. Paradigm classification in supervised learning
of morphology. In Proc. of NAACL.

Enrique Alfonseca, Marius Pasca, and Enrique Robledo-
Arnuncio. 2010. Acquisition of instance attributes via
labeled and related instances. In Proc. of SIGIR.

Ebru Arisoy, Murat Saraçlar, Brian Roark, and Izhak
Shafran. 2010. Syntactic and sub-lexical features for
Turkish discriminative language models. In Proc. of
ICASSP.

Carmen Banea, Janyce M. Wiebe, and Rada Mihalcea.
2008. A bootstrapping method for building subjectiv-
ity lexicons for languages with scarce resources. In
Proc. of LREC.

Yoshua Bengio, Olivier Delalleau, and Nicolas Le Roux.
2006. Label propagation and quadratic criterion. In
Semi-Supervised Learning. MIT Press.

Laurent Besacier, Etienne Barnard, Alexey Karpov, and
Tanja Schultz. 2014. Automatic speech recogni-
tion for under-resourced languages: A survey. Speech
Communication, 56:85–100.

Bernd Bohnet and Joakim Nivre. 2012. A transition-
based system for joint part-of-speech tagging and la-
beled non-projective dependency parsing. In Proc. of
EMNLP.

Chris Callison-Burch and Mark Dredze. 2010. Creat-
ing speech and language data with amazon’s mechan-
ical turk. In Proc. of NAACL Workshop on Creating
Speech and Language Data with Amazon’s Mechani-
cal Turk.

Alexander Clark. 2003. Combining distributional and
morphological information for part of speech induc-
tion. In Proc. of EACL.

Mathias Creutz and Krista Lagus. 2007. Unsupervised
models for morpheme segmentation and morphology
learning. ACM Transactions on Speech and Language
Processing (TSLP), 4(1):3.

Dipanjan Das and Slav Petrov. 2011. Unsupervised part-
of-speech tagging with bilingual graph-based projec-
tions. In Proc. of ACL.

Dipanjan Das and Noah A. Smith. 2012. Graph-based
lexicon expansion with sparsity-inducing penalties. In
Proc. of NAACL.

Marie-Catherine De Marneffe, Timothy Dozat, Natalia
Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre,
and Christopher D. Manning. 2014. Universal stan-
ford dependencies: A cross-linguistic typology. In
Proceedings of LREC.

Vera Demberg. 2007. A language-independent unsuper-
vised model for morphological segmentation. In Proc.
of ACL.

Pascal Denis and Benoı̂t Sagot. 2009. Coupling an anno-
tated corpus and a morphosyntactic lexicon for state-
of-the-art POS tagging with less human effort. In
Proc. of PACLIC.

Paramveer S. Dhillon, Partha Talukdar, and Koby Cram-
mer. 2012. Metric learning for graph-based domain
adaptation. In Proc. of COLING.

Markus Dreyer and Jason Eisner. 2011. Discover-
ing morphological paradigms from plain text using a
Dirichlet process mixture model. In Proc. of EMNLP.

Matthew S. Dryer. 2013. Prefixing vs. Suffixing in Inflec-
tional Morphology. Max Planck Institute for Evolu-
tionary Anthropology.

John Duchi, Elad Hazan, and Yoram Singer. 2011.
Adaptive subgradient methods for online learning and
stochastic optimization. The Journal of Machine
Learning Research, 12:2121–2159.

Kais Dukes and Nizar Habash. 2010. Morphological
annotation of quranic Arabic. In Proc. of LREC.

Greg Durrett and John DeNero. 2013. Supervised learn-
ing of complete morphological paradigms. In Proc. of
NAACL.

Saso Dzeroski, Tomaz Erjavec, and Jakub Zavrel. 2000.
Morphosyntactic tagging of Slovene: Evaluating tag-
gers and tagsets. In Proc. of LREC.

Tomaz Erjavec. 2004. Multext-east version 3: Multilin-
gual morphosyntactic specifications, lexicons and cor-
pora. In Proc. of LREC.

Günes Erkan and Dragomir R. Radev. 2004. Lexrank:
Graph-based lexical centrality as salience in text sum-
marization. Journal of Artificial Intelligence Re-
search, 22(1):457–479.

Ramy Eskander, Nizar Habash, and Owen Rambow.
2013. Automatic extraction of morphological lexicons
from morphologically annotated corpora. In Proc. of
EMNLP.

Manaal Faruqui and Sebastian Padó. 2010. Training and
evaluating a German named entity recognizer with se-
mantic generalization. In Proc. of KONVENS.

13


Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, and
Chris Dyer. 2016. Morphological inflection gener-
ation using character sequence to sequence learning.
arXiv:1104.2086.

Chun-Sung Ferng and Hsuan-Tien Lin. 2011. Multi-
label classification with error-correcting codes. In
Proc. of ACML.

Johannes Fürnkranz, Eyke Hüllermeier, Eneldo Loza
Mencı́a, and Klaus Brinker. 2008. Multilabel classifi-
cation via calibrated label ranking. Machine learning,
73(2):133–153.

Dan Garrette, Jason Mielens, and Jason Baldridge. 2013.
Real-world semi-supervised learning of POS-taggers
for low-resource languages. In Proc. of ACL.

Nadia Ghamrawi and Andrew McCallum. 2005. Collec-
tive multi-label classification. In Proc. of CIKM.

Jesús Giménez and Lluı́s Marquez. 2004. Svmtool: A
general POS tagger generator based on support vector
machines. In Proc. of LREC.

Shantanu Godbole and Sunita Sarawagi. 2004. Discrim-
inative methods for multi-labeled classification. In
Proc. of KDD.

Andrew B. Goldberg, Xiaojin Zhu, and Stephen J.
Wright. 2007. Dissimilarity in graph-based semi-
supervised classification. In Proc. of AISTATS.

Yoav Goldberg, Reut Tsarfaty, Meni Adler, and Michael
Elhadad. 2009. Enhancing unlexicalized parsing per-
formance using a wide coverage lexicon, fuzzy tag-set
mapping, and EM-HMM-based lexical probabilities.
In Proc. of EACL.

Spence Green and John DeNero. 2012. A class-based
agreement model for generating accurately inflected
translations. In Proc. of ACL.

Jan Hajič and Barbora Hladká. 1998. Tagging inflective
languages: Prediction of morphological categories for
a rich, structured tagset. In Proc. of COLING.

Jan Hajič. 2000. Morphological tagging: Data vs. dictio-
naries. In Proc. of NAACL.

Daniel Hsu, Sham Kakade, John Langford, and Tong
Zhang. 2009. Multi-label prediction via compressed
sensing. In Proc. of NIPS.

Stéphane Huet, Guillaume Gravier, and Pascale Sébillot.
2008. Morphosyntactic resources for automatic
speech recognition. In Proc. of LREC.

Ann Irvine and Alexandre Klementiev. 2010. Using me-
chanical turk to annotate lexicons for less commonly
used languages. In Proc. of NAACL Workshop on Cre-
ating Speech and Language Data with Amazon’s Me-
chanical Turk.

Ernst Ising. 1925. Beitrag zur theorie des ferromag-
netismus. Zeitschrift für Physik A Hadrons and Nu-
clei, 31(1):253–258.

Reinhard Kneser and Hermann Ney. 1993. Improved
clustering techniques for class-based statistical lan-
guage modelling. In Proc. of Eurospeech.

Dimitrios Kokkinakis, Maria Toporowska-Gronostaj, and
Karin Warmenius. 2000. Annotating, disambiguat-
ing & automatically extending the coverage of the
Swedish SIMPLE lexicon. In Proc. of LREC.

Terry Koo, Xavier Carreras, and Michael Collins. 2008.
Simple semi-supervised dependency parsing. In Proc.
of ACL.

Sandra Kübler, Ryan McDonald, and Joakim Nivre.
2009. Dependency parsing. Synthesis Lectures on
Human Language Technologies. Morgan & Claypool
Publishers.

Yue Lu, Malu Castellanos, Umeshwar Dayal, and
ChengXiang Zhai. 2011. Automatic construction of a
context-aware sentiment lexicon: An optimization ap-
proach. In Proc. of WWW.

Sven Martin, Jörg Liermann, and Hermann Ney. 1998.
Algorithms for bigram and trigram word clustering.
Speech communication, 24(1):19–37.

Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-
Brundage, Yoav Goldberg, Dipanjan Das, Kuzman
Ganchev, Keith B. Hall, Slav Petrov, Hao Zhang, Os-
car Täckström, Claudia Bedini, Núria B. Castelló, and
Jungmee Lee. 2013. Universal dependency annota-
tion for multilingual parsing. In Proc. of ACL.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.
2013. Linguistic regularities in continuous space word
representations. In Proc. of NAACL.

Einat Minkov, Kristina Toutanova, and Hisami Suzuki.
2007. Generating complex morphology for machine
translation. In Proc. of ACL.

Saif Mohammad, Cody Dunne, and Bonnie Dorr. 2009.
Generating high-coverage semantic orientation lexi-
cons from overtly marked words and a thesaurus. In
Proc. of EMNLP.

Robert Moore. 2015. An improved tag dictionary for
faster part-of-speech tagging. In Proc. of EMNLP.

Thomas Müller and Hinrich Schuetze. 2015. Robust
morphological tagging with word representations. In
Proceedings of NAACL.

Thomas Müller, Helmut Schmid, and Hinrich Schütze.
2013. Efficient higher-order CRFs for morphological
tagging. In Proc. of EMNLP.

Pradeep Muthukrishnan, Dragomir Radev, and Qiaozhu
Mei. 2011. Simultaneous similarity learning and
feature-weight learning for document clustering. In
Proc. of TextGraphs.

Karthik Narasimhan, Regina Barzilay, and Tommi
Jaakkola. 2015. An unsupervised method for uncov-
ering morphological chains. Transactions of the Asso-
ciation for Computational Linguistics, 3:157–167.

14


Garrett Nicolai, Colin Cherry, and Grzegorz Kondrak.
2015. Inflection generation as discriminative string
transduction. In Proc. of NAACL.

Sonja Nießen and Hermann Ney. 2004. Statistical ma-
chine translation with scarce resources using morpho-
syntactic information. Computational Linguistics,
30(2).

Kemal Oflazer and Ìlker Kuruöz. 1994. Tagging and
morphological disambiguation of Turkish text. In
Proc. of ANLP.

Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin
Gimpel, Nathan Schneider, and Noah A. Smith. 2013.
Improved part-of-speech tagging for online conversa-
tional text with word clusters. In Proc. of NAACL.

Tommi A Pirinen. 2011. Modularisation of finnish finite-
state language description—towards wide collabora-
tion in open source development of a morphological
analyser. In Proc. of NODALIDA.

Hoifung Poon, Colin Cherry, and Kristina Toutanova.
2009. Unsupervised morphological segmentation with
log-linear models. In Proc. of NAACL.

Adwait Ratnaparkhi. 1996. A maximum entropy model
for part-of-speech tagging. In Proc. of EMNLP.

Jesse Read, Bernhard Pfahringer, Geoff Holmes, and
Eibe Frank. 2011. Classifier chains for multi-label
classification. Machine Learning, 85(3):333–359.

Haşim Sak, Tunga Güngör, and Murat Saraçlar. 2008.
Turkish language resources: Morphological parser,
morphological disambiguator and web corpus. In
Proc. of ANLP.

Avneesh Saluja and Jirı Navrátil. 2013. Graph-based
unsupervised learning of word similarities using het-
erogeneous feature types. In Proc. of TextGraphs.

Helmut Schmid. 1994. Probabilistic part-of-speech tag-
ging using decision trees. In Proc. of the International
Conference on New Methods in Language Processing.

Kiril Ivanov Simov, Petya Osenova, Sia Kolkovska,
Elisaveta Balabanova, and Dimitar Doikoff. 2004. A
language resources infrastructure for Bulgarian. In
Proc. of LREC.

Noah A. Smith, David A. Smith, and Roy W. Tromble.
2005. Context-based morphological disambiguation
with random fields. In Proc. of EMNLP.

Benjamin Snyder and Regina Barzilay. 2008. Unsuper-
vised multilingual learning for morphological segmen-
tation. In Proc. of ACL.

Radu Soricut and Franz Och. 2015. Unsupervised mor-
phology induction using word embeddings. In Proc.
of NAACL.

Amarnag Subramanya and Jeff Bilmes. 2008. Soft-
supervised learning for text classification. In Proc. of
EMNLP.

Amarnag Subramanya and Partha Pratim Talukdar. 2014.
Graph-based semi-supervised learning. Synthesis Lec-
tures on Artificial Intelligence and Machine Learning,
8(4).

Amarnag Subramanya, Slav Petrov, and Fernando
Pereira. 2010. Efficient graph-based semi-supervised
learning of structured tagging models. In Proc. of
EMNLP.

Oscar Täckström, Ryan McDonald, and Jakob Uszkoreit.
2012. Cross-lingual word clusters for direct transfer
of linguistic structure. In Proc. of NAACL.

Hiroya Takamura, Takashi Inui, and Manabu Okumura.
2007. Extracting semantic orientations of phrases
from dictionary. In Proc. of NAACL.

Partha Pratim Talukdar and William Cohen. 2013.
Scaling graph-based semi-supervised learning to large
number of labels using count-min sketch. In Proc. of
AISTATS.

Partha Pratim Talukdar and Fernando Pereira. 2010.
Experiments in graph-based semi-supervised learning
methods for class-instance acquisition. In Proc. of
ACL.

Partha Pratim Talukdar, Derry Wijaya, and Tom Mitchell.
2012. Acquiring temporal constraints between rela-
tions. In Proc. of CIKM.

Michael Thelen and Ellen Riloff. 2002. A bootstrapping
method for learning semantic lexicons using extraction
pattern contexts. In Proc. of ACL.

Viktor Trón, Péter Halácsy, Péter Rebrus, András Rung,
Péter Vajda, and Eszter Simon. 2006. Morphdb.hu:
Hungarian lexical database and morphological gram-
mar. In Proc. of LREC.

Grigorios Tsoumakas and Ioannis Katakis. 2006. Multi-
label classification: An overview. Dept. of Informat-
ics, Aristotle University of Thessaloniki, Greece.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.
Word representations: a simple and general method for
semi-supervised learning. In Proc. of ACL.

Jakob Uszkoreit and Thorsten Brants. 2008. Distributed
word clustering for large scale class-based language
modeling in machine translation. In Proc. of ACL.

Leonid Velikovich, Sasha Blair-Goldensohn, Kerry Han-
nan, and Ryan McDonald. 2010. The viability of web-
derived polarity lexicons. In Proc. of NAACL.

Fei Wang, Shijun Wang, Changshui Zhang, and Ole
Winther. 2007. Semi-supervised mean fields. In Proc.
of AISTATS.

Kilian Q. Weinberger, John Blitzer, and Lawrence K.
Saul. 2005. Distance metric learning for large mar-
gin nearest neighbor classification. In Proc. of NIPS.

Richard Wicentowski. 2004. Multilingual noise-robust
supervised morphological analysis using the word-
frame model. In Proc. of SIGPHON.

15


David Yarowsky and Richard Wicentowski. 2000. Min-
imally supervised morphological analysis by multi-
modal alignment. In Proc. of ACL.

Yue Zhang and Joakim Nivre. 2011. Transition-based
dependency parsing with rich non-local features. In
Proc. of ACL.

Yi Zhang and Jeff G. Schneider. 2011. Multi-label
output codes using canonical correlation analysis. In
Proc. of AISTATS.

Min-Ling Zhang and Zhi-Hua Zhou. 2005. A k-nearest
neighbor based algorithm for multi-label classifica-
tion. In Proc. of IEEE Conference on Granular Com-
puting.

Xiaojin Zhu. 2005. Semi-supervised Learning with
Graphs. Ph.D. thesis, Carnegie Mellon University,
Pittsburgh, PA, USA. AAI3179046.

16