Edinburgh Research Explorer 
 
 
Minimally-Supervised Morphological Segmentation using
Adaptor Grammars

Citation for published version:
Sirts, K & Goldwater, S 2013, 'Minimally-Supervised Morphological Segmentation using Adaptor
Grammars', Transactions of the Association for Computational Linguistics, vol. 1, no. May, pp. 231-242.
<http://aclweb.org/anthology//Q/Q13/Q13-1021.pdf>

Link:
Link to publication record in Edinburgh Research Explorer

Document Version:
Publisher's PDF, also known as Version of record

Published In:
Transactions of the Association for Computational Linguistics

General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.

Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.

Download date: 06. Apr. 2021

http://aclweb.org/anthology//Q/Q13/Q13-1021.pdf
https://www.research.ed.ac.uk/portal/en/publications/minimallysupervised-morphological-segmentation-using-adaptor-grammars(e5fc4878-09a7-4307-9b93-8934e8059e8b).html


Transactions of the Association for Computational Linguistics, 1 (2013) 255–266. Action Editor: Kristina Toutanova.
Submitted 11/2012; Published 5/2013. c©2013 Association for Computational Linguistics.

Minimally-Supervised Morphological Segmentation
using Adaptor Grammars

Kairit Sirts
Institute of Cybernetics

Tallinn University of Technology
sirts@phon.ioc.ee

Sharon Goldwater
School of Informatics

The University of Edinburgh
sgwater@inf.ed.ac.uk

Abstract

This paper explores the use of Adaptor Gram-
mars, a nonparametric Bayesian modelling
framework, for minimally supervised morpho-
logical segmentation. We compare three train-
ing methods: unsupervised training, semi-
supervised training, and a novel model selec-
tion method. In the model selection method,
we train unsupervised Adaptor Grammars us-
ing an over-articulated metagrammar, then use
a small labelled data set to select which poten-
tial morph boundaries identified by the meta-
grammar should be returned in the final output.
We evaluate on five languages and show that
semi-supervised training provides a boost over
unsupervised training, while the model selec-
tion method yields the best average results over
all languages and is competitive with state-of-
the-art semi-supervised systems. Moreover,
this method provides the potential to tune per-
formance according to different evaluation met-
rics or downstream tasks.

1 Introduction

Research into unsupervised learning of morphology
has a long history, starting with the work of Harris
(1951). While early research was mostly motivated
by linguistic interests, more recent work in NLP often
aims to reduce data sparsity in morphologically rich
languages for tasks such as automatic speech recogni-
tion, statistical machine translation, or automatic text
generation. For these applications, however, com-
pletely unsupervised systems may not be ideal if
even a small amount of segmented training data is

available. In this paper, we explore the use of Adap-
tor Grammars (Johnson et al., 2007) for minimally
supervised morphological segmentation.

Adaptor Grammars (AGs) are a nonparametric
Bayesian modelling framework that can learn latent
tree structures over an input corpus of strings. For
example, they can be used to define a morpholog-
ical grammar where each word consists of zero or
more prefixes, a stem, and zero or more suffixes; the
actual forms of these morphs (and the segmentation
of words into morphs) are learned from the data. In
this general approach AGs are similar to many other
unsupervised morphological segmentation systems,
such as Linguistica (Goldsmith, 2001) and the Mor-
fessor family (Creutz and Lagus, 2007). A major
difference, however, is that the morphological gram-
mar is specified as an input to the program, rather
than hard-coded, which allows different grammars
to be explored easily. For the task of segmenting
utterances into words, for example, Johnson and col-
leagues have experimented with grammars encoding
different kinds of sub-word and super-word structure
(e.g., syllables and collocations), showing that the
best grammars far outperform other systems on the
same corpora (Johnson, 2008a; Johnson and Goldwa-
ter, 2009; Johnson and Demuth, 2010).

These word segmentation papers demonstrated
both the power of the AG approach and the syner-
gistic behavior that occurs when learning multiple
levels of structure simultaneously. However, the best-
performing grammars were selected using the same
corpus that was used for final testing, and each paper
dealt with only one language. The ideal unsuper-
vised learner would use a single grammar tuned on

255


one or more development languages and still perform
well on other languages where development data is
unavailable. Indeed, this is the basic principle be-
hind Linguistica and Morfessor. However, we know
that different languages can have very different mor-
phological properties, so using a single grammar for
all languages may not be the best approach if there
is a principled way to choose between grammars.
Though AGs make it easy to try many different pos-
sible grammars, the process of proposing and testing
plausible options can still be time-consuming.

In this paper, we propose a novel method for au-
tomatically selecting good morphological grammars
for different languages (English, Finnish, Turkish,
German, and Estonian) using a small amount of
gold-segmented data (1000 word types). We use
the AG framework to specify a very general binary-
branching grammar of depth four with which we
learn a parse tree of each word that contains several
possible segmentation splits for the word. Then, we
use the gold-segmented data to learn, for each lan-
guage, which of the proposed splits from the original
grammar should actually be used in order to best
segment that language.

We evaluate our approach on both a small devel-
opment set and the full Morpho Challenge test set
for each language—up to three million word types.
In doing so, we demonstrate that using the posterior
grammar of an AG model to decode unseen data is
a feasible way to scale these models to large data
sets. We compare to several baselines which use the
annotated data to different degrees: parameter tuning,
grammar tuning, supervised training, or no use of
annotated data. In addition to existing approaches—
unsupervised and semi-supervised Morfessor, unsu-
pervised Morsel (Lignos, 2010), and unsupervised
AGs—we also show how to use the annotated data to
train semi-supervised AGs (using the data to accumu-
late rule statistics rather than for grammar selection).
The grammar selection method yields comparable
results to the best of these other approaches.

To summarize, our contributions in this paper are:
1) scaling AGs to large data sets by using the poste-
rior grammar to define an inductive model; 2) demon-
strating how to train semi-supervised AG models, and
showing that this improves morphological segmenta-
tion over unsupervised training; and 3) introducing
a novel grammar selection method for AG models

whose segmentation results are competitive with the
best existing systems.

Before providing details of our methods and re-
sults, we first briefly review Adaptor Grammars. For
a formal definition, see Johnson et al. (2007).

2 Adaptor Grammars

Adaptor Grammars are a framework for specifying
nonparametric Bayesian models that can be used to
learn latent tree structures from a corpus of strings.
There are two components to an AG model: the base
distribution, which is just a PCFG, and the adaptor,
which “adapts” the probabilities assigned to individ-
ual subtrees under the PCFG model, such that the
probability of a subtree under the complete model
may be considerably higher than the product of the
probabilities of the PCFG rules required to construct
it. Although in principle the adaptor can be any func-
tion that maps one distribution onto another, Johnson
et al. (2007) use a Pitman-Yor Process (PYP) (Pit-
man and Yor, 1997) as the adaptor because it acts
as a caching model. Under a PYP AG model, the
posterior probability of a particular subtree will be
roughly proportional to the number of times that sub-
tree occurs in the current analysis of the data (with
the probability of unseen subtrees being computed
under the base PCFG distribution).

An AG model can be defined by specifying the
CFG rules (the support for the base distribution) and
indicating which non-terminals are “adapted”, i.e.,
can serve as the root of a cached subtree. Given this
definition and an input corpus of strings, Markov
chain Monte Carlo samplers can be used to infer the
posterior distribution over trees (and all hyperparam-
eters of the model, including PCFG probabilities in
the base distribution and PYP hyperparameters). Any
frequently recurring substring (e.g., a common pre-
fix) will tend to be parsed consistently, as this permits
the model to treat the subtree spanning that string as
a cached subtree, assigning it higher probability than
under the PCFG distribution.

Adaptor Grammars have been applied to a wide
variety of tasks, including segmenting utterances
into words (Johnson, 2008a; Johnson and Goldwa-
ter, 2009; Johnson and Demuth, 2010), classifying
documents according to perspective (Hardisty et al.,
2010), machine transliteration of names (Huang et

256


al., 2011), native language identification (Wong et
al., 2012), and named entity clustering (Elsner et al.,
2009). There have also been AG experiments with
morphological segmentation, but more as a proof of
concept than an attempt to achieve state-of-the-art
results (Johnson et al., 2007; Johnson, 2008b).

3 Using AGs for Learning Morphology

Originally, the AG framework was designed for un-
supervised learning. This section first describes how
AGs can be used for unsupervised morphological
segmentation, and then introduces two ways to use
a small labelled data set to improve performance:
semi-supervised learning and grammar selection.

3.1 Unsupervised Adaptor Grammars
We define three AG models to use as unsupervised
baselines in our segmentation experiments. The first
of these is very simple:

Word → Morph+

Morph → Char+
(1)

The underline notation indicates an adapted non-
terminal, and + abbreviates a set of recursive rules,
e.g., Word → Morph+ is short for

Word → Morphs
Morphs → Morph Morphs
Morphs → Morph

Grammar 1 (MorphSeq) is just a unigram model
over morphs: the Morph symbol is adapted, so the
probability of each Morph will be roughly propor-
tional to its (inferred) frequency in the corpus. The
grammar specifies no further structural relationships
between morphs or inside of morphs (other than a
geometric distribution on their length in characters).

Experiments with AGs for unsupervised word seg-
mentation suggest that adding further latent structure
can help with learning. Here, we add another layer
of structure below the morphs,1 calling the resulting

1Because the nonterminal labels are arbitrary, this grammar
can also be interpreted as adding another layer on top of morphs,
allowing the model to learn morph collocations that encode de-
pendencies between morphs (which themselves have no substruc-
ture). However preliminary experiments showed that the morph-
submorph interpretation scored better than the collocation-morph
interpretation, hence we chose the corresponding nonterminal
names.

grammar SubMorphs:

Word → Morph+

Morph → SubMorph+

SubMorph → Char+
(2)

For capturing the rules of morphotactics, a gram-
mar with linguistically motivated non-terminals can
be created. There are many plausible options and
the best-performing grammar may be somewhat
language-dependent. Rather than experimenting ex-
tensively, we designed our third grammar to replicate
as closely as possible the grammar that is implicitly
implemented in the Morfessor system. This Com-
pounding grammar distinguishes between prefixes,
stems and suffixes, allows compounding, defines the
order in which the morphs can occur and also allows
the morphs to have inner latent structure:

Word → Compound+

Compound → Prefix∗ Stem Suffix∗

Prefix → SubMorph+

Stem → SubMorph+

Suffix → SubMorph+

SubMorph → Char+

(3)

3.2 Semi-Supervised Adaptor Grammars

The first new use of AGs we introduce is the semi-
supervised AG, where we use the labelled data to ex-
tract counts of the different rules and subtrees used in
the gold-standard analyses. We then run the MCMC
sampler as usual over both the unlabelled and la-
belled data, treating the counts from the labelled data
as fixed.

We assume that the labelled data provides a con-
sistent bracketing (no two spans in the bracketing
can partially overlap) and the labels of the spans
must be compatible with the grammar. However,
the bracketing may not specify all levels of structure
in the grammar. In our case, we have morpheme
bracketings but not, e.g., submorphs. Thus, using
the SubMorphs grammar in semi-supervised mode
will constrain the sampler so that Morph spans in the
labelled data will remain fixed, while the SubMorphs
inside those Morphs will be resampled.

257


The main change made to the AG inference pro-
cess2 for implementing the semi-supervised AG was
to prune out from the sampling distribution any non-
terminals that are inconsistent with the spans/labels
in the given labelling.

3.3 AG Select
Both the unsupervised and semi-supervised methods
described above assume the definition of a grammar
that adequately captures the phenomena being mod-
elled. Although the AG framework makes it easy
to experiment with different grammars, these experi-
ments can be time-consuming and require some good
guesses as to what a plausible grammar might be.
These problems can be overcome by automating the
grammar development process to systematically eval-
uate different grammars and find the best one.

We propose a minimally supervised model selec-
tion method AG Select that uses the AG framework to
automatically identify the best grammar for different
languages and data sets. We first define a very gen-
eral binary-branching CFG grammar for AG training
that we call the metagrammar. The metagrammar
learns a parse tree for each word where each branch
contains a different structure in the word. The granu-
larity of these structures is determined by the depth of
this tree. For example, Grammar 4 generates binary
trees of depth two and can learn segmentations of up
to four segments.

Word → M1
Word → M1 M2

M1 → M11
M1 → M11 M12
M2 → M21
M2 → M21 M22

M11 → Chars+
M12 → Chars+
M21 → Chars+
M22 → Chars+

(4)

Next we introduce the notion of a morphologi-
cal template, which is an ordered sequence of non-
terminals whose concatenated yields constitute the
word and which are used to parse out a specific seg-
mentation of the word. For example, using Gram-
mar 4 the parse tree of the word saltiness is shown in
Figure 1. There are four possible templates with four

2We started with Mark Johnson’s PYAG implementa-
tion, http://web.science.mq.edu.au/∼mjohnson/code/py-cfg.tgz,
which we also used for our unsupervised and grammar selection
experiments.

Word

M1

M11

s a l

M12

t

M2

M21

i

M22

n e s s

Figure 1: The parse tree generated by the metagrammar
of depth 2 for the word saltiness.

different segmentations: M1 M2 (salt iness), M11
M12 M2 (sal t iness), M1 M21 M22 (salt i ness),
and M11 M12 M21 M22 (sal t i ness).

The morphological template consisting only of
non-terminals from the lowest cached level of the
parse tree is expected to have high recall, whereas
the template containing the non-terminals just below
the Word is expected to have high precision. Our
goal is to find the optimal template by using a small
labelled data set. The grammar selection process iter-
ates over the set of all templates. For each template,
the segmentations of the words in the labelled data
set are parsed out and the value of the desired evalua-
tion metric is computed. The template that obtained
the highest score is then chosen.

For each language we use a single template to seg-
ment all the words in that language. However, even
using (say) a four-morph template such as M11 M12
M21 M22, some words may contain fewer morphs
because the metagrammar permits either unary or
binary branching rules, so some parses may not con-
tain M12 or M2 (and thus M21 M22) spans. Thus,
we can represent segmentations of different lengths
(from 1 to 2n, where n is the depth of the metagram-
mar) with a single template.3

For our experiments we use a metagrammar of
depth four. This grammar allows words to consist of
up to 16 segments, which we felt would be enough for
any word in the training data. Also, iterating over all
the templates of a grammar with bigger depth would
not be feasible as the number of different templates
increases very rapidly.4

3We also experimented with selecting different templates for
words of different length but observed no improvements over the
single template approach.

4The number of templates of each depth can be expressed
recursively as Ni = (Ni−1 + 1)2, where Ni−1 is the number of

258


3.4 Inductive Learning
Previous work on AGs has used relatively small data
sets and run the sampler on the entire input corpus
(some or all of which is also used for evaluation)—a
transductive learning scenario. However, our larger
data sets contain millions of word types, where sam-
pling over the whole set is not feasible. For example,
1000 training iterations on 50k word types took about
a week on one 2.67 GHz CPU. To solve this problem,
we need an inductive learner that can be trained on a
small set of data and then used to segment a different
larger set.

To create such a learner, we run the sampler on
up to 50k word types, and then extract the posterior
grammar as a PCFG.5 This grammar contains all the
initial CFG rules, plus rules to generate each of the
cached subtrees inferred by the sampler. The sampler
counts of all rules are normalized to obtain a PCFG,
and we can then use a standard CKY parser to decode
the remaining data using this PCFG.

4 Experiments

4.1 Data
We test on languages with a range of morphologi-
cal complexity: English, Finnish, Turkish, German,
and Estonian. For each language we use two small
sets of gold-annotated data—a labelled set for semi-
supervised training or model selection and a dev
set for development results—and one larger gold-
annotated dataset for final tests. We also have a large
unlabelled training set for each language. Table 1
gives statistics.

The data sets for English, Finnish, Turkish and
German are from the Morpho Challenge 2010 com-
petition6 (MC2010). We use the MC2010 training
set of 1000 annotated word types as our labelled data,
and for our dev sets we collate together the devel-
opment data from all years of the MC competition.
Final evaluation is done on the official MC2010 test
sets, which are not public, so we rely on the MC
organizers to perform the evaluation. The words in

templates in the grammar of depth one less and N0 = 0.
5This can be seen as a form of Structure Compilation (Liang

et al., 2008), where the solution found by a more costly model
is used to define a less costly model. However in Liang et al.’s
case both models were already inductive.

6http://research.ics.aalto.fi/events/morphochallenge2010/
datasets.shtml

Unlab. Lab. Dev Test
English 0.9M 1000 1212 16K
Finnish 2.9M 1000 1494 225K
Turkish 0.6M 1000 1531 64K
German 2.3M 1000 785 62K
Estonian 2.1M 1000 1500 74K

Table 1: Number of word types in our data sets.

each test set are an unknown subset of the words in
the unlabelled corpus, so to evaluate we segmented
the entire unlabelled corpus and sent the results to
the MC team, who then computed scores on the test
words.

The Estonian wordlist is gathered from the news-
paper texts of a mixed corpus of Estonian.7 Gold
standard segmentations of some of these words are
available from the Estonian morphologically disam-
biguated corpus;8 we used these for the test set, with
small subsets selected randomly for the labelled and
dev sets.

For semi-supervised tests of the AG Compounding
grammar we annotated the morphemes in the English,
Finnish and Estonian labelled sets as prefixes, stems
or suffixes. We could not do so for Turkish because
none of the authors knows Turkish.

4.2 Evaluation

We evaluate our results with two measures: segment
border F1-score (SBF1) and EMMA (Spiegler and
Monson, 2010). SBF1 is one of the simplest and
most popular evaluation metrics for morphological
segmentations. It computes F1-score from the preci-
sion and recall of ambiguous segment boundaries—
i.e., word edges are not counted. It is easy and quick
to compute but has the drawback that it gives no
credit for one-morpheme words that have been seg-
mented correctly (i.e., are assigned no segment bor-
ders). Also it can only be used on systems and gold
standards where the output is just a segmentation of
the surface string (e.g., availabil+ity) rather than a
morpheme analysis (e.g., available+ity). For this
reason we cannot report SBF1 on our German data,
which annotations contain only analyses.

EMMA is a newer measure that addresses both
7http://www.cl.ut.ee/korpused/segakorpus/epl
8http://www.cl.ut.ee/korpused/morfkorpus/

259


of these issues—correctly segmented one-morpheme
words are reflected in the score, and it can evalu-
ate both concatenative and non-concatenative mor-
phology. EMMA works by finding the best one-to-
one mapping between the hypothesized and true seg-
ments. The induced segments are then replaced with
their mappings and based on that, F1-score on match-
ing segments is calculated. Using EMMA we can
evaluate the induced segmentations of German words
against gold standard analyses. EMMA has a freely
available implementation,9 but is slow to compute
because it uses Integer Linear Programming.

For our dev results, we computed both scores us-
ing the entire dev set, but for the large test sets, the
evaluation is done on batches of 1000 word types se-
lected randomly from the test set. This procedure is
repeated 10 times and the average is reported, just as
in the MC2010 competition (Kohonen et al., 2010a).

4.3 Baseline Models

We compare our AG models to several other mor-
phology learning systems. We were able to obtain
implementations of two of the best unsupervised sys-
tems from MC2010, Morfessor (Creutz and Lagus,
2007) and Morsel (Lignos, 2010), and we use these
for comparisons on both the dev and test sets. We
also report test results from MC2010 for the only
semi-supervised system in the competition, semi-
supervised Morfessor (Kohonen et al., 2010a; Ko-
honen et al., 2010b). No dev results are reported on
this system since we were unable to obtain an imple-
mentation. This section briefly reviews the systems.

4.3.1 Morfessor Categories-MAP
Morfessor Categories-MAP (Morfessor) is a state-

of-the-art unsupervised morphology learning system.
Its implementation is freely available10 so it is widely
used both as a preprocessing step in tasks requiring
morphological segmentations, and as a baseline for
evaluating morphology learning systems.

Morfessor uses the Minimum Description Length
(MDL) principle to choose the optimal segment lexi-
con and the corpus segmentation. Each morph in the
segment lexicon is labelled as a stem, prefix, suffix

9http://www.cs.bris.ac.uk/Research/MachineLearning/
Morphology/resources.jsp#eval

10http://www.cis.hut.fi/projects/morpho/
morfessorcatmapdownloadform.shtml

or non-morph. The morphotactic rules are encoded
as an HMM, which specifies the allowed morph se-
quences with respect to the labels (e.g., a suffix can-
not directly follow a prefix).

The morphs in the segment lexicon can have a
hierachical structure, containing submorphs which
themselves can consist of submorphs etc. We hypoth-
esize that this hierarchical structure is one of the key
reasons why Morfessor has been so successful, as the
experiments also in this paper with different gram-
mars show that the ability to learn latent structures is
crucial for learning good segmentations.

One essential difference between Morfessor and
the proposed AG Select is that while we use the la-
belled data to choose which levels of the hierarchy
are to be used as morphs, Morfessor makes this de-
cision based on the labels of the segments, choosing
the most fine-grained morph sequence that does not
contain the non-morph label.

Morfessor includes a free parameter, perplexity
threshold, which we found can affect the SBF1 score
considerably (7 points or more). The best value for
this parameter depends on the size of the training
set, characteristics of the language being learned, and
also the evaluation metric being used, as in some
cases the best SBF1 and EMMA scores are obtained
with completely different values.

Thus, we tuned the value of the perplexity thresh-
old on the labelled set for each language and evalua-
tion metric for different unlabelled training set sizes.

4.3.2 Semi-Supervised Morfessor

Recently, the Morfessor system has been adapted
to allow semi-supervised training. Four versions of
the system were evaluated in MC2010, using differ-
ent degrees of supervision. Results reported here are
from the Morfessor S+W system, which performed
best of those that use the same kind of labelled data
as we do.11 This system uses the Morfessor Base-
line model (not Cat-MAP), which incorporates a
lexicon prior and data likelihood term. The semi-
supervised version maintains separate likelihoods for
the labelled and unlabelled data, and uses the devel-
opment set to tune two parameters that weight these
terms with respect to each other and the prior.

11Morfessor S+W+L performs better, but uses training data
with morpheme analyses rather than surface segmentations.

260


4.3.3 Morsel
Morsel (Lignos, 2010) is an unsupervised mor-

phology learning system introduced in MC2010; we
obtained the implementation from the author. Morsel
learns morphological analyses rather than segmenta-
tions, so it can be evaluated only using EMMA. There
are two options provided for running Morsel: aggres-
sive and conservative. We used the development set
to choose the best in each experimental case.

The MC data sets contain gold standard morpho-
logical analyses (as well as segmentations) so we
could compute Morsel’s EMMA scores using the
analyses. However, we found that Morsel obtains
higher EMMA scores when evaluated against gold
standard segmentations and thus we used this option
in all the experiments. (EMMA scores for other sys-
tems were also computed using the segmentations.)

4.4 Method

The experiments were conducted in two parts. First,
we evaluated different aspects of the AG models and
compared to all baseline models using the dev set
data. Then we evaluated the most competitive models
on the final test data.

For the development experiments, we compiled un-
labelled training sets with sizes ranging from 10k to
50k word types (using the most frequent word types
in each case). For the AG results, we report the aver-
age of five different runs made on the same training
set. We let the sampler run for 1000 iterations. No
annealing was used as it did not seem to help. The
table label resampling option was turned on and the
hyperparameter values were inferred.

We trained all AG and baseline models on each of
these training sets. For AG Select, the words from
the labelled data set were added to the training set to
allow for template selection.12 To compute results in
transductive mode, the words from the dev set were
also added to the training data. In inductive mode,
the dev set was instead parsed with the CKY parser.

Preliminary experiments showed that the perfor-
mance of unsupervised AG and AG Select improved
with larger training sets, though the effect is small
(see Figure 2 for results of AG Select in transductive

12We also experimented with smaller sets of labelled data. In
most cases, the template selected based on only 300 word types
was the same than the one selected with 1000 word types.

mode; the trend in inductive mode is similar). Based
on these and similar results with other baseline sys-
tems, all results reported later for unsupervised mod-
els (AG and baseline) and AG Select were obtained
using training sets of 50k words.

In contrast to the above models, the semi-
supervised AG does not always improve with more
unlabelled data (see Figure 2) and in the limit, it
will match the performance of the same grammar
in the unsupervised setting. Other semi-supervised
approaches often solve this problem by weighting
the labelled data more heavily than the unlabelled
data when estimating model parameters—effectively,
assuming that each labelled item has actually been
observed more than once. However, duplicating the
labelled data does not make sense in the AG frame-
work, because duplicate items will in most cases just
be cached at the root (Word) node, providing no addi-
tional counts of Morphs (which are where the useful
information is). It might be possible to come up with
a different way to weight the labelled data more heav-
ily when larger unlabelled sets are used, however
for this paper we instead kept the labelled data the
same and tuned the amount of unlabelled data. We
used the dev set to choose the amount of unlabelled
data (in the range from 10k to 50k types); results for
semi-supervised AG are reported using the optimal
amount of unlabelled data for each experiment.

For test set experiments with semi-supervised AG,
we evaluated each language using whichever gram-
mar performed best on that language’s dev set. For
test set experiments with AG Select, we chose the
templates with a two-pass procedure. First, we
trained 5 samplers on the 50k training set with la-
belled set added, and used the labelled data to choose
the best template for each inferred grammar. Then,
we decoded the dev set using each of the 5 gram-
mar/template pairs and based on these results, chose
the best of these pairs to decode the test set.

4.5 Results

We present the dev set results in Table 2(a) for trans-
ductive and in Table 2(b) for inductive learning. In
each table, unsupervised models are shown in the
upper section and the semi-supervised models and
AG Select below. Morsel appears only in Table 2(a)
since it only works transductively. Semi-supervised
grammars cannot be trained on German, since we

261


 55

 60

 65

 70

 75

 80

 85

 90

 10000  20000  30000  40000  50000

F
-s

c
o
re

# of word types

English
Estonian

Finnish
Turkish

 55

 60

 65

 70

 75

 80

 85

 90

 10000  20000  30000  40000  50000

F
-s

c
o
re

# of word types

English
Estonian

Finnish
Turkish

Figure 2: Effect of training data size on dev set SBF1 for AG Select (left) and semi-supervised SubMorphs grammar
(right) in transductive mode.

only have gold standard analyses, not segmentations.
The SubMorphs grammar performs the best of the

unsupervised AG models, with the Compounding
grammar being only slightly worse. We also tried
the Compounding grammar without the sub-morph
structures but the results were even worse than those
of MorphSeq. This shows that the latent structures
are important for learning good segmentations.

In all cases, the semi-supervised AGs perform bet-
ter (ofen much better) than the corresponding unsu-
pervised grammars. Even though their average scores
are not as high as AG Select’s, they give the best dev
set results in many cases. This shows that although
for semi-supervised AG the grammar must be cho-
sen manually, with a suitable choice of the grammar
and only a small set of labelled data it can improve
considerably over unsupervised AG.

In transductive mode, the AG Select performs the
best in several cases. In both transductive and induc-
tive mode, the results of AG Select are close to the
best results obtained and are consistently good across
all languages—it achieves the best average scores
of all models, suggesting that the model selection
method is robust to different types of morphology
and annotation schemes.

Table 3 presents the test set results. We include
scores for unsupervised Morfessor in both transduc-
tive and inductive mode, where transductive mode
trains on the entire unlabelled corpus and inductive
mode trains on the 50k subset. The semi-supervised
Morfessor scores are taken from the MC results
page13 after verifying that the evaluation method-

13http://research.ics.aalto.fi/events/morphochallenge/

ology and labelled data used is the same as ours.14

There is a good deal of variation between devel-
opment and test results, indicating that the dev sets
may not be a representative sample. The most no-
table differences are in Turkish, where all models
perform far worse on the test than dev set. However,
AG Select performs slightly better on the test set for
the other languages. Thus its average SBF1 score ac-
tually improves on the test set and is not much worse
than semi-supervised Morfessor. While its average
performance drops somewhat on test set EMMA, it
is still as good as any other model on that measure.
Again, these results support the idea that AG Select
is robust to variations in language and data set.

We also note the surprisingly good performance
of Morfessor in transductive mode on Estonian; this
could possibly be due to the larger amount of training
data used for the test set results, but it is not clear
why this would improve performance so much on
Estonian and not on the other languages.

5 Discussion

To give a sense of what the AG Select model is learn-
ing, we provide some examples of both correctly and
incorrectly induced segmentations in Table 4. These
examples suggest that for example in English, M1 is
used to model the stem, M21 is for the suffix or the
second stem in the compound word, and the rest of
the elements in the template are for the remaining
suffixes (if any).

Table 5 presents examples of some of the most
frequently used metagrammar rules and cached rules

14Sami Virpioja, personal communication.

262


(a) Transductive mode Border F1-score EMMA
Eng Est Fin Tur Avg Eng Est Fin Tur Ger Avg

AG MorphSeq 61.5 54.0 56.9 59.5 58.0 74.7 74.1 63.7 53.5 59.4 65.1
AG SubMorphs 66.2 66.9 60.5 59.5 63.3 79.1 83.4 66.8 53.4 57.4 68.0
AG Compounding 63.0 64.8 60.9 60.9 62.4 75.4 81.6 65.5 53.7 62.2 67.7
Morfessor 69.5 55.7 65.0 69.3 64.9 81.3 75.3 67.8 62.2 62.7 69.9
Morsel - - - - - 76.8 74.4 66.1 50.1 55.9 64.7
AG ssv MorphSeq 64.4 57.3 63.0 68.9 63.4 74.4 75.9 65.6 59.6 - -
AG ssv SubMorphs 67.6 69.1 64.4 63.4 66.1 79.5 84.4 69.2 56.1 - -
AG ssv Compounding 70.0 67.5 71.8 - - 79.5 82.8 74.0 - - -
AG Select 71.9 68.5 70.2 72.6 70.8 77.5 81.8 73.2 63.0 62.4 71.6

(b) Inductive mode Border F1-score EMMA
Eng Est Fin Tur Avg Eng Est Fin Tur Ger Avg

AG MorphSeq 57.6 54.0 55.4 59.8 56.7 72.0 73.8 62.6 53.7 58.9 64.2
AG SubMorphs 66.1 67.5 61.6 59.8 63.7 78.6 83.7 67.4 53.4 56.0 67.8
AG Compounding 62.0 64.8 57.4 61.1 61.3 73.5 81.1 61.9 53.2 61.0 66.2
Morfessor 68.9 51.1 63.5 68.2 62.9 80.9 72.0 68.1 60.6 60.8 68.5
AG ssv MorphSeq 64.6 56.9 63.1 70.3 63.8 72.7 73.3 65.9 61.2 - -
AG ssv SubMorphs 70.1 69.7 66.3 67.9 68.4 80.4 83.7 70.5 59.0 - -
AG ssv Compounding 70.5 67.2 70.0 - - 77.3 81.9 70.5 - - -
AG Select 69.8 68.8 67.5 70.1 69.1 77.3 81.9 71.1 59.7 62.6 70.5

Table 2: Dev set results for all models in (a) transductive and (b) inductive mode. Unsupervised AG models and
baselines are shown in the top part of each table; semi-supervised AG models and grammar selection method are below.

Border F1-score EMMA
Eng Est Fin Tur Avg -Est Eng Est Fin Tur Ger Avg -Est/Ger

Morf. trans 67.3 73.9 61.2 57.1 64.9 61.9 78.4 78.8 61.8 49.8 65.2 66.8 63.3
Morf. ind 65.7 57.7 60.8 60.1 61.1 62.2 76.5 70.5 59.6 47.0 64.1 63.5 61.0
Morsel - - - - - - 81.9 77.2 63.3 47.8 59.0 65.8 64.3
Morf. ssv 77.8 - 71.7 68.9 - 72.8 80.6 - 62.1 49.9 - - 64.2
AG ssv best 70.3? 68.6† 64.9? 58.2∗ 65.5 64.5 75.9? 80.3† 61.3? 46.1∗ - - 61.1
AG Select 74.4 71.7 70.0 61.4 69.4 68.6 81.3 81.0 64.0 47.5 63.8 67.5 64.3

Table 3: Test set results for unsupervised baselines Morfessor CatMAP (in transductive and inductive mode) and Morsel;
semi-supervised Morfessor; and AG semi-supervised (? marks the Compounding grammar, † denotes SubMorphs
grammar, and ∗ is the MorphSeq grammar) and grammar selection methods. Results are shown for each language,
averaged over all languages (when possible: Avg), and averaged over just the languages where scores are available for
all systems (-Est, -Est/Ger).

for English, together with their relative frequencies.
It shows that at the Word level the binary rule is
selected over three times more frequently than the
unary rule. Also, most of the more frequently used
grammar rules expand the first branch (rooted in M1)
into more finegrained structures. The second branch
(M2) is mostly modelled with the unary rule.

Among the frequently cached rules we see the
common English prefixes and suffixes. One of the

most frequent cached rule stores the single letter e at
the end of a word, which often causes oversegmen-
tation of words ending in e (as seen in the incorrect
examples in Table 4). This problem is common in
unsupervised morphological segmentation of English
(Goldwater et al., 2006; Goldsmith, 2001).

We also took a look at the most frequent cached
rules learned by the semi-supervised AG with the
SubMorphs grammar, and observed that Morphs

263


Correct Segmentations Incorrect Segmentations
Word Segmentation Induced Correct
treatable [tr.ea.t]M1 [a.b.le]M21 disagree s dis agree s
disciplined [dis.cip.l.i.n]M1 [e.d]M21 reduc e reduce
monogamous [mon.o.g.a.m]M1 [o.u.s]M21 revalu e re value
streakers [st.r.e.a.k]M1 [e.r]M21 [s]M2211 derid e deride
tollgate [t.o.l.l.]M1 [g.a.t.e]M21 [s]M2211 accompani ed ac compani ed
foxhunting [f.o.x]M1 [h.u.n.t]M21 [ing]M2211 war y wary
muscovites [m.u.sc.o.v]M1 [i.t.e]M21 [s]M2211 indescrib able in describ able
standardizes [st.a.n.d.a.rd]M1 [i.z.e]M21 [s]M2211 orat es orate s
slavers’ [sl.a.v]M1 [e.r]M21 [s]M2211 [’]M2212 alger ian s algeri an s
earthiness’ [e.ar.th]M1 [i]M2111 [ness]M2211 [’]M2212 disput e s dispute s
instinkt [in.st.in.kt]M1 meister likkust meisterlikkus t
rebis [re.b.i]M1 [s]M2 min a mina
toitsid [to.it]M1 [s.id]M2 teiste teis te
armuavaldus [a.rm.u]M11 [ava.ld.u.s]M12 kuritegu de sse kuri tegu desse
määgivale [mää.g.i]M11 [v.a]M12 [l.e]M2 liharoa ga liha roa ga
keskuskoulussa [kesk.us]M11 [koul.u]M12 [s.sa]M2 polte tti in polte tt i in
peruslähteille [per.u.s]M11 [l.ä.ht.e]M12 [i]M211 [ll.e]M212 kulttuuri se lt a kin kulttuurise lta kin
perunakaupoista [per.u.n.a]M11 [k.au.p.o]M12 [i]M211 [st.a]M212 tuote palki ntoja tuote palkinto j a
yöpaikkaani [yö]M11 [p.ai.kk.a]M12 [a]M21 [n.i]M22 veli puo lt a veli puol ta
nimettäköön [ni.m.e]M11 [tt.ä]M12 [k.ö]M21 [ö.n]M22 ota ttava otatta va

Table 4: Examples of segmented words in English (top), Estonian (middle) and Finnish (bottom). Correctly segmented
words are in the left part of the table. The identified segments are in brackets indexed by the respective template
nonterminal; dots separate the metagrammar generated parse tree leaves. Examples of incorrectly segmented words
together with the correct segmentation are on the right.

Freq (%) Rule Freq (%) Cached Rule
9.9 Word → M1 M2 1.2 (M2 (M21 (M211 (M2111 s)))))
5.7 M1 → M11 M12 0.9 (M2 (M21 (M211 (M2111 e)) (M212 (M2121 d))))
3.1 Word → M1 0.7 (M2 (M21 (M211 (M2111 i)) (M212 (M2121 n g))))
2.5 M11 → M111 0.6 (M2 (M21 (M211 (M2111 e)))
1.8 M2 → M21 0.4 (M2 (M21 (M211 (M2111 ’))) (M22 (M221 (M2211 s))))
1.4 M12→ M121 M122 0.3 (M1112 a)
1.4 M111 → M1111 M1112 0.3 (M2 (M21 (M211 (M2111 y))))
0.9 M12 → M121 0.3 (M2 (M21 (M211 (M2111 e))) (M212 (M2121 r)))
0.9 M11 → M111 M112 0.2 (M2 (M21 (M211 (M2111 a))))

Table 5: Examples from English most frequently used metagrammar rules and cached rules together with their relative
occurrence frequencies (in percentages).

tended to contain only a single SubMorph. This
helps to explain why the SubMorphs grammar in
semi-supervised AG improved less over the unsuper-
vised AG as compared to the MorphSeq grammar—
the rules with only a single SubMorph under the
Morph are essentially the same as they would be in
the MorphSeq grammar.

Finally, we examined the consistency of the tem-
plates chosen for each of the 5 samplers during model

selection for the test set (Section 4.4). We found that
there was some variability in the templates, but in
most experiments the same template was chosen for
the majority of the samplers (see Table 6). While this
majority template is not always the optimal one on
the dev set, we observed that it does produce con-
sistently good results. It is possible that using the
majority template, rather than the optimal template
for the dev set, would actually produce better results

264


Majority template
English M1 M21 M2211 M2212 M222
Finnish M11 M12 M211 M212 M22
Turkish M11 M12 M211 M212 M22
German M11 M121 M122 M21 M221 M222
Estonian M11 M12 M2

Table 6: Majority templates for each language. Note
that the Estonian gold standard contains less fine-grained
segmentations than some of the other languages.

on the test set, especially if (as appears to be the case
here, and may often be the case in real applications)
the dev and test sets are from somewhat different
distributions.

It must be noted that both AG Select and semi-
supervised AG are computationally more demanding
than the comparison systems. Since we do inference
over tree structures, the complexity is cubic in the
input word length, while most segmentation systems
are quadratic or linear. Even compared to the unsu-
pervised AG, AG Select is more expensive, because
of the larger grammar and number of cached symbols.
Nevertheless, our systems can feasibly be run on the
large Morpho Challenge datasets.

Other recent unsupervised systems have reported
state-of-the art results by incorporating additional in-
formation from surrounding words (Lee et al., 2011),
multilingual alignments (Snyder and Barzilay, 2008),
or overlapping context features in a log-linear model
(Poon et al., 2009), but they have only been run on
Semitic languages and English (and in the latter case,
a very small corpus). Since they explicitly enumerate
and sample from all possible segmentations of each
word (often with some heuristic constraints), they
could have trouble with the much longer words of
the agglutinative languages tested here. In any case
the results are not directly comparable to ours.

6 Conclusion

In this paper we have introduced three new meth-
ods for Adaptor Grammars and demonstrated their
usefulness for minimally supervised morphological
segmentation. First, we showed that AG models can
be scaled to large data sets by using the posterior
grammar for defining an inductive model, that on
average results in the same accuracy as compared to
full transductive training.

Second, we implemented semi-supervised AG in-
ference, which uses labelled data to constrain the
sampler, and showed that in all cases it performs
much better than the unsupervised AG on the same
grammar. Semi-supervised AG could benefit from
labelled data reweighting techniques frequently used
in semi-supervised learning, and studying the proper
ways of doing so within the AG framework would be
a potential topic for future research.

Our final contribution is the AG Select method,
where the initial model is trained using a very general
grammar that oversegments the data, and the labelled
data is used to select which granularity of segments to
use. Unlike other morphological segmentation mod-
els, this method can adapt its grammar to languages
with different structures, rather than having to use
the same grammar for every language. Indeed, we
found that AG Select performs well across a range
of languages and also seems to be less sensitive to
differences between data sets (here, dev vs. test). In
addition, it can be trained on either morphological
analyses or segmentations. Although we tuned all
results to optimize the SBF1 metric, in principle the
same method could be used to optimize other mea-
sures, including extrinsic measures on downstream
applications such as machine translation or informa-
tion retrieval. In future we hope to show that this
method can be used to improve performance on such
applications, and also to explore its use for related
segmentation tasks such as stemming or syllabifica-
tion. Also, the method itself could potentially be
improved by designing a classifier to determinine the
best template for each word based on a set of features,
rather than using a single template for all words in
the language.

Acknowledgments

This work was supported by the Tiger University pro-
gram of Estonian Information Technology Founda-
tion for the first author. We thank Constantine Lignos
for releasing his Morsel code to us, Sami Virpioja for
evaluating test set results, and Federico Sangati for
providing useful scripts.

References
Mathias Creutz and Krista Lagus. 2007. Unsupervised

models for morpheme segmentation and morphology

265


learning. ACM Transactions of Speech and Language
Processing, 4(1):1–34, February.

Micha Elsner, Eugene Charniak, and Mark Johnson. 2009.
Structured generative models for unsupervised named-
entity clustering. In Proceedings of NAACL, pages
164–172. Association for Computational Linguistics.

John Goldsmith. 2001. Unsupervised learning of the
morphology of a natural language. Computational Lin-
guistics, 27(2):153–198, June.

Sharon Goldwater, Thomas L. Griffiths, and Mark John-
son. 2006. Interpolating between types and tokens by
estimating power-law generators. In Advances in Neu-
ral Information Processing Systems 18, pages 459–466,
Cambridge, MA. MIT Press.

Eric A. Hardisty, Jordan Boyd-Graber, and Philip Resnik.
2010. Modeling perspective using adaptor grammars.
In Proceedings of EMNLP, pages 284–292. Association
for Computational Linguistics.

Zellig Harris. 1951. Structural Linguistics. University of
Chicago Press.

Yun Huang, Min Zhang, and Chew Lim Tan. 2011. Non-
parametric bayesian machine transliteration with syn-
chronous adaptor grammars. In Proceedings of ACL:
short papers - Volume 2, pages 534–539. Association
for Computational Linguistics.

Mark Johnson and Katherine Demuth. 2010. Unsuper-
vised phonemic chinese word segmentation using adap-
tor grammars. In Proceedings of COLING, pages 528–
536. Association for Computational Linguistics.

Mark Johnson and Sharon Goldwater. 2009. Improving
nonparameteric bayesian inference: experiments on un-
supervised word segmentation with adaptor grammars.
In Proceedings of NAACL, pages 317–325. Association
for Computational Linguistics.

Mark Johnson, Thomas L. Griffiths, and Sharon Gold-
water. 2007. Adaptor grammars: A framework for
specifying compositional nonparametric bayesian mod-
els. In B. Schölkopf, J. Platt, and T. Hoffman, editors,
Advances in Neural Information Processing Systems
19, pages 641–648. MIT Press, Cambridge, MA.

Mark Johnson. 2008a. Unsupervised word segmentation
for sesotho using adaptor grammars. In Proceedings
of ACL Special Interest Group on Computational Mor-
phology and Phonology, pages 20–27. Association for
Computational Linguistics.

Mark Johnson. 2008b. Using adaptor grammars to iden-
tify synergies in the unsupervised acquisition of linguis-
tic structure. In Proceedings of ACL, pages 398–406.
Association for Computational Linguistics.

Oskar Kohonen, Sami Virpioja, and Krista Lagus. 2010a.
Semi-supervised learning of concatenative morphology.
In Proceedings of ACL Special Interest Group on Com-
putational Morphology and Phonology, pages 78–86.
Association for Computational Linguistics.

Oskar Kohonen, Sami Virpioja, Laura Leppänen, and
Krista Lagus. 2010b. Semi-supervised extensions to
morfessor baseline. In Proceedings of the Morpho
Challenge 2010 Workshop, pages 30–34. Aalto Univer-
sity School of Science and Technology.

Yoong Keok Lee, Aria Haghighi, and Regina Barzilay.
2011. Modeling syntactic context improves morpho-
logical segmentation. In Proceedings of CoNLL, pages
1–9. Association for Computational Linguistics.

Percy Liang, Hal Daumé, III, and Dan Klein. 2008. Struc-
ture compilation: trading structure for features. In
Proceedings of ICML, pages 592–599. Association for
Computing Machinery.

Constantine Lignos. 2010. Learning from Unseen Data.
In Mikko Kurimo, Sami Virpioja, and Ville T. Turunen,
editors, Proceedings of the Morpho Challenge 2010
Workshop, pages 35–38. Aalto University School of
Science and Technology.

Jim Pitman and Marc Yor. 1997. The two-parameter
Poisson-Dirichlet distribution derived from a stable sub-
ordinator. Annals of Probability, 25(2):855–900.

Hoifung Poon, Colin Cherry, and Kristina Toutanova.
2009. Unsupervised morphological segmentation with
log-linear models. In Proceedings of NAACL, pages
209–217. Association for Computational Linguistics.

Benjamin Snyder and Regina Barzilay. 2008. Unsuper-
vised multilingual learning for morphological segmen-
tation. In Proceedings of ACL, pages 737–745. Associ-
ation for Computational Linguistics.

Sebastian Spiegler and Christian Monson. 2010. Emma:
A novel evaluation metric for morphological analysis.
In Proceedings of COLING, pages 1029–1037. Associ-
ation for Computational Linguistics.

Sze-Meng Jojo Wong, Mark Dras, and Mark Johnson.
2012. Exploring adaptor grammars for native language
identification. In Proceedings of EMNLP, pages 699–
709. Association for Computational Linguistics.

266