A Generative Model of Phonotactics

Richard Futrell
Brain and Cognitive Sciences

Massachusetts Institute of Technology
futrell@mit.edu

Adam Albright
Department of Linguistics

Massachusetts Institute of Technology
albright@mit.edu

Peter Graff
Intel Corporation

graffmail@gmail.com

Timothy J. O’Donnell
Department of Linguistics

McGill University
timothy.odonnell@mcgill.ca

Abstract

We present a probabilistic model of phono-
tactics, the set of well-formed phoneme se-
quences in a language. Unlike most compu-
tational models of phonotactics (Hayes and
Wilson, 2008; Goldsmith and Riggle, 2012),
we take a fully generative approach, model-
ing a process where forms are built up out
of subparts by phonologically-informed struc-
ture building operations. We learn an inven-
tory of subparts by applying stochastic memo-
ization (Johnson et al., 2007; Goodman et al.,
2008) to a generative process for phonemes
structured as an and-or graph, based on con-
cepts of feature hierarchy from generative
phonology (Clements, 1985; Dresher, 2009).
Subparts are combined in a way that allows
tier-based feature interactions. We evaluate
our models’ ability to capture phonotactic dis-
tributions in the lexicons of 14 languages
drawn from the WOLEX corpus (Graff, 2012).
Our full model robustly assigns higher proba-
bilities to held-out forms than a sophisticated
N-gram model for all languages. We also
present novel analyses that probe model be-
havior in more detail.

1 Introduction

People have systematic intuitions about which se-
quences of sounds would constitute likely or un-
likely words in their language: Although blick is not
an English word, it sounds like it could be, while
bnick does not (Chomsky and Halle, 1965). Such in-

tuitions reveal that speakers are aware of the restric-
tions on sound sequences which can make up possi-
ble morphemes in their language—the phonotactics
of the language. Phonotactic restrictions mean that
each language uses only a subset of the logically,
or even articulatorily, possible strings of phonemes.
Admissible phoneme combinations, on the other
hand, typically recur in multiple morphemes, lead-
ing to redundancy.

It is widely accepted that phonotactic judgments
may be gradient: the nonsense word blick is better
as a hypothetical English word than bwick, which
is better than bnick (Hayes and Wilson, 2008; Al-
bright, 2009; Daland et al., 2011). To account for
such graded judgements, there have been a vari-
ety of probabilistic (or, more generally, weighted)
models proposed to handle phonotactic learning and
generalization over the last two decades (see Da-
land et al. (2011) and below for review). How-
ever, inspired by optimality-theoretic approaches to
phonology, the most linguistically informed and suc-
cessful such models have been constraint-based—
formulating the problem of phonotactic generaliza-
tion in terms of restrictions that penalize illicit com-
binations of sounds (e.g., ruling out ∗bn-).

In this paper, by contrast, we adopt a generative
approach to modeling phonotactic structure. Our
approach harkens back to early work on the sound
structure of lexical items which made use of mor-
pheme structure rules or conditions (Halle, 1959;
Stanley, 1967; Booij, 2011; Rasin and Katzir, 2014).
Such approaches explicitly attempted to model the

73

Transactions of the Association for Computational Linguistics, vol. 5, pp. 73–86, 2017. Action Editor: Eric Fosler-Lussier.
Submission batch: 8/2016; Revision batch: 11/2016; Published 2/2017.

c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


redudancy within the set of allowable lexical forms
in a language. We adopt a probabilistic version of
this idea, conceiving of the phonotactic system as
the component of the linguistic system which gen-
erates the phonological form of lexical items such
as words and morphemes.1 Our system learns in-
ventories of reusable phonotactically licit structures
from existing lexical items, and assembles new lex-
ical items by combining these learned phonotac-
tic patterns using phonologically plausible structure-
building operations. Thus, instead of modeling
phonotactic generalizations in terms of constraints,
we treat the problem as a problem of learning lan-
guage specific inventories of phonological units and
language specific biases on how these phones are
likely to be combined.

Although there have been a number of earlier gen-
erative models of phonotactic structure (see Sec-
tion 4) these models have mostly used relatively
simplistic or phonologically implausible representa-
tions of phones and phonological structure-building.
By contrast, our model is built around three repre-
sentational assumptions inspired by the generative
phonology literature. First, we capture sparsity in
the space of feature-specifications of phonemes by
using feature dependency graphs—an idea inspired
by work on feature geometries and the contrastive
hierarchy (Clements, 1985; Dresher, 2009). Sec-
ond, our system can represent phonotactic general-
izations not only at the level of fully specified seg-
ments, but also allows the storage and reuse of sub-
segments, inspired by the autosegments and class
nodes of autosegmental phonology. Finally, also in-
spired by autosegmental phonology, we make use of
a structure-building operation which is senstitive to
tier-based contextual structure.

To model phonotactic learning, we make use of
tools from Bayesian nonparametric statistics. In par-
ticular, we make use of the notion of lexical mem-
oization (?; Goodman et al., 2008; Wood et al.,
2009; O’Donnell, 2015)—the idea that language-
specific generalizations can be captured by the stor-
age and reuse of frequent patterns from a linguisti-

1Ultimately, we conceive of phonotactics as the module
of phonology which generates the underlying forms of lexical
items, which are then subject to phonological transformations
(i.e., transductions). In this work, however, we do not attempt
to model transformations from underlying to surface forms.

cally universal inventory. In our case, this amounts
to the idea that an inventory of segments and sub-
segments can be acquired by a learner that stores
and reuses commonly occuring segments in partic-
ular, phonologically relevant contexts. In short, we
view the problem of learning the phoneme inven-
tory as one of concentrating probability mass on the
segments which have been observed before, and the
problem of phonotactic generalization as learning
which (sub-)segments are likely in particular tier-
based phonological contexts.

2 Model Motivations

In this section, we give an overview of how our
model works and discuss the phenomena and the-
oretical ideas that motivate it.

2.1 Feature Dependency Graphs

Most formal models of phonology posit that seg-
ments are grouped into sets, known as natural
classes, that are characterized by shared articulatory
and acoustic properties, or phonological features
(Trubetzkoy, 1939; Jakobson et al., 1952; Chomsky
and Halle, 1968). For example, the segments /n/ and
/m/ are classified with a positive value of a nasal-
ity feature (i.e., NASALITY:+). Similarly, /m/ and
/p/ can be classified using the labial value of a
PLACE feature, PLACE:labial. These features al-
low compact description of many phonotactic gen-
eralizations.2

From a probabilistic structure-building perspec-
tive, we need to specify a generative procedure
which assembles segments out of parts defined in
terms of these features. In this section, we will build
up such a procedure starting from the simplest possi-
ble procedure and progressing towards one which is
more phonologically informed. We will clarify the

2For compatibility with the data sources used in evaluation
(Section 5.2), the feature system we use here departs in several
ways from standard feature sets: (1) We use multivalent rather
than binary-valued features. (2) We represent manner with a
single feature, which has values such as vocalic, stop, and
fricative. This approach allows us to refer to manners more
compactly than in systems that employ combinations of features
such as sonorant, continuant, and consonantal. For
example, rather than referring to vowels as ‘non-syllabic’, we
refer to them using feature value vocalic for the feature
MANNER.

74


generative process here using an analogy to PCFGs,
but this analogy will break down in later sections.

The simplest procedure for generating a seg-
ment from features is to specify each feature
independently. For example, consider the set
of feature-value pairs for /t/: {NASALITY:-,
PLACE:alveolar, ...}. In a naive generative pro-
cedure, one could generate an instance of /t/ by inde-
pendently choosing values for each feature in the set
{NASALITY, PLACE, ...}. We express this process
using the and-or graph notation below. Box-shaped
nodes—called or-nodes—represent features such as
NASALITY, while circular nodes represent groups of
features whose values are chosen independently and
are called and-nodes.

NASALITY ... PLACE

This generative procedure is equivalent (ignoring or-
der) to a PCFG with rules:

SEGMENT → NASALITY ... PLACE
NASALITY → +
NASALITY → -
PLACE → bilabial
PLACE → alveolar
...

Not all combinations of feature-value pairs cor-
respond to possible phonemes. For example, while
/l/ is distinguished from other consonants by the
feature LATERAL, it is incoherent to specify vow-
els as LATERAL. In order to concentrate probabil-
ity mass on real segments, our process should opti-
mally assign zero probability mass to these incoher-
ent phonemes. We can avoid specifying a LATERAL
feature for vowels by structuring the generative pro-
cess as below, so that the LATERAL or-node is only
reached for consonants:

VOCALIC

A

LATERAL
...

B

HEIGHT
...

consonant vowel

Beyond generating well-formed phonemes, a ba-
sic requirement of a model of phonotactics is that
it concentrates mass only on the segments in a par-
ticular language’s segment inventory. For exam-
ple, the model of English phonotactics should put

zero or nominal mass on any sequence containing
the segment /x/, although this is a logically possi-
ble phoneme. So our generative procedure for a
phoneme must be able to learn to generate only the
licit segments of a language, given some probabil-
ity distributions at the and- and or-nodes. For this
task, independently sampling values at and-nodes
does not give us a way to rule out particular com-
binations of features such as those forming /x/.

Our approach to this problem uses the idea of
stochastic memoization (or adaptation), in which the
results of certain computations are stored and may
be probabilistically reused “as wholes,” rather than
recomputed from scratch (Michie, 1968; Goodman
et al., 2008). This technique has been applied to the
problem of learning lexical items at various levels
of linguistic structure (de Marcken, 1996; Johnson
et al., 2007; Goldwater, 2006; O’Donnell, 2015).
Given our model so far, applying stochastic memo-
ization is equivalent to specifying an adaptor gram-
mar over the PCFGs described so far.

Let f be a stochastic function which samples
feature values using the and-or graph representa-
tion described above.We apply stochastic memo-
ization to each node. Following Johnson et al.
(2007) and Goodman et al. (2008), we use a distri-
bution for probabilistic memoization known as the
Dirichlet Process (DP) (Ferguson, 1973; Sethura-
man, 1994). Let mem{f} be a DP-memoized ver-
sion of f. The behavior of a DP-memoized function
can be described as follows. The first time we invoke
mem{f}, the feature specification of a new segment
will be sampled using f. On subsequent invocations,
we either choose a value from among the set of pre-
vious sampled values (a memo draw), or we draw a
new value from f (a base draw). The probability of
sampling the ith old value in a memo draw is ni

N+θ
,

where N is the number of tokens sampled so far, ni
is the number of times that value i has been used in
the past, and θ > 0 is a parameter of the model. A
base draw happens with probability θ

N+θ
. This pro-

cess induces a bias to reuse items from f which have
been frequently generated in the past.

We apply mem recursively to the sampling proce-
dure for each node in the feature dependency graph.
The more times that we use some particular set of
features under a node to generate words in a lan-
guage, the more likely we are to reuse that set of

75


features in the future in a memo draw. This dynamic
leads our model to rapidly concentrate probability
mass on the subset of segments which occur in the
inventory of a language.

2.2 Class Node Structure

Our use of and-or graphs and lexical memoiza-
tion to model inter-feature dependencies is in-
spired by work in phonology on distinctiveness
and markedness hierarchies (Kean, 1975; Berwick,
1985; Dresher, 2009). In addition to using feature
hierarchies to delineate possible segments, the liter-
ature has used these structures to designate bundles
of features that have privileged status in phonolog-
ical description, i.e. feature geometries (Clements,
1985; Halle, 1995; McCarthy, 1988). For example,
many analyses group features concerning laryngeal
states (e.g., VOICE, ASPIRATION, etc.) under a la-
ryngeal node, which is distinct from the node con-
taining oral place-of-articulation features (Clements
and Hume, 1995). These nodes are known as class
nodes. In these analyses, features grouped together
under the laryngeal class node may covary while be-
ing independent of features grouped under the oral
class node.

The lexical memoization technique discussed
above captures this notion of class node directly, be-
cause the model learns an inventory of subsegments
under each node.

Consider the feature dependency graph below.
A

B

NASALITY VOICE ...

VOCALIC

C

BACKNESS HEIGHT ...

...
consonantvowel

In this graph, the and-node A generates fully spec-
ified segments. And-node B can be thought of as
generating the non-oral properties of a segment, in-
cluding voicing and nasality. And-node C is a class
node bundling together the oral features of vowel
segments.

The features under B are outside of the VO-
CALIC node, so these features are specified for both
consonant and vowel segments. This allows
combinations such as voiced nasal consonants, and
also rarer combinations such as unvoiced nasal vow-

els. Because all and-nodes are recursively memo-
ized, our model is able to bind together particular
non-oral choices (node B), learning for instance that
the combination {NASALITY:+, VOICED:+} com-
monly recurs for both vowels and consonants in a
language. That is, {NASALITY:+, VOICED:+} be-
comes a high-probability memo draw.

Since the model learns an inventory of fully spec-
ified segments at node A, the model could learn one-
off exceptions to this generalization as well. For
example, it could store at a high level a segment
with {NASALITY:+, VOICED:-} along with some
other features, while maintaining the generalization
that {NASALITY:+, VOICED:+} is highly frequent in
base draws. Language-specific phoneme invento-
ries abound with such combinations of class-node-
based generalizations and idiosyncrasies. By using
lexical memoization at multiple different levels, our
model can capture both the broader generalizations
described in class node terminology and the excep-
tions to those generalizations.

2.3 Sequential Structure as Memoization in
Context

In Section 2.2, we focused on the role that features
play in defining a language’s segment inventory. We
gave a phonologically-motivated generative process,
equivalent to an adaptor grammar, for phonemes
in isolation. However, features also play an im-
portant role in characterizing licit sequences. We
model sequential restrictions as context-dependent
segment inventories. Our model learns a distribution
over segments and subsegments conditional on each
preceding sequence of (sub)segments, using lexi-
cal memoization. Introducing context-dependence
means that the model can no longer be formulated
as an adaptor grammar.

2.4 Tier-based Interaction

One salient property of sequential restrictions in
phonotactics is that segments are often required to
bear the same feature values as nearby segments.
For example, a sequence of a nasal and a follow-
ing stop must agree in place features at the end of a
morpheme in English. Such restrictions may even
be non-local. For example, many languages pre-
fer combinations of vowels that agree in features
such as HEIGHT, BACKNESS, or ROUNDING, even

76


Figure 1: Tiers defined by class nodes A and B for context
sequence /ak/. See text.

across arbitrary numbers of intervening consonants
(i.e., vowel harmony).

One way to describe these sequential feature in-
teractions is to assume that feature values of one
segment in a word depend on values for the same
or closely related features in other segments. This
is accomplished by dividing segments into subsets
(such as consonants and vowels), called tiers, and
then making a segment’s feature values preferen-
tially dependent on the values of other segments on
the same tier.

Such phonological tiers are often identified with
class nodes in a feature dependency graph. For
example, a requirement that one vowel identically
match the vowel in the preceding syllable would be
stated as a requirement that the vowel’s HEIGHT,
BACKNESS, and ROUNDING features match the val-
ues of the preceding vowel’s features. In this case,
the vowels themselves need not be adjacent—by as-
suming that vowel quality features are not present
in consonants, it is possible to say that two vowels
are adjacent on a tier defined by the nodes HEIGHT,
BACKNESS, and ROUNDING.

Our full generative process for a segment follow-
ing other segments is the following. We follow the
example of the generation of a phoneme conditional
on a preceding context of /ak/, shown with simpli-
fied featural specifications and tiers in Figure 1.

At each node in the feature dependency graph, we
can either generate a fully-specified subsegment for
that node (memo draw), or assemble a novel subseg-
ment for that node out of parts defined by the fea-
ture dependency graph (base draw). Starting at the
root node of the feature dependency graph, we de-
cide whether to do a memo draw or base draw con-
ditional on the previous n subsegments at that node.

So in order to generate the next segment follow-
ing /ak/ in the example, we start at node A in the next
draw from the feature geometry, with some probabil-
ity we do a memo draw conditioned on /ak/, defined
by the red tier. If we decide to do a base draw in-
stead, we then repeat the procedure conditional on
the previous n − 1 segments, recursively until we
are conditioning on the empty context. That is, we
do a memo draw conditional on /k/, or conditional
on the empty context. This process of conditioning
on successively smaller contexts is a standard tech-
nique in Bayesian nonparametric language modeling
(Teh, 2006; Goldwater et al., 2006).

At the empty context, if we decide to do a base
draw, then we generate a novel segment by repeat-
ing the whole process at each child node, to gen-
erate several subsegments. In the example, we
would assemble a phoneme by independently sam-
pling subsegments at the nasal/laryngeal node B and
the MANNER node, and then combining them. Cru-
cially, the conditioning context consists only of the
values at the current node in the previous phonemes.
So when we sample a subsegment from node B, it is
conditional on the previous two values at node B,
{ VOICE:+, NASAL:-} and { VOICE:-, NASAL:-},
defined by the blue tier in the figure. The process
continues down the feature dependency graph recur-
sively. At the point where the model decides on
vowel place features such as height and backness,
these will be conditioned only on the vowel places
features of the preceding /a/, with /k/ skipped en-
tirely as it does not have values at vowel place nodes.

This section has provided motivations and a walk-
through of our proposed generative procedure for se-
quences of segments. In the next section, we give the
formalization of the model.

3 Formalization of the Models

Here we give a full formal description of our pro-
posed model in three steps. First, in Section 3.1,
we formalize the generative process for a segment
in isolation. Second, in Section 3.2, we give for-
mulation of Bayesian nonparametric N-gram mod-
els with backoff. Third, in Section 3.3, we show
how to drop the generative process for a phoneme
into the N-gram model such that tier-based interac-
tions emerge naturally.

77


3.1 Generative Process for a Segment
A feature dependency graph G is a fully connected,
singly rooted, directed, acyclic graph given by the
triple 〈V,A,t,r〉 where V is a set of vertices or
nodes, A is a set of directed arcs, t is a total function
t(n) : V 7→ {and,or}, and r is a distinguished
root node in V . A directed arc is a pair 〈p,c〉 where
the parent p and child c are both elements in V . The
function t(n) identifies whether n is an and- or or-
node. Define ch(n) to be the function that returns
all children of node n, that is, all n′ ∈ N such that
〈n,n′〉 ∈ A.

A subgraph Gs of feature dependency graph G
is the graph obtained by starting from node s by re-
taining only nodes and arcs reachable by traversing
arcs starting from s. A subsegment ps is a subgraph
rooted in node s for which each or-node contains ex-
actly one outgoing arc. Subsegments represent sam-
pled phone constituents. A segment is a subsegment
rooted in r—that is, a fully specified phoneme.

The distribution associated with a subgraph Gs is
given by Gs below. Gs is a distribution over sub-
segments; the distribution for the full graph Gr is a
distribution over fully specified segments. We oc-
casionally overload the notation such that Gs(ps)
will refer to the probability mass function associated
with distribution Gs evaluated at the subsegment ps.

Hs ∼ DP(θs,Gs) (1)

Gs(ps) =





∏

s′∈ch(s)
H
s′

(p
s′

) t(s) = AND

∑

s′∈ch(s)
ψ
s
s′H

s′
(p
s′

) t(s) = OR

The first case of the definition covers and-nodes.
We assume that the leaves of our feature dependency
graph—which represent atomic feature values such
as the laryngeal value of a PLACE feature—are
childless and-nodes.

The second case of the definition covers or-nodes
in the graph, where ψss′ is the probability associated
with choosing outgoing arc 〈s,s′〉 from parent or-
node s to child node s′. Thus, or-nodes define mix-
ture distributions over outgoing arcs. The mixture
weights are drawn from a Dirichlet process. In par-
ticular, for or-node n in the underlying graph G, the
vector of probabilities over outgoing edges is dis-
tributed as follows.

~ψs ∼ DP(θs, UNIFORM(|ch(s)|))

Note that in both cases the distribution over child
subgraphs is drawn from a Dirichlet process, as be-
low, capturing the notion of subsegmental storage
discussed above.

3.2 N-Gram Models with DP-Backoff
Let T be a set of discrete objects (e.g., atomic sym-
bols or structured segments as defined in the preced-
ing sections). Let T∗ be the set of all finite-length
strings which can be generated by combining ele-
ments of T , under concatenation, ·, including the
empty string �. A context, u is any finite string be-
ginning with a special distinguished start sym-
bol and ending with some sequence in T∗, that is,
u ∈{start ·T∗}.

For any string α, define hd(α) to be the function
that returns the first symbol in the string, tl(α) to
be the function that returns suffix of α minus the first
symbol, and |α| to be the length of α, with hd(�) =
tl(�) = � and |�| = 0. Write the concatenation of
two strings α and α′ as α ·α′.

Let Hu be a distribution on next symbols—that
is, objects in T ∪{stop}—conditioned on a given
context u. For an N-gram model of order N, the
probability of a string β in T∗ is given by KNstart(β ·
stop), where KNu (α) is defined as:

KNu (α) =
{

1 α = �

HfN(u)
(hd(α)) ×KN

u·hd(α)(tl(α)) otherwise
,

(2)
where fn(·) is a context-management function
which determines which parts of the left-context
should be used to determine the probability of the
current symbol. In the case of the N-gram models
used in this paper, fn(·) takes a sequence u and re-
turns only the rightmost n − 1 elements from the
sequence, or the entire sequence if it has length less
than n.

Note two aspects of this formulation of N-gram
models. First, Hu is a family of distributions over
next symbols or more general objects. Later, we will
drop in phonological-feature-based generative pro-
cesses for these distributions. Second, the function
fn is a parameter of the above definitions. In what
follows, we will use a variant of this function which
is sensitive to tier-based structure, returning the pre-
vious n−1 only on the appropriate tier.

MacKay and Peto (1994) introduced a hierarchi-
cal Dirichlet process-based backoff scheme for N-

78


gram models, with generalizations in Teh (2006) and
Goldwater et al. (2006). In this setup, the distribu-
tion over next symbols given a context u is drawn
hierarchically from a Dirichlet process whose base
measure is another Dirichlet process associated with
context tl(u), and so on, with all draws ultimately
backing off into some unconditioned distribution
over all possible next symbols. That is, in a hier-
archical Dirichlet process N-gram model, Hfn(u) is
given as follows.

Hfn(u) ∼
{

DP(θfn(u),Hfn−1(u)) n ≥ 1
DP(θfn(u), UNIFORM(T ∪{stop})) n = 0

3.3 Tier-Based Interactions

To make the N-gram model defined in the last sec-
tion capture tier-based interactions, we make two
changes. First, we generalize the generative pro-
cess Hs from Equation 1 to Hsu, which generates
subsegments conditional on a sequence u. Sec-
ond, we define a context-truncating function fsn(u)
which takes a context of segments u and returns the
rightmost n−1 non-empty subsegments whose root
node is s. Then we substitute the generative pro-
cess Hs

fsn(u)
(which applies the context-management

function fsn(·) to the context u) for Hfn(u) in Equa-
tion 2. The resulting probability distribution is:

KNu (α) =
{

1 α = �

Hr
fr
N

(u)
(hd(α)) ×KN

u·hd(α)(tl(α)) otherwise .

KNu (α) is the distribution over continuations
given a context of segments. Its definition depends
on Hs

fsn(u)
, which is the generalization of the gener-

ative process for segments Hs to be conditional on
some tier-based N-gram context fsn(u). H

s
fsn(u)

is:

Hsfsn(u)
∼

{
DP(θs

fsn(u)
,Hs

fs
n−1(u)

) n ≥ 1
DP(θs

fsn(u)
,Gs
fs
N

(u)
) n = 0

Gsfsn(u)
(ps) =

{ ∏
s′∈ch(s) H

s′

fs
′
n (u)

(ps
′
) t(s) = AND

∑
s′∈ch(s) ψ

s
s′H

s′

fs
′
n (u)

(ps
′
) t(s) = OR.

Hs
fsn(u)

and Gs
fsn(u)

above are mutually recursive
functions. Hs

fsn(u)
implements backoff in the tier-

based context of previous subsegments; Gs
fsn(u)

im-
plements backoff by going down into the probabil-
ity distributions defined by the feature dependency
graph.

Note that the function Hs
fsn(u)

recursively backs
off to the empty context, but its ultimate base distri-
bution is indexed by fsN(u), using the global maxi-
mum N-gram order N. So when samples are drawn
from the feature dependency graph, they are con-
ditioned on non-empty tier-based contexts. In this
way, subsegments are generated based on tier-based
context and based on featural backoff in an inter-
leaved fashion.

3.4 Inference

We use the Chinese Restaurant Process represen-
tation for sampling. Inference in the model is
over seating arrangements for observations of sub-
segments and over the hyperparameters θ for each
restaurant. We perform Gibbs sampling on seating
arrangements in the Dirichlet N-gram models by re-
moving and re-adding observations in each restau-
rant. These Gibbs sweeps had negligible impact
on model behavior. For the concentration parame-
ter θ, we set a prior Gamma(10, .1). We draw pos-
terior samples using the slice sampler described in
Johnson and Goldwater (2009). We draw one pos-
terior sample of the hyperparameters for each Gibbs
sweep. In contrast to the Gibbs sweeps, we found re-
sampling hyperparameters to be crucial for achiev-
ing the performance described below (Section 5.3).

4 Related Work

Phonotactics has proven a fruitful problem domain
for computational models. Most such work has
adopted a constraint-based approach, attempting to
design a scoring function based on phonological fea-
tures to separate acceptable forms from unaccept-
able ones, typically by formulating restrictions or
constraints to rule out less-good structures.

This concept has led naturally to the use of undi-
rected (maximum-entropy, log-linear) models. In
this class of models, a form is scored by evaluation
against a number of predicates, called factors3—for
example, whether two adjacent segments have the
phonological features VOICE:+ VOICE:-. Each fac-
tor is associated with a weight, and the score for a
form is the sum of the weights of the factors which
are true for the form. The well-known model of

3Factors are also commonly called “features”—a term we
avoid to prevent confusion with phonological features.

79


Hayes and Wilson (2008) adopts this framework,
pairing it with a heuristic procedure for finding ex-
planatory factors while preventing overfitting. Simi-
larly, Albright (2009) assigns a score to forms based
on factors defined over natural classes of adjacent
segments. Constraint-based models have the advan-
tage of flexibility: it is possible to score forms using
arbitrarily complex and overlapping sets of factors.
For example, one can state a constraint against ad-
jacent phonemes having features VOICE:+ and LAT-
ERAL:+, or any combination of feature values.

In contrast, we have presented a model where
forms are built out of parts by structure-building op-
erations. From this perspective, the goal of a model
is not to rule out bad forms, but rather to discover
repeating structures in good forms, such that new
forms with those structures can be generated.

In this setting there is less flexibility in how
phonological features can affect well-formedness.
For a structure-building model to assign “scores” to
arbitrary pairs of co-occurring features, there must
be a point in the generative process where those fea-
tures are considered in isolation. Coming up with
such a process has been challenging. As a result of
this limitation, structure-building models of phono-
tactics have not generally included rich featural in-
teractions. For example, Coleman and Pierrehum-
bert (1997) give a probabilistic model for phonotac-
tics where words are generated using grammar over
units such as syllables, onsets, and rhymes. This
model does not incorporate fine-grained phonolog-
ical features such as voicing and place.

In fact, it has been argued that a constraint-
based approach is required in order to capture rich
feature-based interactions. For example, Goldsmith
and Riggle (2012) develop a tier-based structure-
building model of Finnish phonotactics which cap-
tures nonlocal vowel harmony interactions, but ar-
gue that this model is inadequate because it does
not assign higher probabilities to forms than an N-
gram model, a common baseline model for phono-
tactics (Daland et al., 2011). They argue that this
deficiency is because the model cannot simulta-
neously model nonlocal vowel-vowel interactions
and local consonant-vowel interactions. Because of
our tier-based conditioning mechanism (Sections 2.4
and 3.3), our model can simultaneously produce lo-
cal and nonlocal interactions between features us-

ing structure-building operations, and does assign
higher probabilities to held-out forms than an N-
gram model (Section 5.3). From this perspective,
our model can be seen as a proof of concept that it
is possible to have rich feature-based conditioning
without adopting a constraint-based approach.

While our model can capture featural interactions,
it is less flexible than a constraint-based model in
that the allowable interactions are specified by the
feature dependency graph. For example, there is
no way to encode a direct constraint against adja-
cent phonemes having features VOICE:+ and LAT-
ERAL:+. We consider this a strength of the ap-
proach: A particular feature dependency graph is
a parameter of our model, and a specific scientific
hypothesis about the space of likely featural interac-
tions between phonemes, similar to feature geome-
tries from classical generative phonology (Clements,
1985; McCarthy, 1988; Halle, 1995).4

While probabilistic approaches have mostly taken
a constraint-based approach, recent formal language
theoretic approaches to phonology have investigated
what basic parts and structure building operations
are needed to capture realistic feature-based interac-
tions (Heinz et al., 2011; Jardine and Heinz, 2015).
We see probabilistic structure-building approaches
such as this work as a way to unify the recent for-
mal language theoretic advances in computational
phonology with computational phonotactic model-
ing.

Our model joins other NLP work attempting to
do sequence generation where each symbol is gen-
erated based on a rich featural representation of
previous symbols (Bilmes and Kirchhoff, 2003;
Duh and Kirchhoff, 2004), though we focus more
on phonology-specific representations. Our and-or
graphs are similar to those used in computer vision
to represent possible objects (Jin and Geman, 2006).

5 Model Evaluation and Experiments

Here we evaluate some of the design decisions of
our model and compare it to a baseline N-gram
model and to a widely-used constraint-based model,
BLICK. In order to probe model behavior, we also

4We do however note that it may be possible to learn feature
hierarchies on a language-by-language basis from universal ar-
ticulatory and acoustic biases, as suggested by Dresher (2009).

80


present evaluations on artificial data, and a sampling
of “representative forms” preferred by one model as
compared to another.

Our model consists of structure-building opera-
tions over a learned inventory of subsegments. If our
model can exploit more repeated structure in phono-
logical forms than the N-gram model or constraint-
based models, then it should assign higher probabil-
ities to forms. The log probability of a form under a
model corresponds to the description length of that
form under the model; if a model assigns a higher
log probability to a form, that means the model is ca-
pable of compressing the form more than other mod-
els. Therefore, we compare models on their ability
to assign high probabilities to phonological forms,
as in Goldsmith and Riggle (2012).

5.1 Evaluation of Model Components

We are interested in discovering the extent to which
each model component described above— feature
dependency graphs (Section 2.1), class node struc-
ture (Section 2.2), and tier-based conditioning (Sec-
tion 2.4)— contributes to the ability of the model to
explain wordforms.

To evaluate the contribution of feature depen-
dency graphs, we compare our models with a base-
line N-gram model, which represents phonemes as
atomic units. For this N-gram model, we use a Hier-
archical Dirichlet Process with n = 3.

To evaluate feature dependency graphs with and
without articulated class node structure, we com-
pare models using the graph shown in Figure 3
(the minimal structure required to produce well-
formed phonemes) to models with the graph shown
in Figure 2, which includes phonologically moti-
vated “class nodes”.5

To evaluate tier-based conditioning, we compare
models with the conditioning described in Sec-
tions 2.4 and 3.3 to models where all decisions are
conditioned on the full featural specification of the
previous n− 1 phonemes. This allows us to isolate
improvements due to tier-based conditioning beyond
improvements from the feature hierarchy.

5These feature dependency graphs differ from those in the
exposition in Section 2 in that they do not include a MANNER
feature; but rather treat vowel as a possible value of MANNER.

duration

laryngeal nasal

manner

suprasegmental

backness height rounding

C place 2nd art. lateral

otherwisevowel

Figure 2: Feature dependency graph with class node
structure used in our experiments. Plain text nodes are
OR-nodes with no child distributions. The arc marked
otherwise represents several arcs, each labelled with a
consonant manner such as stop, fricative, etc.

duration laryngeal nasal manner

suprasegmental backness height rounding C place 2nd art. lateral

otherwisevowel

Figure 3: “Flat” feature dependency graph.

5.2 Lexicon Data

The WOLEX corpus provides transcriptions for
words in dictionaries of 60 diverse languages, rep-
resented in terms of phonological features (Graff,
2012). In addition to words, the dictionaries in-
clude some short set phrases, such as of course. We
use the featural representation of WOLEX, and de-
sign our feature dependency graphs to generate only
well-formed phonemes according to this feature sys-
tem. For space reasons, we present the evaluation of
our model on 14 of these languages, chosen based
on the quality of their transcribed lexicons, and the
authors’ knowledge of their phonological systems.

5.3 Held-Out Evaluation

Here we test whether the different model configu-
rations described above assign high probability to
held-out forms. This tests the models’ ability to
generalize beyond their training data. We train each
model on 2500 randomly selected wordforms from
a WOLEX dictionary, and compute posterior predic-
tive probabilities for the remaining wordforms from
the final state of the model.

81


Language ngram flat cl. node flat/no tiers cl.node/no tiers
English -22.20 -22.15 -21.73∗∗ -22.15 -22.14
French -18.30 -18.28 -17.93∗∗ -18.29 -18.28
Georgian -20.21 -20.17 -19.64∗ -20.18 -20.18
German -24.77 -24.72 -24.07∗∗ -24.73 -24.74
Greek -22.48 -22.45 -21.65∗∗ -22.45 -22.45
Haitian Creole -16.09 -16.04 -15.82∗∗ -16.05 -16.04
Lithuanian -19.03 -18.99 -18.58∗ -18.99 -18.99
Mandarin -13.95 -13.83∗ -13.78∗∗ -13.82∗ -13.82∗
Mor. Arabic -16.15 -16.10 -16.00∗ -16.13 -16.12
Polish -20.12 -20.08 -19.76∗∗ -20.08 -20.07
Quechua -14.35 -14.30 -13.87∗ -14.30 -14.31
Romanian -18.71 -18.68 -18.32∗∗ -18.69 -18.68
Tatar -16.21 -16.18 -15.65∗∗ -16.19 -16.19
Turkish -18.88 -18.85 -18.55∗∗ -18.85 -18.84

Table 1: Average log posterior predictive probability of a
held-out form. “ngram” is the DP Backoff 3-gram model.
“flat” models use the feature dependency graph in Fig-
ure 3. “cl. node” models use the graph in Figure 2. See
text for motivations of these graphs. “no tiers” models
condition each decision on the previous phoneme, rather
than on tiers of previous features. Asterisks indicate sta-
tistical significance according to a t-test comparing with
the scores under the N-gram model. * = p < .05; **
= p < .001.

Table 1 shows the average probability of a held-
out word under our models and under the N-gram
model for one model run.6 For all languages, we
get a statistically significant increase in probabili-
ties by adopting the autosegmental model with class
nodes and tier-based conditioning. Model variants
without either component do not significantly out-
perform the N-gram model except in Chinese. The
combination of class nodes and tier-based condition-
ing results in model improvements beyond the con-
tributions of the individual features.

5.4 Evaluation on Artificial Data
Our model outperforms the N-gram model in pre-
dicting held-out forms, but it remains to be shown
that this performance is due to capturing the kinds
of linguistic intuitions discussed in Section 2. An
alternative possibility is that the Autosegmental N-
gram model, which has many more parameters than
a plain N-gram model, can simply learn a more ac-
curate model of any sequence, even if that sequence
has none of the structure discussed above. To evalu-
ate this possibility, we compare the performance of
our model in predicting held-out linguistic forms to
its performance in predicting held-out forms from
artificial lexicons which expressly do not have the

6The mean standard deviation per form of log probabilities
over 50 runs of the full model ranged from .09 for Amharic to
.23 for Dutch.

linguistic structure we are interested in.
If the autosegmental model outperforms the N-

gram model even on artificial data with no phono-
logical structure, then its performance on the real
linguistic data in Section 5.3 might be overfitting.
On the other hand, if the autosegmental model does
better on real data but not artificial data, then we can
conclude that it is picking up on some real distinc-
tive structure of that data.

For each real lexicon Lr, we generate an artificial
lexicon La by training a DP 3-gram model on Lr
and forward-sampling |Lr| forms. Additionally, the
forms in La are constrained to have the same distri-
bution over lengths as the forms in Lr. The resulting
lexicons have no tier-based or featural interactions
except as they appear by chance from the N-gram
model trained on these lexica. For each La we then
train our models on the first 2500 forms and score
the probabilities of the held-out forms, the same pro-
cedure as in Section 5.3.

We ran this procedure for all the lexicons shown
in Table 1. For all but one lexicon, we find that the
autosegmental models do not significantly outper-
form the N-gram models on artificial data. The ex-
ception is Mandarin Chinese, where the average log
probability of an artificial form is −13.81 under the
N-gram model and −13.71 under the full autoseg-
mental model. The result suggests that the anoma-
lous behavior of Mandarin Chinese in Section 5.3
may be due to overfitting.

When exposed to data that explicitly does not
have autosegmental structure, the model is not more
accurate than a plain sequence model for almost all
languages. But when exposed to real linguistic data,
the model is more accurate. This result provides ev-
idence that the generative model developed in Sec-
tion 2 captures true distributional properties of lexi-
cons that are absent in N-gram distributions, such as
featural and tier-based interactions.

5.5 Comparison with a Constraint-Based
Model

Here we provide a comparison with Hayes and
Wilson (2008)’s Phonotactic Learner, which out-
puts a phonotactic grammar in the form of a set
of weighted constraints on feature co-occurrences.
This grammar is optimized to match the constraint
violation profile in a training lexicon, and so can be

82


seen as a probabilistic model of that lexicon. The
authors have distributed one such grammar, BLICK,
as a “reference point for phonotactic probability in
experimentation” (Hayes, 2012). Here we compare
our model against BLICK on its ability to assign
probabilities to forms, as in Section 5.3.

Ideally, we would simply compute the probabil-
ity of forms like we did in our earlier model com-
parisons. BLICK returns scores for each form.
However, since the probabilistic model underlying
BLICK is undirected, these scores are in fact unnor-
malized log probabilities, so they cannot be com-
pared directly to the normalized probabilities as-
signed by the other models. Furthermore, because
the probabilistic model underlying BLICK does not
penalize forms for length, the normalizing constant
over all forms is in fact infinite, making straightfor-
ward comparison of predictive probabilities impos-
sible. Nevertheless, we can turn BLICK scores into
probabilities by conditioning on further constraints,
such as the length k of the form. We enumerate all
possible forms of length k to compute the normal-
izing constant for the distribution over forms of that
length. The same procedure can also be used to com-
pute the probabilities of each form, conditioned on
the length of the form k, under the N-gram and Au-
tosegmental models.

To compare our models against BLICK, we cal-
culate conditional probabilities for forms of length
2 through 5 from the English lexicon.7 The forms
are those in the WOLEX corpus; we include them
for this evaluation if they are k symbols long in the
WOLEX representation. For our N-gram and Au-
tosegmental models, we use the same models as in
Section 5.3. The average probabilities of forms un-
der the three models are shown in Table 2. For
length 3-5, the autosegmental model assigns the
highest probabilities, followed by the N-gram model
and BLICK. For length 2, BLICK outperforms the
DP N-gram model but not the autosegmental model.

Our model assigns higher probabilities to short
forms than BLICK. That is, our models have iden-
tified more redundant structure in the forms than
BLICK, allowing them to compress the data more.
However, the comparison is imperfect in several

7Enumerating and scoring the 22,164,361,129 possible
forms of length 6 was computationally impractical.

Length BLICK ngram cl. node
2 -6.50 -6.81 -5.18
3 -9.38 -8.76 -7.95
4 -14.1 -11.7 -11.4
5 -18.1 -14.2 -13.9

Table 2: Average log posterior predictive probability of
an English form of fixed length under BLICK and our
models.

English N-gram English Full Model
collaborationist mistrustful
a posteriori inharmoniousness
sacristy absentmindedness
matter of course blamelessness
earnest money phlegmatically

Table 3: Most representative forms for the N-gram model
and for our full model (“cl. node” in Table 1) in En-
glish. Forms are presented in native orthography, but
were scored based on their phonetic form.

ways. First, BLICK and our models were trained on
different data; it is possible that our training data are
more representative of our test data than BLICK’s
training data were. Second, BLICK uses a different
underlying featural decomposition than our models;
it is possible that our feature system is more ac-
curate. Nevertheless, these results show that our
model concentrates more probability mass on (short)
forms attested in a language, whereas BLICK likely
spreads its probability mass more evenly over the
space of all possible (short) strings.

5.6 Representative Forms

In order to get a sense of the differences between
models, we investigate what phonological forms are
preferred by different kinds of models. These forms
might be informative about the phonotactic patterns
that our model is capturing which are not well-
represented in simpler models. We calculate the rep-
resentativeness of a form f with respect to model
m1 as opposed to m2 as p(f|m1)/p(f|m2) (Good,
1965; Tenenbaum and Griffiths, 2001). The forms
that are most “representative” of model m1 are not
the forms that m1 assigns the highest probability, but
rather the forms that m1 ranks highest relative to m2.

Tables 3 and 4 show forms from the lexicon that
are most representative of our full model and of the
N-gram model for English and Turkish. The most

83


Turkish N-gram Turkish Full Model
üstfamilya büyükkarapınar
dekstrin kızılcapınar
mnemotekni altınpınar
ekskavatör sarımehmetler
foksterye karaelliler

Table 4: Most representative forms for N-gram and Au-
tosegmental models in Turkish.

uniquely representative forms for our full model are
morphologically complex forms consisting of many
productive, frequently reused morphemes such as
ness. On the other hand, the representative forms
for the N-gram model include foreign forms such as
a posteriori (for English) and ekskavatör (for Turk-
ish), which are not built out of parts that frequently
repeat in those languages. The representative forms
suggest that the full model places more probability
mass on words which are built out of highly produc-
tive, phonotactically well-formed parts.

6 Discussion

We find that our models succeed in assigning high
probabilities to unseen forms, that they do so specifi-
cally for linguistic forms and not random sequences,
that they tend to favor forms with many productive
parts, and that they perform comparably to a state-
of-the-art constraint-based model in assigning prob-
abilities to short forms.

The improvement for our models over the N-gram
baseline is consistent but not large. We attribute
this to the way in which phonological generaliza-
tions are used in the present model: in particular,
phonological generalizations function primarily as a
form of backoff for a sequence model. Our mod-
els have lexical memoization at each node in a fea-
ture dependency graph; as such, the top node in
the graph ends up representing transition probabil-
ities for whole phonemes conditioned on previous
phonemes, and the rest of the feature dependency
graph functions as a backoff distribution. When a
model has been exposed to many training forms, its
behavior will be largely dominated by the N-gram-
like behavior of the top node. In future work it might
be effective to learn an optimal backoff procedure
which gives more influence to the base distribution
(Duh and Kirchhoff, 2004; Wood and Teh, 2009).

While the tier-based conditioning in our model
would seem to be capable of modeling nonlocal
interactions such as vowel harmony, we have not
found that the models do well at reproducing these
nonlocal interactions. We believe this is because the
model’s behavior is dominated by nodes high in the
feature dependency graph. In any case, a simple
Markov model defined over tiers, as we have pre-
sented here, might not be enough to fully model
vowel harmony. Rather, a model of phonological
processes, transducing underlying forms to surface
forms, seems like a more natural way to capture
these phenomena.

We stress that this model is not tied to a particular
feature dependency graph. In fact, we believe our
model provides a novel way of testing different hy-
potheses about feature structures, and could form the
basis for learning the optimal feature hierarchy for a
given data set. The choice of feature dependency
graph has a large effect on what featural interactions
the model can represent directly. For example, nei-
ther feature dependency graph has shared place fea-
tures for consonants and vowels, so the model has
limited ability to represent place-based restrictions
on consonant-vowel sequences such as requirements
for labialized or palatalized consonants in the con-
text of /u/ or /i/. These interactions can be treated in
our framework if vowels and consonants share place
features, as in Padgett (2011).

7 Conclusion

We have presented a probabilistic generative model
for sequences of phonemes defined in terms of
phonological features, based on representational
ideas from generative phonology and tools from
Bayesian nonparametric modeling. We consider
our model as a proof of concept that probabilistic
structure-building models can include rich featural
interactions. Our model robustly outperforms an N-
gram model on simple metrics, and learns to gener-
ate forms consisting of highly productive parts. We
also view this work as a test of the scientific hy-
potheses that phonological features can be organized
in a hierarchy and that they interact along tiers: in
our model evaluation, we found that both concepts
were necessary to get an improvement over a base-
line N-gram model.

84


Acknowledgments

We would like to thank Tal Linzen, Leon Bergen,
Edward Flemming, Edward Gibson, Bob Berwick,
Jim Glass, and the audiences at MIT’s Phonology
Circle, SIGMORPHON, and the LSA 2016 Annual
Meeting for helpful comments. This work was sup-
ported in part by NSF DDRIG Grant #1551543 to
R.F.

References
Adam Albright. 2009. Feature-based generalization as a

source of gradient acceptability. Phonology, 26:9–41.
Robert C. Berwick. 1985. The acquisition of syntactic

knowledge. MIT Press, Cambridge, MA.
Jeff A. Bilmes and Katrin Kirchhoff. 2003. Fac-

tored language models and generalized parallel back-
off. In Proceedings of the 2003 Conference of the
North American Chapter of the Association for Com-
putational Linguistics on Human Language Technol-
ogy: Companion Volume of the Proceedings of HLT-
NAACL 2003–Short Papers–Volume 2, pages 4–6. As-
sociation for Computational Linguistics.

Geert Booij. 2011. Morpheme structure constraints. In
The Blackwell Companion to Phonology. Blackwell.

Noam Chomsky and Morris Halle. 1965. Some contro-
versial questions in phonological theory. Journal of
Linguistics, 1:97–138.

Noam Chomsky and Morris Halle. 1968. The Sound
Pattern of English. Harper & Row, New York, NY.

George N. Clements and Elizabeth V. Hume. 1995. The
internal organization of speech sounds. In The Hand-
book of Phonological Theory, pages 24–306. Black-
well, Oxford.

George N. Clements. 1985. The geometry of phonologi-
cal features. Phonology Yearbook, 2:225–252.

John Coleman and Janet B. Pierrehumbert. 1997.
Stochastic phonological grammars and acceptability.
In John Coleman, editor, Proceedings of the 3rd Meet-
ing of the ACL Special Interest Group in Computa-
tional Phonology, pages 49–56, Somserset, NJ. Asso-
ciation for Computational Linguistics.

Robert Daland, Bruce Hayes, James White, Marc
Garellek, Andreas Davis, and Ingrid Normann. 2011.
Explaining sonority projection effects. Phonology,
28:197–234.

Carl de Marcken. 1996. Linguistic structure as com-
position and perturbation. In Proceedings of the 34th
annual meeting on Association for Computational Lin-
guistics, pages 335–341. Association for Computa-
tional Linguistics.

B. Elan Dresher. 2009. The Contrastive Hierarchy in
Phonology. Cambridge University Press.

Kevin Duh and Katrin Kirchhoff. 2004. Automatic
learning of language model structure. In Proceedings
of COLING 2004, Geneva, Switzerland.

Thomas S. Ferguson. 1973. A bayesian analysis of
some nonparametric problems. Annals of Statistics,
1(2):209–230.

John Goldsmith and Jason Riggle. 2012. Information
theoretic approaches to phonology: the case of Finnish
vowel harmony. Natural langauge and linguistic the-
ory, 30(3):859–96.

Sharon Goldwater, Thomas L. Griffiths, and Mark John-
son. 2006. Interpolating between types and tokens by
estimating power-law generators. In Advances in Neu-
ral Information Processing Systems, volume 18, pages
459–466, Cambridge, MA. MIT Press.

Sharon Goldwater. 2006. Nonparametric Bayesian Mod-
els of Lexical Acquisition. Ph.D. thesis, Brown Uni-
versity.

Irving John Good. 1965. The Estimation of Probabili-
ties. MIT Press, Cambridge, MA.

Noah D. Goodman, Vikash K. Mansinghka, Daniel Roy,
Keith Bonawitz, and Joshua B. Tenenbaum. 2008.
Church: A language for generative models. In Un-
certainty in Artificial Intelligence, Helsinki, Finland.
AUAI Press.

Peter Graff. 2012. Communicative Efficiency in the Lexi-
con. Ph.D. thesis, Massachusetts Institute of Technol-
ogy.

Morris Halle. 1959. The Sound Pattern of Russian: A
linguistic and acoustical investigation. Mouton, The
Hague, The Netherlands.

Morris Halle. 1995. Feature geometry and feature
spreading. Linguistic Inquiry, 26(1):1–46.

Bruce Hayes and Colin Wilson. 2008. A maximum en-
tropy model of phonotactics and phonotactic learning.
Linguistic Inquiry, 39(3):379–440.

Bruce Hayes. 2012. Blick - a phonotactic probability
calculator.

Jeffrey Heinz, Chetan Rawal, and Herbert G. Tanner.
2011. Tier-based strictly local constraints for phonol-
ogy. In The 49th Annual Meeting of the Association
for Computational Linguistics.

Roman Jakobson, C. Gunnar M. Fant, and Morris Halle.
1952. Preliminaries to Speech Analysis: The Distinc-
tive Features and their Correlates. The MIT Press,
Cambridge, Massachusetts and London, England.

Adam Jardine and Jeffrey Heinz. 2015. A concatenation
operation to derive autosegmental graphs. In Proceed-
ings of the 14th Meeting on the Mathematics of Lan-
guage (MoL 2015), pages 139–151, Chicago, USA,
July.

85


Ya Jin and Stuart Geman. 2006. Context and hierar-
chy in a probabilistic image model. In Proceedings
of the 2006 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVRP’06),
pages 2145–2152.

Mark Johnson and Sharon Goldwater. 2009. Improving
nonparameteric bayesian inference: experiments on
unsupervised word segmentation with adaptor gram-
mars. In NAACL-HLT 2009, pages 317–325. Associa-
tion for Computational Linguistics.

Mark Johnson, Thomas L Griffiths, and Sharon Goldwa-
ter. 2007. Adaptor grammars: A framework for speci-
fying compositional nonparametric Bayesian models.
In Advances in Neural Information Processing Sys-
tems, pages 641–648.

Mary-Louise Kean. 1975. The Theory of Markedness
in Generative Grammar. Ph.D. thesis, Massachusetts
Institute of Technology.

David J.C. MacKay and Linda C. Bauman Peto. 1994. A
hierarchical Dirichlet language model. Natural Lan-
guage Engineering, 1:1–19.

John McCarthy. 1988. Feature geometry and depen-
dency: A review. Phonetica, 43:84–108.

Donald Michie. 1968. “memo” functions and machine
learning. Nature, 218:19–22.

Timothy J. O’Donnell. 2015. Productivity and Reuse in
Language: A Theory of Linguistic Computation and
Storage. The MIT Press, Cambridge, Massachusetts
and London, England.

Jaye Padgett. 2011. Consonant-vowel place feature in-
teractions. In The Blackwell Companion to Phonol-
ogy, pages 1761–1786. Blackwell Publishing, Malden,
MA.

Ezer Rasin and Roni Katzir. 2014. A learnability argu-
ment for constraints on underlying representations. In
Proceedings of the 45th Annual Meeting of the North
East Linguistic Society (NELS 45), Cambridge, Mas-
sachusetts.

Jayaram Sethuraman. 1994. A constructive definition of
Dirichlet priors. Statistica Sinica, pages 639–650.

Richard Stanley. 1967. Redundancy rules in phonology.
Language, 43(2):393–436.

Yee Whye Teh. 2006. A Bayesian interpretation of in-
terpolated Kneser-Ney. Technical Report TRA2/06,
School of Computing, National University of Singa-
pore.

Joshua B Tenenbaum and Thomas L Griffiths. 2001. The
rational basis of representativeness. In Proceedings of
the 23rd Annual Conference of the Cognitive Science
Society, pages 1036–1041.

Nikolai S. Trubetzkoy. 1939. Grundzüge der Phonolo-
gie. Number 7 in Travaux du Cercle Linguistique de
Prague. Vandenhoeck & Ruprecht, Göttingen.

Frank Wood and Yee Whye Teh. 2009. A hierarchi-
cal nonparametric Bayesian approach to statistical lan-
guage model domain adaptation. In Artificial Intelli-
gence and Statistics, pages 607–614.

Frank Wood, Cédric Archambeau, Jan Gasthaus, James
Lancelot, and Yee Whye Teh. 2009. A stochastic
memoizer for sequence data. In Proceedings of the
26 th International Conference on Machine Learning,
Montreal, Canada.

86