Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models

Karl Stratos, Michael Collins∗ and Daniel Hsu
Department of Computer Science, Columbia University

{stratos, mcollins, djhsu}@cs.columbia.edu

Abstract

We tackle unsupervised part-of-speech (POS)
tagging by learning hidden Markov models
(HMMs) that are particularly well-suited for
the problem. These HMMs, which we call an-
chor HMMs, assume that each tag is associ-
ated with at least one word that can have no
other tag, which is a relatively benign con-
dition for POS tagging (e.g., “the” is a word
that appears only under the determiner tag).
We exploit this assumption and extend the
non-negative matrix factorization framework
of Arora et al. (2013) to design a consistent
estimator for anchor HMMs. In experiments,
our algorithm is competitive with strong base-
lines such as the clustering method of Brown
et al. (1992) and the log-linear model of Berg-
Kirkpatrick et al. (2010). Furthermore, it pro-
duces an interpretable model in which hidden
states are automatically lexicalized by words.

1 Introduction

Part-of-speech (POS) tagging without supervision is
a quintessential problem in unsupervised learning
for natural language processing (NLP). A major ap-
plication of this task is reducing annotation cost: for
instance, it can be used to produce rough syntactic
annotations for a new language that has no labeled
data, which can be subsequently refined by human
annotators.

Hidden Markov models (HMMs) are a natural
choice of model and have been a workhorse for
this problem. Early works estimated vanilla HMMs

∗Currently on leave at Google Inc. New York.

with standard unsupervised learning methods such
as the expectation-maximization (EM) algorithm,
but it quickly became clear that they performed very
poorly in inducing POS tags (Merialdo, 1994). Later
works improved upon vanilla HMMs by incorporat-
ing specific structures that are well-suited for the
task, such as a sparse prior (Johnson, 2007) or a
hard-clustering assumption (Brown et al., 1992).

In this work, we tackle unsupervised POS tagging
with HMMs whose structure is deliberately suitable
for POS tagging. These HMMs impose an assump-
tion that each hidden state is associated with an ob-
servation state (“anchor word”) that can appear un-
der no other state. For this reason, we denote this
class of restricted HMMs by anchor HMMs. Such
an assumption is relatively benign for POS tagging;
it is reasonable to assume that each POS tag has at
least one word that occurs only under that tag. For
example, in English, “the” is an anchor word for the
determiner tag; “laughed” is an anchor word for the
verb tag.

We build on the non-negative matrix factoriza-
tion (NMF) framework of Arora et al. (2013) to de-
rive a consistent estimator for anchor HMMs. We
make several new contributions in the process. First,
to our knowledge, there is no previous work di-
rectly building on this framework to address unsu-
pervised sequence labeling. Second, we generalize
the NMF-based learning algorithm to obtain exten-
sions that are important for empirical performance
(Table 1). Third, we perform extensive experiments
on unsupervised POS tagging and report competitive
results against strong baselines such as the cluster-
ing method of Brown et al. (1992) and the log-linear

245

Transactions of the Association for Computational Linguistics, vol. 4, pp. 245–257, 2016. Action Editor: Hinrich Schütze.
Submission batch: 1/2016; Revision batch: 3/2016; Published 6/2016.

c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


model of Berg-Kirkpatrick et al. (2010).
One characteristic of the approach is the imme-

diate interpretability of inferred hidden states. Be-
cause each hidden state is associated with an obser-
vation, we can examine the set of such anchor obser-
vations to qualitatively evaluate the learned model.
In our experiments on POS tagging, we find that an-
chor observations correspond to possible POS tags
across different languages (Table 3). This property
can be useful when we wish to develop a tagger for a
new language that has no labeled data; we can label
only the anchor words to achieve a complete label-
ing of the data.

This paper is structured as follows. In Section 2,
we establish the notation we use throughout. In Sec-
tion 3, we define the model family of anchor HMMs.
In Section 4, we derive a matrix decomposition al-
gorithm for estimating the parameters of an anchor
HMM. In Section 5, we present our experiments on
unsupervised POS tagging. In Section 6, we discuss
related works.

2 Notation

We use [n] to denote the set of integers {1, . . . ,n}.
We use E[X] to denote the expected value of a ran-
dom variable X. We define ∆m−1 := {v ∈ Rm :
vi ≥ 0 ∀i,

∑
i vi = 1}, i.e., the (m−1)-dimensional

probability simplex. Given a vector v ∈ Rm, we
use diag(v) ∈ Rm×m to denote the diagonal matrix
with v1 . . .vm on the main diagonal. Given a matrix
M ∈ Rn×m, we write Mi ∈ Rm to denote the i-th
row of M (as a column vector).

3 The Anchor Hidden Markov Model

Definition 3.1. An anchor HMM (A-HMM) is a 6-
tuple (n,m,π,t,o,A) for positive integers n,m and
functions π,t,o,A where
• [n] is a set of observation states.
• [m] is a set of hidden states.
• π(h) is the probability of generating h ∈ [m]

in the first position of a sequence.

• t(h′|h) is the probability of generating h′ ∈
[m] given h ∈ [m].
• o(x|h) is the probability of generating x ∈ [n]

given h ∈ [m].

• A(h) := {x ∈ [n] : o(x|h) > 0 ∧ o(x|h′) =
0 ∀h′ 6= h} is non-empty for each h ∈ [m].

In other words, an A-HMM is an HMM in which
each hidden state h is associated with at least one
“anchor” observation state that can be generated by,
and only by, h. Note that the anchor condition im-
plies n ≥ m.

An equivalent definition of an A-HMM is given
by organizing the parameters in matrix form. Under
this definition, an A-HMM has parameters (π,T,O)
where π ∈ Rm is a vector and T ∈ Rm×m,O ∈
Rn×m are matrices whose entries are set to:

• πh = π(h) for h ∈ [m]

• Th′,h = t(h′|h) for h,h′ ∈ [m]

• Ox,h = o(x|h) for h ∈ [m],x ∈ [n]
The anchor condition implies that rank(O) = m.
To see this, consider the rows Oa1 . . .Oam where
ah ∈ A(h). Since each Oah has a single non-zero
entry at the h-th index, the columns of O are linearly
independent. We assume rank(T) = m.

An important special case of A-HMM introduced
by Brown et al. (1992) is defined below. Under
this A-HMM, every observation state is an anchor
of some hidden state.
Definition 3.2. A Brown model is an A-HMM in
which A(1) . . .A(m) partition [n].

4 Parameter Estimation for A-HMMs

We now derive an algorithm for learning A-HMMs.
The algorithm reduces the learning problem to an
instance of NMF from which the model parameters
can be computed in closed-form.

4.1 NMF
We give a brief review of the NMF method of Arora
et al. (2013). (Exact) NMF is the following problem:
given an n × d matrix A = BC where B ∈ Rn×m
and C ∈ Rm×d have non-negativity constraints, re-
cover B and C. This problem is NP-hard in general
(Vavasis, 2009), but Arora et al. (2013) provide an
exact and efficient method when A has the follow-
ing special structure:
Condition 4.1. A matrix A ∈ Rn×d satisfies this
condition if A = BC for B ∈ Rn×m and C ∈
Rm×d where

246


Anchor-NMF
Input: A ∈ Rn×d satisfying Condition 4.1 with A =
BC for some B ∈ Rn×m and C ∈ Rm×d, value m
• For h = 1 . . .m, find a vertex ah as

U ← Gram-Schmidt({Aal}h−1l=1 )
ah ← arg max

x∈[n]

∣∣∣∣Ax −UU>Ax
∣∣∣∣

2

where Gram-Schmidt({Aal}h−1l=1 ) is the Gram-
Schmidt process that orthonormalizes {Aal}h−1l=1 .

• For x = 1 . . .n, recover the x-th row of B as

Bx ← arg min
u∈∆m−1

∣∣∣∣∣

∣∣∣∣∣Ax −
m∑

h=1

uhAah

∣∣∣∣∣

∣∣∣∣∣
2

(1)

• Set C = [Aa1 . . .Aam ]>.
Output: B and C such that B>h = B

>
ρ(h)

and Ch =
Cρ(h) for some permutation ρ : [m] → [m]

Figure 1: Non-negative matrix factorization algorithm of
Arora et al. (2012).

1. For each x ∈ [n], Bx ∈ ∆m−1. I.e., each row
of B defines a probability distribution over [m].

2. For each h ∈ [m], there is some ah ∈ [n] such
that Bah,h = 1 and Bah,h′ = 0 for all h

′ 6= h.

3. rank(C) = m.

Since rank(B) = rank(C) = m (by property 2 and
3), the matrix A must have rank m. Note that by
property 1, each row of A is given by a convex com-
bination of the rows of C: for x ∈ [n],

Ax =
m∑

h=1

Bx,h ×Ch

Furthermore, by property 2 each h ∈ [m] has an as-
sociated row ah ∈ [n] such that Aah = Cah . These
properties can be exploited to recover B and C.

A concrete algorithm for factorizing a matrix sat-
isfying Condition 4.1 is given in Figure 1 (Arora
et al., 2013). It first identifies a1 . . .am (up to
some permutation) by greedily locating the row
of A furthest away from the subspace spanned by

the vertices selected so far. Then it recovers each
Bx as the convex coefficients required to combine
Aa1 . . .Aam to yield Ax. The latter computation (1)
can be achieved with any constrained optimization
method; we use the Frank-Wolfe algorithm (Frank
and Wolfe, 1956). See Arora et al. (2013) for a proof
of the correctness of this algorithm.

4.2 Random Variables
To derive our algorithm, we make use of cer-
tain random variables under the A-HMM. Let
(X1, . . . ,XN ) ∈ [n]N be a random sequence
of N ≥ 2 observations drawn from the model,
along with the corresponding hidden state sequence
(H1, . . . ,HN ) ∈ [m]N ; independently, pick a posi-
tion I ∈ [N − 1] uniformly at random. Let YI ∈ Rd
be a d-dimensional vector which is conditionally in-
dependent of XI given HI, i.e., P(YI|HI,XI) =
P(YI|HI). We will provide how to define such a
variable in Section 4.4.1: YI will be a function of
(X1, . . . ,XN ) serving as a “context” representation
of XI revealing the hidden state HI.

Define unigram probabilities u∞,u1 ∈ Rn and
bigram probabilities B ∈ Rn×n where

u∞x := P(XI = x) ∀x ∈ [n]
u1x := P(XI = x|I = 1) ∀x ∈ [n]
Bx,x′ := P(XI = x,XI+1 = x′) ∀x,x′ ∈ [n]

Additionally, define π̄ ∈ Rm where

π̄h = P(HI = h) ∀h ∈ [m] (2)

We assume π̄h > 0 for all h ∈ [m].

4.3 Derivation of a Learning Algorithm
The following proposition provides a way to use the
NMF algorithm in Figure 1 to recover the emission
parameters O up to scaling.

Proposition 4.1. Let XI ∈ [n] and YI ∈ Rd be
respectively an observation and a context vector
drawn from the random process described in Sec-
tion 4.2. Define a matrix Ω ∈ Rn×d with rows

Ωx = E[YI|XI = x] ∀x ∈ [n] (3)

If rank(Ω) = m, then Ω satisfies Condition 4.1:

Ω = ÕΘ

247


where Õx,h = P(HI = h|XI = x) and Θh =
E[YI|HI = h].
Proof.

E[YI|XI = x]

=

m∑

h=1

P(HI = h|XI = x) × E[YI|HI = h,XI = x]

=
m∑

h=1

P(HI = h|XI = x) × E[YI|HI = h]

The last equality follows by the conditional inde-
pendence of YI. This shows Ω = ÕΘ. By the an-
chor assumption of the A-HMM, each h ∈ [m] has
at least one x ∈ A(h) such that P(HI = h|XI =
x) = 1, thus Ω satisfies Condition 4.1.

A useful interpretation of Ω in Proposition 4.1
is that its rows Ω1 . . . Ωn are d-dimensional vec-
tor representations of observation states forming a
convex hull in Rd. This convex hull has m ver-
tices Ωa1 . . . Ωam corresponding to anchors ah ∈
A(h) which can be convexly combined to realize all
Ω1 . . . Ωn.

Given Õ, we can recover the A-HMM parameters
as follows. First, we recover the stationary state dis-
tribution π̄ as:

π̄h =
∑

x∈[n]
P(HI = h|XI = x) ×P(XI = x)

=
∑

x∈[n]
Õx,h ×u∞x

The emission parameters O are given by Bayes’ the-
orem:

Ox,h =
P(HI = h|XI = x) ×P(XI = x)∑
x∈[n] P(HI = h|XI = x) ×P(XI = x)

=
Õx,h ×u∞x

π̄h

Using the fact that the emission probabilities are
position-independent, we see that the initial state
distribution π satisfies u1 = Oπ:

u1x = P(XI = x|I = 1)
=

∑

h∈[m]
P(XI = x|HI = h,I = 1) ×P(HI = h|I = 1)

=
∑

h∈[m]
Ox,h ×πh

Learn-Anchor-HMM
Input: Ω in Proposition 4.1, number of hidden states m,
bigram probabilities B, unigram probabilities u∞,u1

• Compute (Õ, Θ) ← Anchor-NMF(Ω,m).
• Recover the parameters:

π̄ ← Õ>u∞ (4)
O ← diag(π̄)−1diag(u∞)Õ (5)
π = O+u1 (6)

T ← (diag(π̄)−1O+B(O>)+)> (7)

Output: A-HMM parameters (π,T,O)

Figure 2: NMF-based learning algorithm for A-HMMs.
The algorithm Anchor-NMF is given in Figure 1.

Thus π can be recovered as π = O+u1 where
O+ is the Moore–Penrose pseudoinverse of O. Fi-
nally, it can be algebraically verified that B =
Odiag(π̄)T>O> (Hsu et al., 2012). Since all the in-
volved matrices have rank m, we can directly solve
for T as

T = (diag(π̄)−1O+B(O>)+)>

Figure 2 shows the complete algorithm. As input,
it receives a matrix Ω satisfying Proposition 4.1, the
number of hidden states, and the probabilities of ob-
served unigrams and bigrams. It first decomposes
Ω using the NMF algorithm in Figure 1. Then it
computes the A-HMM parameters whose solution is
given analytically.

The following theorem guarantees the consis-
tency of the algorithm.
Theorem 4.1. Let (π,T,O) be an A-HMM such that
rank(T) = m and π̄ defined in (2) has strictly pos-
itive entries π̄h > 0. Given random variables Ω
satisfying Proposition 4.1 and B,u∞,u1 under this
model, the algorithm Learn-Anchor-HMM in Fig-
ure 2 outputs (π,T,O) up to a permutation on hid-
den states.

Proof. By Proposition 4.1, Ω satisfies Condition 4.1
with Ω = ÕΘ, thus Õ can be recovered up to a
permutation on columns with the algorithm Anchor-
NMF. The consistency of the recovered parameters
follows from the correctness of (4–7) under the rank
conditions.

248


4.3.1 Constrained Optimization for π and T
Note that (6) and (7) require computing the pseu-

doinverse of the estimated O, which can be expen-
sive and vulnerable to sampling errors in practice.
To make our parameter estimation more robust, we
can explicitly impose probability constraints. We re-
cover π by solving:

π = arg min
π′∈∆m−1

∣∣∣∣u1 −Oπ′
∣∣∣∣

2
(8)

which can again be done with algorithms such as
Frank-Wolfe. We recover T by maximizing the log
likelihood of observation bigrams

∑

x,x′
Bx,x′ log




∑

h,h′∈[m]
π̄hOx,hTh′,hOx′,h′


 (9)

subject to the constraint (T>)h ∈ ∆m−1. Since (9)
is concave in T with other parameters O and π̄ fixed,
we can use EM to find the global optimum.

4.4 Construction of the Convex Hull Ω
In this section, we provide several ways to construct
a convex hull Ω satisfying Proposition 4.1.

4.4.1 Choice of the Context YI
In order to satisfy Proposition 4.1, we need to de-

fine the context variable YI ∈ Rd with two proper-
ties:

• P(YI|HI,XI) = P(YI|HI)

• The matrix Ω with rows

Ωx = E[YI|XI = x] ∀x ∈ [n]

has rank m.

A simple construction (Arora et al., 2013) is given
by defining YI ∈ Rn to be an indicator vector for
the next observation:

[YI]x′ =

{
1 if XI+1 = x′

0 otherwise
(10)

The first condition is satisfied since XI+1 does not
depend on XI given HI. For the second condition,
observe that Ωx,x′ = P(XI+1 = x′|XI = x), or in
matrix form

Ω = diag (u∞)−1 B (11)

Under the rank conditions in Theorem 4.1, (11) has
rank m.

More generally, we can let YI be an observation
(encoded as an indicator vector as in (10)) randomly
drawn from a window of L ∈ N nearby observa-
tions. We can either only use the identity of the cho-
sen observation (in which case YI ∈ Rn) or addi-
tionally indicate the relative position in the window
(in which case YI ∈ RnL). It is straightforward to
verify that the above two conditions are satisfied un-
der these definitions. Clearly, (11) is a special case
with L = 1.

4.4.2 Reducing the Dimension of Ωx
With the definition of Ω in the previous section,

the dimension of Ωx is d = O(n) which can be dif-
ficult to work with when n � m. Proposition 4.1
allows us to reduce the dimension as long as the fi-
nal matrix retains the form in (3) and has rank m. In
particular, we can multiply Ω by any rank-m projec-
tion matrix Π ∈ Rd×m on the right side: if Ω sat-
isfies the properties in Proposition 4.1, then so does
ΩΠ with m-dimensional rows

(ΩΠ)x = E[YIΠ|XI = x]

Since rank(Ω) = m, a natural choice of Π is the
projection onto the best-fit m-dimensional subspace
of the row space of Ω.

We mention that previous works on the NMF-
learning framework have employed various projec-
tion methods, but they do not examine relative mer-
its of their choices. For instance, Arora et al. (2013)
simply use random projection, which is convenient
for theoretical analysis. Cohen and Collins (2014)
use a projection based on canonical correlation anal-
ysis (CCA) without further exploration. In con-
trast, we give a full comparison of valid construc-
tion methods and find that the choice of Ω is crucial
in practice.

4.4.3 Construction of Ω for the Brown Model
We can formulate an alternative way to construct

a valid Ω when the model is further restricted to be a
Brown model. Since every observation is an anchor,
Ox ∈ Rm has a single nonzero entry for every x.
Thus the rows defined by Ωx = Ox/ ||Ox|| (an in-
dicator vector for the unique hidden state of x) form

249


Input: bigram probabilities B, unigram probabilities u∞,
number of hidden states m, construction method τ
Scaled Matrices: (

√
· is element-wise)

B := diag (u∞)−1/2 Bdiag (u∞)−1/2

B̃ := diag
(√

u∞
)−1/2 √

Bdiag
(√

u∞
)−1/2

Singular Vectors: U(M) (V (M)) is an n×m matrix of
the left (right) singular vectors of M corresponding to the
largest m singular values

• If τ 6= brown: set

Ω ← diag (u∞)−1 BΠ

where the projection matrix Π ∈ Rn×m is given by

Πi,j ∼N(0, 1/m) if τ = random
Π = V (diag (u∞)−1 B) if τ = best-fit
Π = diag (u∞)−1/2 V (B) if τ = cca

• If τ = brown: compute the transformed emission
matrix as f(O) = U(B̃) and set

Ω ← diag(v)−1f(O)

where vx := ||f(O)x||2 is the length of the x-th
row of f(O).

Output: Ω ∈ Rn×m in Proposition 4.1

Figure 3: Algorithm for constructing a valid Ω with dif-
ferent construction methods. For simplicity, we only
show the bigram construction (context size L = 1), but an
extension for larger context (L > 1) is straightforward.

a trivial convex hull in which every point is a ver-
tex. This corresponds to choosing an oracle context
YI ∈ Rm where

[YI]h =

{
1 if HI = h
0 otherwise

It is possible to recover the Brown model param-
eters O up to element-wise scaling and rotation of
rows using the algorithm of Stratos et al. (2015).
More specifically, let f(O) ∈ Rn×m denote the out-
put of their algorithm. Then they show that for some
vector s ∈ Rm with strictly positive entries and an
orthogonal matrix Q ∈ Rm×m:

f(O) = O〈1/4〉diag(s)Q>

where O〈1/4〉 is an element-wise exponentiation of
O by 1/4. Since the rows of f(O) are simply some
scaling and rotation of the rows of O, using Ωx =
f(O)x/ ||f(O)x|| yields a valid Ω.

While we need to impose an additional assump-
tion (the Brown model restriction) in order to justify
this choice of Ω, we find in our experiments that it
performs better than other alternatives. We specu-
late that this is because a Brown model is rather ap-
propriate for the POS tagging task; many words are
indeed unambiguous with respect to POS tags (Ta-
ble 5). Also, the general effectiveness of f(O) for
representational purposes has been demostrated in
previous works (Stratos et al., 2014; Stratos et al.,
2015). By restricting the A-HMM to be a Brown
model, we can piggyback on the proven effective-
ness of f(O).

Figure 3 shows an algorithm for constructing Ω
with these different construction methods. For sim-
plicity, we only show the bigram construction (con-
text size L = 1), but an extension for larger context
(L > 1) is straightforward as discussed earlier. The
construction methods random (random projection),
best-fit (projection to the best-fit subspace), and cca
(CCA projection) all compute (11) and differ only
in how the dimension is reduced. The construction
method brown computes the transformed Brown pa-
rameters f(O) as the left singular vectors of a scaled
covariance matrix and then normalizes its rows. We
direct the reader to Stratos et al. (2015) for a deriva-
tion of this calculation.

4.4.4 Ω with Feature Augmentation
The x-th row of Ω is a d-dimensional vector repre-

sentation of x lying in a convex set with m vertices.
This suggests a natural way to incorporate domain-
specific features: we can add additional dimensions
that provide information about hidden states from
the surface form of x.

For instance, consider the the POS tagging task.
In the simple construction (11), the representation
of word x is defined in terms of neighboring words
x′:

[Ωx]x′ = E
[
1
(
XI+1 = x

′) |XI = x
]

where 1(·) ∈ {0, 1} is the indicator function. We
can augment this vector with s additional dimen-

250


sions indicating the spelling features of x. For in-
stance, the (n + 1)-th dimension may be defined as:

[Ωx]n+1 = E [1 (x ends in “ing” ) |XI = x]

This value will be generally large for verbs and
small for non-verbs, nudging verbs closer together
and away from non-verbs. The modified (n + s)-
dimensional representation is followed by the usual
dimension reduction. Note that the spelling features
are a deterministic function of a word, and we are
implicitly assuming that they are independent of the
word given its tag. While this is of course not true in
practice, we find that these features can significantly
boost the tagging performance.

5 Experiments

We evaluate our A-HMM learning algorithm on the
task of unsupervised POS tagging. The goal of this
task is to induce the correct sequence of POS tags
(hidden states) given a sequence of words (observa-
tion states). The anchor condition corresponds to as-
suming that each POS tag has at least one word that
occurs only under that tag.

5.1 Background on Unsupervised POS Tagging
Unsupervised POS tagging has long been an active
area of research (Smith and Eisner, 2005a; John-
son, 2007; Toutanova and Johnson, 2007; Haghighi
and Klein, 2006; Berg-Kirkpatrick et al., 2010),
but results on this task are complicated by vary-
ing assumptions and unclear evaluation metrics
(Christodoulopoulos et al., 2010). Rather than ad-
dressing multiple alternatives for evaluating unsu-
pervised POS tagging, we focus on a simple and
widely used metric: many-to-one accuracy (i.e., we
map each hidden state to the most frequently coin-
ciding POS tag in the labeled data and compute the
resulting accuracy).

5.1.1 Better Model v.s. Better Learning
Vanilla HMMs are notorious for their mediocre

performance on this task, and it is well known that
they perform poorly largely because of model mis-
specification, not because of suboptimal parameter
estimation (e.g., because EM gets stuck in local op-
tima). More generally, a large body of work points
to the inappropriateness of simple generative mod-
els for unsupervised induction of linguistic structure

(Merialdo, 1994; Smith and Eisner, 2005b; Liang
and Klein, 2008).

Consequently, many works focus on using more
expressive models such as log-linear models (Smith
and Eisner, 2005a; Berg-Kirkpatrick et al., 2010)
and Markov random fields (MRF) (Haghighi and
Klein, 2006). These models are shown to deliver
good performance even though learning is approxi-
mate. Thus one may question the value of having a
consistent estimator for A-HMMs and Brown mod-
els in this work: if the model is wrong, what is the
point of learning it accurately?

However, there is also ample evidence that
HMMs are competitive for unsupervised POS induc-
tion when they incorporate domain-specific struc-
tures. Johnson (2007) is able to outperform the
sophisticated MRF model of Haghighi and Klein
(2006) on one-to-one accuracy by using a sparse
prior in HMM estimation. The clustering method
of Brown et al. (1992) which is based on optimizing
the likelihood under the Brown model (a special case
of HMM) remains a baseline difficult to outperform
(Christodoulopoulos et al., 2010).

We add to this evidence by demonstrating the ef-
fectiveness of A-HMMs on this task. We also check
the anchor assumption on data and show that the A-
HMM model structure is in fact appropriate for the
problem (Table 5).

5.2 Experimental Setting

We use the universal treebank dataset (version 2.0)
which contains sentences annotated with 12 POS tag
types for 10 languages (McDonald et al., 2013). All
models are trained with 12 hidden states. We use
the English portion to experiment with different hy-
perparameter configurations. At test time, we fix a
configuration (based on the English portion) and ap-
ply it across all languages.

The list of compared methods is given below:

BW The Baum-Welch algorithm, an EM algorithm
for HMMs (Baum and Petrie, 1966).

CLUSTER A parameter estimation scheme for
HMMs based on Brown clustering (Brown et al.,
1992). We run the Brown clustering algorithm1 to
obtain 12 word clusters C1 . . .C12. Then we set

1We use the implementation of Liang (2005).

251


the emission parameters o(x|h), transition param-
eters t(h′|h), and prior π(h) to be the maximum-
likelihood estimates under the fixed clusters.

ANCHOR Our algorithm Learn-Anchor-HMM in
Figure 2 but with the constrained optimization (8)
and (9) for estimating π and T .2

ANCHOR-FEATURES Same as ANCHOR but em-
ploys the feature augmentation scheme described in
Section 4.4.4.

LOG-LINEAR The unsupervised log-linear model
described in Berg-Kirkpatrick et al. (2010). Instead
of emission parameters o(x|h), the model maintains
a miniature log-linear model with a weight vector w
and a feature function φ. The probability of a word
x given tag h is computed as

p(x|h) = exp(w
>φ(x,h))∑

x∈[n] exp(w
>φ(x,h))

The model can be trained by maximizing the like-
lihood of observed sequences. We use L-BFGS to
directly optimize this objective.3 This approach ob-
tains the current state-of-the-art accuracy on fine-
grained (45 tags) English WSJ dataset.

We use maximum marginal decoding for HMM
predictions: i.e., at each position, we predict the
most likely tag given the entire sentence.

5.3 Practical Issues with the Anchor Algorithm
In our experiments, we find that Anchor-NMF (Fig-
ure 1) tends to propose extremely rare words as an-
chors. A simple fix is to search for anchors only
among relatively frequent words. We find that any
reasonable frequency threshold works well; we use
the 300 most frequent words. Note that this is not
a problem if these 300 words include anchor words
corresponding to all the 12 tags.

We must define the context for constructing Ω.
We use the previous and next words (i.e., context
size L = 2) marked with relative positions. Thus Ω
has 2n columns before dimension reduction. Table 1
shows the performance on the English portion with
different construction methods for Ω. The Brown

2https://github.com/karlstratos/anchor
3We use the implementation of Berg-Kirkpatrick et al.

(2010) (personal communication).

Choice of Ω Accuracy
Random 48.2
Best-Fit 53.4

CCA 57.0
Brown 66.1

Table 1: Many-to-one accuracy on the English data with
different choices of the convex hull Ω (Figure 3). These
results do not use spelling features.

construction (τ = brown in Figure 3) clearly per-
forms the best: essentially, the anchor algorithm is
used to extract the HMM parameters from the CCA-
based word embeddings of Stratos et al. (2015).

We also explore feature augmentation discussed
in Section 4.4.4. For comparison, we employ the
same word features used by Berg-Kirkpatrick et al.
(2010):

• Indicators for whether a word is capitalized,
contains a hyphen, or contains a digit

• Suffixes of length 1, 2, and 3

We weigh the l2 norm of these extra dimensions
in relation to the original dimensions: we find a
small weight (e.g., 0.1 of the norm of the original
dimensions) works well. We also find that these fea-
tures can sometimes significantly improve the per-
formance. For instance, the accuracy on the English
portion can be improved from 66.1% to 71.4% with
feature augmentation.

Another natural experiment is to refine the HMM
parameters obtained from the anchor algorithm (or
Brown clusters) with a few iterations of the Baum-
Welch algorithm. In our experiments, however, it
did not significantly improve the tagging perfor-
mance, so we omit this result.

5.4 Tagging Accuracy
Table 2 shows the many-to-one accuracy on all lan-
guages in the dataset. For the Baum-Welch algo-
rithm and the unsupervised log-linear models, we
report the mean and the standard deviation (in paren-
theses) of 10 random restarts run for 1,000 itera-
tions.

Both ANCHOR and ANCHOR-FEATURES compete
favorably. On 5 out of 10 languages, ANCHOR-
FEATURES achieves the highest accuracy, often

252


Model de en es fr id it ja ko pt-br sv

BW
(4.8)

45.5
(3.4)

59.8
(2.2)

60.6
(3.6)

60.1
(3.1)

49.6
(2.6)

51.5
(2.1)

59.5
(0.6)

51.7
(3.7)

59.5
(3.0)

42.4
CLUSTER 60.0 62.9 67.4 66.4 59.3 66.1 60.3 47.5 67.4 61.9
ANCHOR 61.1 66.1 69.0 68.2 63.7 60.4 65.3 53.8 64.9 51.1

ANCHOR-FEATURES 63.4 71.4 74.3 71.9 67.3 60.2 69.4 61.8 65.8 61.0

LOG-LINEAR
(1.8)

67.5
(3.5)

62.4
(3.1)

67.1
(4.5)

62.1
(3.9)

61.3
(2.9)

52.9
(2.9)

78.2
(3.6)

60.5
(2.2)

63.2
(2.5)

56.7

Table 2: Many-to-one accuracy on each language using 12 universal tags. The first four models are HMMs estimated
with the Baum-Welch algorithm (BW), the clustering algorithm of Brown et al. (1992), the anchor algorithm without
(ANCHOR) and with (ANCHOR-FEATURES) feature augmentation. LOG-LINEAR is the model of Berg-Kirkpatrick et
al. (2010) trained with the direct-gradient method using L-BFGS. For BW and LOG-LINEAR, we report the mean and
the standard deviation (in parentheses) of 10 random restarts run for 1,000 iterations.

closely followed by ANCHOR. The Brown clustering
estimation is also competitive and has the highest
accuracy on 3 languages. Not surprisingly, vanilla
HMMs trained with BW perform the worst (see Sec-
tion 5.1.1 for a discussion).

LOG-LINEAR is a robust baseline and performs
the best on the remaining 2 languages. It per-
forms especially strongly on Japanese and Korean
datasets in which poorly segmented strings such as
“1950年11月5日には” (on November 5, 1950) and
“40.3%로” (by 40.3%) abound. In these datasets,
it is crucial to make effective use of morphological
features.

5.5 Qualitative Analysis

5.5.1 A-HMM Parameters

An A-HMM can be easily interpreted since each
hidden state is marked with an anchor observation.
Table 3 shows the 12 anchors found in each lan-
guage. Note that these anchor words generally have
a wide coverage of possible POS tags.

We also experimented with using true anchor
words (obtained from labeled data), but they did not
improve performance over automatically induced
anchors. Since anchor discovery is inherently tied to
parameter estimation, it is better to obtain anchors in
a data-driven manner. In particular, certain POS tags
(e.g., X) appear quite infrequently, and the model is
worse off by being forced to allocate a hidden state
for such a tag.

Table 4 shows words with highest emission prob-
abilities o(x|h) under each anchor. We observe

that an anchor is representative of a certain group
of words. For instance, the state “loss” rep-
resents noun-like words, “1” represents numbers,
“on” represents preposition-like words, “one” rep-
resents determiner-like words, and “closed” repre-
sents verb-like words. The conditional distribution
is peaked for anchors that represent function tags
(e.g., determiners, punctuation) and flat for anchors
that represent content tags (e.g., nouns). Occasion-
ally, an anchor assigns high probabilities to words
that do not seem to belong to the corresponding
POS tag. But this is to be expected since o(x|h) ∝
P(XI = x) is generally larger for frequent words.

5.5.2 Model Assumptions on Data
Table 5 checks the assumptions in A-HMMs and

Brown models on the universal treebank dataset.
The anchor assumption is indeed satisfied with 12
universal tags: in every language, each tag has at
least one word uniquely associated with the tag.
The Brown assumption (each word has exactly one
possible tag) is of course not satisfied, since some
words are genuinely ambiguous with respect to their
POS tags. However, the percentage of unambigu-
ous words is very high (well over 90%). This analy-
sis supports that the model assumptions made by A-
HMMs and Brown models are appropriate for POS
tagging.

Table 6 reports the log likelihood (normalized by
the number of words) on the English portion of
different estimation methods for HMMs. BW and
CLUSTER obtain higher likelihood than the anchor
algorithm, but this is expected given that both EM

253


de en es fr id it ja ko pt-br sv
empfehlen loss y avait bulan radar お世話に 완전 E och

wie 1 hizo commune tetapi però ないと 중에 de bör
; on - Le wilayah sulle ことにより 경우 partida grund

Sein one especie de - - されている。 줄 fazer mellan
Berlin closed Además président Bagaimana Stati ものを 같아요 meses i

und are el qui , Lo , 많은 os sociala
, take paı́ses ( sama legge した , : .
- , la à . al それは 볼 diretor bli

der vice España États dan far- 、 자신의 2010 den
im to en Unis Utara di 幸福の 받고 , ,
des York de Cette pada la ことが 맛있는 uma tid

Region Japan municipio quelques yang art. 通常の 위한 O Detta

Table 3: Anchor words found in each language (model ANCHOR-FEATURES).

loss year (.02) market (.01) share (.01) company (.01) stock (.01) quarter (.01) shares (.01) price (.01)
1 1 (.03) 10 (.02) 30 (.02) 15 (.02) 8 (.02) 2 (.01) 20 (.01) 50 (.01)

on of (.14) in (.12) . (.08) for (.06) on (.04) by (.04) from (.04) and (.03)
one the (.23) a (.12) “ (.03) an (.03) $ (.02) its (.02) that (.02) this (.02)

closed said (.05) ’s (.02) is (.02) says (.02) was (.01) has (.01) had (.01) expected (.01)
are and (.08) is (.08) are (.05) was (.04) ’s (.04) “ (.04) has (.03) of (.03)

take be (.04) % (.02) have (.02) million (.02) But (.02) do (.01) The (.01) make (.01)
, , (.53) . (.25) and (.05) ” (.04) % (.01) million (.01) – (.01) that (.01)

vice ’s (.03) The (.02) “ (.02) New (.01) and (.01) new (.01) first (.01) chief (.01)
to to (.39) . (.11) a (.06) will (.04) $ (.03) n’t (.03) would (.02) % (.02)

York the (.15) a (.05) The (.04) of (.04) ’s (.04) million (.01) % (.01) its (.01)
Japan Mr. (.03) it (.02) ” (.02) $ (.02) he (.02) that (.02) which (.01) company (.01)

Table 4: Most likely words under each anchor word (English model ANCHOR-FEATURES). Emission probabilities
o(x|h) are given in parentheses.

and Brown clustering directly optimize likelihood.
In contrast, the anchor algorithm is based on the
method of moments and does not (at least directly)
optimize likelihood. Note that high likelihood does
not imply high accuracy under HMMs.

6 Related Work

6.1 Latent-Variable Models

There has recently been great progress in estima-
tion of models with latent variables. Despite the
NP-hardness in general cases (Terwijn, 2002; Arora
et al., 2012), many algorithms with strong theoreti-
cal guarantees have emerged under natural assump-
tions. For example, for HMMs with full-rank condi-
tions, Hsu et al. (2012) derive a consistent estimator
of the marginal distribution of observed sequences.
Anandkumar et al. (2014) propose an exact tensor
decomposition method for learning a wide class of
latent variable models with similar non-degeneracy
conditions. Arora et al. (2013) derive a provably cor-

rect learning algorithm for topic models with a cer-
tain parameter structure.

The anchor-based framework has been originally
formulated for learning topic models (Arora et al.,
2013). It has been subsequently adopted to learn
other models such as latent-variable probabilistic
context-free grammars (Cohen and Collins, 2014).
In our work, we have extended this framework to
address unsupervised sequence labeling.

Zhou et al. (2014) also extend Arora et al.
(2013)’s framework to learn various models includ-
ing HMMs, but they address a more general prob-
lem. Consequently, their algorithm draws from
Anandkumar et al. (2012) and is substantially dif-
ferent from ours.

6.2 Unsupervised POS Tagging

Unsupervised POS tagging is a classic problem in
unsupervised learning that has been tackled with
various approaches. Johnson (2007) observes that

254


de en es fr id it ja ko pt-br sv
% anchored tags 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0

% unambig words 96.6 91.5 94.0 94.2 94.8 94.8 99.5 98.4 94.8 97.4

. VERB PRON ADP NOUN ADV CONJ DET NUM ADJ X PRT
, said it from Mr. n’t or which billion new bono na

53928 6339 5147 4856 4436 3582 2748 2458 1935 1542 8 5

Table 5: Verifying model assumptions on the universal treebank. The anchor assumption is satisfied in every language.
The Brown assumption (each word has exactly one possible tag) is violated but not by a large margin. The lower table
shows the most frequent anchor word and its count under each tag on the English portion.

Model Normalized LL Acc
BW -6.45 59.8

CLUSTER -6.71 62.9
ANCHOR -7.06 66.1

ANCHOR-FEATURES -7.05 71.4

Table 6: Log likelihood normalized by the number of
words on English (along with accuracy). For BW, we
report the mean of 10 random restarts run for 1,000 it-
erations.

EM performs poorly in this task because it induces
flat distributions; this is not the case with our algo-
rithm as seen in the peaky distributions in Table 4.
Haghighi and Klein (2006) assume a set of proto-
typical words for each tag and report high accuracy.
In contrast, our algorithm automatically finds such
prototypes in a subroutine.

Berg-Kirkpatrick et al. (2010) achieve the state-
of-the-art result in unsupervised fine-grained POS
tagging (mid-70%). As described in Section 5.2,
their model is an HMM in which probabilties are
given by log-linear models. Table 7 provides a
point of reference comparing our work with Berg-
Kirkpatrick et al. (2010) in their setting: models are
trained and tested on the entire 45-tag WSJ dataset.
Their model outperforms our approach in this set-
ting: with fine-grained tags, spelling features be-
come more important, for instance to distinguish
“played” (VBD) from “play” (VBZ). Nonetheless, we
have shown that our approach is competitive when
universal tags are used (Table 2).

Many past works on POS induction predate the
introduction of the universal tagset by Petrov et al.
(2012) and thus report results with fine-grained tags.
More recent works adopt the universal tagset but

Models Accuracy
BW 62.6 (1.1)
CLUSTER 65.6
ANCHOR 67.2
ANCHOR-FEATURES 67.7
LOG-LINEAR 74.9 (1.5)

Table 7: Many-to-one accuracy on the English data with
45 original tags. We use the same setting as in Table 2.
For BW and LOG-LINEAR, we report the mean and the
standard deviation (in parentheses) of 10 random restarts
run for 1,000 iterations.

they leverage additional resources. For instance, Das
and Petrov (2011) and Täckström et al. (2013) use
parallel data to project POS tags from a supervised
source language. Li et al. (2012) use tag dictionar-
ies built from Wiktionary. Thus their results are not
directly comparable to ours.4

7 Conclusion

We have presented an exact estimation method for
learning anchor HMMs from unlabeled data. There
are several directions for future work. An important
direction is to extend the method to a richer family
of models such as log-linear models or neural net-
works. Another direction is to further generalize the
method to handle a wider class of HMMs by relax-
ing the anchor condition (Condition 4.1). This will
require a significant extension of the NMF algorithm
in Figure 1.

4Das and Petrov (2011) conduct unsupervised experiments
using the model of Berg-Kirkpatrick et al. (2010), but their
dataset and evaluation method differ from ours.

255


Acknowledgments

We thank Taylor Berg-Kirkpatrick for providing the
implementation of Berg-Kirkpatrick et al. (2010).
We also thank anonymous reviewers for their con-
structive comments.

References

Animashree Anandkumar, Daniel Hsu, and Sham M.
Kakade. 2012. A method of moments for mixture
models and hidden Markov models. In Twenty-Fifth
Annual Conference on Learning Theory.

Animashree Anandkumar, Rong Ge, Daniel Hsu,
Sham M. Kakade, and Matus Telgarsky. 2014. Ten-
sor decompositions for learning latent variable models.
Journal of Machine Learning Research, 15(1):2773–
2832.

Sanjeev Arora, Rong Ge, and Ankur Moitra. 2012.
Learning topic models–going beyond SVD. In Foun-
dations of Computer Science (FOCS), 2012 IEEE 53rd
Annual Symposium on, pages 1–10. IEEE.

Sanjeev Arora, Rong Ge, Yonatan Halpern, David
Mimno, Ankur Moitra, David Sontag, Yichen Wu, and
Michael Zhu. 2013. A practical algorithm for topic
modeling with provable guarantees. In Proceedings of
the 30th International Conference on Machine Learn-
ing (ICML-13), pages 280–288.

Leonard E. Baum and Ted Petrie. 1966. Statisti-
cal inference for probabilistic functions of finite state
Markov chains. The Annals of Mathematical Statis-
tics, 37(6):1554–1563.

Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté,
John DeNero, and Dan Klein. 2010. Painless unsu-
pervised learning with features. In Human Language
Technologies: The 2010 Annual Conference of the
North American Chapter of the Association for Com-
putational Linguistics, pages 582–590. Association for
Computational Linguistics.

Peter F. Brown, Peter V. Desouza, Robert L. Mercer, Vin-
cent J. Della Pietra, and Jenifer C. Lai. 1992. Class-
based n-gram models of natural language. Computa-
tional Linguistics, 18(4):467–479.

Christos Christodoulopoulos, Sharon Goldwater, and
Mark Steedman. 2010. Two decades of unsupervised
POS induction: How far have we come? In Proceed-
ings of the 2010 Conference on Empirical Methods in
Natural Language Processing, pages 575–584. Asso-
ciation for Computational Linguistics.

Shay B. Cohen and Michael Collins. 2014. A provably
correct learning algorithm for latent-variable PCFGs.
In Proceedings of ACL.

Dipanjan Das and Slav Petrov. 2011. Unsupervised
part-of-speech tagging with bilingual graph-based pro-
jections. In Proceedings of the 49th Annual Meet-
ing of the Association for Computational Linguistics:
Human Language Technologies-Volume 1, pages 600–
609. Association for Computational Linguistics.

Marguerite Frank and Philip Wolfe. 1956. An algorithm
for quadratic programming. Naval Research Logistics
Quarterly, 3(1-2):95–110.

Aria Haghighi and Dan Klein. 2006. Prototype-driven
learning for sequence models. In Proceedings of
the main conference on Human Language Technol-
ogy Conference of the North American Chapter of the
Association of Computational Linguistics, pages 320–
327. Association for Computational Linguistics.

Daniel Hsu, Sham M. Kakade, and Tong Zhang. 2012. A
spectral algorithm for learning hidden Markov mod-
els. Journal of Computer and System Sciences,
78(5):1460–1480.

Mark Johnson. 2007. Why doesn’t EM find good HMM
POS-taggers? In EMNLP-CoNLL, pages 296–305.

Shen Li, Joao V. Graça, and Ben Taskar. 2012. Wiki-
ly supervised part-of-speech tagging. In Proceedings
of the 2012 Joint Conference on Empirical Methods
in Natural Language Processing and Computational
Natural Language Learning, pages 1389–1398. Asso-
ciation for Computational Linguistics.

Percy Liang and Dan Klein. 2008. Analyzing the errors
of unsupervised learning. In ACL, pages 879–887.

Percy Liang. 2005. Semi-supervised learning for natural
language. Master’s thesis, Massachusetts Institute of
Technology.

Ryan T. McDonald, Joakim Nivre, Yvonne Quirmbach-
Brundage, Yoav Goldberg, Dipanjan Das, Kuzman
Ganchev, Keith B. Hall, Slav Petrov, Hao Zhang, Os-
car Täckström, Claudia Bedini, Núria B. Castelló, and
Jungmee Lee. 2013. Universal dependency annota-
tion for multilingual parsing. In ACL, pages 92–97.

Bernard Merialdo. 1994. Tagging English text with
a probabilistic model. Computational Linguistics,
20(2):155–171.

Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012.
A universal part-of-speech tagset. In Proceedings of
the Eighth International Conference on Language Re-
sources and Evaluation (LREC’12).

Noah A. Smith and Jason Eisner. 2005a. Contrastive
estimation: Training log-linear models on unlabeled
data. In Proceedings of the 43rd Annual Meeting
on Association for Computational Linguistics, pages
354–362. Association for Computational Linguistics.

Noah A. Smith and Jason Eisner. 2005b. Guiding un-
supervised grammar induction using contrastive esti-
mation. In Proc. of IJCAI Workshop on Grammatical
Inference Applications, pages 73–82.

256


Karl Stratos, Do-kyum Kim, Michael Collins, and Daniel
Hsu. 2014. A spectral algorithm for learning class-
based n-gram models of natural language. In Proceed-
ings of the thirtieth conference on Uncertainty in Arti-
ficial Intelligence.

Karl Stratos, Michael Collins, and Daniel Hsu. 2015.
Model-based word embeddings from decompositions
of count matrices. In Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguis-
tics and the 7th International Joint Conference on Nat-
ural Language Processing (Volume 1: Long Papers),
pages 1282–1291, Beijing, China, July. Association
for Computational Linguistics.

Oscar Täckström, Dipanjan Das, Slav Petrov, Ryan Mc-
Donald, and Joakim Nivre. 2013. Token and type
constraints for cross-lingual part-of-speech tagging.
Transactions of the Association for Computational
Linguistics, 1:1–12.

Sebastiaan A. Terwijn. 2002. On the learnability of hid-
den Markov models. In Grammatical Inference: Al-
gorithms and Applications, pages 261–268. Springer.

Kristina Toutanova and Mark Johnson. 2007. A
Bayesian LDA-based model for semi-supervised part-
of-speech tagging. In Advances in Neural Information
Processing Systems, pages 1521–1528.

Stephen A. Vavasis. 2009. On the complexity of nonneg-
ative matrix factorization. SIAM Journal on Optimiza-
tion, 20(3):1364–1377.

Tianyi Zhou, Jeff A. Bilmes, and Carlos Guestrin. 2014.
Divide-and-conquer learning by anchoring a conical
hull. In Advances in Neural Information Processing
Systems, pages 1242–1250.

257


258