Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models Karl Stratos, Michael Collins∗ and Daniel Hsu Department of Computer Science, Columbia University {stratos, mcollins, djhsu}@cs.columbia.edu Abstract We tackle unsupervised part-of-speech (POS) tagging by learning hidden Markov models (HMMs) that are particularly well-suited for the problem. These HMMs, which we call an- chor HMMs, assume that each tag is associ- ated with at least one word that can have no other tag, which is a relatively benign con- dition for POS tagging (e.g., “the” is a word that appears only under the determiner tag). We exploit this assumption and extend the non-negative matrix factorization framework of Arora et al. (2013) to design a consistent estimator for anchor HMMs. In experiments, our algorithm is competitive with strong base- lines such as the clustering method of Brown et al. (1992) and the log-linear model of Berg- Kirkpatrick et al. (2010). Furthermore, it pro- duces an interpretable model in which hidden states are automatically lexicalized by words. 1 Introduction Part-of-speech (POS) tagging without supervision is a quintessential problem in unsupervised learning for natural language processing (NLP). A major ap- plication of this task is reducing annotation cost: for instance, it can be used to produce rough syntactic annotations for a new language that has no labeled data, which can be subsequently refined by human annotators. Hidden Markov models (HMMs) are a natural choice of model and have been a workhorse for this problem. Early works estimated vanilla HMMs ∗Currently on leave at Google Inc. New York. with standard unsupervised learning methods such as the expectation-maximization (EM) algorithm, but it quickly became clear that they performed very poorly in inducing POS tags (Merialdo, 1994). Later works improved upon vanilla HMMs by incorporat- ing specific structures that are well-suited for the task, such as a sparse prior (Johnson, 2007) or a hard-clustering assumption (Brown et al., 1992). In this work, we tackle unsupervised POS tagging with HMMs whose structure is deliberately suitable for POS tagging. These HMMs impose an assump- tion that each hidden state is associated with an ob- servation state (“anchor word”) that can appear un- der no other state. For this reason, we denote this class of restricted HMMs by anchor HMMs. Such an assumption is relatively benign for POS tagging; it is reasonable to assume that each POS tag has at least one word that occurs only under that tag. For example, in English, “the” is an anchor word for the determiner tag; “laughed” is an anchor word for the verb tag. We build on the non-negative matrix factoriza- tion (NMF) framework of Arora et al. (2013) to de- rive a consistent estimator for anchor HMMs. We make several new contributions in the process. First, to our knowledge, there is no previous work di- rectly building on this framework to address unsu- pervised sequence labeling. Second, we generalize the NMF-based learning algorithm to obtain exten- sions that are important for empirical performance (Table 1). Third, we perform extensive experiments on unsupervised POS tagging and report competitive results against strong baselines such as the cluster- ing method of Brown et al. (1992) and the log-linear 245 Transactions of the Association for Computational Linguistics, vol. 4, pp. 245–257, 2016. Action Editor: Hinrich Schütze. Submission batch: 1/2016; Revision batch: 3/2016; Published 6/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. model of Berg-Kirkpatrick et al. (2010). One characteristic of the approach is the imme- diate interpretability of inferred hidden states. Be- cause each hidden state is associated with an obser- vation, we can examine the set of such anchor obser- vations to qualitatively evaluate the learned model. In our experiments on POS tagging, we find that an- chor observations correspond to possible POS tags across different languages (Table 3). This property can be useful when we wish to develop a tagger for a new language that has no labeled data; we can label only the anchor words to achieve a complete label- ing of the data. This paper is structured as follows. In Section 2, we establish the notation we use throughout. In Sec- tion 3, we define the model family of anchor HMMs. In Section 4, we derive a matrix decomposition al- gorithm for estimating the parameters of an anchor HMM. In Section 5, we present our experiments on unsupervised POS tagging. In Section 6, we discuss related works. 2 Notation We use [n] to denote the set of integers {1, . . . ,n}. We use E[X] to denote the expected value of a ran- dom variable X. We define ∆m−1 := {v ∈ Rm : vi ≥ 0 ∀i, ∑ i vi = 1}, i.e., the (m−1)-dimensional probability simplex. Given a vector v ∈ Rm, we use diag(v) ∈ Rm×m to denote the diagonal matrix with v1 . . .vm on the main diagonal. Given a matrix M ∈ Rn×m, we write Mi ∈ Rm to denote the i-th row of M (as a column vector). 3 The Anchor Hidden Markov Model Definition 3.1. An anchor HMM (A-HMM) is a 6- tuple (n,m,π,t,o,A) for positive integers n,m and functions π,t,o,A where • [n] is a set of observation states. • [m] is a set of hidden states. • π(h) is the probability of generating h ∈ [m] in the first position of a sequence. • t(h′|h) is the probability of generating h′ ∈ [m] given h ∈ [m]. • o(x|h) is the probability of generating x ∈ [n] given h ∈ [m]. • A(h) := {x ∈ [n] : o(x|h) > 0 ∧ o(x|h′) = 0 ∀h′ 6= h} is non-empty for each h ∈ [m]. In other words, an A-HMM is an HMM in which each hidden state h is associated with at least one “anchor” observation state that can be generated by, and only by, h. Note that the anchor condition im- plies n ≥ m. An equivalent definition of an A-HMM is given by organizing the parameters in matrix form. Under this definition, an A-HMM has parameters (π,T,O) where π ∈ Rm is a vector and T ∈ Rm×m,O ∈ Rn×m are matrices whose entries are set to: • πh = π(h) for h ∈ [m] • Th′,h = t(h′|h) for h,h′ ∈ [m] • Ox,h = o(x|h) for h ∈ [m],x ∈ [n] The anchor condition implies that rank(O) = m. To see this, consider the rows Oa1 . . .Oam where ah ∈ A(h). Since each Oah has a single non-zero entry at the h-th index, the columns of O are linearly independent. We assume rank(T) = m. An important special case of A-HMM introduced by Brown et al. (1992) is defined below. Under this A-HMM, every observation state is an anchor of some hidden state. Definition 3.2. A Brown model is an A-HMM in which A(1) . . .A(m) partition [n]. 4 Parameter Estimation for A-HMMs We now derive an algorithm for learning A-HMMs. The algorithm reduces the learning problem to an instance of NMF from which the model parameters can be computed in closed-form. 4.1 NMF We give a brief review of the NMF method of Arora et al. (2013). (Exact) NMF is the following problem: given an n × d matrix A = BC where B ∈ Rn×m and C ∈ Rm×d have non-negativity constraints, re- cover B and C. This problem is NP-hard in general (Vavasis, 2009), but Arora et al. (2013) provide an exact and efficient method when A has the follow- ing special structure: Condition 4.1. A matrix A ∈ Rn×d satisfies this condition if A = BC for B ∈ Rn×m and C ∈ Rm×d where 246 Anchor-NMF Input: A ∈ Rn×d satisfying Condition 4.1 with A = BC for some B ∈ Rn×m and C ∈ Rm×d, value m • For h = 1 . . .m, find a vertex ah as U ← Gram-Schmidt({Aal}h−1l=1 ) ah ← arg max x∈[n] ∣∣∣∣Ax −UU>Ax ∣∣∣∣ 2 where Gram-Schmidt({Aal}h−1l=1 ) is the Gram- Schmidt process that orthonormalizes {Aal}h−1l=1 . • For x = 1 . . .n, recover the x-th row of B as Bx ← arg min u∈∆m−1 ∣∣∣∣∣ ∣∣∣∣∣Ax − m∑ h=1 uhAah ∣∣∣∣∣ ∣∣∣∣∣ 2 (1) • Set C = [Aa1 . . .Aam ]>. Output: B and C such that B>h = B > ρ(h) and Ch = Cρ(h) for some permutation ρ : [m] → [m] Figure 1: Non-negative matrix factorization algorithm of Arora et al. (2012). 1. For each x ∈ [n], Bx ∈ ∆m−1. I.e., each row of B defines a probability distribution over [m]. 2. For each h ∈ [m], there is some ah ∈ [n] such that Bah,h = 1 and Bah,h′ = 0 for all h ′ 6= h. 3. rank(C) = m. Since rank(B) = rank(C) = m (by property 2 and 3), the matrix A must have rank m. Note that by property 1, each row of A is given by a convex com- bination of the rows of C: for x ∈ [n], Ax = m∑ h=1 Bx,h ×Ch Furthermore, by property 2 each h ∈ [m] has an as- sociated row ah ∈ [n] such that Aah = Cah . These properties can be exploited to recover B and C. A concrete algorithm for factorizing a matrix sat- isfying Condition 4.1 is given in Figure 1 (Arora et al., 2013). It first identifies a1 . . .am (up to some permutation) by greedily locating the row of A furthest away from the subspace spanned by the vertices selected so far. Then it recovers each Bx as the convex coefficients required to combine Aa1 . . .Aam to yield Ax. The latter computation (1) can be achieved with any constrained optimization method; we use the Frank-Wolfe algorithm (Frank and Wolfe, 1956). See Arora et al. (2013) for a proof of the correctness of this algorithm. 4.2 Random Variables To derive our algorithm, we make use of cer- tain random variables under the A-HMM. Let (X1, . . . ,XN ) ∈ [n]N be a random sequence of N ≥ 2 observations drawn from the model, along with the corresponding hidden state sequence (H1, . . . ,HN ) ∈ [m]N ; independently, pick a posi- tion I ∈ [N − 1] uniformly at random. Let YI ∈ Rd be a d-dimensional vector which is conditionally in- dependent of XI given HI, i.e., P(YI|HI,XI) = P(YI|HI). We will provide how to define such a variable in Section 4.4.1: YI will be a function of (X1, . . . ,XN ) serving as a “context” representation of XI revealing the hidden state HI. Define unigram probabilities u∞,u1 ∈ Rn and bigram probabilities B ∈ Rn×n where u∞x := P(XI = x) ∀x ∈ [n] u1x := P(XI = x|I = 1) ∀x ∈ [n] Bx,x′ := P(XI = x,XI+1 = x′) ∀x,x′ ∈ [n] Additionally, define π̄ ∈ Rm where π̄h = P(HI = h) ∀h ∈ [m] (2) We assume π̄h > 0 for all h ∈ [m]. 4.3 Derivation of a Learning Algorithm The following proposition provides a way to use the NMF algorithm in Figure 1 to recover the emission parameters O up to scaling. Proposition 4.1. Let XI ∈ [n] and YI ∈ Rd be respectively an observation and a context vector drawn from the random process described in Sec- tion 4.2. Define a matrix Ω ∈ Rn×d with rows Ωx = E[YI|XI = x] ∀x ∈ [n] (3) If rank(Ω) = m, then Ω satisfies Condition 4.1: Ω = ÕΘ 247 where Õx,h = P(HI = h|XI = x) and Θh = E[YI|HI = h]. Proof. E[YI|XI = x] = m∑ h=1 P(HI = h|XI = x) × E[YI|HI = h,XI = x] = m∑ h=1 P(HI = h|XI = x) × E[YI|HI = h] The last equality follows by the conditional inde- pendence of YI. This shows Ω = ÕΘ. By the an- chor assumption of the A-HMM, each h ∈ [m] has at least one x ∈ A(h) such that P(HI = h|XI = x) = 1, thus Ω satisfies Condition 4.1. A useful interpretation of Ω in Proposition 4.1 is that its rows Ω1 . . . Ωn are d-dimensional vec- tor representations of observation states forming a convex hull in Rd. This convex hull has m ver- tices Ωa1 . . . Ωam corresponding to anchors ah ∈ A(h) which can be convexly combined to realize all Ω1 . . . Ωn. Given Õ, we can recover the A-HMM parameters as follows. First, we recover the stationary state dis- tribution π̄ as: π̄h = ∑ x∈[n] P(HI = h|XI = x) ×P(XI = x) = ∑ x∈[n] Õx,h ×u∞x The emission parameters O are given by Bayes’ the- orem: Ox,h = P(HI = h|XI = x) ×P(XI = x)∑ x∈[n] P(HI = h|XI = x) ×P(XI = x) = Õx,h ×u∞x π̄h Using the fact that the emission probabilities are position-independent, we see that the initial state distribution π satisfies u1 = Oπ: u1x = P(XI = x|I = 1) = ∑ h∈[m] P(XI = x|HI = h,I = 1) ×P(HI = h|I = 1) = ∑ h∈[m] Ox,h ×πh Learn-Anchor-HMM Input: Ω in Proposition 4.1, number of hidden states m, bigram probabilities B, unigram probabilities u∞,u1 • Compute (Õ, Θ) ← Anchor-NMF(Ω,m). • Recover the parameters: π̄ ← Õ>u∞ (4) O ← diag(π̄)−1diag(u∞)Õ (5) π = O+u1 (6) T ← (diag(π̄)−1O+B(O>)+)> (7) Output: A-HMM parameters (π,T,O) Figure 2: NMF-based learning algorithm for A-HMMs. The algorithm Anchor-NMF is given in Figure 1. Thus π can be recovered as π = O+u1 where O+ is the Moore–Penrose pseudoinverse of O. Fi- nally, it can be algebraically verified that B = Odiag(π̄)T>O> (Hsu et al., 2012). Since all the in- volved matrices have rank m, we can directly solve for T as T = (diag(π̄)−1O+B(O>)+)> Figure 2 shows the complete algorithm. As input, it receives a matrix Ω satisfying Proposition 4.1, the number of hidden states, and the probabilities of ob- served unigrams and bigrams. It first decomposes Ω using the NMF algorithm in Figure 1. Then it computes the A-HMM parameters whose solution is given analytically. The following theorem guarantees the consis- tency of the algorithm. Theorem 4.1. Let (π,T,O) be an A-HMM such that rank(T) = m and π̄ defined in (2) has strictly pos- itive entries π̄h > 0. Given random variables Ω satisfying Proposition 4.1 and B,u∞,u1 under this model, the algorithm Learn-Anchor-HMM in Fig- ure 2 outputs (π,T,O) up to a permutation on hid- den states. Proof. By Proposition 4.1, Ω satisfies Condition 4.1 with Ω = ÕΘ, thus Õ can be recovered up to a permutation on columns with the algorithm Anchor- NMF. The consistency of the recovered parameters follows from the correctness of (4–7) under the rank conditions. 248 4.3.1 Constrained Optimization for π and T Note that (6) and (7) require computing the pseu- doinverse of the estimated O, which can be expen- sive and vulnerable to sampling errors in practice. To make our parameter estimation more robust, we can explicitly impose probability constraints. We re- cover π by solving: π = arg min π′∈∆m−1 ∣∣∣∣u1 −Oπ′ ∣∣∣∣ 2 (8) which can again be done with algorithms such as Frank-Wolfe. We recover T by maximizing the log likelihood of observation bigrams ∑ x,x′ Bx,x′ log   ∑ h,h′∈[m] π̄hOx,hTh′,hOx′,h′   (9) subject to the constraint (T>)h ∈ ∆m−1. Since (9) is concave in T with other parameters O and π̄ fixed, we can use EM to find the global optimum. 4.4 Construction of the Convex Hull Ω In this section, we provide several ways to construct a convex hull Ω satisfying Proposition 4.1. 4.4.1 Choice of the Context YI In order to satisfy Proposition 4.1, we need to de- fine the context variable YI ∈ Rd with two proper- ties: • P(YI|HI,XI) = P(YI|HI) • The matrix Ω with rows Ωx = E[YI|XI = x] ∀x ∈ [n] has rank m. A simple construction (Arora et al., 2013) is given by defining YI ∈ Rn to be an indicator vector for the next observation: [YI]x′ = { 1 if XI+1 = x′ 0 otherwise (10) The first condition is satisfied since XI+1 does not depend on XI given HI. For the second condition, observe that Ωx,x′ = P(XI+1 = x′|XI = x), or in matrix form Ω = diag (u∞)−1 B (11) Under the rank conditions in Theorem 4.1, (11) has rank m. More generally, we can let YI be an observation (encoded as an indicator vector as in (10)) randomly drawn from a window of L ∈ N nearby observa- tions. We can either only use the identity of the cho- sen observation (in which case YI ∈ Rn) or addi- tionally indicate the relative position in the window (in which case YI ∈ RnL). It is straightforward to verify that the above two conditions are satisfied un- der these definitions. Clearly, (11) is a special case with L = 1. 4.4.2 Reducing the Dimension of Ωx With the definition of Ω in the previous section, the dimension of Ωx is d = O(n) which can be dif- ficult to work with when n � m. Proposition 4.1 allows us to reduce the dimension as long as the fi- nal matrix retains the form in (3) and has rank m. In particular, we can multiply Ω by any rank-m projec- tion matrix Π ∈ Rd×m on the right side: if Ω sat- isfies the properties in Proposition 4.1, then so does ΩΠ with m-dimensional rows (ΩΠ)x = E[YIΠ|XI = x] Since rank(Ω) = m, a natural choice of Π is the projection onto the best-fit m-dimensional subspace of the row space of Ω. We mention that previous works on the NMF- learning framework have employed various projec- tion methods, but they do not examine relative mer- its of their choices. For instance, Arora et al. (2013) simply use random projection, which is convenient for theoretical analysis. Cohen and Collins (2014) use a projection based on canonical correlation anal- ysis (CCA) without further exploration. In con- trast, we give a full comparison of valid construc- tion methods and find that the choice of Ω is crucial in practice. 4.4.3 Construction of Ω for the Brown Model We can formulate an alternative way to construct a valid Ω when the model is further restricted to be a Brown model. Since every observation is an anchor, Ox ∈ Rm has a single nonzero entry for every x. Thus the rows defined by Ωx = Ox/ ||Ox|| (an in- dicator vector for the unique hidden state of x) form 249 Input: bigram probabilities B, unigram probabilities u∞, number of hidden states m, construction method τ Scaled Matrices: ( √ · is element-wise) B := diag (u∞)−1/2 Bdiag (u∞)−1/2 B̃ := diag (√ u∞ )−1/2 √ Bdiag (√ u∞ )−1/2 Singular Vectors: U(M) (V (M)) is an n×m matrix of the left (right) singular vectors of M corresponding to the largest m singular values • If τ 6= brown: set Ω ← diag (u∞)−1 BΠ where the projection matrix Π ∈ Rn×m is given by Πi,j ∼N(0, 1/m) if τ = random Π = V (diag (u∞)−1 B) if τ = best-fit Π = diag (u∞)−1/2 V (B) if τ = cca • If τ = brown: compute the transformed emission matrix as f(O) = U(B̃) and set Ω ← diag(v)−1f(O) where vx := ||f(O)x||2 is the length of the x-th row of f(O). Output: Ω ∈ Rn×m in Proposition 4.1 Figure 3: Algorithm for constructing a valid Ω with dif- ferent construction methods. For simplicity, we only show the bigram construction (context size L = 1), but an extension for larger context (L > 1) is straightforward. a trivial convex hull in which every point is a ver- tex. This corresponds to choosing an oracle context YI ∈ Rm where [YI]h = { 1 if HI = h 0 otherwise It is possible to recover the Brown model param- eters O up to element-wise scaling and rotation of rows using the algorithm of Stratos et al. (2015). More specifically, let f(O) ∈ Rn×m denote the out- put of their algorithm. Then they show that for some vector s ∈ Rm with strictly positive entries and an orthogonal matrix Q ∈ Rm×m: f(O) = O〈1/4〉diag(s)Q> where O〈1/4〉 is an element-wise exponentiation of O by 1/4. Since the rows of f(O) are simply some scaling and rotation of the rows of O, using Ωx = f(O)x/ ||f(O)x|| yields a valid Ω. While we need to impose an additional assump- tion (the Brown model restriction) in order to justify this choice of Ω, we find in our experiments that it performs better than other alternatives. We specu- late that this is because a Brown model is rather ap- propriate for the POS tagging task; many words are indeed unambiguous with respect to POS tags (Ta- ble 5). Also, the general effectiveness of f(O) for representational purposes has been demostrated in previous works (Stratos et al., 2014; Stratos et al., 2015). By restricting the A-HMM to be a Brown model, we can piggyback on the proven effective- ness of f(O). Figure 3 shows an algorithm for constructing Ω with these different construction methods. For sim- plicity, we only show the bigram construction (con- text size L = 1), but an extension for larger context (L > 1) is straightforward as discussed earlier. The construction methods random (random projection), best-fit (projection to the best-fit subspace), and cca (CCA projection) all compute (11) and differ only in how the dimension is reduced. The construction method brown computes the transformed Brown pa- rameters f(O) as the left singular vectors of a scaled covariance matrix and then normalizes its rows. We direct the reader to Stratos et al. (2015) for a deriva- tion of this calculation. 4.4.4 Ω with Feature Augmentation The x-th row of Ω is a d-dimensional vector repre- sentation of x lying in a convex set with m vertices. This suggests a natural way to incorporate domain- specific features: we can add additional dimensions that provide information about hidden states from the surface form of x. For instance, consider the the POS tagging task. In the simple construction (11), the representation of word x is defined in terms of neighboring words x′: [Ωx]x′ = E [ 1 ( XI+1 = x ′) |XI = x ] where 1(·) ∈ {0, 1} is the indicator function. We can augment this vector with s additional dimen- 250 sions indicating the spelling features of x. For in- stance, the (n + 1)-th dimension may be defined as: [Ωx]n+1 = E [1 (x ends in “ing” ) |XI = x] This value will be generally large for verbs and small for non-verbs, nudging verbs closer together and away from non-verbs. The modified (n + s)- dimensional representation is followed by the usual dimension reduction. Note that the spelling features are a deterministic function of a word, and we are implicitly assuming that they are independent of the word given its tag. While this is of course not true in practice, we find that these features can significantly boost the tagging performance. 5 Experiments We evaluate our A-HMM learning algorithm on the task of unsupervised POS tagging. The goal of this task is to induce the correct sequence of POS tags (hidden states) given a sequence of words (observa- tion states). The anchor condition corresponds to as- suming that each POS tag has at least one word that occurs only under that tag. 5.1 Background on Unsupervised POS Tagging Unsupervised POS tagging has long been an active area of research (Smith and Eisner, 2005a; John- son, 2007; Toutanova and Johnson, 2007; Haghighi and Klein, 2006; Berg-Kirkpatrick et al., 2010), but results on this task are complicated by vary- ing assumptions and unclear evaluation metrics (Christodoulopoulos et al., 2010). Rather than ad- dressing multiple alternatives for evaluating unsu- pervised POS tagging, we focus on a simple and widely used metric: many-to-one accuracy (i.e., we map each hidden state to the most frequently coin- ciding POS tag in the labeled data and compute the resulting accuracy). 5.1.1 Better Model v.s. Better Learning Vanilla HMMs are notorious for their mediocre performance on this task, and it is well known that they perform poorly largely because of model mis- specification, not because of suboptimal parameter estimation (e.g., because EM gets stuck in local op- tima). More generally, a large body of work points to the inappropriateness of simple generative mod- els for unsupervised induction of linguistic structure (Merialdo, 1994; Smith and Eisner, 2005b; Liang and Klein, 2008). Consequently, many works focus on using more expressive models such as log-linear models (Smith and Eisner, 2005a; Berg-Kirkpatrick et al., 2010) and Markov random fields (MRF) (Haghighi and Klein, 2006). These models are shown to deliver good performance even though learning is approxi- mate. Thus one may question the value of having a consistent estimator for A-HMMs and Brown mod- els in this work: if the model is wrong, what is the point of learning it accurately? However, there is also ample evidence that HMMs are competitive for unsupervised POS induc- tion when they incorporate domain-specific struc- tures. Johnson (2007) is able to outperform the sophisticated MRF model of Haghighi and Klein (2006) on one-to-one accuracy by using a sparse prior in HMM estimation. The clustering method of Brown et al. (1992) which is based on optimizing the likelihood under the Brown model (a special case of HMM) remains a baseline difficult to outperform (Christodoulopoulos et al., 2010). We add to this evidence by demonstrating the ef- fectiveness of A-HMMs on this task. We also check the anchor assumption on data and show that the A- HMM model structure is in fact appropriate for the problem (Table 5). 5.2 Experimental Setting We use the universal treebank dataset (version 2.0) which contains sentences annotated with 12 POS tag types for 10 languages (McDonald et al., 2013). All models are trained with 12 hidden states. We use the English portion to experiment with different hy- perparameter configurations. At test time, we fix a configuration (based on the English portion) and ap- ply it across all languages. The list of compared methods is given below: BW The Baum-Welch algorithm, an EM algorithm for HMMs (Baum and Petrie, 1966). CLUSTER A parameter estimation scheme for HMMs based on Brown clustering (Brown et al., 1992). We run the Brown clustering algorithm1 to obtain 12 word clusters C1 . . .C12. Then we set 1We use the implementation of Liang (2005). 251 the emission parameters o(x|h), transition param- eters t(h′|h), and prior π(h) to be the maximum- likelihood estimates under the fixed clusters. ANCHOR Our algorithm Learn-Anchor-HMM in Figure 2 but with the constrained optimization (8) and (9) for estimating π and T .2 ANCHOR-FEATURES Same as ANCHOR but em- ploys the feature augmentation scheme described in Section 4.4.4. LOG-LINEAR The unsupervised log-linear model described in Berg-Kirkpatrick et al. (2010). Instead of emission parameters o(x|h), the model maintains a miniature log-linear model with a weight vector w and a feature function φ. The probability of a word x given tag h is computed as p(x|h) = exp(w >φ(x,h))∑ x∈[n] exp(w >φ(x,h)) The model can be trained by maximizing the like- lihood of observed sequences. We use L-BFGS to directly optimize this objective.3 This approach ob- tains the current state-of-the-art accuracy on fine- grained (45 tags) English WSJ dataset. We use maximum marginal decoding for HMM predictions: i.e., at each position, we predict the most likely tag given the entire sentence. 5.3 Practical Issues with the Anchor Algorithm In our experiments, we find that Anchor-NMF (Fig- ure 1) tends to propose extremely rare words as an- chors. A simple fix is to search for anchors only among relatively frequent words. We find that any reasonable frequency threshold works well; we use the 300 most frequent words. Note that this is not a problem if these 300 words include anchor words corresponding to all the 12 tags. We must define the context for constructing Ω. We use the previous and next words (i.e., context size L = 2) marked with relative positions. Thus Ω has 2n columns before dimension reduction. Table 1 shows the performance on the English portion with different construction methods for Ω. The Brown 2https://github.com/karlstratos/anchor 3We use the implementation of Berg-Kirkpatrick et al. (2010) (personal communication). Choice of Ω Accuracy Random 48.2 Best-Fit 53.4 CCA 57.0 Brown 66.1 Table 1: Many-to-one accuracy on the English data with different choices of the convex hull Ω (Figure 3). These results do not use spelling features. construction (τ = brown in Figure 3) clearly per- forms the best: essentially, the anchor algorithm is used to extract the HMM parameters from the CCA- based word embeddings of Stratos et al. (2015). We also explore feature augmentation discussed in Section 4.4.4. For comparison, we employ the same word features used by Berg-Kirkpatrick et al. (2010): • Indicators for whether a word is capitalized, contains a hyphen, or contains a digit • Suffixes of length 1, 2, and 3 We weigh the l2 norm of these extra dimensions in relation to the original dimensions: we find a small weight (e.g., 0.1 of the norm of the original dimensions) works well. We also find that these fea- tures can sometimes significantly improve the per- formance. For instance, the accuracy on the English portion can be improved from 66.1% to 71.4% with feature augmentation. Another natural experiment is to refine the HMM parameters obtained from the anchor algorithm (or Brown clusters) with a few iterations of the Baum- Welch algorithm. In our experiments, however, it did not significantly improve the tagging perfor- mance, so we omit this result. 5.4 Tagging Accuracy Table 2 shows the many-to-one accuracy on all lan- guages in the dataset. For the Baum-Welch algo- rithm and the unsupervised log-linear models, we report the mean and the standard deviation (in paren- theses) of 10 random restarts run for 1,000 itera- tions. Both ANCHOR and ANCHOR-FEATURES compete favorably. On 5 out of 10 languages, ANCHOR- FEATURES achieves the highest accuracy, often 252 Model de en es fr id it ja ko pt-br sv BW (4.8) 45.5 (3.4) 59.8 (2.2) 60.6 (3.6) 60.1 (3.1) 49.6 (2.6) 51.5 (2.1) 59.5 (0.6) 51.7 (3.7) 59.5 (3.0) 42.4 CLUSTER 60.0 62.9 67.4 66.4 59.3 66.1 60.3 47.5 67.4 61.9 ANCHOR 61.1 66.1 69.0 68.2 63.7 60.4 65.3 53.8 64.9 51.1 ANCHOR-FEATURES 63.4 71.4 74.3 71.9 67.3 60.2 69.4 61.8 65.8 61.0 LOG-LINEAR (1.8) 67.5 (3.5) 62.4 (3.1) 67.1 (4.5) 62.1 (3.9) 61.3 (2.9) 52.9 (2.9) 78.2 (3.6) 60.5 (2.2) 63.2 (2.5) 56.7 Table 2: Many-to-one accuracy on each language using 12 universal tags. The first four models are HMMs estimated with the Baum-Welch algorithm (BW), the clustering algorithm of Brown et al. (1992), the anchor algorithm without (ANCHOR) and with (ANCHOR-FEATURES) feature augmentation. LOG-LINEAR is the model of Berg-Kirkpatrick et al. (2010) trained with the direct-gradient method using L-BFGS. For BW and LOG-LINEAR, we report the mean and the standard deviation (in parentheses) of 10 random restarts run for 1,000 iterations. closely followed by ANCHOR. The Brown clustering estimation is also competitive and has the highest accuracy on 3 languages. Not surprisingly, vanilla HMMs trained with BW perform the worst (see Sec- tion 5.1.1 for a discussion). LOG-LINEAR is a robust baseline and performs the best on the remaining 2 languages. It per- forms especially strongly on Japanese and Korean datasets in which poorly segmented strings such as “1950年11月5日には” (on November 5, 1950) and “40.3%로” (by 40.3%) abound. In these datasets, it is crucial to make effective use of morphological features. 5.5 Qualitative Analysis 5.5.1 A-HMM Parameters An A-HMM can be easily interpreted since each hidden state is marked with an anchor observation. Table 3 shows the 12 anchors found in each lan- guage. Note that these anchor words generally have a wide coverage of possible POS tags. We also experimented with using true anchor words (obtained from labeled data), but they did not improve performance over automatically induced anchors. Since anchor discovery is inherently tied to parameter estimation, it is better to obtain anchors in a data-driven manner. In particular, certain POS tags (e.g., X) appear quite infrequently, and the model is worse off by being forced to allocate a hidden state for such a tag. Table 4 shows words with highest emission prob- abilities o(x|h) under each anchor. We observe that an anchor is representative of a certain group of words. For instance, the state “loss” rep- resents noun-like words, “1” represents numbers, “on” represents preposition-like words, “one” rep- resents determiner-like words, and “closed” repre- sents verb-like words. The conditional distribution is peaked for anchors that represent function tags (e.g., determiners, punctuation) and flat for anchors that represent content tags (e.g., nouns). Occasion- ally, an anchor assigns high probabilities to words that do not seem to belong to the corresponding POS tag. But this is to be expected since o(x|h) ∝ P(XI = x) is generally larger for frequent words. 5.5.2 Model Assumptions on Data Table 5 checks the assumptions in A-HMMs and Brown models on the universal treebank dataset. The anchor assumption is indeed satisfied with 12 universal tags: in every language, each tag has at least one word uniquely associated with the tag. The Brown assumption (each word has exactly one possible tag) is of course not satisfied, since some words are genuinely ambiguous with respect to their POS tags. However, the percentage of unambigu- ous words is very high (well over 90%). This analy- sis supports that the model assumptions made by A- HMMs and Brown models are appropriate for POS tagging. Table 6 reports the log likelihood (normalized by the number of words) on the English portion of different estimation methods for HMMs. BW and CLUSTER obtain higher likelihood than the anchor algorithm, but this is expected given that both EM 253 de en es fr id it ja ko pt-br sv empfehlen loss y avait bulan radar お世話に 완전 E och wie 1 hizo commune tetapi però ないと 중에 de bör ; on - Le wilayah sulle ことにより 경우 partida grund Sein one especie de - - されている。 줄 fazer mellan Berlin closed Además président Bagaimana Stati ものを 같아요 meses i und are el qui , Lo , 많은 os sociala , take paı́ses ( sama legge した , : . - , la à . al それは 볼 diretor bli der vice España États dan far- 、 자신의 2010 den im to en Unis Utara di 幸福の 받고 , , des York de Cette pada la ことが 맛있는 uma tid Region Japan municipio quelques yang art. 通常の 위한 O Detta Table 3: Anchor words found in each language (model ANCHOR-FEATURES). loss year (.02) market (.01) share (.01) company (.01) stock (.01) quarter (.01) shares (.01) price (.01) 1 1 (.03) 10 (.02) 30 (.02) 15 (.02) 8 (.02) 2 (.01) 20 (.01) 50 (.01) on of (.14) in (.12) . (.08) for (.06) on (.04) by (.04) from (.04) and (.03) one the (.23) a (.12) “ (.03) an (.03) $ (.02) its (.02) that (.02) this (.02) closed said (.05) ’s (.02) is (.02) says (.02) was (.01) has (.01) had (.01) expected (.01) are and (.08) is (.08) are (.05) was (.04) ’s (.04) “ (.04) has (.03) of (.03) take be (.04) % (.02) have (.02) million (.02) But (.02) do (.01) The (.01) make (.01) , , (.53) . (.25) and (.05) ” (.04) % (.01) million (.01) – (.01) that (.01) vice ’s (.03) The (.02) “ (.02) New (.01) and (.01) new (.01) first (.01) chief (.01) to to (.39) . (.11) a (.06) will (.04) $ (.03) n’t (.03) would (.02) % (.02) York the (.15) a (.05) The (.04) of (.04) ’s (.04) million (.01) % (.01) its (.01) Japan Mr. (.03) it (.02) ” (.02) $ (.02) he (.02) that (.02) which (.01) company (.01) Table 4: Most likely words under each anchor word (English model ANCHOR-FEATURES). Emission probabilities o(x|h) are given in parentheses. and Brown clustering directly optimize likelihood. In contrast, the anchor algorithm is based on the method of moments and does not (at least directly) optimize likelihood. Note that high likelihood does not imply high accuracy under HMMs. 6 Related Work 6.1 Latent-Variable Models There has recently been great progress in estima- tion of models with latent variables. Despite the NP-hardness in general cases (Terwijn, 2002; Arora et al., 2012), many algorithms with strong theoreti- cal guarantees have emerged under natural assump- tions. For example, for HMMs with full-rank condi- tions, Hsu et al. (2012) derive a consistent estimator of the marginal distribution of observed sequences. Anandkumar et al. (2014) propose an exact tensor decomposition method for learning a wide class of latent variable models with similar non-degeneracy conditions. Arora et al. (2013) derive a provably cor- rect learning algorithm for topic models with a cer- tain parameter structure. The anchor-based framework has been originally formulated for learning topic models (Arora et al., 2013). It has been subsequently adopted to learn other models such as latent-variable probabilistic context-free grammars (Cohen and Collins, 2014). In our work, we have extended this framework to address unsupervised sequence labeling. Zhou et al. (2014) also extend Arora et al. (2013)’s framework to learn various models includ- ing HMMs, but they address a more general prob- lem. Consequently, their algorithm draws from Anandkumar et al. (2012) and is substantially dif- ferent from ours. 6.2 Unsupervised POS Tagging Unsupervised POS tagging is a classic problem in unsupervised learning that has been tackled with various approaches. Johnson (2007) observes that 254 de en es fr id it ja ko pt-br sv % anchored tags 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 % unambig words 96.6 91.5 94.0 94.2 94.8 94.8 99.5 98.4 94.8 97.4 . VERB PRON ADP NOUN ADV CONJ DET NUM ADJ X PRT , said it from Mr. n’t or which billion new bono na 53928 6339 5147 4856 4436 3582 2748 2458 1935 1542 8 5 Table 5: Verifying model assumptions on the universal treebank. The anchor assumption is satisfied in every language. The Brown assumption (each word has exactly one possible tag) is violated but not by a large margin. The lower table shows the most frequent anchor word and its count under each tag on the English portion. Model Normalized LL Acc BW -6.45 59.8 CLUSTER -6.71 62.9 ANCHOR -7.06 66.1 ANCHOR-FEATURES -7.05 71.4 Table 6: Log likelihood normalized by the number of words on English (along with accuracy). For BW, we report the mean of 10 random restarts run for 1,000 it- erations. EM performs poorly in this task because it induces flat distributions; this is not the case with our algo- rithm as seen in the peaky distributions in Table 4. Haghighi and Klein (2006) assume a set of proto- typical words for each tag and report high accuracy. In contrast, our algorithm automatically finds such prototypes in a subroutine. Berg-Kirkpatrick et al. (2010) achieve the state- of-the-art result in unsupervised fine-grained POS tagging (mid-70%). As described in Section 5.2, their model is an HMM in which probabilties are given by log-linear models. Table 7 provides a point of reference comparing our work with Berg- Kirkpatrick et al. (2010) in their setting: models are trained and tested on the entire 45-tag WSJ dataset. Their model outperforms our approach in this set- ting: with fine-grained tags, spelling features be- come more important, for instance to distinguish “played” (VBD) from “play” (VBZ). Nonetheless, we have shown that our approach is competitive when universal tags are used (Table 2). Many past works on POS induction predate the introduction of the universal tagset by Petrov et al. (2012) and thus report results with fine-grained tags. More recent works adopt the universal tagset but Models Accuracy BW 62.6 (1.1) CLUSTER 65.6 ANCHOR 67.2 ANCHOR-FEATURES 67.7 LOG-LINEAR 74.9 (1.5) Table 7: Many-to-one accuracy on the English data with 45 original tags. We use the same setting as in Table 2. For BW and LOG-LINEAR, we report the mean and the standard deviation (in parentheses) of 10 random restarts run for 1,000 iterations. they leverage additional resources. For instance, Das and Petrov (2011) and Täckström et al. (2013) use parallel data to project POS tags from a supervised source language. Li et al. (2012) use tag dictionar- ies built from Wiktionary. Thus their results are not directly comparable to ours.4 7 Conclusion We have presented an exact estimation method for learning anchor HMMs from unlabeled data. There are several directions for future work. An important direction is to extend the method to a richer family of models such as log-linear models or neural net- works. Another direction is to further generalize the method to handle a wider class of HMMs by relax- ing the anchor condition (Condition 4.1). This will require a significant extension of the NMF algorithm in Figure 1. 4Das and Petrov (2011) conduct unsupervised experiments using the model of Berg-Kirkpatrick et al. (2010), but their dataset and evaluation method differ from ours. 255 Acknowledgments We thank Taylor Berg-Kirkpatrick for providing the implementation of Berg-Kirkpatrick et al. (2010). We also thank anonymous reviewers for their con- structive comments. References Animashree Anandkumar, Daniel Hsu, and Sham M. Kakade. 2012. A method of moments for mixture models and hidden Markov models. In Twenty-Fifth Annual Conference on Learning Theory. Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. 2014. Ten- sor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1):2773– 2832. Sanjeev Arora, Rong Ge, and Ankur Moitra. 2012. Learning topic models–going beyond SVD. In Foun- dations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on, pages 1–10. IEEE. Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. 2013. A practical algorithm for topic modeling with provable guarantees. In Proceedings of the 30th International Conference on Machine Learn- ing (ICML-13), pages 280–288. Leonard E. Baum and Ted Petrie. 1966. Statisti- cal inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statis- tics, 37(6):1554–1563. Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, John DeNero, and Dan Klein. 2010. Painless unsu- pervised learning with features. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Com- putational Linguistics, pages 582–590. Association for Computational Linguistics. Peter F. Brown, Peter V. Desouza, Robert L. Mercer, Vin- cent J. Della Pietra, and Jenifer C. Lai. 1992. Class- based n-gram models of natural language. Computa- tional Linguistics, 18(4):467–479. Christos Christodoulopoulos, Sharon Goldwater, and Mark Steedman. 2010. Two decades of unsupervised POS induction: How far have we come? In Proceed- ings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 575–584. Asso- ciation for Computational Linguistics. Shay B. Cohen and Michael Collins. 2014. A provably correct learning algorithm for latent-variable PCFGs. In Proceedings of ACL. Dipanjan Das and Slav Petrov. 2011. Unsupervised part-of-speech tagging with bilingual graph-based pro- jections. In Proceedings of the 49th Annual Meet- ing of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 600– 609. Association for Computational Linguistics. Marguerite Frank and Philip Wolfe. 1956. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):95–110. Aria Haghighi and Dan Klein. 2006. Prototype-driven learning for sequence models. In Proceedings of the main conference on Human Language Technol- ogy Conference of the North American Chapter of the Association of Computational Linguistics, pages 320– 327. Association for Computational Linguistics. Daniel Hsu, Sham M. Kakade, and Tong Zhang. 2012. A spectral algorithm for learning hidden Markov mod- els. Journal of Computer and System Sciences, 78(5):1460–1480. Mark Johnson. 2007. Why doesn’t EM find good HMM POS-taggers? In EMNLP-CoNLL, pages 296–305. Shen Li, Joao V. Graça, and Ben Taskar. 2012. Wiki- ly supervised part-of-speech tagging. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1389–1398. Asso- ciation for Computational Linguistics. Percy Liang and Dan Klein. 2008. Analyzing the errors of unsupervised learning. In ACL, pages 879–887. Percy Liang. 2005. Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology. Ryan T. McDonald, Joakim Nivre, Yvonne Quirmbach- Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith B. Hall, Slav Petrov, Hao Zhang, Os- car Täckström, Claudia Bedini, Núria B. Castelló, and Jungmee Lee. 2013. Universal dependency annota- tion for multilingual parsing. In ACL, pages 92–97. Bernard Merialdo. 1994. Tagging English text with a probabilistic model. Computational Linguistics, 20(2):155–171. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eighth International Conference on Language Re- sources and Evaluation (LREC’12). Noah A. Smith and Jason Eisner. 2005a. Contrastive estimation: Training log-linear models on unlabeled data. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 354–362. Association for Computational Linguistics. Noah A. Smith and Jason Eisner. 2005b. Guiding un- supervised grammar induction using contrastive esti- mation. In Proc. of IJCAI Workshop on Grammatical Inference Applications, pages 73–82. 256 Karl Stratos, Do-kyum Kim, Michael Collins, and Daniel Hsu. 2014. A spectral algorithm for learning class- based n-gram models of natural language. In Proceed- ings of the thirtieth conference on Uncertainty in Arti- ficial Intelligence. Karl Stratos, Michael Collins, and Daniel Hsu. 2015. Model-based word embeddings from decompositions of count matrices. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguis- tics and the 7th International Joint Conference on Nat- ural Language Processing (Volume 1: Long Papers), pages 1282–1291, Beijing, China, July. Association for Computational Linguistics. Oscar Täckström, Dipanjan Das, Slav Petrov, Ryan Mc- Donald, and Joakim Nivre. 2013. Token and type constraints for cross-lingual part-of-speech tagging. Transactions of the Association for Computational Linguistics, 1:1–12. Sebastiaan A. Terwijn. 2002. On the learnability of hid- den Markov models. In Grammatical Inference: Al- gorithms and Applications, pages 261–268. Springer. Kristina Toutanova and Mark Johnson. 2007. A Bayesian LDA-based model for semi-supervised part- of-speech tagging. In Advances in Neural Information Processing Systems, pages 1521–1528. Stephen A. Vavasis. 2009. On the complexity of nonneg- ative matrix factorization. SIAM Journal on Optimiza- tion, 20(3):1364–1377. Tianyi Zhou, Jeff A. Bilmes, and Carlos Guestrin. 2014. Divide-and-conquer learning by anchoring a conical hull. In Advances in Neural Information Processing Systems, pages 1242–1250. 257 258