Transactions of the Association for Computational Linguistics, 1 (2013) 75–88. Action Editor: Sharon Goldwater. Submitted 11/2012; Revised 1/2013; Published 3/2013. c©2013 Association for Computational Linguistics. An HDP Model for Inducing Combinatory Categorial Grammars Yonatan Bisk and Julia Hockenmaier Department of Computer Science The University of Illinois at Urbana-Champaign 201 N Goodwin Ave Urbana, IL 61801 {bisk1,juliahmr}@illinois.edu Abstract We introduce a novel nonparametric Bayesian model for the induction of Combinatory Cat- egorial Grammars from POS-tagged text. It achieves state of the art performance on a number of languages, and induces linguisti- cally plausible lexicons. 1 Introduction What grammatical representation is appropriate for unsupervised grammar induction? Initial attempts with context-free grammars (CFGs) were not very successful (Carroll and Charniak, 1992; Charniak, 1993). One reason may be that CFGs require the specification of a finite inventory of nonterminal cat- egories and rewrite rules, but unless one adopts lin- guistic principles such as X-bar theory (Jackendoff, 1977), these nonterminals are essentially arbitrary labels that can be combined in arbitrary ways. While further CFG-based approaches have been proposed (Clark, 2001; Kurihara and Sato, 2004), most re- cent work has followed Klein and Manning (2004) in developing models for the induction of projec- tive dependency grammars. It has been shown that more sophisticated probability models (Headden III et al., 2009; Gillenwater et al., 2011; Cohen and Smith, 2010) and learning regimes (Spitkovsky et al., 2010), as well as the incorporation of prior lin- guistic knowledge (Cohen and Smith, 2009; Berg- Kirkpatrick and Klein, 2010; Naseem et al., 2010) can lead to significant improvement over Klein and Manning’s baseline model. The use of dependency grammars circumvents the question of how to obtain an appropriate inventory of categories, since depen- dency parses are simply defined by unlabeled edges between the lexical items in the sentence. But de- pendency grammars make it also difficult to cap- ture non-local structures, and Blunsom and Cohn (2010) show that it may be advantageous to refor- mulate the underlying dependency grammar in terms of a tree-substitution grammar (TSG) which pairs words with treelets that specify the number of left and right dependents they have. In this paper, we explore yet another option: instead of dependency grammars, we use Combinatory Categorial Gram- mar (CCG, Steedman (1996; 2000)), a linguistically expressive formalism that pairs lexical items with rich categories that capture all language-specific in- formation. This may seem a puzzling choice, since CCG requires a significantly larger inventory of cat- egories than is commonly assumed for CFGs. How- ever, unlike CFG nonterminals, CCG categories are not arbitrary symbols: they encode, and are deter- mined by, the basic word order of the language and the number of arguments each word takes. CCG is very similar to TSG in that it also pairs lexical items with rich items that capture all language-specific in- formation. Like TSG and projective dependency grammars, we restrict ourselves to a weakly context- free fragment of CCG. But while TSG does not dis- tinguish between argument and modifier dependen- cies, CCG makes an explicit distinction between the two. And while the elementary trees of Blunsom and Cohn (2010)’s TSG and their internal nodel la- bels have no obvious linguistic interpretation, the syntactic behavior of any CCG constituent can be directly inferred from its category. To see whether 75 the algorithm has identified the basic syntactic prop- erties of the language, it is hence sufficient to in- spect the induced lexicon. Conversely, Boonkwan and Steedman (2011) show that knowledge of these basic syntactic properties makes it very easy to cre- ate a language-specific lexicon for accurate unsu- pervised CCG parsing. We have recently proposed an algorithm for inducing CCGs (Bisk and Hocken- maier, 2012b) that has been shown to be competitive with other approaches even when paired with a very simple probability model (Gelling et al., 2012). In this paper, we pair this induction algorithm with a novel nonparametric Bayesian model that is based on a different factorization of CCG derivations, and show that it outperforms our original model and many other approaches on a large number of lan- guages. Our results indicate that the use of CCG yields grammars that are significantly more robust when dealing with longer sentences than most de- pendency grammar-based approaches. 2 Combinatory Categorial Grammar Combinatory Categorial Grammar (Steedman, 2000) is a linguistically expressive, lexicalized grammar formalism that associates rich syntactic types with words and constituents. For simplicity, we restrict ourselves to the standard two atomic types S (sentences) and N (encompassing both nouns and noun phrases) from which we recursively build categories. Complex categories are of the form X/Y or X\Y, and represent functions which return a result of type X when combined with an argument of type Y. The directionality of the slash indicates whether the argument precedes or follows the functor. We write X|Y when the direction of the slash does not matter. The CCG lexicon encodes all language-specific information. It pairs every word with a set of cate- gories that define both its specific syntactic behavior as well as the overall word order of the language: N: {he, girl, lunch, ...} N/N: {good, the, eating, ...} S\N: {sleeps, ate, eating, ...} (S\N)/N: {sees, ate, ...} S\S: {quickly, today...} (S\N)/(S\N): {good, the, ...} To draw a simple contrast, in Spanish we would expect adjectives to take the category N\N because Spanish word ordering dictates that the adjective fol- low the noun. The lexical categories also capture word-word dependencies: head-argument relations are captured by the lexical category of the head (e.g. (S\N)/N), whereas head-modifier relations are cap- tured by the lexical category of the modifier, which is of the form X\X or X/X, and may take further arguments of its own. Our goal will be to automati- cally learn these types of lexicons for a language. In Figure 3, we juxtapose several such lexicons which were automatically discovered by our system. The rules of CCG are defined by a small set of of combinatory rules, which are traditionally writ- ten as schemas that define how constituents can be combined in a bottom-up fashion (although genera- tive probability models for CCG view them in a top- down manner, akin to CFG rules). The first, and most obvious, of these rules is function application: X/Y Y ⇒ X (B0>) Y X\Y ⇒ X (B0<) Here the functor X/Y or X\Y is applied to an argument Y resulting in X. While standard CCG has a number of additional combinatory rules (type- raising, generalized variants of composition and substitution) that increase its generative capacity be- yond context-free grammars and allow an elegant treatment of non-local dependencies arising in ex- traction, coordination and scrambling, we follow Bisk and Hockenmaier (2012b) and use a restricted fragment, without type-raising, that allows only ba- sic composition and is context-free: X/Y Y/Z ⇒ X/Z (B1> ) X/Y Y\Z ⇒ X\Z (B1X>) Y\Z X\Y ⇒ X\Z (B1< ) Y/Z X\Y ⇒ X/Z (B1X<) The superscript 1 denotes the arity of the compo- sition which is too low to recover non-projective de- pendencies, and our grammar is thus weakly equiva- lent to the dependency grammar representations that are commonly used for grammar induction. The main role of composition in our fragment is that it allows sentential and verb modifiers to both take cat- egories of the form S\S and S/S. Composition in- 76 troduces spurious ambiguities, which we eliminate by using Eisner (1996)’s normal form.1 Coordinating conjunctions have a special cate- gory conj, and we binarize coordination as follows (Hockenmaier and Steedman, 2007): X X[conj] ⇒&1 X (&1) conj X ⇒&2 X[conj] (&2) 3 Category induction Unlike dependency grammars, CCG requires an in- ventory of lexical categories. Given a set of lexical categories, the combinatory rules define the set of parses for each sentence. We follow the algorithm proposed by Bisk and Hockenmaier (2012b) to au- tomatically induce these categories. The lexicon is initialized by pairing all nominal tags (nouns, pro- nouns and determiners) with the category N, all verb tags with the category S, and coordinating conjunc- tions with the category conj: CONJ → conj DET, NOUN, NUM, PRON → N VERB → S Although our lexicons are defined over corpus- specific POS tags, we use a slightly modified version of Petrov et al. (2012)’s Universal POS tagset to cat- egorize them into these broad classes. The primary changes we make to their mappings are the addition of a distinction (where possible) between subordi- nating and coordinating conjunctions and between main and auxiliary verbs2. Since the initial lexicon consists only of atomic categories, it cannot parse any complex sentences: The man ate quickly DT NNS VBD RB - N S - Complex lexical categories are induced by con- sidering the local context in which tokens appear. Given an input sentence, and a current lexicon which assigns categories to at least some of the tokens in the sentence, we apply the following two rules to add new categories to the lexicon: The argument rule allows any lexical tokens that have categories other than N and conj to take immediately adjacent 1The normal-form of Hockenmaier and Bisk (2010) is not required for this fragment of CCG. 2This distinction was suggested by the authors (p.c.) Ns as arguments. The modifier rule allows any to- ken (other than coordinating conjunctions that ap- pear in the middle of sentences) to modify an imme- diate neighbor that has the category S or N or is a modifier (S|S or N|N) itself. The man ate quickly DT NNS VBD RB N/N N, S/S S, N\N S\S S\N These rules can be applied iteratively to form more complex categories. We restrict lexical cate- gories to a maximal arity of 2, and disallow the cat- egory (S/N)\N, since it is equivalent to (S\N)/N. The man ate quickly DT NNS VBD RB N/N, N, S/S S, N\N, S\S, (S/S)/(S/S)(N\N)/(N\N) S\N (N\N)\(N\N) (N/N)\(N/N) (S/S)\(S/S) (S\S)/(S\S) The resultant, overly general, lexicon is then used to parse the training data. Each complete parse has to be of category S or N, with the constraint that sentences that contain a main verb can only form parses of category S. 4 A new probability model for CCG Generative models define the probability of a parse tree τ as the product of individual rule probabili- ties. Our previous work (Bisk and Hockenmaier, 2012b) uses the most basic model of Hockenmaier and Steedman (2002), which first generates the head direction (left, right, unary, or lexical), followed by the head category, and finally the sister category. 3 This factorization does not take advantage of the unique functional nature of CCG. We therefore in- troduce a new factorization we call the Argument Model. It exploits the fact that CCG imposes strong constraints on a category’s left and right children, since these must combine to create the parent type via one of the combinators. In practice this means that given the parent X/Z, the choice of combinator4 c and an argument Y we can uniquely determine the categories of the left and right children: 3Huang et al. (2012) present a (deficient) variant and Bayesian extension of the Bisk and Hockenmaier (2012b) model without k-best smoothing that both underperform our published results. 4If X is an atomic category, only application is possible. 77 Parent c ⇒ Left Right X/Z B0> (X/Z)/Y Y B0< Y (X/Z)\Y B1> X/Y Y/Z B1< Y/Z X\Y and correspondingly for X\Z: Parent c ⇒ Left Right X\Z B0> (X\Z)/Y Y B0< Y (X\Z)\Y B1> X/Y Y\Z B1< Y\Z X\Y While type-changing and raising are not used in this work the model’s treatment of root productions extends easily to handle these other unary cases. We simply treat the argument Y as the unary outcome so that the parent, combinator and argument uniquely specify every detail of the unary rule: Parent c ⇒ Y TOP TOP ∈{S,N} S/(S\N) T< N S\(S/N) T> N We still distinguish the same rule types as before (lexical, unary, binary with head left/right), leading us to the following model definition: Given: P := X/Z where type(t) ∈{Left,Right,Unary,Lex} p(t|P) × { p(w|P, t) Lex p(Y|P, t) ×p(c|P, t,Y) o.w. Argument Combinator Note that this model generates only one CCG cat- egory but uniquely defines the two children of a par- ent node. We will see below that this greatly simpli- fies the development of non-parametric extensions. 5 HDP-CCG: a nonparametric model Simple generative models such as PCFGs or Bisk and Hockenmaier (2012b)’s CCG model are not robust in the face of sparsity, since they assign zero probability to any unseen event. Sparsity is a particular problem for formalisms like CCG that have a rich inventory of object types. Nonpara- metric Bayesian models, e.g. Dirichlet Processes (Teh, 2010) or their hierarchical variants (Teh et al., 2006) and generalizations (Teh, 2006) overcome this problem in a very elegant manner, and are used by many state-of-the-art grammar induction systems (Naseem et al., 2010; Blunsom and Cohn, 2010; Boonkwan and Steedman, 2011). They also im- pose a rich-getting-richer behavior that seems to be advantageous in many modeling applications. By contrast, Bisk and Hockenmaier (2012b) propose a weighted top-k scheme to address these issues in an ad-hoc manner. The argument model introduced above lends it- self particularly well to nonparametric extensions such as the standard Hierarchical Dirichlet Pro- cesses (HDP). In this work the size of the grammar and the number of productions are fixed and small, but we present the formulation as infinite to allow for easy extension in the future. Specifically, this frame- work allows for extensions which grow the grammar during parsing/training or fully lexicalize the pro- ductions. Additionally, while our current work uses only a restricted fragment of CCG that has only a finite set of categories, the literature’s generalized variants of composition make it possible to gener- ate categories of unbounded arity. We therefore be- lieve that this is a very natural probabilistic frame- work for CCG, since HDPs make it possible to con- sider a potentially infinite set of categories that can instantiate the Y slot, while allowing the model to capture language-specific preferences for the set of categories that can appear in this position. The HDP-CCG model In Bayesian models, multinomials are drawn from a corresponding n- dimensional Dirichlet distribution. The Dirichlet Process (DP) generalizes the Dirichlet distribution to an infinite number of possible outcomes, allowing us to deal with a potentially infinite set of categories or words. DPs are defined in terms of a base dis- tribution H that corresponds to the mean of the DP, and a concentration or shape parameter α. In a Hi- erarchical Dirichlet Process (Teh et al., 2006), there is a hierarchy of DPs, such that the base distribution of a DP at level n is a DP at level n− 1. The HDP-CCG (Figure 1) is a reformulation of the Argument Model introduced above in terms of Hierarchical Dirichlet Processes.5 At the heart of the model is a distribution over CCG categories. By combining a stick breaking process with a multino- mial over categories we can define a DP over CCG 5An alternative HDP model for semantic parsing with CCG is proposed by Kwiatkowski et al. (2012). 78 HDP-CCG 1) Draw global parameters Define MLE root parameter θTOP Draw top-level symbol weights βY ∼ GEM(αY ) Draw top-level lexical weights βL ∼ GEM(αL) For each grammar symbol z ∈{1, 2, ...}: Define MLE rule type parameters θTz Draw argument parameters φYz ∼ DP(αY,βY ) Draw lexical emission parameters φLz ∼ DP(αL,βL) For each grammar symbol y ∈{1, 2, ...}: Define MLE combinator parameters θCz,y 2) For each parse tree: Generate root node zTOP ∼ Binomial(θTOP ) For each node i in the parse tree: Choose rule type ti ∼ Multinomial(θTzi ) If ti == Lex: Emit terminal symbol xi ∼ Multinomial(φLzi ) If ti == Left/Right/Unary: Generate argument category yi ∼ Multinomial(φYzi ) Generate combinator ci ∼ Multinomial(θCzi,yi ) Deterministically create zL(i) (and zR(i) if binary) zi yi ci zL(i) zR(i) xL(i) xR(i) z ∞ ∞y φY θT θC φL βY βL Because we are working with CCG, the parent zi, argument yi and combinator ci uniquely define the two children categories (zL(i),zR(i)). The dashed arrows here rep- resent the deterministic process used to generate these two categories. Figure 1: The HDP-CCG has two base distributions, one over the space of categories and the other over words (or tags). For every grammar symbol, an argument distribution and emission distribution is drawn from the corresponding Dirichlet Processes. In addition, there are several MLE distributions tied to a given symbol for generating rule types, combinators and lexical tokens. categories whose stick weights (βY ) correspond to the frequency of the category in the corpus. Next we build the hierarchical component of our model by choosing an argument distribution (φY ), again over the space of categories, for every parent X/Z. This argument distribution is drawn from the previously defined base DP, allowing for an important level of parameter tying across all argument distributions. While the base DP does define the mean around which all argument distributions are drawn, we also require a notion of variance or precision which de- termines how similar individual draws will be. This precision is determined by the magnitude of the hy- perparameter αY . This hierarchy is paralleled for lexical productions which are drawn from a unigram base DP over terminal symbols controlled by αL. For simplicity we use the same scheme for setting the values for αL as αY . We present experimen- tal results in which we vary the value of αY as a function of the number of outcomes allowed by the grammar for argument categories or the corpus in the case of terminal symbols. Specifically, we set αY = np for conditioning contexts with n out- comes. Since Liang et al. (2009) found that the ideal value for alpha appears to be superlinear but sub- quadratic in n, we present results where p takes the values 0, 1.0, 1.5, and 2.0 to explore the range from uniform to quadratic. This setting for α is the only free parameter in the model. By controlling preci- sion we can tell the model to what extent global cor- pus statistics should be trusted. We believe this has a similar effect to Bisk and Hockenmaier (2012b)’s top-k upweighting and smoothing scheme. One advantage of the argument model is that it only requires a single distribution over categories for each binary tree. In contrast to similar proposals for CFGs (Liang et al., 2007), which impose no formal restrictions on the nonterminals X, Y, Z that can ap- pear in a rewrite rule X → Y Z, this greatly sim- plifies the modeling problem (yielding effectively a model that is more akin to nonparametric HMMs), since it avoids the need to capture correlations be- 79 tween different base distributions for Y and Z. Variational Inference HDPs need to be estimated with approximate techniques. As an alternative to Gibbs sampling (Teh et al., 2006), which is exact, but typically very slow and has no clear convergence criteria, variational inference algorithms (Bishop, 2006; Blei and Jordan, 2004) estimate the parame- ters of a truncated model to maximize a lower bound of the likelihood of the actual model. This allows for factorization of the model and a training procedure analogous to the Inside-Outside algorithm (Lari and Young, 1991), allowing training to run very quickly and in a trivially parallelizable manner. To initialize the base DP’s stick weights, we fol- low the example of Kurihara et al. (2007) and use an MLE model initialized with uniform distributions to compute global counts for the categories in our grammar. When normalized these provide a better initialization than a uniform set of weights. Updates to the distributions are then performed in a coordi- nate descent manner which includes re-estimation of the base DPs. In variational inference, multinomial weights W take the place of probabilities. The weights for an outcome Y with conditioning variable P are com- puted by summing pseudocounts with a scaled mean vector from the base DP. The computation involves moving in the direction of the gradient of the Dirich- let distribution which results in the following differ- ence of Digammas (Ψ): WP (Y ) = Ψ(C(P,Y ) + α P βY ) − Ψ(C(P,∗) + αP ) Importantly, the Digamma and multinomial weights comprise a righ-get-richer scheme, biasing the model against rare outcomes. In addition, since variational inference is done by coordinate descent, it is trivially parallelizeable. In practice, training and testing our models on the corpora containing sen- tences up to length 15 used in this paper takes be- tween one minute to at most three hours on a single 12-core machine depending on their size. 6 Evaluation As is standard for this task, we evaluate our systems against a number of different dependency treebanks, and measure performance in terms of the accuracy of directed dependencies (i.e. the percentage of words in the test corpus that are correctly attached). We use the data from the PASCAL challenge for grammar induction (Gelling et al., 2012), the data from the CoNLL-X shared task (Buchholz and Marsi, 2006) and Goldberg (2011)’s Hebrew corpus. Converting CCG derivations into dependencies is mostly straightforward, since the CCG derivation identifies the root word of each sentence, and head- argument and head-modifier dependencies are easily read off of CCG derivations, since the lexicon de- fines them explicitly. Unlike dependency grammar, CCG is designed to recover non-local dependencies that arise in control and binding constructions as well as in wh-extraction and non-standard coordi- nation, but since this requires re-entrancies, or co- indexation of arguments (Hockenmaier and Steed- man, 2007), within the lexical categories that trigger these constructions, our current system returns only local dependencies. But since dependency gram- mars also captures only local dependencies, this has no negative influence on our current evaluation. However, a direct comparison between depen- dency treebanks and dependencies produced by CCG is more difficult (Clark and Curran, 2007), since dependency grammars allow considerable freedom in how to analyze specific constructions such as verb clusters (which verb is the head?) prepositional phrases and particles (is the head the noun or the preposition/particle?), subordinating conjunctions (is the conjunction a dependent of the head of the main clause and the head of the embed- ded clause a dependent of the conjunction, or vice versa?) and this is reflected in the fact that the tree- banks we consider often apply different conventions for these cases. Although remedying this issue is beyond the scope of this work, these discrepancies very much hint at the need for a better mechanism to evaluate linguistically equivalent structures or tree- bank standardization. The most problematic construction is coordina- tion. In standard CCG-to-dependency schemes, both conjuncts are independent, and the conjunction itself is not attached to the dependency graph, whereas de- pendency grammars have to stipulate that either one of the conjuncts or the conjunction itself is the head, with multiple possibilities of where the remaining 80 constituents attach. In addition to the standard CCG scheme, we have identified five main styles of con- junction in our data (Figure 2), although several cor- pora distinguish multiple types of coordinating con- junctions which use different styles (not all shown here). Since our system has explicit rules for coordi- nation, we transform its output into the desired target representation that is specific to each language. 7 Experiments We evaluate our system on 13 different languages. In each case, we follow the test and training regimes that were used to obtain previously published results in order to allow a direct comparison. We com- pare our system to the results presented at the PAS- CAL Challenge on Grammar Induction (Gelling et al., 2012)6, as well as to Gillenwater et al. (2011) and Naseem et al. (2012). We use Nivre (2006)’s Penn2Malt implementation of Collins (2003)’s head rules to translate the WSJ Penn Treebank (Marcus et al., 1993) into dependencies. Finally, when train- ing the MLE version of our model we use a simple smoothing scheme which defines a small rule proba- bility (e−15) to prevent any rule used during training from going to zero. 7.1 PASCAL Challenge on Grammar Induction In Table 1, we compare the performance of the ba- sic Argument model (MLE), of our HDP model with four different settings of the hyperparameters (as ex- plained above) and of the systems presented in the PASCAL Challenge on Grammar Induction (Gelling et al., 2012). The systems in this competition were instructed to train over the full dataset, including the unlabelled test data, and include Bisk and Hocken- maier (2012a)’s CCG-based system (BH) to Cohn et al. (2010)’s reimplementation of Klein and Manning (2004)’s DMV model in a tree-substitution gram- mar framework (BC), as well as three other de- pendency based systems which either incorporate Naseem et al. (2010)’s rules in a deterministic fash- ion (Søgaard, 2012), rely on extensive tuning on 6Numbers are from personal correspondence with the orga- nizers. The previously published numbers are not comparable to literature due to an error in the evaluation. http://wiki. cs.ox.ac.uk/InducingLinguisticStructure/ ResultsDepComparable the development set (Tu, 2012) or incorporate mil- lions of additional tokens from Wikipedia to esti- mate model parameters (Marecek and Zabokrtsky, 2012). We ignore punctuation for all experiments reported in this paper, but since the training data (but not the evaluation) includes punctuation marks, participants were free to choose whether to include punctuation or ignore it. While BH is the only other system with directly interpretable linguistic output, we also include a di- rect comparison with BC, whose TSG representa- tion is equally expressive to ours. Finally we present a row with the maximum performance among the other three models. As we have no knowledge of how much data was used in the training of other sys- tems we simply present results for systems trained on length 15 (not including punctuation) sentences and then evaluated at lengths 10 and 15. The MLE version of our model shows rather vari- able performance: although its results are particu- larly bad on Basque (Eu), it outperforms both BH and BC on some other settings. By contrast, the HDP system is always better than the MLE model. It outperforms all other systems on half of the cor- pora. On average, it outperforms BH and BC by 10.3% and 9.3% on length 10, or 9.7% and 7.8 % on length 15 respectively. The main reason why our system does not outperform BC by an even higher margin is the very obvious 11.4%/11.5% deficit on Slovene. However, the Slovene dependency tree- bank seems to follow a substantially different anno- tation scheme. In particular, the gold standard an- notation of the 1,000 sentences in the Slovene de- velopment set treats many of them as consisting of independent sentences (often separated by punctua- tion marks that our system has no access to), so that the average number of roots per sentence is 2.7: >> “ verjeti believe ti I , , 〈〈 ” je is mehko soft rekla said When our system is presented with these short components in isolation, it oftentimes analyzes them correctly, but since it has to return a tree with a sin- gle root, its performance degrades substantially. We believe the HDP performs so well as com- pared to the MLE model because of the influence of the shared base distribution, which allows the 81 Ar, Eu, Cs, Nl, WSJ, Ch, He Da, He Es, Bg, De, Pt Sv, Sl Ja noun conj noun noun conj noun noun conj noun noun conj noun noun conj noun Figure 2: In the treebanks used for evaluation different standards exist for annotating coordination. While not exhaustive, this table demonstrates five of the most common schemes used in the literature. Syntactically these are identical and traditionally CCG draws arcs only to the arguments without attaching the conjunction. For the purposes of comparison with the literature we have implemented these five translation schemes. Arabic Danish Slovene Swedish Dutch Basque Portuguese WSJ Childes Czech # Tokens 5,470 25,341 54,032 61,877 78,737 81,345 158,648 163,727 290,604 436,126 # Tags 20 24 36 30 304 14 23 36 69 62 PA SC A L BC 60.8/58.4 44.7/39.4 62.6/57.9 63.2/56.6 51.8/52.0 53.0/48.9 52.4/50.2 68.6/63.3 47.4/46.1 47.9/43.1 Max 67.2/66.8 60.1/56.0 65.6/61.8 72.8/63.4 51.1/47.6 53.7/47.8 67.0/61.8 71.2/64.8 56.0/54.5 58.3/54.4 BH 41.6/43.7 46.4/43.8 49.6/43.9 63.7/57.0 49.7/43.6 45.1/39.6 70.8/67.2 68.2/59.6 61.4/59.8 45.0/38.9 T hi s w or k MLE 41.6/42.9 43.4/39.2 46.1/41.1 70.1/59.7 52.2/47.2 29.6/26.5 62.2/59.7 59.5/52.4 53.3/51.9 50.5/45.8 HDP0.0 48.0/50.0 63.9/58.5 44.8/39.8 67.6/62.1 45.0/33.9 41.6/39.1 71.0/66.0 59.8/52.9 56.3/55.2 54.0/49.0 HDP1.0 45.6/47.1 45.7/42.3 53.9/46.9 74.5/66.9 58.5/54.4 50.1/44.6 65.1/60.6 64.3/56.5 71.5/70.3 55.8/50.7 HDP1.5 49.6/50.4 58.7/54.4 53.2/48.2 74.3/67.1 57.4/54.5 50.6/45.0 70.0/64.7 65.5/57.2 69.6/68.6 55.6/50.3 HDP2.0 66.4/65.1 56.5/49.5 54.2/46.4 71.6/64.1 51.7/48.3 49.4/43.3 76.3/70.5 70.7/62.9 74.1/73.3 54.4/48.5 +/− -0.8/-1.7 +3.8/+2.5 -11.4/-15.4 +1.7/+3.5 +6.7/+2.4 -3.1/-3.9 +5.5/+3.3 -0.5/-1.9 +12.7/+13.5 -2.5/-3.7 Table 1: A comparison of the basic Argument model (MLE) and four hyper-parameter settings of the HDP- CCG against two syntactic formalisms that participated in the PASCAL Challenge (Gelling et al., 2012), BH (Bisk and Hockenmaier, 2012a) and BC (Blunsom and Cohn, 2010), in addition to a max over all other participants. We trained on length 15 data (punctuation removed), including the test data as recommended by the organizers. The last row indicates the difference between our best system and the competition. global category distribution to influence each of the more specific distributions. Further, it provides a very simple knob in the choice of hyperparame- ters, which has a substantial effect on performance. A side effect of the hyperparameters is that their strength also determines the rate of convergence. This may be one of the reasons for the high vari- ance seen in the four settings tested, although we note that since our initialization is always uniform, and not random, consecutive runs do not introduce variance in the model’s performance. 7.2 Comparison with systems that capture linguistic constraints Since our induction algorithm is based on the knowl- edge of which POS tags are nouns and verbs, we compare in Table 2 our system to Naseem et al. (2010), who present a nonparametric dependency model that incorporates thirteen universal linguistic constraints. Three of these constraints correspond to our rules that verbs are the roots of sentences and may take nouns as dependents, but the other ten con- straints (e.g. that adjectives modify nouns, adverbs modify adjectives or verbs, etc.) have no equivalent in our system. Although our system has less prior knowledge, it still performs competitively. On the WSJ, Naseem et al. demonstrate the im- portance and effect of the specific choice of syntactic rules by comparing the performance of their system with hand crafted universal rules (71.9), with En- glish specific rules (73.8), and with rules proposed by Druck et al. (2009) (64.9). The performance of Naseem et al.’s system drops very significantly as sentence length (and presumable parse complexity) 82 Sl Es Da Pt Sv ∼#Tokens 3.8K 4.2K 9.5K 15K 24K N10 50.9 67.2 51.9 71.5 63.3 HDP 56.6 62.1 51.5 74.7 69.8 Table 2: A comparison of our system with Naseem et al. (2010), both trained and tested on the length 10 training data from the CoNLL-X Shared Task. increases, whereas our system shows significantly less decline, and outperforms their universal system by a significant margin.7 ≤ 10 ≤ 20 Naseem Universal Rules 71.9 50.4 Naseem English Rules 73.8 66.1 HDP-CCG 68.2 64.2 HDP-CCG (train ≤ 20) 71.9 In contrast to Spitkovsky et al. (2010), who reported that performance of their dependency based system degrades when trained on longer sentences, our per- formance on length 10 sentences increases to 71.9 when we train on sentences up to length 20. Another system that is also based on CCG, but captures significantly more linguistic knowledge than ours, was presented by Boonkwan and Steed- man (2011), who achieve an accuracy of 74.5 on WSJ10 section 23 (trained on sections 02-22). Us- ing the same settings, our system achieves an accu- racy of 68.4. Unlike our approach, Boonkwan and Steedman do not automatically induce an appropri- ate inventory of lexical category, but use an exten- sive questionnaire that defines prototype categories for various syntactic constructions, and requires sig- nificant manual engineering of which POS tags are mapped to what categories to generate a language- specific lexicon. However, their performance de- grades significantly when only a subset of the ques- tions are considered. Using only the first 14 ques- tions, covering facts about the ordering of subjects, verbs and objects, adjectives, adverbs, auxiliaries, adpositions, possessives and relative markers, they achieve an accuracy of 68.2, which is almost iden- 7Our earlier generative model showed similar behavior, al- though the results in Bisk and Hockenmaier (2012b) are not directly comparable due to differences in the data. Sl Es Da Pt Sv #Tokens 3,857 4,230 9,549 15,015 24,021 G10 51.2 62.4 47.2 54.3 48.6 HDP 57.9 65.4 49.3 73.5 73.2 Bg WSJ Nl Ja De #Tokens 38,220 42,442 43,405 43,501 77,705 G10 59.8 64.4 47.5 60.2 47.4 HDP 66.1 70.3 56.2 64.1 68.4 Table 3: A comparison of our system with Gillenwa- ter et al. (2010), both trained on the length 10 train- ing data, and tested on the length 10 test data, from the CoNLL-X Shared task. tical to ours, even though we use significantly less initial knowledge. However, the lexicons we present below indicate that we are in fact learning many of the very exact details that in their system are con- structed by hand. The remaining 14 questions in Boonkwan and Steedman’s questionnaire cover less frequent phenomena such as the order of negative markers, dative shift, and pro-drop. The obvious ad- vantage of this approach is that this allows them to define a much more fine-grained inventory of lexical categories than our system can automatically induce. We also stipulate that for certain languages knowl- edge of pro-drop could play a significant role in the success of their approach: if complete sentences are allowed to be of the form S\N or S/N, the same lex- ical category can be used for the verb regardless of whether the subject is present or has been dropped. 7.3 Additional Languages In order to provide results on additional languages, we present in Table 3 a comparison to the work of Gillenwater et al. (2010) (G10), using the ConLL-X Shared Task data (Buchholz and Marsi, 2006). Fol- lowing Gillenwater et al., we train only on sentences of length 10 from the training set and evaluate on the test set. Since this is a different training regime, and these corpora differ for many languages from that of the PASCAL challenge, numbers from Table 1 can- not be compared directly with those in Table 3. We have also applied our model to Goldberg (2011)’s Hebrew corpus, where it achieves an accuracy of 62.1 (trained and tested on all sentences length 10; 7,253) and 59.6 (length 15; 21,422 tokens). 83 Arabic % Swedish % WSJ % Childes % Japanese % Czech % VERB (S\N)/N 56 S 45 S\N 52 S/N 44 S 84 S 26 (S/N)/N 29 S\N 20 (S\N)/N 19 S 37 S\N 25 ADP N\N 68 (S\S)/N 49 (S\S)/N 46 (S\S)/N 45 (S/S)\N 44 (S\S)/N 42 N/N 21 (N\N)/N 25 (N\N)/N 20 N/N 25 N\N 23 (S/S)/N 26 NOUN N\N 50 N 91 N 79 N 89 N 73 N 74 N 35 N/N 12 ADJ N\N 82 N/N 50 N/N 70 N/N 46 S/S 64 N/N 55 Figure 3: Partial lexicons demonstrating language specific knowledge learned automatically for five lan- guages. For ease of comparison between languages, we use the universal tag label (Verb, Adposition, Noun and Adjective). Shown are the most common categories and the fraction of occurrences of the tag that are assigned this category (according to the Viterbi parses). 8 The Induced Lexicons Since our approach is based on a lexicalized for- malism such as CCG, our system automatically in- duces lexicons that pair words (or, in our case, POS- tags) with language-specific categories that capture their syntactic behavior. If our approach is success- ful, it should learn the basic syntactic properties of each language, which will be reflected in the corre- sponding lexicon. In Figure 3 one sees how verbs subcategorize differently, how word ordering differs by language, and how the attachment structures of prepositions are automatically discovered and differ across languages. In Arabic, for example, the sys- tem learns that word order is variable and therefore the verb must allow for both SVO and VOS style constructions. We generally learn that adpositions (prepositions or postpositions) take nouns as argu- ments. In Czech, PPs can appear before and after the verb, leading to two different categories ((S\S)/N and (S/S)/N). Japanese has postpositions that ap- pear in preverbal position ((S/S)\N), but when this category is assigned to nominal particles that cor- respond to case markers, it effectively absorbs the noun, leading to a preference for verbs that do not take any arguments (S), and to a misanalysis of ad- jectives as verb modifiers (S/S). Our lexicons also reflect differences in style: while Childes and the WSJ are both English, they represent very different registers. We learn that subjects are mostly absent in the informal speech and child-directed instructions contained in Childes, while effectively mandatory in the Wall Street Journal. 9 Conclusions This paper has introduced a novel factorization for CCG models and showed how when combined with non-parametric Bayesian statistics it can compete with every other grammar induction system cur- rently available, including those that capture a sig- nificant amount of prior linguistic knowledge. The use of a powerful syntactic formalism proves ben- eficial both in terms of requiring very limited uni- versal knowledge and robustness at longer sentence lengths. Unlike standard grammar induction sys- tems that are based on dependency grammar, our system returns linguistically interpretable lexicons for each language that demonstrate it has discov- ered their basic word order. Of particular note is the simplicity of the model both algorithmically and in terms of implementation. By not faltering on longer sentences or requiring extensive tuning, the system can be easily and quickly deployed on a new lan- guage and return state of the art performance and easily interpretable lexicons. In this paper, we have applied this model only to a restricted fragment of CCG, but future work will address the impact of lex- icalization and the inclusion of richer combinators. 10 Acknowledgements This work is supported by NSF CAREER award 1053856 (Bayesian Models for Lexicalized Gram- mars). 84 References Taylor Berg-Kirkpatrick and Dan Klein. 2010. Phyloge- netic Grammar Induction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1288–1297, Uppsala, Sweden, July. Christopher Bishop. 2006. Pattern Recognition and Ma- chine Learning. Springer-Verlag, August. Yonatan Bisk and Julia Hockenmaier. 2012a. Induction of Linguistic Structure with Combinatory Categorial Grammars. In NAACL HLT Workshop on Induction of Linguistic Structure, pages 90–95, Montréal, Canada, June. Yonatan Bisk and Julia Hockenmaier. 2012b. Simple Robust Grammar Induction with Combinatory Cate- gorial Grammars. In Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (AAAI-12), pages 1643–1649, Toronto, Canada, July. David M Blei and Michael I Jordan. 2004. Variational Methods for the Dirichlet Process. In Proceedings of the Twenty-First International Conference on Machine Learning (ICML 2004), Banff, Alberta, Canada, July. Phil Blunsom and Trevor Cohn. 2010. Unsupervised Induction of Tree Substitution Grammars for Depen- dency Parsing. Proceedings of the 2010 Conference on Empirical Methods of Natural Language Process- ing, pages 1204–1213, October. Prachya Boonkwan and Mark Steedman. 2011. Gram- mar Induction from Text Using Small Syntactic Proto- types. In Proceedings of 5th International Joint Con- ference on Natural Language Processing, pages 438– 446, Chiang Mai, Thailand, November. Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X Shared Task on Multilingual Dependency Parsing. In Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL-X), pages 149– 164, New York City, June. Glenn Carroll and Eugene Charniak. 1992. Two Exper- iments on Learning Probabilistic Dependency Gram- mars from Corpora. Working Notes of the Workshop Statistically-Based NLP Techniques, pages 1–13. Eugene Charniak. 1993. Statistical Language Learning. The MIT Press, Cambridge, Massachusetts. Stephen Clark and James R Curran. 2007. Formalism- Independent Parser Evaluation with CCG and Dep- Bank. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 248–255, Prague, Czech Republic, June. Alex Clark. 2001. Unsupervised Language Acquisition: Theory and Practice. Ph.D. thesis, University of Sus- sex, September. Shay B Cohen and Noah A Smith. 2009. Variational Inference for Grammar Induction with Prior Knowl- edge. Proceedings of the ACL-IJCNLP 2009 Confer- ence Short Papers, pages 1–4. Shay B Cohen and Noah A Smith. 2010. Covariance in Unsupervised Learning of Probabilistic Grammars. The Journal of Machine Learning Research, pages 3117–3151, November. Trevor Cohn, Phil Blunsom, and Sharon Goldwater. 2010. Inducing Tree-Substitution Grammars. The Journal of Machine Learning Research, 11:3053– 3096, November. Michael Collins. 2003. Head-Driven Statistical Mod- els for Natural Language Parsing. Computational Lin- guistics, 29(4):589–637, December. Gregory Druck, Gideon Mann, and Andrew McCal- lum. 2009. Semi-supervised Learning of Dependency Parsers using Generalized Expectation Criteria. In Proceedings of the Joint Conference of the 47th An- nual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 360–368, Suntec, Singapore, Au- gust. Jason Eisner. 1996. Efficient Normal-Form Parsing for Combinatory Categorial Grammar. In Proceedings of the 34th Annual Meeting of the Association for Com- putational Linguistics, pages 79–86, Santa Cruz, Cali- fornia, USA, June. Douwe Gelling, Trevor Cohn, Phil Blunsom, and João V Graca. 2012. The PASCAL Challenge on Grammar Induction. In NAACL HLT Workshop on Induction of Linguistic Structure, pages 64–80, Montréal, Canada, June. Jennifer Gillenwater, Kuzman Ganchev, João V Graca, Fernando Pereira, and Ben Taskar. 2010. Sparsity in Dependency Grammar Induction. In Proceedings of the 48th Annual Meeting of the Association for Com- putational Linguistics, pages 194–199, Uppsala, Swe- den, July. Jennifer Gillenwater, Kuzman Ganchev, João V Graca, Fernando Pereira, and Ben Taskar. 2011. Posterior Sparsity in Unsupervised Dependency Parsing. The Journal of Machine Learning Research, 12:455–490, February. Yoav Goldberg. 2011. Automatic Syntactic Processing of Modern Hebrew. Ph.D. thesis, Ben-Gurion University of the Negev, November. William P Headden III, Mark Johnson, and David Mc- Closky. 2009. Improving Unsupervised Dependency Parsing with Richer Contexts and Smoothing. In Pro- ceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 101–109, Boulder, Colorado, June. Julia Hockenmaier and Yonatan Bisk. 2010. Normal- form parsing for Combinatory Categorial Grammars with generalized composition and type-raising. In 85 Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 465– 473, Beijing, China, August. Coling 2010 Organizing Committee. Julia Hockenmaier and Mark Steedman. 2002. Gener- ative Models for Statistical Parsing with Combinatory Categorial Grammar. In Proceedings of 40th Annual Meeting of the Association for Computational Lin- guistics, pages 335–342, Philadelphia, Pennsylvania, USA, July. Julia Hockenmaier and Mark Steedman. 2007. CCG- bank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank. Com- putational Linguistics, 33(3):355–396, September. Yun Huang, Min Zhang, and Chew Lim Tan. 2012. Improved Combinatory Categorial Grammar Induc- tion with Boundary Words and Bayesian Inference. In Proceedings of the 24rd International Conference on Computational Linguistics (Coling 2012), Mumbai, India, December. Ray Jackendoff. 1977. X-Bar Syntax: A Study of Phrase Structure. The MIT Press. Dan Klein and Christopher D Manning. 2004. Corpus- Based Induction of Syntactic Structure: Models of De- pendency and Constituency. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, pages 478–485, Barcelona, Spain, July. Kenichi Kurihara and Taisuke Sato. 2004. An Appli- cation of the Variational Bayesian Approach to Prob- abilistic Context-Free Grammars. International Joint Conference on Natural Language Language Process- ing Workshop Beyond Shallow Analyses, March. Kenichi Kurihara, Max Welling, and Yee-Whye Teh. 2007. Collapsed Variational Dirichlet Process Mix- ture Models. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI07), pages 2796–2801, Hyderabad, India, January. Tom Kwiatkowski, Sharon Goldwater, Luke Zettlemoyer, and Mark Steedman. 2012. A probabilistic model of syntactic and semantic acquisition from child-directed utterances and their meanings. In Proceedings of the 13th Conference of the European Chapter of the As- sociation for Computational Linguistics, pages 234– 244, Avignon, France, April. Association for Compu- tational Linguistics. Karim Lari and Steve J Young. 1991. Applications of stochastic context-free grammars using the Inside- Outside algorithm. Computer speech & language, 5(3):237–257, January. Percy Liang, Slav Petrov, Michael I Jordan, and Dan Klein. 2007. The Infinite PCFG Using Hierarchical Dirichlet Processes. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Lan- guage Processing and Computational Natural Lan- guage Learning (EMNLP-CoNLL), pages 688–697, Prague, Czech Republic. Percy Liang, Michael I Jordan, and Dan Klein. 2009. Probabilistic Grammars and Hierarchical Dirichlet Processes. In The Oxford Handbook of Applied Bayesian Analysis. Oxford University Press. Mitchell P Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Computa- tional Linguistics, 19(2):313–330, June. David Marecek and Zdenek Zabokrtsky. 2012. Unsu- pervised Dependency Parsing using Reducibility and Fertility features. In NAACL HLT Workshop on Induc- tion of Linguistic Structure, pages 84–89, Montréal, Canada, June. Tahira Naseem, Harr Chen, Regina Barzilay, and Mark Johnson. 2010. Using Universal Linguistic Knowl- edge to Guide Grammar Induction. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1234–1244, Cambridge, MA, October. Tahira Naseem, Regina Barzilay, and Amir Globerson. 2012. Selective Sharing for Multilingual Dependency Parsing. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 629–637, Jeju, Republic of Korea, July. Joakim Nivre. 2006. Inductive Dependency Parsing. Springer. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A Universal Part-of-Speech Tagset. In Proceedings of the 8th International Conference on Language Re- sources and Evaluation (LREC-2012), pages 2089– 2096, Istanbul, Turkey, May. Anders Søgaard. 2012. Two baselines for unsuper- vised dependency parsing. In NAACL HLT Work- shop on Induction of Linguistic Structure, pages 81– 83, Montréal, Canada, June. Valentin I Spitkovsky, Hiyan Alshawi, and Daniel Juraf- sky. 2010. From Baby Steps to Leapfrog: How “Less is More” in Unsupervised Dependency Parsing. In Hu- man Language Technologies: The 2010 Annual Con- ference of the North American Chapter of the Associ- ation for Computational Linguistics, pages 751–759, Los Angeles, California, June. Mark Steedman. 1996. Surface Structure and Interpre- tation. The MIT Press, January. Mark Steedman. 2000. The Syntactic Process. The MIT Press, September. Yee-Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. 2006. Hierarchical Dirichlet Pro- 86 cesses. Journal of the American Statistical Associa- tion, 101(476):1566–1581. Yee-Whye Teh. 2006. A Hierarchical Bayesian Lan- guage Model based on Pitman-Yor Processes. In Pro- ceedings of the 21st International Conference on Com- putational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 985–992, Sydney, Australia, July. Yee-Whye Teh. 2010. Dirichlet Process. In Encyclope- dia of Machine Learning, pages 280–287. Springer. Kewei Tu. 2012. Combining the Sparsity and Unambi- guity Biases for Grammar Induction. In NAACL HLT Workshop on Induction of Linguistic Structure, pages 105–110, Montréal, Canada, June. 87 88