Fine-Grained Prediction of Syntactic Typology: Discovering Latent Structure with Supervised Learning Dingquan Wang and Jason Eisner Department of Computer Science, Johns Hopkins University {wdd,eisner}@jhu.edu Abstract We show how to predict the basic word-order facts of a novel language given only a cor- pus of part-of-speech (POS) sequences. We predict how often direct objects follow their verbs, how often adjectives follow their nouns, and in general the directionalities of all depen- dency relations. Such typological properties could be helpful in grammar induction. While such a problem is usually regarded as unsu- pervised learning, our innovation is to treat it as supervised learning, using a large collec- tion of realistic synthetic languages as train- ing data. The supervised learner must identify surface features of a language’s POS sequence (hand-engineered or neural features) that cor- relate with the language’s deeper structure (la- tent trees). In the experiment, we show: 1) Given a small set of real languages, it helps to add many synthetic languages to the training data. 2) Our system is robust even when the POS sequences include noise. 3) Our system on this task outperforms a grammar induction baseline by a large margin. 1 Introduction Descriptive linguists often characterize a human language by its typological properties. For instance, English is an SVO-type language because its basic clause order is Subject-Verb-Object (SVO), and also a prepositional-type language because its adpositions normally precede the noun. Identifying basic word order must happen early in the acqui- sition of syntax, and presumably guides the initial interpretation of sentences and the acquisition of a finer-grained grammar. In this paper, we propose a method for doing this. While we focus on word order, one could try similar methods for other typo- logical classifications (syntactic, morphological, or phonological). The problem is challenging because the lan- guage’s true word order statistics are computed from syntax trees, whereas our method has access only to a POS-tagged corpus. Based on these POS se- quences alone, we predict the directionality of each type of dependency relation. We define the direc- tionality to be a real number in [0, 1]: the fraction of tokens of this relation that are “right-directed,” in the sense that the child (modifier) falls to the right of its parent (head). For example, the dobj relation points from a verb to its direct object (if any), so a di- rectionality of 0.9—meaning that 90% of dobj de- pendencies are right-directed—indicates a dominant verb-object order. (See Table 1 for more such exam- ples.) Our system is trained to predict the relative frequency of rightward dependencies for each of 57 dependency types from the Universal Dependencies project (UD). We assume that all languages draw on the same set of POS tags and dependency relations that is proposed by the UD project (see §3), so that our predictor works across languages. Why do this? Liu (2010) has argued for using these directionality numbers in [0, 1] as fine-grained and robust typological descriptors. We believe that these directionalities could also be used to help de- fine an initializer, prior, or regularizer for tasks like grammar induction or syntax-based machine trans- lation. Finally, the vector of directionalities—or the feature vector that our method extracts in or- der to predict the directionalities—can be regarded as a language embedding computed from the POS- tagged corpus. This language embedding may be useful as an input to multilingual NLP systems, such as the cross-linguistic neural dependency parser of Ammar et al. (2016). In fact, some multilingual NLP systems already condition on typological properties looked up in the World Atlas of Language Struc- tures, or WALS (Dryer and Haspelmath, 2013), as 147 Transactions of the Association for Computational Linguistics, vol. 5, pp. 147–161, 2017. Action Editor: Mark Steedman. Submission batch: 11/2016; Revision batch: 2/2017; Published 6/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Typology Example Verb-Object (English) She gave me a raise dobj Object-Verb (Hindi) She me a raise gave vah mujhe ek uthaane diya dobj Prepositional (English) She is in a car case Postpositional (Hindi) She a car in is vah ek kaar mein hai case Adjective-Noun (English) This is a red car amod Noun-Adjective (French) This is a car red Ceci est une voiture rouge amod Table 1: Three typological properties in the World Atlas of Lan- guage Structures (Dryer and Haspelmath, 2013), and how they affect the directionality of Universal Dependencies relations. we review in §8. However, WALS does not list all properties of all languages, and may be somewhat inconsistent since it collects work by many linguists. Our system provides an automatic alternative as well as a methodology for generalizing to new properties. More broadly, we are motivated by the chal- lenge of determining the structure of a language from its superficial features. Principles & Parame- ters theory (Chomsky, 1981; Chomsky and Lasnik, 1993) famously—if controversially—hypothesized that human babies are born with an evolutionarily tuned system that is specifically adapted to natural language, which can predict typological properties (“parameters”) by spotting telltale configurations in purely linguistic input. Here we investigate whether such a system can be tuned by gradient descent. It is at least plausible that useful superficial features do exist: e.g., if nouns often precede verbs but rarely follow verbs, then the language may be verb-final. 2 Approach We depart from the traditional approach to latent structure discovery, namely unsupervised learning. Unsupervised syntax learners in NLP tend to be rather inaccurate—partly because they are failing to maximize an objective that has many local optima, and partly because that objective does not capture all the factors that linguists consider when assigning syntactic structure. Our remedy in this paper is a su- pervised approach. We simply imitate how linguists have analyzed other languages. This supervised ob- jective goes beyond the log-likelihood of a PCFG- like model given the corpus, because linguists do not merely try to predict the surface corpus. Their de- pendency annotations may reflect a cross-linguistic theory that considers semantic interpretability and equivalence, rare but informative phenomena, con- sistency across languages, a prior across languages, and linguistic conventions (including the choice of latent labels such as dobj). Our learner does not consider these factors explicitly, but we hope it will identify correlates (e.g., using deep learning) that can make similar predictions. Being supervised, our objective should also suffer less from local optima. Indeed, we could even set up our problem with a convex objective, such as (kernel) logistic regres- sion, to predict each directionality separately. Why hasn’t this been done before? Our set- ting presents unusually sparse data for supervised learning, since each training example is an en- tire language. The world presumably does not offer enough natural languages—particularly with machine-readable corpora—to train a good classi- fier to detect, say, Object-Verb-Subject (OVS) lan- guages, especially given the class imbalance prob- lem that OVS languages are empirically rare, and the non-IID problem that the available OVS lan- guages may be evolutionarily related.1 We miti- gate this issue by training on the Galactic Depen- dencies treebanks (Wang and Eisner, 2016), a col- lection of more than 50,000 human-like synthetic languages. The treebank of each synthetic language is generated by stochastically permuting the sub- trees in a given real treebank to match the word or- der of other real languages. Thus, we have many synthetic languages that are Object-Verb like Hindi but also Noun-Adjective like French. We know the true directionality of each synthetic language and we would like our classifier to predict that directional- ity, just as it would for a real language. We will show that our system’s accuracy benefits from fleshing out the training set in this way, which can be seen as a form of regularization. 1Properties shared within an OVS language family may ap- pear to be consistently predictive of OVS, but are actually con- founds that will not generalize to other families in test data. 148 A possible criticism of our work is that obtaining the input POS sequences requires human annotators, and perhaps these annotators could have answered the typological classification questions as well. In- deed, this criticism also applies to most work on grammar induction. We will show that our system is at least robust to noise in the input POS sequences (§7.4). In the future, we hope to devise similar meth- ods that operate on raw word sequences. 3 Data We use two datasets in our experiment: UD: Universal Dependencies version 1.2 (et al., 2015) A collection of dependency treebanks for 37 languages, annotated in a consistent style with POS tags and dependency relations. GD: Galactic Dependencies version 1.0 (Wang and Eisner, 2016) A collection of projective de- pendency treebanks for 53,428 synthetic languages, using the same format as UD. The treebank of each synthetic language is generated from the UD treebank of some real language by stochastically permuting the dependents of all nouns and/or verbs to match the dependent orders of other real UD lan- guages. Using this “mix-and-match” procedure, the GD collection fills in gaps in the UD collection— which covers only a few possible human languages. 4 Task Formulation We now formalize the setup of the fine-grained typo- logical prediction task. Let R be the set of universal relation types, with N = |R|. We use r→ to denote a rightward dependency token with label r ∈R. Input for language L: A POS-tagged corpus u. (“u” stands for “unparsed.”) Output for language L: Our system predicts p(→| r,L), the probability that a token in language L of an r-labeled dependency will be right-oriented. It predicts this for each dependency relation type r ∈ R, such as r = dobj. Thus, the output is a vector of predicted probabilities p̂ ∈ [0, 1]N . Training: We set the parameters of our system using a collection of training pairs (u,p∗), each of which corresponds to some UD or GD training lan- guage L. Here p∗ denotes the true vector of proba- bilities as empirically estimated from L’s treebank. Evaluation: Over pairs (u,p∗) that correspond to held-out real languages, we evaluate the expected loss of the predictions p̂. We use ε-insensitive loss2 with ε = 0.1, so our evaluation metric is ∑ r∈R p∗(r | L) · lossε(p̂(→| r,L),p∗(→| r,L)) (1) where • lossε(p̂,p∗) ≡ max(|p̂−p∗|−ε, 0) • p∗(→| r,L) = countL( r→) countL(r) is the empirical esti- mate of p(→| r,L). • p̂(→| r,L) is the system’s prediction of p∗ The aggregate metric (1) is an expected loss that is weighted by p∗(r | L) = countL(r)∑ r′∈R countL(r ′) , to empha- size relation types that are more frequent in L. Why this loss function? We chose an L1-style loss, rather than L2 loss or log-loss, so that the ag- gregate metric is not dominated by outliers. We took ε > 0 in order to forgive small errors: if some pre- dicted directionality is already “in the ballpark,” we prefer to focus on getting other predictions right, rather than fine-tuning this one. Our intuition is that errors < ε in p̂’s elements will not greatly harm downstream tasks that analyze individual sentences, and might even be easy to correct by grammar rees- timation (e.g., EM) that uses p̂ as a starting point. In short, we have the intuition that if our pre- dicted p̂ achieves small lossε on the frequent relation types, then p̂ will be helpful for downstream tasks, although testing that intuition is beyond the scope of this paper. One could tune ε on a downstream task. 5 Simple “Expected Count” Baseline Before launching into our full models, we warm up with a simple baseline heuristic called expected count (EC), which is reminiscent of Principles & Pa- rameters. The idea is that if ADJs tend to precede nearby NOUNs in the sentences of language L, then amod probably tends to point leftward in L. After all, the training languages show that when ADJ and NOUN are nearby, they are usually linked by amod. Fleshing this out, EC estimates directionalities as p̂(→| r,L) = ecountL( r→) ecountL( r→) + ecountL( r←) (2) 2Proposed by V. Vapnik for support vector regression. 149 where we estimate the expected r← and r→ counts by ecountL( r→) = ∑ u∈u ∑ 1≤i