Unsupervised dependency parsing with acoustic cues John K Pate (j.k.pate@sms.ed.ac.uk) Sharon Goldwater (sgwater@inf.ed.ac.uk) School of Informatics, University of Edinburgh 10 Crichton St., Edinburgh EH8 9AB, UK Abstract Unsupervised parsing is a difficult task that infants readily perform. Progress has been made on this task using text-based models, but few computational approaches have considered how infants might benefit from acoustic cues. This paper explores the hypothesis that word duration can help with learning syntax. We de- scribe how duration information can be incor- porated into an unsupervised Bayesian depen- dency parser whose only other source of infor- mation is the words themselves (without punc- tuation or parts of speech). Our results, evalu- ated on both adult-directed and child-directed utterances, show that using word duration can improve parse quality relative to words-only baselines. These results support the idea that acoustic cues provide useful evidence about syntactic structure for language-learning in- fants, and motivate the use of word duration cues in NLP tasks with speech. 1 Introduction Unsupervised learning of syntax is difficult for NLP systems, yet infants perform this task routinely. Pre- vious work in NLP has focused on using the implicit syntactic information available in part-of-speech (POS) tags (Klein and Manning, 2004), punctuation (Seginer, 2007; Spitkovsky et al., 2011b; Ponvert et al., 2011), and syntactic similarities between related languages (Cohen and Smith, 2009; Cohen et al., 2011). However, these approaches likely use the data in a very different way from children: neither POS tags nor punctuation are observed during language acquisition (although see Spitkovsky et al. (2011a) and Christodoulopoulos et al. (2012) for encourag- ing results using unsupervised POS tags), and many children learn in a broadly monolingual environment. This paper explores a possible source of information that NLP systems typically ignore: word duration, or the length of time taken to pronounce each word. There are good reasons to think that word dura- tion might be useful for learning syntax. First, the well-established Prosodic Bootstrapping hypothesis (Gleitman and Wanner, 1982) proposes that infants use acoustic-prosodic cues (such as word duration) to help them identify syntactic structure, because prosodic and syntactic structures sometimes coincide. More recently, we proposed (Pate and Goldwater, 2011) that infants might use word duration as a di- rect cue to syntactic structure (i.e., without requir- ing intermediate prosodic structure), because words in high-probability syntactic structures tend to be pronounced more quickly (Gahl and Garnsey, 2004; Gahl et al., 2006; Tily et al., 2009). Like most recent work on unsupervised parsing, we focus on learning syntactic dependencies. Our work is based on Headden et al. (2009)’s Bayesian version of the Dependency Model with Valence (DMV) (Klein and Manning, 2004), using interpo- lated backoff techniques to incorporate multiple infor- mation sources per token. However, whereas Head- den et al. used words and POS tags as input, we use words and word duration information, presenting three variants of their model that use this information in slightly different ways.1 1By using neither gold-standard nor learned POS tags as input, our work differs from nearly all previous work on unsuper- vised dependency parsing. While learned tags might be plausible To our knowledge, this is the first work to incor- porate acoustic cues into an unsupervised system for learning full syntactic parses. The methods in this paper were inspired by our previous approach (Pate and Goldwater, 2011), which showed that word dura- tion measurements could improve the performance of an unsupervised lexicalized syntactic chunker over a words-only baseline. However, that work was lim- ited to HMM-like sequence models, tested on adult- directed speech (ADS) only, and none of the models outperformed uniform-branching baselines. Here, we extend our results to full dependency parsing, and experiment on transcripts of both spontaneous ADS and child-directed speech (CDS). Our models us- ing word duration outperform words-only baselines, along with the Common Cover Link parser of Seginer (2007), and the Unsupervised Partial Parser of Pon- vert et al. (2011), unsupervised lexicalized parsers that have obtained state-of-the-art results on standard newswire treebanks (though their performance here is worse, as our input lacks punctuation). We also outperform uniform-branching baselines. 2 Syntax and Word Duration Before presenting our models and experiments, we first discuss why word duration might be a useful cue to syntax. This section reviews the two possible rea- sons mentioned above: duration as a cue to prosodic structure, or as a cue to predictability. 2.1 Prosodic Bootstrapping Prosody is the structure of speech as conveyed by rhythm and intonation, which are, in turn, conveyed by such measurable phenomena as variation in fun- damental frequency, word duration, and spectral tilt. Prosodic structure is typically analyzed as imposing a shallow, hierarchical grouping structure on speech, with the ends of prosodic phrases (constituents) be- ing cued in part by lengthening the last word of the phrase (Beckman and Pierrehumbert, 1986). The Prosodic Bootstrapping hypothesis (Gleit- man and Wanner, 1982) points out that prosodic phrases are often also syntactic phrases, and proposes that language-acquiring infants exploit this correla- tion. Specifically, if infants can learn about prosodic phrase structure using word duration (and fundamen- in a model of language acquisition, gold tags certainly are not. tal frequency), they may be able to identify syntactic phrases more easily using word strings and prosodic trees than using word strings alone. Several behavioral experiments support the con- nection between prosody and syntax and the prosodic bootstrapping hypothesis specifically. For example, there is evidence that adults use prosodic information for syntactic disambiguation (Millotte et al., 2007; Price et al., 1991) and to help in learning the syntax of an artificial language (Morgan et al., 1987), while infants can use acoustic-prosodic cues for utterance- internal clause segmentation (Seidl, 2007). On the computational side, we are aware of only our previous HMM-based chunkers (Pate and Gold- water, 2011), which learned shallow syntax from words, words and word durations, or words and hand- annotated prosody. Using these chunkers, we found that using words plus prosodic annotation worked better than just words, and words plus word duration worked even better. While these results are consistent with the prosodic bootstrapping hypothesis, we sug- gested that predictability bootstrapping (see below) might be a more plausible explanation. Other computational work has combined prosody with syntax, but only in supervised systems, and typi- cally using hand-annotated prosodic information. For example, Huang and Harper (2010) used annotated prosodic breaks as a kind of punctuation in a su- pervised PCFG, while prosodic breaks learned in a semi-supervised way have been used as features for parse reranking (Kahn et al., 2005) or PCFG state- splitting (Dreyer and Shafran, 2007). In contrast to these methods, our approach observes neither parse trees nor prosodic annotations. 2.2 Predictability Bootstrapping On the basis of our HMM chunkers, we introduced the predictability bootstrapping hypothesis (Pate and Goldwater, 2011): the idea that word durations could be a useful cue to syntactic structure not (or not only) because they provide information about prosodic structure, but because they are a direct cue to syntac- tic predictability. It is well-established that talkers tend to pronounce words more quickly when they are more predictable, as measured by, e.g., word frequency, n-gram probability, or whether or not the word has been previously mentioned (Aylett and Turk, 2004; Bell et al., 2009). However, syntactic proba- you threw it right at the basket Figure 1: Example unlabeled dependency parse. bility also seems to matter, with studies showing that verbs tend to be pronounced more quickly when they are in their preferred syntactic frame—transitive vs. intransitive or direct object vs. sentential comple- ment (Gahl and Garnsey, 2004; Gahl et al., 2006; Tily et al., 2009). While this syntactic evidence is only for verbs, together with the evidence that effects of other notions of predictability, it suggests that such syntactic effects may also be widespread. If so, the duration of a word could give clues as to whether it is being used in a high-probability or low-probability structure, and thus what the correct structure is. We found that our syntactic chunkers benefited more from duration information than prosodic an- notations, providing some preliminary evidence in favor of predictability bootstrapping, but not ruling out prosodic bootstrapping. So, we are left with two plausible mechanisms by which word duration could help with learning syntax. Slow pronunciations may cue the end of a prosodic phrase, which is sometimes also the end of a syntactic phrase. Alternatively, slow pronunciations may indicate that the hidden syntactic structure is low probability, facilitating the induc- tion of a probabilistic grammar. This paper will not seek to determine which mechanism is useful, instead taking the presence of two possible mechanisms as encouraging for the prospect of incorporating word duration into unsupervised parsing. 3 Models2 As mentioned, we will be incorporating word dura- tion into unsupervised dependency parsing, produc- ing analyses like the one in Figure 1. Each arc is between two words, with the head at the non-arrow end of the arc, and the dependent at the arrow end. One word, the root, depends on no word, and all other words depend on exactly one word. Following previous work on unsupervised dependency parsing, we will not label the arcs. 2The implementation of these models is available at http://github.com/jpate/predictabilityParsing 3.1 Dependency Model with Valence All of our models are ultimately based on the De- pendency Model with Valence (DMV) of Klein and Manning (2004), a generative, probabilistic model for projective (i.e. no crossing arcs), unlabeled de- pendency parses, such as the one in Figure 1. The DMV generates dependency parses using three probability distributions, which together com- prise model parameters θ. First, the root of the sentence is drawn from Proot . Second, we decide whether to stop generating dependents of the head h in direction dir ∈ {left, right} with probability Pstop(·|h,dir,v), where v is T if h has a dir-ward dependent and F otherwise. If we decide to stop, then h takes no more dependents in the direction of dir. If we don’t stop, we use the third probability distribution Pchoose(d|h,dir) to determine which de- pendent d to generate. The second and third step repeat for each generated word until all words have stopped generating in both directions. The DMV was the first unsupervised parsing model to outperform a uniform-branching baseline on the Wall Street Journal corpus. It was trained using EM to obtain a maximum-likelihood estimate of the parameters θ, and learned from POS tags to avoid rare events. However, all work on syntactic predictability effects on word duration has been lexi- calized (looking at, e.g., the transitivity bias of par- ticular verbs). In addition, it is unlikely that children have access to the correct parts of speech when first learning syntactic structure. Thus, we want a DMV variant that learns from words rather than POS tags. We therefore adopt several extensions to the DMV due to Headden et al. (2009), described next. 3.2 The DMV with Backoff Headden et al. (2009) sought to improve the DMV by incorporating lexical information in addition to POS tags. However, arcs between particular words are rare, so they modified the DMV in two ways to deal with this sparsity. First, they switched from MLE to a Bayesian approach, estimating a probability distribu- tion over model parameters θ and dependency trees T given the training corpus C and a prior distribution α over models: P(T,θ|C,α). Headden et al. avoided overestimating the proba- bility of rare events that happen to occur in the train- ing data by picking α to assign low probability to models θ which give high probability to rare events. Accordingly, models that overcommit to rare events will contribute little to the final average over models. Specifically, Headden et al. use Dirichlet priors, with α being the Dirichlet hyperparameters. Headden et al.’s second innovation was to adapt in- terpolated backoff methods from language modeling with n-grams, where one can estimate the probabil- ity of word wn given word wn−1 by interpolating between unigram and bigram probability estimates: P̂(wn|wn−1) = λP(wn|wn−1) + (1−λ)P(wn) with λ ∈ [0,1]. Ideally, λ should be large when wn−1 is frequent, and small when wn−1 is rare. Headden et al. (2009) apply this method to the DMV by backing off from Choose and Stop distributions that condition on both head word and POS to distributions that condition on only the head POS. In the equation above, λ is a scalar parameter. However, it actually specifies a probability distri- bution over the decision to back off (B) or not back off (¬B), and we can use different notation to reflect this view. Specifically, λstop(·) and λchoose(·) will represent our backoff distributions for the Stop and Choose decision, respectively. Using hp and dp to represent head and dependent POS tag and hw and dw to represent head and dependent word, one of the models Headden et al. explored estimates: P̂ choose(dp|hw,hp,dir,val) = λchoose(¬B|hw,hp,dir)Pchoose(dp|hw,hp,dir) +λchoose(B|hw,hp,dir)Pchoose(dp|hp,dir) (1) with an analogous backoff for Pstop . We can see from Equation 1 that P̂choose backs off from a dis- tribution that conditions on hw to a distribution that marginalizes out hw, and that the extent of backoff varies across hw; we can use this to back off more when we have less evidence about hw. This model only conditions on words; it does not generate them in the dependents. This means it is actually a condi- tional, rather than fully generative, model of observed POS tags and unobserved syntax conditioned on the observed words. Since identifying the true posterior distribution P(T,θ|C,α) is intractable, Headden et al. use Mean- field Variational Bayes (Kurihara and Sato, 2006; Johnson, 2007), which finds an approximation to the posterior using an iterative EM-like algorithm. In the E-step of VBEM, expected counts E(ri ) are gathered for each latent variable using the Inside-Outside algo- rithm, exactly as in the E-step of traditional EM. The Maximization step differs from the M-Step of EM in two ways. First, the expected counts for each value of the latent variable xi are incremented by the hy- perparameter αi. Second, the numerator and denom- inator are scaled by the function exp(ψ(·)), which reduces the probability of rare events. Specifically, the Pchoose distribution is estimated using expecta- tions for each arc adp,h,dir from head h to dependent POS tag dp in direction dir, and the update equation for Pchoose from iteration n to n + 1 is: P̂n+1choose(dp|h,dir) = exp(ψ(En(adp,h,dir ) + αdp,h,dir )) exp(ψ( ∑ c(E n(ac,h,dir ) + αc,h,dir ))) (2) where h is the head POS tag for the backoff distri- bution, and the head (word, POS) pair for the no backoff distribution. The update equation for Pstop is analogous. Now consider the update equations for λchoose : λ̂n+1choose(¬B|hw,hp,dir) = exp(ψ(α¬B + ∑ c(E n(ac,hw,hp,dir )))) exp(ψ(αB + α¬B + ∑ c(E n(ac,hw,hp,dir )))) λ̂n+1choose(B|hw,hp,dir) = exp(ψ(αB)) exp(ψ(αB + α¬B + ∑ c(E n(ac,hw,hp,dir )))) Only the ¬B numerator includes the expected counts, so as we see hw in direction dir more often, the ¬B numerator will swamp the B numerator. By picking αB larger than α¬B, we can bias our λ distribution to prefer backing off until we expect at least αB −α¬B arcs out of hw with tag hp in the direction of dir. To obtain good performance, Headden et al. re- placed each word that appeared fewer than 100 times in the training data with the token “UNK.” We will also use such an UNK cutoff. 3.3 DMV with Duration We explore three models. One is a straightforward application of the DMV with Backoff to words and (quantized) word duration, and the other two are fully- generative variants. We also consider using words and POS tags as input to these models. Backoff mod- els are given two streams of information, providing two of word identity, POS tag, or word duration for each observed token. We call one stream the “back- off” stream, and the other the “extra” stream. Backoff models learn a probability distribution conditioning on both streams, backing off to condition on only the backoff stream. Our first words and duration model takes the du- ration as the extra stream and the word identity as the backoff stream, and, using ha to represent the acoustic information for the head, defines: P̂ choose(dw|hw,ha,dir) = λchoose(¬B|hw,ha,dir)Pchoose(dw|hw,ha,dir) +λchoose(B|hw,ha,dir)Pchoose(dw|hw,dir) (3) with an analogous backoff scheme for Pstop . We will refer to this conditional model as “Cond.” in our experiments. This equation is similar to Equation 1, except it uses words and duration instead of words and POS tags, and backs off to, not away from, words. We back off to the sparse words, rather than the less sparse duration, because duration provides almost no information about syntax in isolation.3 Directly modelling the extra stream among the dependents may allow us to capture selectional re- strictions in POS and words models, or exploit ef- fects of syntactic predictability on dependent dura- tion. We therefore explore variants that generate both streams in the dependents. First, we examine a model (“Joint”) that generates them jointly: P̂choose(dw,da|hw,hp,dir) = λchoose(¬B|hw,ha,dir) Pchoose(dw,da|hw,ha,dir) +λchoose(B|hw,ha,dir) Pchoose(dw,da|hw,dir) (4) However, this joint model will have a very large state- space and may suffer from the same data sparsity, so we also explore a model (“Indep.”) that generates the 3Preliminary dev-set experiments confirmed this intuition, as models that backed off to word duration performed poorly. extra and backoff independently: P̂choose(dw,da|hw,hp,dir) = λchoose(¬B|hw,ha,dir) Pchoose backoff (dw|hw,ha,dir) Pchoose extra (da|hw,ha,dir) + λchoose(B|hw,ha,dir) Pchoose backoff (dw|hw,dir) Pchoose extra (da|hw,dir) (5) We also modified the DMV with Backoff to handle heavily lexicalized models. In Headden et al. (2009), arcs between words that never appear in the same sentence are given probability mass only by virtue of the backoff distribution to POS tags, which all appear in the same sentence at least once. We want to both avoid relying on POS tags, and we also want to use held-out development and test sets to avoid implicitly overfitting the data when exploring different model structures. To this end, we add one extra αUNK hyper- parameter to the Dirichlet prior of Pchoose for each combination of conditioning events. This hyperpa- rameter reserves probability mass for a head h to take a word dw as a dependent if h and dw never appeared together in the training data. The amount of probabil- ity mass reserved decreases as we see hw more often. This is implemented in training by adding αUNK to the denominator of the Pchoose update equation for each h and dir. At test time, if a word dw appears as an unseen dependent for head h, h takes dw as a dependent with probability: P̂ choose(dw|h,dir) = (6) exp(ψ(αUNK)) exp(ψ(αUNK + ∑ c(E last(rc,h,dir ) + αc,h,dir ))) Here, h may be a word, (word, POS) pair, or (word, duration) pair. Since this event by definition never occurs in the training data, αUNK does not appear in the numerator during training. 4 4Note also that αUNK is different from a global UNK cutoff, which is imposed in preprocessing, and so effects every occur- rence of an an UNK’d word in the model. αUNK affects only dependents in Pchoose , and treats a dependent as UNK iff it did not occur on that particular side of that particular head word in any sentence. We used both global UNK cutoffs (optimized on the dev set) and these αUNK hyperparameters. Train Dev Test w s j 1 0 Word tokens 42,505 1,765 2,571 Word types 7,804 818 1,134 Sentences 6,007 233 357 s w b d n x t 1 0 Word tokens 24,998 2,980 3,052 Word types 2,647 760 767 Sentences 3,998 488 491 b r e n t Word tokens 20,954 2,127 2,206 Word types 1,390 482 488 Sentences 6,249 424 449 Table 1: Statistics for our three corpora. Finally, the conditional model ignores the extra stream in Proot , and the generative models estimate Proot over both streams jointly. 4 Experimental Setup 4.1 Datasets We evaluate on three datasets: wsj10, sentences of length 10 or less from the Wall Street Journal por- tion of the Penn Treebank; swbdnxt10, sentences of length 10 or less from the Switchboard dataset of ADS used by Pate and Goldwater (2011); and brent, part of the Brent corpus of CDS (Brent and Siskind, 2001). Table 1 presents corpus statistics. 4.1.1 wsj10 We present a new evaluation of the DMV with Backoff on wsj10, which does not have any acous- tic information, simply to verify that αUNK performs sensibly on a standard corpus. Additionally, Headden et al. (2009) use an intensive initializer that relies on dozens of random restarts, and so, strictly speaking, only show that the backoff technology is useful for good initializations. Our new evaluation will show that the backoff technology provides a substantial benefit even for harmonic initialization. wsj10 was created in the standard way; all punc- tuation and traces were removed, and sentences con- taining more than ten tokens were discarded. For our fully lexicalized version of wsj10, all words were lowercased, and numbers were replaced with the token “NUMBER.”5 Following standard practice, we used sections 2-21 for training, section 22 for development, and section 23 for test. wsj10 con- tains hand-annotated constituency parses, not depen- 5Numbers were treated in this way only in wsj10. dency parses, so we used the standard “constituency- to-dependency” conversion tool of Johansson and Nugues (2007) to obtain high-quality CoNLL-style dependency parses. 4.1.2 swbdnxt10 Next, we evaluate on swbdnxt10, which con- tains all sentences up to length 10 from the same sections of the swbdnxt version of Switchboard used by Pate and Goldwater (2011). Short sentences are usually formulaic discourse responses (e.g. “oh ok”), so this dataset also excludes sentences shorter than three words. As our models successfully use word durations, this evaluation provides an important replication of the basic result from Pate and Goldwa- ter (2011) with a different kind of syntactic model. swbdnxt10 has a forced alignment of a dictionary-based phonetic transcription of each ut- terance to audio, providing our word duration infor- mation. As a very simple model of hyper-articulation and hypo-articulation, we classify a word as in the longest third duration, shortest third, or middle third. To minimize effects of word form, this classification was based on vowel count (counting a diphthong as one vowel): each word with n vowels is classified as in the shortest, longest, or middle tercile of duration among words with n vowels. Like wsj10, swbdnxt10 is annotated only with constituency parses, so to provide approximate “gold-standard” dependencies, we used the same constituency-to-dependency conversion tool as for wsj10. We evaluated 200 randomly-selected sen- tences to check the accuracy of the conversion tool, which was designed for newspaper text. Excluding arcs involving words with no clear role in depen- dency structure (such as “um”), about 86% of the arcs were correct. While this rate is uncomfortably low, it is still much higher than unsupervised depen- dency parsers typically achieve, and so may provide a reasonable measure of relative dependency parse quality among competing systems. 4.1.3 brent We also evaluated our models on the “Large Brent” dataset introduced in Rytting et al. (2010), a por- tion of the Brent corpus of child-directed speech (Brent and Siskind, 2001). We call this corpus brent. It consists of utterances from four of the mothers in Brent and Siskind’s (2001) study, and, like swbdnxt10, has a forced alignment from which we obtain duration terciles. Rytting et al. (2010) used a 90%/10% train/test partition. We extracted every ninth utterance from the original training partition to create a dev set, producing an 80%/10%/10% parti- tion. We also separated clitics from their base word. This dataset only has 186 sentences longer than ten words, with a maximum length of 22 words, so we discarded only sentences shorter than three words from the evaluation sets. The Brent corpus is distributed via CHILDES (MacWhinney, 2000) with automatic dependency an- notations. However, these are not hand-corrected, and rely on a different tokenization of the dataset than is present on the transcription tier. To produce a reliable gold-standard,6 we annotated all sentences of length 2 or greater from the development and test sets with dependencies drawn from the Stanford Typed Dependency set (de Marneffe and Manning, 2008) using the annotation tool used for the Copenhagen Dependency Treebank (Kromann, 2003). 4.2 Parameters In all experiments, hyperparameters for Proot , Pstop , and Pchoose (and their backed-off distributions, and including αUNK) were 1, αB was 10, and α¬B was 1. VBEM was run on the training set until the data log-likelihood changed by less than 0.001%, and then the parameters were held fixed and used to obtain Viterbi parses for the evaluation sentences. Finally, we explored different global UNK cutoffs, replacing each word that appeared less than c times with the token UNK. We ran each model for each c ∈ {0,1,25,50,100}, and picked the best-scoring c on the development set for running on the test set and presentation here. We used a harmonic initializer similar to the one in Klein and Manning (2004). 4.3 Evaluation In addition to evaluating the various incarnations of the DMV with backoff and input types, we compare to uniform branching baselines, the Common Cover Link (CCL) parser of Seginer (2007), and the Unsu- pervised Partial Parser (UPP) of Ponvert et al. (2011). The UPP produces a constituency parse from words 6Available at http://homepages.inf.ed.ac.uk/s0930006/brentDep/ and punctuation using a series of finite-state chun- kers; we use the best-performing (Probabilistic Right Linear Grammar) version. The CCL parser produces a constituency parse using a novel “Cover Link” rep- resentation, scoring these links heuristically. Both CCL and UPP rely on punctuation (though according to Ponvert et al. (2011), UPP less so), which our in- put is missing. The left-headed “LH” (right-headed “RH”) baseline assumes that each word takes the first word to its right (left) as a dependent, and corre- sponds to a uniform right-branching (left-branching) constituency baseline. We evaluate the output of all models in terms of both constituency scores and dependency accu- racy. Our wsj10 and swbdnxt10 corpora are originally annotated for constituency structure, with the dependency gold standard derived as described above, while our brent corpus is originally anno- tated for dependency structure, with the constituency gold standard derived by defining a constituent to span a head and each of its dependents (ignoring any one-word “constituents”). As the CCL and UPP parsers don’t produce dependencies, only con- stituency scores are provided. For constituency scores, we present the standard unlabeled Precision, Recall, and F-measure scores. For dependency scores, we present Directed attach- ment accuracy, Undirected attachment accuracy, and the “Neutral Edge Detection” (NED) score intro- duced by Schwartz et al. (2011). Directed attachment accuracy counts an arc as a true positive if it correctly identifies both a head and a dependent, whereas undi- rected attachment accuracy ignores arc direction in counting true positives. NED counts an arc as a true positive if it would be a true positive under the Undi- rected attachment score, or if the proposed head is the gold-standard grandparent of the proposed depen- dent. This avoids penalizing parses for flipping an arc, such as making determiners, rather than nouns, the head of noun phrases. To assess statistical significance, we carried out stratified shuffling tests, with 10,000 random shuf- fles, for all measures. Tables indicate significance differences between the backoff models and the most competitive baseline model on that measure, indi- cated by an italic score. A star (∗) indicates p < 0.05, and a dagger (†) indicates p < 0.01. To see the direc- tion of a significant difference (i.e. whether backoff wsj10 swbdnxt10 Dependency Constituency Dependency Constituency UNK Dir. Undir. NED P R F UNK Dir. Undir. NED P R F E M Wds 25 32.5 52.5 67.0 49.5 48.5 49.0 25 30.6 50.9 66.8 45.4 47.1 46.3 POS — 46.4 63.8 78.1 59.2 58.1 58.6 — 53.0 65.0 76.8 52.5 52.9 52.7 V B Wds 25 29.4 52.4 70.5 51.3 52.6 52.0 25 36.1 54.9 72.7 49.0 50.0 49.5 POS — 43.5 61.9 77.3 59.7 57.1 58.4 — 51.3 62.5 74.3 47.1 46.6 46.8 W ds + P O S Cond. 50 49.9† 66.1† 79.6∗ 64.2† 61.9† 63.0† 100 45.5† 62.4† 77.8 58.4† 58.9† 58.7† Joint 50 46.0 63.7 79.0 62.0† 59.1 60.5∗ 1 49.4† 63.7 79.6† 60.0† 52.9 56.3† Indep. 25 52.5† 68.0† 83.5† 63.5† 61.5† 62.5† 100 55.7† 65.8 74.6† 61.5† 57.9† 59.6† LH — 26.0 55.8 74.3 53.1 69.6 60.3 — 24.1 50.8 72.7 60.8 82.5 70.0 RH — 31.2 56.4 61.4 25.8 33.8 29.3 — 29.2 52.0 57.9 22.2 30.1 25.5 CCL — — — — 50.8 40.7 45.2 — — — — 53.6 47.4 50.3 UPP — — — — 52.8 37.2 43.7 — — — — 60.0 46.6 52.4 Table 2: Performance on wsj10 and swbdnxt10 for models using words and POS tags only. Bold scores indicate the best performance of all models and baselines on that measure. † Significantly different from best non-uniform baseline (italics) by a stratified shuffling test, p < 0.01; ∗: p < 0.05. model is better or worse than the baseline), look to the scores themselves. 5 Results In all results, when a model sees only one kind of information, that is expressed by writing out the ab- breviation for the relevant stream: “Wds” for words, “POS” for Part-Of-Speech, “Dur” for word duration. For baseline models that see two streams, the abbre- viations are joined by a “×” symbol (as they treat input pairs as atoms drawn in the cross-product of the two streams’ vocabulary). For the backoff models, the abbreviations are joined by a “+” symbol (as they combine the information sources with a weighted sum), with the “extra” stream name first. 5.1 Results: wsj10 The left half of Table 2 presents results on wsj10. For the baseline models, the first column with hori- zontal text indicates the input, while for the backoff (Wds+POS) models, the first column with horizontal text indicates whether and how the extra stream is modeled in dependents (as described in Section 3.3). The EM model with POS input is largely a repli- cation of the original DMV, differing in the use of separate train, dev, and test sets, and possibly the details of the harmonic initializer. Our replication achieves an undirected attachment score of 63.8 on the test set, similar to the score of 64.5 reported by Klein and Manning (2004) when training and evalu- ating on all of wsj10. Cohen et al. (2008) use the same train/dev/test partition that we do, and report a directed attachment score of 45.8, similar to our directed attachment score of 46.4. The VB model which learns from POS tags does not outperform the EM model which learns from POS tags, suggesting that data sparsity does not hurt the DMV when using POS tags. As expected, the words- only models perform much worse than both the POS input models and the uniform LH baseline. VB does improve the words-only constituency performance. The Cond. and Indep. backoff models outperform the POS-only baseline on all measures, but the Joint backoff model does not demonstrate a clear advan- tage over the POS-only baseline on any measure. The success of the Indep. model indicates that modelling dependent word identity does provide enough infor- mation to justify the increase in sparsity. The failure of the Joint model to provide a further improvement indicates that the extra information in the full joint over dependents does not justify the large increase in parameters. We also see that several models out- perform the LH baseline on dependencies, but the advantage is much less in F-Score, underscoring the loss of information in the conversion of dependen- cies to constituencies. Finally, all models outperform CCL and UPP on F-score, emphasizing their reliance on the punctuation we removed. Dependency Constituency UNK Dir. Undir. NED P R F E M Wds 25 30.6 50.9 66.8 45.4 47.1 46.3 Wds×Dur 25 26.1 46.5 62.0 45.6 48.7 47.1 V B Wds 25 36.4 55.1 73.0 49.1 50.0 49.6 Wds×Dur 25 31.8 51.7 71.3 49.2 55.9 52.3 D ur + W ds Cond. 25 32.6† 55.1 74.5† 59.1† 71.4† 64.7† Joint 50 31.8† 51.8† 70.8∗ 54.4† 60.5† 57.3† Indep. 50 40.3† 59.1† 76.0† 56.1† 61.7† 58.8† LH — 24.1 50.8 72.7 60.8 82.5 70.0 RH — 29.2 52.0 57.9 22.2 30.1 25.5 CCL — — — — 53.6 47.4 50.3 UPP — — — — 60.0 46.6 52.4 ● ● 45 50 55 60 4 5 5 0 5 5 6 0 6 5 7 0 7 5 Switchboard Model Performance Undirected Attachment Score C o n st itu e n cy F − sc o re ● ● Wds WdsxDur Cond. Joint Indep. LH Table 3: Performance on swbdnxt10 for models using words and duration. The scatterplot includes a subset of the information in the table: F-score and undirected attachment accuracy for backoff models and VB and LH baseline. Bold, italics, and significance annotations as in Table 2. 5.2 Results: swbdnxt10 The right half of Table 2 presents performance fig- ures on swbdnxt10 for input involving words and POS tags. As expected, the EM and VB baselines perform best when learning from gold-standard POS tags, and we again see no benefit for the VB POS- only model compared to the EM POS-only model. The POS-only baselines far outperform the uniform- attachment baselines on the dependency measures; to our knowledge this is the first demonstration outside the newspaper domain that the DMV outperforms a uniform branching strategy on these measures. The other comparisons among systems listed in Table 2 are largely inconclusive. Models do com- paratively well on either the constituency or depen- dency evaluation, but not both. The backoff mod- els outperform the baseline POS-only models in the constituency evaluation, but underperform or match those same models in the dependency evaluation. Conversely, most models outperform the LH base- line in the dependency evaluation, but not in the constituency evaluation. There are probably two causes for the ambiguity in these results. First, the noise in the dependency gold-standard may have over- whelmed any advantage from backoff. Second, as we saw with wsj10, the conversion from dependencies to constituencies removes information, which may explain the failure of any model to outperform the LH baseline in the constituency evaluation. Table 3 presents performance figures on swbdnxt10 for input involving words and duration, including a scatter-plot of Undirected attachment against constituency F-Score for the interesting comparisons. In the scatter-plot, models up and to the right performed better, and we see that the negative correlation between the dependency and constituency evaluations persists in words and dura- tion input. VB substantially outperforms EM in the baselines, indicating that good smoothing is helpful when learning from words. Other comparisons are again ambiguous; the dependency evaluation is noisy, and backoff models outperform baseline models on the constituency evaluation but not the LH baseline. Still, the backoff models outperform all words-only baselines in constituency score, with two performing slightly worse in dependency score and one performing much better. So there is some evidence that word duration is useful, but we will find clearer evidence on the brent corpus. 5.3 Results: brent Table 4 presents results on the brent dataset. VB is even more effective than in the other datasets for improving performance among baseline models, lead- ing to double-digit improvements on some measures. Moreover, the best dev-set UNK cutoff drops to 1 for all VB models, indicating that, on this dataset, VB provides good smoothing even in models without backoff. This difference between datasets is likely related to differences in vocabulary diversity; the Dependency Constituency UNK Dir. Undir. NED P R F E M Wds 25 36.9 56.3 70.7 52.4 69.5 59.8 Wds×Dur 25 31.3 51.1 66.9 50.7 64.7 56.9 V B Wds 1 51.2 64.2 77.3 63.3 68.1 66.0 Wds×Dur 1 47.0 60.5 74.0 66.2 64.9 65.5 D ur + W ds Cond. 1 53.1∗ 65.5∗ 78.7∗ 65.4 68.6 67.0∗ Joint 1 50.7 63.0 76.3 65.6 65.4† 65.5 Indep. 1 53.2 66.7† 79.6† 61.5† 67.9 64.5 LH — 28.3 53.6 78.3 47.9 85.6 61.4 RH — 27.2 48.8 61.1 26.2 46.8 33.6 CCL — — — — 41.7 58.8 48.8 UPP — — — — 56.8 63.8 60.1 ● ● 50 55 60 65 70 6 0 6 2 6 4 6 6 6 8 7 0 Brent Model Performance Undirected Attachment Score C o n st itu e n cy F − sc o re ● ● Wds WdsxDur Cond. Joint Indep. LH Table 4: Performance on brent for models using words and duration. The scatterplot includes a subset of the information in the table: F-score and undirected attachment accuracy for backoff models and VB and LH baseline. Bold, italics, and significance annotations as in Table 2. type:token ratio in the brent training set is about 1:15, compared to 1:5 and 1:9 in the wsj10 and swbdnxt10 training sets, respectively. More importantly for our main hypothesis, all three backoff models using words and duration out- perform the words-only baselines (including CCL and UPP) on all dependency measures—the most accurate measures on this corpus, which has hand- annotated dependencies—and the Cond. model also wins on F-score. 6 Conclusion In this paper, we showed how to use the DMV with Backoff and two fully-generative variants to explore the utility of word duration in fully lexicalized un- supervised dependency parsing. Although other re- searchers have incorporated features beyond words and POS tags into DMV-like models (e.g., semantics: Naseem and Barzilay (2011); morphology: Berg- Kirkpatrick et al. (2009)), we believe this is the first example based on Headden et al. (2009)’s backoff method. As far as we know, our work is also the first test of a DMV-based model on transcribed conver- sational speech and the first to outperform uniform- branching baselines without using either POS tags or punctuation in the input. Our results show that fully- lexicalized models can do well if they are smoothed properly and exploit multiple cues. Our experiments also suggest that CDS is espe- cially easy to learn from. Model performance on the brent dataset was generally higher than on swbdnxt10, with a much lower UNK threshold. This latter point, and the fact that brent has a much lower word type/token ratio than the other datasets, suggest that CDS provides more and clearer evidence about words’ syntactic behavior. Finally, our results provide more evidence, using a different, more powerful syntactic model than that of Pate and Goldwater (2011), that word duration is a useful cue for unsupervised parsing. We found that several ways of incorporating duration were use- ful, although the extra sparsity of Joint emissions was not justified in any of our investigations. Our results are consistent with both the prosodic and pre- dictability bootstrapping hypotheses of language ac- quisition, providing the first computational support for these using a full syntactic parsing model and tested on child-directed speech. While our models do not provide a mechanistic account of how children might use duration information to help with learning syntax, they do show that this information is useful in principle, even without any knowledge of latent prosodic structure or its relationship to syntax. In ad- dition, our results suggest it may be useful to explore using word duration to enrich NLP tasks in speech- related technologies, such as syntactically-inspired language models for text-to-speech generation. In the future, we also hope to investigate why duration is helpful, designing experiments to tease apart the role of prosody and predictability in learning syntax. References Matthew Aylett and Alice Turk. 2004. The smooth signal redundancy hypothesis: A functional explanation for re- lationships between redundancy, prosodic prominence, and duration in spontaneous speech. Language and Speech, 47(1):31–56. Mary Beckman and Janet Pierrehumbert. 1986. Intona- tional structure in Japanese and English. Phonology Yearbook, 3:255–309. Alan Bell, Jason M Brenier, Michelle Gregory, Cynthia Girand, and Dan Jurafsky. 2009. Predictability effects on durations of content and function words in conver- sational English. Journal of Memory and Language, 60:92–111. Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, John DeNero, and Dan Klein. 2009. Painless unsupervised learning with features. In Proceedings of NAACL. Michael R Brent and Jeffrey M Siskind. 2001. The role of exposure to isolated words in early vocabulary de- velopment. Cognition, 81:31–44. Christos Christodoulopoulos, Sharon Goldwater, and Mark Steedman. 2012. Turning the pipeline into a loop: Iterated unsupervised dependency parsing and PoS in- duction. In Proceedings of the NAACL-HLT Workshop on the Induction of Linguistic Structure, pages 96–99, Montréal, Canada, June. Association for Computational Linguistics. Shay B Cohen and Noah A Smith. 2009. Shared lo- gistic normal distributions for soft parameter tying in unsupervised grammar induction. In Proceedings of NAACL. Shay B Cohen, Kevin Gimpel, and Noah A Smith. 2008. Logistic normal priors for unsupervised probabilistic grammar induction. In Advances in Neural Information Processing Systems 22. Shay B Cohen, Dipanjan Das, and Noah A Smith. 2011. Unsupervised structure prediction with non-parallel multilingual guidance. In Proceedings of EMNLP. Marie-Catherine de Marneffe and Christopher D Manning. 2008. Stanford typed dependencies manual. Technical report. Markus Dreyer and Izhak Shafran. 2007. Exploiting prosody for pcfgs with latent annotations. In Proceed- ings of Interspeech, Antwerp, Belgium, August. Susanne Gahl and Susan M Garnsey. 2004. Knowledge of grammar, knowledge of usage: Syntactic probabilities affect pronunciation variation. Language, 80:748–775. Susanne Gahl, Susan M Garnsey, Cynthia Fisher, and Laura Matzen. 2006. “That sounds unlikely”: Syntac- tic probabilities affect pronunciation. In Proceedings of the 27th meeting of the Cognitive Science Society. Lila Gleitman and Eric Wanner. 1982. Language acqui- sition: The state of the art. In Eric Wanner and Lila Gleitman, editors, Language acquisition: The state of the art, pages 3–48. Cambridge University Press, Cam- bridge, UK. Will Headden, Mark Johnson, and David McClosky. 2009. Improved unsupervised dependency parsing with richer contexts and smoothing. In Proceedings of NAACL- HLT. Zhongqiang Huang and Mary Harper. 2010. Appropri- ately handled prosodic breaks help PCFG parsing. In Proceedings of NAACL-HLT, pages 37–45, Los Ange- les, California, June. Association for Computational Linguistics. Richard Johansson and Pierre Nugues. 2007. Extended constituent-to-dependency conversion for english. In Proceedings of NODALIDA 2007. Mark Johnson. 2007. Why doesn’t EM find good HMM POS-taggers. In Proceedings of EMNLP-CoNLL, pages 296–305. Jeremy G Kahn, Matthew Lease, Eugene Charniak, Mark Johnson, and Mari Ostendorf. 2005. Effective use of prosody in parsing conversational speech. In Proceed- ings of HLT-EMNLP, pages 233–240. Dan Klein and Christopher D. Manning. 2004. Corpus- based induction of syntactic structure: Models of de- pendency and constituency. In Proceedings of ACL, pages 479–486. Matthias Trautner Kromann. 2003. The Danish Depen- dency Treebank and the DTAG treebank tool. In Pro- ceedings of the Second Workshop on Treebanks and Linguistic Theories, pages 217–220. Kenichi Kurihara and Taisuke Sato. 2006. Variational Bayesian grammar induction for natural language. In Proceedings of the International Colloquium on Gram- matical Inference, pages 84–96. Brian MacWhinney. 2000. The CHILDES project: Tools for analyzing talk. Lawrence Erlbaum Associates, Mah- wah, NJ, third edition. Séverine Millotte, Roger Wales, and Anne Christophe. 2007. Phrasal prosody disambiguates syntax. Lan- guage and Cognitive Processes, 22(6):898–909. James L Morgan, Richard P Meier, and Elissa L Newport. 1987. Structural packaging in the input to language learning: contributions of prosodic and morphologi- cal marking of phrases to the acquisition of language. Cognitive Psychology, 19:498–550. Tahira Naseem and Regina Barzilay. 2011. Using seman- tic cues to learn syntax. In Proceedings of AAAI. John K Pate and Sharon Goldwater. 2011. Unsupervised syntactic chunking with acoustic cues: computational models for prosodic bootstrapping. In Proceedings of the 2nd ACL workshop on Cognitive Modeling and Computational Linguistics. Elias Ponvert, Jason Baldridge, and Katrin Erk. 2011. Simple unsupervised grammar induction from raw text with cascaded finite state models. In Proceedings of ACL-HLT. Patti J Price, Mari Ostendorf, Stefanie Shattuck-Hufnagel, and Cynthia Fong. 1991. The use of prosody in syntac- tic disambiguation. In Proceedings of the HLT work- shop on Speech and Natural Language, pages 372–377, Morristown, NJ, USA. Association for Computational Linguistics. C Anton Rytting, Chris Brew, and Eric Fosler-Lussier. 2010. Segmenting words from natural speech: subseg- mental variation in segmental cues. Journal of Child Language, 37(3):513–543. Roy Schwartz, Omri Abend, Roi Reichart, and Ari Rap- poport1. 2011. Neutralizing linguistically problematic annotations in unsupervised dependency parsing evalu- ation. In Proceedings of the 49th ACL, pages 663–672. Yoav Seginer. 2007. Fast unsupervised incremental pars- ing. In Proceedings of ACL. Amanda Seidl. 2007. Infants’ use and weighting of prosodic cues in clause segmentation. Journal of Mem- ory and Language, 57(1):24–48. Valentin I Spitkovsky, Hiyan Alshawi, Angel X Chang, and Daniel Jurafsky. 2011a. Unsupervised dependency parsing without gold part-of-speech tags. In Proceed- ings of EMNLP. Valentin I Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. 2011b. Punctuation: Making a point in unsupervised dependency parsing. In Proceedings of CoNLL. Harry Tily, Susanne Gahl, Inbal Arnon, Neal Snider, Anubha Kothari, and Joan Bresnan. 2009. Syntactic probabilities affect pronunciation variation in sponta- neous speech. Language and Cognition, 1(2):147–165.