Parsing entire discourses as very long strings: Capturing topic continuity in grounded language learning Minh-Thang Luong Department of Computer Science Stanford University Stanford, California lmthang@stanford.edu Michael C. Frank Department of Psychology Stanford University Stanford, California mcfrank@stanford.edu Mark Johnson Department of Computing Macquarie University Sydney, Australia Mark.Johnson@MQ.edu.au Abstract Grounded language learning, the task of map- ping from natural language to a representation of meaning, has attracted more and more in- terest in recent years. In most work on this topic, however, utterances in a conversation are treated independently and discourse struc- ture information is largely ignored. In the context of language acquisition, this indepen- dence assumption discards cues that are im- portant to the learner, e.g., the fact that con- secutive utterances are likely to share the same referent (Frank et al., 2013). The current pa- per describes an approach to the problem of simultaneously modeling grounded language at the sentence and discourse levels. We com- bine ideas from parsing and grammar induc- tion to produce a parser that can handle long input strings with thousands of tokens, creat- ing parse trees that represent full discourses. By casting grounded language learning as a grammatical inference task, we use our parser to extend the work of Johnson et al. (2012), investigating the importance of discourse con- tinuity in children’s language acquisition and its interaction with social cues. Our model boosts performance in a language acquisition task and yields good discourse segmentations compared with human annotators. 1 Introduction Learning mappings between natural language (NL) and meaning representations (MR) is an important goal for both computational linguistics and cognitive science. Accurately learning novel mappings is cru- cial in grounded language understanding tasks and such systems can suggest insights into the nature of children language learning. Two influential examples of grounded language learning tasks are the sportscasting task, RoboCup, where the NL is the set of running commentary and the MR is the set of logical forms representing ac- tions like kicking or passing (Chen and Mooney, 2008), and the cross-situational word-learning task, where the NL is the caregiver’s utterances and the MR is the set of objects present in the context (Siskind, 1996; Yu and Ballard, 2007). Work in these domains suggests that, based on the co- occurrence between words and their referents in context, it is possible to learn mappings between NL and MR even under substantial ambiguity. Nevertheless, contexts like RoboCup—where ev- ery single utterance is grounded—are extremely rare. Much more common are cases where a sin- gle topic is introduced and then discussed at length throughout a discourse. In a television news show, for example, a topic might be introduced by present- ing a relevant picture or video clip. Once the topic is introduced, the anchors can discuss it by name or even using a pronoun without showing a picture. The discourse is grounded without having to ground every utterance. Moreover, although previous work has largely treated utterance order as independent, the order of utterances is critical in grounded discourse contexts: if the order is scrambled, it can become impossible to recover the topic. Supporting this idea, Frank et al. (2013) found that topic continuity—the tendency to talk about the same topic in multiple utterances that are contiguous in time—is both prevalent and informative for word learning. This paper examines the importance of topic continuity through a gram- matical inference problem. We build on Johnson et al. (2012)’s work that used grammatical inference to �������� ��� � ��� ��� � ��� ��� � ������ ���� ��� � ������ ���� ��� ��� ����� ���� � � ����� ���� ��� � �� � ��� � � ������ ��� � ��� ��� � ���� � ���� � ��� ��� ��� �� � ���� ��� ��� �� � ��� � ��� ��� ��� ���� ��� ��� ��� ��� � ��� ��� ��� �� �� � ������ ��� �� � ���� ������ ��� �� � ��� � ������ ��� ��� ���� ������ ��� ��� ��� � ������ ��� ��� �� �� ���������� ������������� � �������������� � Figure 1: Unigram Social Cue PCFGs (Johnson et al., 2012) – shown is a parse tree of the input utterance “wheres the piggie” accompanied with social cue prefixes, indicating that the caregiver is holding a pig toy while the child is looking at it; at the same time, a dog toy is present in the screen. learn word-object mappings and to investigate the role of social information (cues like eye-gaze and pointing) in a child language acquisition task. Our main contribution lies in the novel integra- tion of existing techniques and algorithms in parsing and grammar induction to offer a complete solution for simultaneously modeling grounded language at the sentence and discourse levels. Specifically, we: (1) use the Earley algorithm to exploit the special structure of our grammars, which are deterministic or have at most bounded ambiguity, to achieve ap- proximately linear parsing time; (2) suggest a rescal- ing approach that enables us to build a PCFG parser capable of handling very long strings with thou- sands of tokens; and (3) employ Variational Bayes for grammatical inference to obtain better grammars than those given by the EM algorithm. By parsing entire discourses at once, we shed light on a scientifically interesting question about why the child’s own gaze is a positive cue for word learn- ing (Johnson et al., 2012). Our data provide support for the hypothesis (from previous work) that care- givers “follow in”: they name objects that the child is already looking at (Tomasello and Farrar, 1986). In addition, our discourse model produces a perfor- mance improvement in a language acquisition task and yields good discourse segmentations compared with human annotators. 2 Related Work Supervised semantic parsers. Previous work has developed supervised semantic parsers to map sen- tences to meaning representations of various forms, including meaning hierarchies (Lu et al., 2008) and, most dominantly, λ-calculus expressions (Zettle- moyer and Collins, 2005; Zettlemoyer, 2007; Wong and Mooney, 2007; Kwiatkowski et al., 2010). These approaches rely on training data of annotated sentence-meaning pairs, however. Such data are costly to obtain and are quite different from the ex- perience of language learners. Grounded Language Learning. In contrast to se- mantic parsers, grounded language learning systems aim to learn the meanings of words and sentences given an observed world state (Yu and Ballard, 2004; Gorniak and Roy, 2007). A growing body of work in this field employs distinct techniques from a wide variety of perspectives from text-to-record align- ment using structured classification (Barzilay and Lapata, 2005; Snyder and Barzilay, 2007), iterative retraining (Chen et al., 2010), and generative models of segmentation and alignment (Liang et al., 2009) to text-to-interaction mapping using reinforcement learning (Branavan et al., 2009; Vogel and Juraf- sky, 2010), graphical model semantics representa- tion (Tellex et al., 2011a; Tellex et al., 2011b), and Combinatory Categorial Grammar (Artzi and Zettle- moyer, 2013). A number of systems have also used alternative forms of supervision, including sentences paired with responses (Clarke et al., 2010; Gold- wasser and Roth, 2011; Liang et al., 2011) and no supervision (Poon and Domingos, 2009; Gold- wasser et al., 2011). Recent work has also introduced an alternative approach to grounded learning by reducing it to a grammatical inference problem. Börschinger et al. (2011) casted the problem of learning a semantic parser as a PCFG induction task, achieving state-of the art performance in the RoboCup domain. Kim and Mooney (2012) extended the technique to make it tractable for more complex problems. Later, Kim and Mooney (2013) adapted discriminative rerank- ing to the grounded learning problem using a form of weak supervision. We employ this general gram- matical inference approach in the current work. Children Language Acquisition. In the context of language acquisition, Frank et al. (2008) proposed a system that learned words and jointly inferred speakers’ intended referent (utterance topic) using graphical models. Johnson et al. (2012) used gram- matical inference to demonstrate the importance of social cues in children’s early word learning. We ex- tend this body of work by capturing discourse-based dependencies among utterances rather than treating each utterance independently. Discourse Parsing. A substantial literature has ex- amined formal representations of discourse across a wide variety of theoretical perspectives (Mann and Thompson, 1988; Scha and Polanyi, 1988; Hobbs, 1990; Lascarides and Asher, 1993; Knott and Sanders, 1997). Although much of this work was highly influential, Marcu (1997)’s work on dis- course parsing brought this task to special promi- nence. Since then, more and more sophisticated models of discourse analysis have been developed:, e.g., (Marcu, 1999; Soricut and Marcu, 2003; Forbes et al., 2003; Polanyi et al., 2004; Baldridge and Las- carides, 2005; Subba and Di Eugenio, 2009; Her- nault et al., 2010; Lin et al., 2012; Feng and Hirst, 2012). Our contribution to work on this task is to examine latent discourse structure specifically in grounded language learning. 3 A Grounded Learning Task Our focus in this paper is to develop computational models that help us better understand children’s lan- guage acquisition. The goal is to learn both the long term lexicon of mappings between words and objects (language learning) as well as the intended topic of individual utterances (language comprehen- sion). We consider a corpus of child-directed speech annotated with social cues, described in (Frank et al., 2013). There are a total of 4,763 utterances in the corpus, each of which is orthographically- transcribed from videos of caregivers playing with pre-linguistic children of various ages (6, 12, and 18 months) during home visits.1 Each utterance was hand-annotated with objects present in the (non-linguistic) context, e.g. dog and pig (Fig- ure 1), together with sets of social cues, one set per object. The social cues describe objects the care-giver is looking at (mom.eyes), holding onto (mom.hands), or pointing to (mom.point); sim- ilarly, for (child.eyes) and (child.hands). 3.1 Sentence-level Models Motivated by the importance of social information in children’s early language acquisition (Carpenter et al., 1998), Johnson et al. (2012) proposed a joint model of non-linguistic information including the physical context and social cues, and the linguis- tic content of individual utterances. They framed the joint inference problem of inferring word-object mappings and inferring sentence topics as a gram- mar induction task where input strings are utterances prefixed with non-linguistic information. Objects present in the non-linguistic context of an utterance are considered its potential topics. There is also a special null topic, None, to indicate non-topical ut- terances. The goal of the model is then to select the most probable topic for each utterance. Top-level rules, Sentence → Topict Wordst (unigram PCFG) or Sentence → Topict Collocst (collocation Adaptor Grammar), are tai- lored to link the two modalities (t ranges over T ′, the set of all available topics (T ) and None). These rules enforce sharing of topics between prefixes (Topict) and words (Wordst or Collocst). Each word in the utterance is drawn from either a topic- specific distribution Wordt or a general “null” dis- tribution WordNone. As illustrated in Figure 1, the selected topic, pig, is propagated down to the input string through two paths: (a) through topical nodes until an object is 1Caregivers were given pairs of toys to play with, e.g. a stuffed dog and pig, or a wooden car and truck. reached, in this case the .pig object, and (b) through lexical nodes to topical word tokens, e.g. piggie. So- cial cues are then generated by a series of binary decisions as detailed in Johnson et al. (2012). The key feature of these grammars is that parameter in- ference corresponds both to learning word-topic re- lations and learning the salience of social cues in grounded learning. In the current work, we restrict our attention to only the unigram PCFG model to focus on investi- gating the role of topic continuity. Unlike the ap- proach of Johnson et al. (2012), which uses Markov Chain Monte Carlo techniques to perform gram- matical inference, we experiment with Variational Bayes methods, detailed in Section 6. 3.2 A Discourse-level Model Topic continuity—the tendency to group utterances into coherent discourses about a single topic—may be an important source of information for children learning the meanings of words (Frank et al., 2013). To address this issue, we consider a new discourse- level model of grounded language that captures de- pendencies between utterances. By linking multiple utterances in a single parse, our proposed grammati- cal formalism is a bigram Markov process that mod- els transitions among utterance topics. Our grammar starts with a root symbol Discourse, which then selects a starting topic through a set of discourse initial rules, Discourse → Discourset for t ∈ T ′. Each of the Discourset nodes generates an utter- ance of the same topic, and advances into other topics through transition rules, Discourset → Sentencet Discourset′ for t′ ∈ T ′. Dis- courses terminate by ending rules, Discourset → Sentencet. Other rules in the unigram PCFG model by Johnson are reused except for the top- level rules in which we replace the non-terminal Sentence by topic-specific ones Sentencet. 3.3 Parsing Discourses and Challenges Using a discourse-level grammar, we must parse a concatenation of all the utterances (with annota- tions) in each conversation. This concatenation re- sults in an extremely long string: in the social-cue corpus (Frank et al., 2013), the average length of these per-recording concatenations is 2152 tokens (σ=972). Parsing such strings poses many chal- lenges for existing algorithms. For familiar algorithms such as CYK, runtime quickly becomes enormous: the time complexity of CYK is O(n3) for an input of length n. Fortunately, we can take advantage of a special structural prop- erty of our grammars. The shape of the parse tree is completely determined by the input string; the only variation is in the topic annotations in the nonter- minal labels. So even though the number of possi- ble parses grows exponentially with input length n, the number of possible constituents grows only lin- early with input length, and the possible constituents can be identified from the left context.2 These con- straints ensure that the Earley algorithm3 (Earley, 1970) will parse an input of length n with this gram- mar in time O(n). A second challenge in parsing very long strings is that the probability of a parse is the product of the probabilities of the rules involved in its derivation. As the length of a derivation grows linearly with the length of the input, the parse probabilities decrease exponentially as a function of sentence length, caus- ing floating-point underflow on inputs of even mod- erate length. The standard method for handling this is to compute log probabilities (which decrease lin- early as a function of input length, rather than ex- ponentially), but as we explain later (Section 5), we can use the ability of the Earley algorithm to com- pute prefix probabilities (Stolcke, 1995) to rescale the probability of the parse incrementally and avoid floating-point underflows. In the next section, we provide background in- formation on the Earley algorithm for PCFGs, the prefix probability scheme we use, and the inside- outside algorithm in the Earley context. 4 Background 4.1 Earley Algorithm for PCFGs The Earley algorithm was developed by Earley (1970) and known to be efficient for certain kinds of CFGs (Aho and Ullman, 1972). An Earley parser 2The prefix markers # and ## and the topic markers such as “.dog” enable a left-to- right parser to unambiguously iden- tify its location in the input string. 3In order to achieve linear time the parsing chart must have suitable indexing; see Aho and Ullman (1972), Leo (1991) and Aycock and Horspool (2002) for details. constructs left-most derivations of strings, using dot- ted productions to keep track of partial derivations. Specifically, each state in an Earley parser is rep- resented as [l,r]: X→α . β to indicate that input symbols xl, . . . ,xr−1 have been processed and the parser is expecting to expand β. States are gen- erated on the fly using three transition operations: predict (add states to charts), scan (shift dots across terminals), and complete (merge two states). Fig- ure 2 shows an example of a completion step which also illustrates the implicit binarization automati- cally done in Earley algorithm. � � � � � � � � � � � � � � �� � � Figure 2: Completion step – merging two states [l,m]: X→α . Y β and [m,r]:Y →ν . to produce a new state [l,r]: X→αY . β. In order to handle PCFGs, Stolcke (1995) extends the Earley parsing algorithm to introduce the no- tion of an Earley path being a sequence of states linked by Earley operations. By establishing a one- to-one mapping between partial derivations and Ear- ley paths, Stolcke could then assign each path a derivation probability, that is the product of the all rule probabilities used in the predicted states of that path. Here, each production X→ν corresponds to a predicted state [l, l] : X→. ν. Besides parsing, being able to compute string and prefix probabilities by summing derivation probabil- ities is also of great importance. To compute these sums efficiently, each Earley state is attached with a forward and an inner probability which are updated incrementally as new states are spawned by the three transition operations. 4.2 Forward and Prefix Probabilities Intuitively, the forward probability of a state [l,r]: X→α . β is the probability of an Earley path through that state, generating input up to position r-1. This probability generalizes a similar concept in HMM and lends itself to the computation of pre- fix probabilities, sums of forward probabilities over scanned states yielding a prefix x. Computing prefix probabilities is important be- cause it enables probabilistic prediction of pos- sible follow-words xi+1 as P(xi+1|x0 . . .xi) = P(x0...xixi+1) P(x0...xi) (Jelinek and Lafferty, 1991). These conditional probabilities allow estimation of the in- cremental costs of a stack decoder (Bahl et al., 1983). In (Huang and Sagae, 2010), a conceptu- ally similar prefix cost is defined to order states in a beam search decoder. Moreover, the negative log- arithm of such conditional probabilities are termed as surprisal values in the psycholinguistics literature (e.g., Hale, 2001; Levy, 2008), to describe how dif- ficult a word is in a given context. Interestingly, we show that prefix probabilities lead us to construct a parser that could parse extremely long strings next. 4.3 Inside Outside Algorithm To extend the Inside Outside (IO) algorithm (Baker, 1979) to the Earley context, Stolcke introduced in- ner and outer probabilities which generalize the in- side and outside probabilities in the IO algorithm. Specifically, the inner probability of a state [l,r]: X→α . β is the probability of generating an input substring xl, . . . ,xr−1 from a non-terminal X using a production X→α β.4 � � � � � � � � � � � � � � �� � � Figure 3: Inner and outer probabilities. The outer probability of X→α . Y β is a sum of all products of its parent outer probability (X→αY . β) and its sibling inner probability (Y →ν .). Similarly, the outer proba- bility of Y →ν . is derived from the outer probability of X→αY . β and the inner probability of X→α . Y β. Once all inner probabilities have been populated in a forward pass, outer probabilities are derived backward, starting from the outer probability of the goal state [0,n] :→ S . being 1. Here, each Earley state is associated with an outer probability which complements the inner probability by referring pre- cisely to those parts (not covered by the correspond- ing inner probability) of the complete paths generat- ing the input string x. The implicit binarization in 4Summing up inner probabilities of all states Y →ν . exactly yields Baker’s inside probability for Y . Earley parsing allows outer probabilities to be accu- mulated in a similar way as its counterpart in the IO algorithm (see Figure 3). These quantities allow for efficient grammatical inference in which the expected count of each rule X→λ given a string x is computed as: c(X→λ|x) = ∑ s:[l,r]X→.λ outer(s) · inner(s) P(S ⇒∗ x) . (1) 5 A Rescaling Approach for Parsing Our parser originated from the prefix probability parser by Levy (2008), but has diverged markedly since then. The parser, called Earleyx5, is ca- pable of producing Viterbi parses and performing grammatical induction based on the expectation- maximization and variational Bayes algorithms. To tackle the underflow problem posed when parsing discourses (§3.3), we borrow the rescal- ing concept from HMMs (Rabiner, 1990) to extend the probabilistic Earley algorithm. Specifically, the probability of each Earley path is scaled by a con- stant ci each time it passes through a scanned state generating the input symbol xi. In fact, each path passes through each scanned state exactly once, so we consistently accumulate scaling factors for the forward and inner probabilities of a state [l,r] : X→α . β as c0 . . .cr−1 and cl . . .cr−1 respectively. Arguably, the most intuitive choice of the scal- ing factors are the prefix probabilities, which essen- tially resets the probability of any Earley path start- ing from any position i to 1. Concretely, we set c0 = 1 P(x0) and ci = P(x0...xi−1) P(x0...xi) for i=1, . . . ,n- 1 where n is the input length. As noted in section §4.2, the logarithm of ci gives us the surprisal value for the input symbol xi. Rescaling factors are only introduced in the for- ward pass, during which the outer probability of a state [l,r]: X→α . β has already been scaled by fac- tors c0 . . .cl−1cr . . .cn−1.6 More importantly, when 5Parser code is available at http://nlp.stanford. edu/˜lmthang/earleyx. 6The outer probability of a state is essentially the product of inner probabilities covering all input symbols outside the span of that state. For grammars containing cyclic unit productions, we also need to multiply with terms from the unit-production relation matrix (Stolcke, 1995). computing expected counts, scaling factors in the outer and inner terms cancel out with those in the string probability in Eq. (1), implying that rule prob- ability estimation is unaffected by rescaling. 5.1 Parsing Time on Dense Grammars We compare in Table 1 the parsing time (on a 2.4GHz Xeon CPU) of our parser (Earleyx) and Levy’s. The task is to compute surprisal values for a 22-word sentence over a dense grammar.7 Given that our parser is now capable of performing scaling to avoid underflow, we avoid converting probabili- ties to logarithmic form, which yields a speedup of about 4 times compared to Levy’s parser. Parser Time (s) (Levy, 2008) 640 Earleyx + scaling 145 Table 1: Parsing time (dense grammars) – to compute surprisal values for a 22-word sentence using Levy’s parser and ours (Earleyx). 5.2 Parsing Time on Sparse Grammars 200 400 600 800 1000 1200 1400 1600 1800 2000 1 2 3 4 5 6 7 Parsing Time on Sparse Grammars # Words se co n d s Figure 4: Parsing time (sparse grammars) – to compute Viterbi parses for sentences of increasing lengths. Figure 4 shows the time taken (as a function of the input length) for Earleyx to compute a Viterbi parses over our sparse grammars (§3.2). The plot confirmed our analysis in that the special structure of our grammars yields approximately linear parsing time in the input length (see §3.3). 7MLE estimated from the English Penn Treebank. 6 Grammar Induction We employ a Variational Bayes (VB) approach to perform grammatical inference instead of the stan- dard Inside Outside (IO) algorithm, or equivalently the Expectation Maximization (EM) algorithm, for several reasons: (1) it has been shown to be less likely to cause over-fitting for PCFGs than EM (Kurihara and Sato, 2004) and (2) implementation- wise, VB is a straightforward extension from EM as they both share the same process of computing the expected counts (the IO part) and only differ at how rule probabilities are reestimated. At the same time, VB has also been demonstrated to do well on large datasets and is competitive with Gibbs samplers while having the fastest convergence time among these estimators (Gao and Johnson, 2008). The rule reestimation in VB is carried as fol- lows. Let αr be the prior hyperparameter of a rule r in the rule set R and cr be its expected count accumulated over the entire corpus after an IO iteration. The posterior hyperparameter for r is α∗r = αr + cr. Let ψ be the digamma function, the rule parameter update formula is: θr:X→λ = exp [ ψ (α∗r) − ψ (∑ r′:X→λ′ α ∗ r′ )] . Whereas IO minimizes the negative log- likelihood of the observed data (sentences), -log p(x), VB minimizes a quantity called free energy, which we will use later to monitor con- vergence. Here x denotes the observed data and θ represents the model parameters (PCFG rule probabilities). Following (Kurihara and Sato, 2006), we compute the free energy as: F(x,θ) = − log p(x) + ∑ X∈N log Γ ( ∑ r:X→λ α ∗ r) Γ ( ∑ r:X→λ αr) − ∑ r∈R ( log Γ (α∗r) Γ (αr) + cr log θr ) where Γ denotes the gamma function. 6.1 Sparse Dirichlet Priors In our application, since each topic should only be associated with a few words rather than the entire vocabulary, we impose sparse Dirichlet priors over the Wordt distributions by setting a symmetric prior α<1 for all rules Wordt→w (∀t ∈ T,w ∈ W), where W is the set of all words in the corpus. This biases the model to select only a few rules per non- terminal Wordt.8 For all other rules, a uniform hy- perparameter value of 1 is used. We initialized rule probabilities with uniform distributions plus random noise. It is worthwhile to mention that sparse Dirich- let priors were proposed in Johnson (2010)’s work that learns Latent Dirchlet Allocation topic models using Bayesian inference for PCFGs. 7 Experiments Our experiments apply sentence- and discourse- level models to the annotated corpus of child- directed speech described in Section 3. Each model is evaluated on (a) topic accuracy—how many utter- ances are labeled with correct topics (including the null), (b) topic metrics (f-scores/precision/recall)— how well the model predicts non-null topical utter- ances, (c) word metrics—how well the model pre- dicts topical words,9 and (d) lexicon metrics—how well word types are assigned to the topic that they attach to most frequently. For example, in Figure 1, the model assigns topic pig to the entire utterance. At the word level, it labels piggie with topic pig and assigns null topic to wheres and the. See (Johnson et al., 2012) for more details of these metrics. In Section 7.1, we examine baseline models that do not make use of social cues (mother and child’s eye-gaze and hand position) to discover the topic; these baselines are contrasted with a range of social cues (§7.2 and §7.3). In Section 7.4, we evaluate the discourse structures discovered by our models. 7.1 Baseline Models (No Social Cues) To create baselines for later experiments, we eval- uate our models without social information. We compare sentence-level models using three different inference procedures—Markov Chain Monte Carlo (MCMC) (Johnson et al., 2012), Expectation Max- imization (EM), and Variational Bayes (VB)10—as well as the discourse-level model described above. 8It is important to not sparsify the WordNone distribution since WordNone could expand into many non-topical words. 9Topics assigned by the model are compared with those given by the gold dictionary provided by (Johnson et al., 2012). 10To determine the best sparsity hyperparameter α for lexical rules (§6.1), we performed a line search over {1,0.1, 0.01, 0.001, 0.0001}. As α decreases, performance im- proves, peaking at 0.001, the value used for all reported results Model Topic Word Lexicon Energy Acc. F1 P R F1 P R F1 P R MCMC 49.07 60.64 48.67 80.43 29.50 17.63 90.31 14.83 8.10 88.10 VB 53.14 60.89 50.53 76.59 25.62 14.94 89.91 16.71 9.25 85.71 156719 discourse 51.02 59.40 48.60 76.35 23.86 13.82 87.33 15.05 8.27 83.33 150023 discourse+init 55.78 60.91 52.15 73.22 29.75 17.91 87.65 21.11 11.95 90.48 149458 Table 2: Social-cue models. Comparison of sentence- and discourse-level models (init: initialized from the VB sentence-level model) over full metrics. Free energies are shown to compare VB-based models. Model Acc. Topic F1 Word F1 Lexicon F1 MCMC 33.95 40.44 20.07 10.37 EM 32.08 39.76 13.31 6.09 VB 39.64 39.22 17.40 12.27 discourse 40.63 42.01 19.31 12.72 Table 3: Baseline (non-social) models. Comparison of sentence-level models (MCMC (Johnson et al., 2012), EM, VB) and the discourse-level model. Results in Table 3 suggest that incorporating topic continuity through the discourse model boosts performance compared to sentence-level models. Within sentence-level models, EM is inferior to both MCMC and VB (in accordance with the consensus that EM is likely to overfit for PCFGs). Comparing VB and MCMC, VB is significantly better at topic accuracy but is worse at topic F1. This result sug- gests that VB predicts that more utterances are non- topical compared with MCMC, perhaps explaining why MCMC has the highest word F1. Nevertheless, unlike VB, the discourse model outperforms MCMC in all topic metrics, indicating that topic continuity helps in predicting both null and topical utterances. The discourse model is also capable of captur- ing topical transitions. Examining one instance of a learned grammar reveals that the distribution un- der Discourset is often dominated by a few major transitions. For example, car tends to have transi- tions into car (0.72) and truck (0.19); while pig prefers to transit into pig (0.69) and dog (0.24). These learned transitions nicely recover the struc- ture of the task that caregivers were given: to play with toy pairs like car/truck and pig/dog. 7.2 Social-cue Models We next explore how topic continuity interacts with social information via a set of simulations mirroring those in the previous section. Results are shown in Table 2. For the sentence-level models using social cues, VB now outperforms MCMC in topic accuracy and F1, as well as lexicon evaluations, suggesting that VB is overall quite competitive with MCMC.11 Turning to the discourse models, social informa- tion and topic continuity both independently boost learning performance (as evidenced in Johnson et al. (2012) and in Section 7.1). Nevertheless, joint infer- ence using both information sources (discourse row) resulted in a performance decrement. Rather than reflecting issues in the model itself, perhaps the in- creased complexity of the inference problem might have led to this performance decrement. To test this explanation, we initialized our discourse-level model with the VB sentence-level model. Results are shown in the discourse+init row. With a sentence-level initialization, perfor- mance improved substantially, yielding the best re- sults over most metrics. In addition, the discourse model with sentence-level initialization achieved lower free energy than the standard initialization dis- course model. Both of these results support the hy- pothesis that initialization facilitated inference in the more complex discourse model. From a cognitive science perspective, this sort of result may point to the utility of beginning the task of discourse segmen- tation with some initial sentence-level expectations. 7.3 Effects of Individual Social Cues The importance of particular social cues and their relationship to discourse continuity is an additional topic of interest from the cognitive science per- spective (Frank et al., 2013). Returning to one of the questions that motivated this work, we can use 11Detailed breakdown of word f-scores reveals that MCMC is much better at precision, indicating that VB predicts more words as topical than MCMC. An explanation for such effect is that we use the same α for all lexical rules, which results in suboptimal sparsity levels for Wordt distributions. For MCMC, Johnson et al. (2012) used the adaptor grammar software to learn the hyperparameters automatically from data. all no.child.eyes no.child.hands no.mom.eyes no.mom.hands no.mom.point MCMC 49.1/60.6/29.5/14.8 38.4/46.6/21.5/11.1 49.1/60.6/29.6/15.3 48.0/59.7/29.0/15.5 48.7/60.0/29.3/15.6 48.8/60.3/29.3/15.6 VB 53.1/60.9/25.62/16.71 49.3/56.0/22.6/15.1 52.9/60.4/26.2/16.2 51.5/59.1/24.6/16.3 51.9/59.2/25.3/16.3 52.9/60.6/25.5/16.6 discourse+init 55.8/60.9/29.8/21.1 53.7/59.2/27.8/19.7∗+ 55.2/60.7/29.0/21.4+ 54.7/60.0/29.0/21.6 55.2/60.1/29.1/21.4 55.6/60.8/29.5/21.7 Table 4: Social cue influence. Ablation test results across models without discourse (MCMC, VB) and with discourse (discourse+init). We start with the full set of social cues and drop one at a time. Each cell contains results for metrics: topic accuracy/topic F1/word F1/lexicon F1. For row discourse+init, we compare models with/without a social cue using chi-square tests and denote statistically significant results (p < .05) at the utterance (∗) and word (+) levels. none child.eyes child.hands mom.eyes mom.hands mom.point MCMC 34.0/40.4/20.1/10.4 45.7/57.3/28.9/13.6 34.0/40.1/20.1/9.7 33.8/40.2/19.9/9.7 35.6/42.8/19.8/10.0 30.6/35.5/18.1/9.2 VB 39.6/39.2/17.4/12.27 47.2/53.0/21.9/13.9 43.0/45.8/15.4/12.9 42.9/46.5/14.6/12.4 41.1/43.8/17.1/12.4 39.7/39.7/17.5/13.4 discourse 40.7/41.8/19.2/12.1 47.8/55.4/22.8/14.2∗+ 44.6/50.8/20.3/13.1∗+ 44.7/50.1/21.7/14.3∗+ 42.7/46.4/19.0/11.6+ 38.7/40.2/16.6/11.9∗+ Table 5: Social cue influence. Add-one test results across models without discourse (MCMC, VB) and with discourse (discourse). We start with no social information and add one cue at a time. Each cell contains results for metrics: topic accuracy/topic F1/word F1/lexicon F1. For row discourse, we compare models with/without a social cue using chi-square tests and denote statistically significant results (p < .05) at the utterance (∗) and word (+) levels. our discourse model to answer the question about the role that the child.eyes cue plays in child- directed discourses. Johnson et al. (2012) raised two hypotheses that could explain the importance of child.eyes as a social cue: (1) caregivers “fol- low in” on the child’s gaze: they tend to talk about what the child is looking at (Baldwin, 1993), or (2) the child.eyes cue encodes the topic of the pre- vious sentence, inadvertently giving a non-discourse model access to rudimentary discourse information. To address this question, we conduct two tests: (1) ablation – eliminating each social cue in turn (e.g. child.eyes), and (2) add-one, using a sin- gle social cue per turn. Table 4 and 5 show corre- sponding results for models without discourse (the MCMC and VB sentence-level models) and with discourse (discourse+init for the ablation test and discourse for the add-one test). We observe simi- lar trends to Johnson et al. (2012): the childs gaze is the most important cue. Removing it from the full model with all social cues or adding it to the base model with no cues both result in the largest perfor- mance change; in both cases this change is statisti- cally reliable.12 The large performance differences for child.eyes are consistent with the hypothe- sis that caregivers are following in, or discussing the object that children are interested in – even control- 12It is somewhat surprising when child.eye has much less influence on VB than on MCMC in the ablation test. Though re- sults in the add-one test reveal that VB generalizes much better than MCMC when presented with a single social cue, it remains interesting to find out internally what causes the difference. ling for the continuity of discourse, a confound in previous analyses. In other words, the importance of child.eyes in the discourse model suggests that this cue encodes useful information in addition to the intersentential discourse topic. 7.4 Discourse Structure Evaluation While the discourse model performs well using met- rics from previous work, these metrics do not fully reflect an important strength of the model: its abil- ity to capture inter-utterance structure. For exam- Raw Discourse Utterance car car come here lets find the car car there car car is that a car car car the car goes vroom vroom vroom Table 6: Topic annotation examples. raw (previous metrics) and discourse (new metrics). ple, consider the sequence of utterances in Table 6. Our previous evaluation is based on the raw annota- tion, which labels as topical only utterances contain- ing topical words or pronouns referring to an object. As a result, classifying “there” as car is incorrect. From the perspective of a human listener, however, “there” is part of a broader discourse about the car, and labeling it with the same topic captures the fact that it encodes useful information for learners. To differentiate these cases, Frank and Rohde (under re- view) added a new set of annotations (to the dataset used in Section 7) based on the discourse structure perceived by human, similar to column discourse, . We utilize these new annotations to judge topics predicted by our discourse model and adopt previ- ous metrics for discourse segmentation evaluation: a=b, a simple proportion equivalence of discourse assignments; pk, a window method (Beeferman et al., 1999) to measure the probability of two random utterances correctly classified as being in the same discourse; and WindowDiff (Pevzner and Hearst, 2002), an improved version of pk which gives “par- tial credit” to boundaries close to the correct ones. Results in Table 7 demonstrate that our model is in better agreement with human annotation (model- human) than the raw annotation (raw-human) across all metrics. As is visible from the limited change in the a=b metric, relatively few topic assignments are altered; yet these alterations create much more coherent discourses that allow for far better segmen- tation performance under pk and WindowDiff. raw-human model-human a=b 63.6 69.3 pk 57.0 83.6 WindowDiff 36.2 61.2 Table 7: Discourse evaluation. Single annotator sample, comparison between topics assigned by the raw annota- tion, our discourse model, and a human coder. To put an upper bound on possible discourse seg- mentation results, we further evaluated performance on a subset of 634 utterances for which multiple an- notations were collected. Results in Table 8 demon- strate that our model predicts discourse topics (m-h1, m-h1) at a level quite close to the level of agreement between human annotators (column h1-h2). r-h1 r-h2 m-h1 m-h2 h1-h2 a=b 60.1 65.6 70.4 72.4 81.7 pk 50.7 51.8 85.1 84.9 89.7 WindowDiff 29.0 30.1 60.1 66.9 72.7 Table 8: Discourse evaluation. Multiple annotator sam- ple, comparison between raw annotations (r), our model (m), and two independent human coders (h1, h2). 8 Conclusion and Future Work In this paper, we proposed a novel integration of existing techniques in parsing and grammar induc- tion to offer a complete solution for simultaneously modeling grounded language at the sentence and discourse levels. Specifically, we used the Ear- ley algorithm to exploit the special structure of our grammars to achieve approximately linear parsing time, introduced a rescaling approach to handle very long input strings, and utilized Variational Bayes for grammar induction to obtain better solutions than the Expectation Maximization algorithm. By transforming a grounded language learning problem into a grammatical inference task, we used our parser to study how discourse structure could facilitate children’s language acquisition. In ad- dition, we investigate the interaction between dis- course structure and social cues, both important and complementary sources of information in lan- guage learning (Baldwin, 1993; Frank et al., 2013). We also examined why individual children’s gaze was an important predictor of reference in previ- ous work (Johnson et al., 2012). Using ablation tests, we showed that information provided by the child’s gaze is still valuable even in the presence of discourse continuity, supporting the hypothesis that parents “follow in” on the particular focus of chil- dren’s attention (Tomasello and Farrar, 1986). Lastly, we showed that our models can produce accurate discourse segmentations. Our system’s out- put is considerably better than the raw topic anno- tations provided in the previous social cue corpus (Frank et al., 2013) and is in good agreement with discourse topics assigned by human annotators in Frank and Rohde (under review). In conclusion, although previous work on grounded language learning has treated individual utterances as independent entities, we have shown that the ability to incorporate discourse information can be quite useful for such problems. Discourse continuity is an important source of information in children language acquisition and may be a valuable part of future grounded language learning systems. Acknowledgements We thank the TACL action editor, Mark Steed- man, and the anonymous reviewers for their valu- able feedback, as well as Chris Manning for helpful discussions. This research was supported under the Australian Research Council’s Discovery Projects funding scheme (project numbers DP110102506 and DP110102593). References Alfred V. Aho and Jeffery D. Ullman. 1972. The The- ory of Parsing, Translation and Compiling; Volume 1: Parsing. Prentice-Hall, Englewood Cliffs, New Jersey. Yoav Artzi and Luke S. Zettlemoyer. 2013. Weakly su- pervised learning of semantic parsers for mapping in- structions to actions. Transactions of the Association for Computational Linguistics, 1:49–62. John Aycock and R. Nigel Horspool. 2002. Practical ear- ley parsing. The Computer Journal, 45(6):620–630. Lalit R. Bahl, Frederick Jelinek, and Robert L. Mercer. 1983. A maximum likelihood approach to continu- ous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(2):179 –190. James K. Baker. 1979. Trainable grammars for speech recognition. The Journal of the Acoustical Society of America, 65(S1):S132. Jason Baldridge and Alex Lascarides. 2005. Proba- bilistic head-driven parsing for discourse structure. In CONLL. Dare A. Baldwin. 1993. Infants’ ability to consult the speaker for clues to word reference. Journal of Child Language, 20:395–418. Regina Barzilay and Mirella Lapata. 2005. Collective content selection for concept-to-text generation. In HLT-EMNLP. Doug Beeferman, Adam Berger, and John Lafferty. 1999. Statistical models for text segmentation. Ma- chine Learning, 34(1-3):177–210. Benjamin Börschinger, Bevan K. Jones, and Mark John- son. 2011. Reducing grounded learning tasks to gram- matical inference. In EMNLP. S. R. K. Branavan, Harr Chen, Luke S. Zettlemoyer, and Regina Barzilay. 2009. Reinforcement learning for mapping instructions to actions. In ACL-IJCNLP. Malinda Carpenter, Katherine Nagell, Michael Tomasello, George Butterworth, and Chris Moore. 1998. Social cognition, joint attention, and com- municative competence from 9 to 15 months of age. Monographs of the society for research in child development, 63(4). David L. Chen and Raymond J. Mooney. 2008. Learning to sportscast: A test of grounded language acquisition. In ICML. David L. Chen, Joohyun Kim, and Raymond J. Mooney. 2010. Training a multilingual sportscaster: Using per- ceptual context to learn language. Journal of Artificial Intelligence Research, 37:397–435. James Clarke, Dan Goldwasser, Ming-Wei Chang, and Dan Roth. 2010. Driving semantic parsing from the world’s response. In CoNLL. Jay Earley. 1970. An efficient context-free parsing algo- rithm. Communications of the ACM, 13(2):94–102. Vanessa Wei Feng and Graeme Hirst. 2012. Text-level discourse parsing with rich linguistic features. In ACL. Katherine Forbes, Eleni Miltsakaki, Rashmi Prasad, Anoop Sarkar, Joshi Aravind, and Bonnie Webber. 2003. D-ltag system: Discourse parsing with a lexical- ized tree adjoining grammar. Journal of Logic, Lan- guage and Information, 12:261–279. Michael C. Frank and Hannah Rohde. under review. Markers of topical discourse in child-directed speech. Michael C. Frank, Noah D. Goodman, and Josh B. Tenen- baum. 2008. A Bayesian framework for cross- situational word-learning. Advances in Neural Infor- mation Processing Systems 20. Michael C. Frank, Joshua B. Tenenbaum, and Anne Fernald. 2013. Social and discourse contributions to the determination of reference in cross-situational word learning. Language Learning and Development, 9(1):1–24. Jianfeng Gao and Mark Johnson. 2008. A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers. In EMNLP. Dan Goldwasser and Dan Roth. 2011. Learning from natural instructions. In IJCAI. Dan Goldwasser, Roi Reichart, James Clarke, and Dan Roth. 2011. Confidence driven unsupervised semantic parsing. In ACL. Peter Gorniak and Deb Roy. 2007. Situated language understanding as filtering perceived affordances. Cog- nitive Science, 31(2):197–231. Hugo Hernault, Helmut Prendinger, David A. duVerle, and Mitsuru Ishizuk. 2010. HILDA: A discourse parser using support vector machine classification. Di- alogue and Discourse, 1(3):1–33. Jerry R. Hobbs. 1990. Literature and Cognition. CSLI Lecture Notes 21. Liang Huang and Kenji Sagae. 2010. Dynamic program- ming for linear-time incremental parsing. In ACL. Frederick Jelinek and John D. Lafferty. 1991. Compu- tation of the probability of initial substring generation by stochastic context-free grammars. Computational Linguistics, 17(3):315–323. Mark Johnson, Katherine Demuth, and Michael Frank. 2012. Exploiting social information in grounded lan- guage learning via grammatical reduction. In ACL. Mark Johnson. 2010. Pcfgs, topic models, adaptor gram- mars and learning topical collocations and the struc- ture of proper names. In ACL. Joohyun Kim and Raymond J. Mooney. 2012. Unsu- pervised pcfg induction for grounded language learn- ing with highly ambiguous supervision. In EMNLP- CoNLL. Joohyun Kim and Raymond J. Mooney. 2013. Adapting discriminative reranking to grounded language learn- ing. In ACL. Alistair Knott and Ted Sanders. 1997. The classification of coherence relations and their linguistic markers: An exploration of two languages. Journal of Pragmatics, 30(2):135–175. Kenichi Kurihara and Taisuke Sato. 2004. An appli- cation of the variational bayesian approach to proba- bilistic contextfree grammars. In IJCNLP Workshop Beyond Shallow Analyses. Kenichi Kurihara and Taisuke Sato. 2006. Variational bayesian grammar induction for natural language. In ICGI. Tom Kwiatkowski, Luke S. Zettlemoyer, Sharon Gold- water, and Mark Steedman. 2010. Inducing proba- bilistic ccg grammars from logical form with higher- order unification. In EMNLP. Alex Lascarides and Nicholas Asher. 1993. Temporal interpretation, discourse relations, and common sense entailment. Linguistics and Philosophy, 16(5):437– 493. Joop M. I. M. Leo. 1991. A general context-free parsing algorithm running in linear time on every lr(k) gram- mar without using lookahead. Theoretical Computer Science, 82(1):165–176. Roger Levy. 2008. Expectation-based syntactic compre- hension. Cognition, 106(3):1126–1177. Percy Liang, Michael I. Jordan, and Dan Klein. 2009. Learning semantic correspondences with less supervi- sion. In ACL-IJCNLP, pages 91–99. Percy Liang, Michael I. Jordan, and Dan Klein. 2011. Learning dependency-based compositional semantics. In Association for Computational Linguistics (ACL). Ziheng Lin, Hwee Tou Ng, and Min-Yen Kan. 2012. A pdtb-styled end-to-end discourse parser. Natural Lan- guage Engineering, FirstView:1–34. Wei Lu, Hwee Tou Ng, Wee Sun Lee, and Luke S. Zettle- moyer. 2008. A generative model for parsing natural language to meaning representations. In EMNLP. William C. Mann and Sandra A. Thompson. 1988. Rhetorical structure theory: Toward a functional the- ory of text organization. Text, 8(3):243–281. Daniel Marcu. 1997. The rhetorical parsing of natural language texts. In ACL. Daniel Marcu. 1999. A decision-based approach to rhetorical parsing. In ACL. Lev Pevzner and Marti A. Hearst. 2002. A critique and improvement of an evaluation metric for text segmen- tation. Computational Linguistics, 28(1):19–36. Livia Polanyi, Chris Culy, Martin Van Den Berg, Gian Lorenzo Thione, and David Ahn. 2004. A rule based approach to discourse parsing. In SigDIAL. Hoifung Poon and Pedro Domingos. 2009. Unsuper- vised semantic parsing. In EMNLP. Lawrence R. Rabiner. 1990. A tutorial on hidden Markov models and selected applications in speech recognition. Remko Scha and Livia Polanyi. 1988. An augmented context free grammar for discourse. In COLING. Jeffrey M. Siskind. 1996. A computational study of cross-situational techniques for learning word-to- meaning mappings. Cognition, 61(1-2):39–91. Benjamin Snyder and Regina Barzilay. 2007. Database- text alignment via structured multilabel classification. In IJCAI. Radu Soricut and Daniel Marcu. 2003. Sentence level discourse parsing using syntactic and lexical informa- tion. In NAACL. Andreas Stolcke. 1995. An efficient probabilistic context-free parsing algorithm that computes prefix probabilities. Computational Linguistics, 21(2):165– 201. Rajen Subba and Barbara Di Eugenio. 2009. An effec- tive discourse parser that uses rich linguistic informa- tion. In NAACL. Stefanie Tellex, Thomas Kolla, Steven Dickerson, Matthew R. Walter, Ashis G. Banerjee, Seth Teller, and Nicholas. Roy. 2011a. Approaching the symbol grounding problem with probabilistic graphical mod- els. AI Magazine, 32(4):64–76. Stefanie Tellex, Thomas Kolla, Steven Dickerson, Matthew R. Walter, Ashis G. Banerjee, Seth Teller, and Nicholas Roy. 2011b. Understanding Natural Lan- guage Commands for Robotic Navigation and Mobile Manipulation. In AAAI. Michael Tomasello and Michael Jeffrey Farrar. 1986. Joint attention and early language. Child development, pages 1454–1463. Adam Vogel and Daniel Jurafsky. 2010. Learning to fol- low navigational directions. In ACL. Yuk Wah Wong and Raymond J. Mooney. 2007. Learn- ing synchronous grammars for semantic parsing with lambda calculus. In ACL. Chen Yu and Dana H. Ballard. 2004. On the integration of grounding language and learning objects. In AAAI. Chen Yu and Dana H Ballard. 2007. A unified model of early word learning: Integrating statistical and social cues. Neurocomputing, 70(13):2149–2165. Luke S. Zettlemoyer and Michael Collins. 2005. Learn- ing to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars. In UAI. Michael Zettlemoyer, Luke S.and Collins. 2007. Online learning of relaxed CCG grammars for parsing to log- ical form. In EMNLP-CoNLL.