Exploring the Role of Stress in Bayesian Word Segmentation using Adaptor Grammars Exploring the Role of Stress in Bayesian Word Segmentation using Adaptor Grammars Benjamin Börschinger1,2 Mark Johnson1,3 1Department of Computing, Macquarie University, Sydney, Australia 2Department of Computational Linguistics, Heidelberg University, Heidelberg, Germany 3Santa Fe Institute, Santa Fe, USA {benjamin.borschinger|mark.johnson}@mq.edu.au Abstract Stress has long been established as a major cue in word segmentation for English infants. We show that enabling a current state-of-the-art Bayesian word segmentation model to take ad- vantage of stress cues noticeably improves its performance. We find that the improvements range from 10 to 4%, depending on both the use of phonotactic cues and, to a lesser ex- tent, the amount of evidence available to the learner. We also find that in particular early on, stress cues are much more useful for our model than phonotactic cues by themselves, consistent with the finding that children do seem to use stress cues before they use phono- tactic cues. Finally, we study how the model’s knowledge about stress patterns evolves over time. We not only find that our model cor- rectly acquires the most frequent patterns rel- atively quickly but also that the Unique Stress Constraint that is at the heart of a previously proposed model does not need to be built in but can be acquired jointly with word segmen- tation. 1 Introduction Among the first tasks a child language learner has to solve is picking out words from the fluent speech that constitutes its linguistic input.1 For English, stress has long been claimed to be a useful cue in infant word segmentation (Jusczyk et al., 1993; Jusczyk et al., 1999b), following the demonstra- 1The datasets and software to replicate our experiments are available from http://web.science.mq.edu.au/ ˜bborschi/ tion of its effectiveness in adult speech process- ing (Cutler et al., 1986). Several studies have investigated the role of stress in word segmenta- tion using computational models, using both neu- ral network and “algebraic” (as opposed to “statis- tical”) approaches (Christiansen et al., 1998; Yang, 2004; Lignos and Yang, 2010; Lignos, 2011; Lig- nos, 2012). Bayesian models of word segmenta- tion (Brent, 1999; Goldwater, 2007), however, have until recently completely ignored stress. The sole exception in this respect is Doyle and Levy (2013) who added stress cues to the Bigram model (Gold- water et al., 2009), demonstrating that this leads to an improvement in segmentation performance. In this paper, we extend their work and show how to integrate stress cues into the flexible Adaptor Gram- mar framework (Johnson et al., 2007). This allows us to both start from a stronger baseline model and to investigate how the role of stress cues interacts with other aspects of the model. In particular, we find that phonotactic cues to word-boundaries inter- act with stress cues, indicating synergistic effects for small inputs and partial redundancy for larger in- puts. Overall, we find that stress cues add roughly 6% token f-score to a model that does not account for phonotactics and 4% to a model that already in- corporates phonotactics. Relatedly and in line with the finding that stress cues are used by infants be- fore phonotactic cues (Jusczyk et al., 1999a), we ob- serve that phonotactic cues require more input than stress cues to be used efficiently. A closer look at the knowledge acquired by our models shows that the Unique Stress Constraint of Yang (2004) can be acquired jointly with segmenting the input instead 93 Transactions of the Association for Computational Linguistics, 2 (2014) 93–104. Action Editor: Stefan Riezler. Submitted 12/2013; Published 2/2014. c©2014 Association for Computational Linguistics. of having to be pre-specified; and that our models correctly identify the predominant stress pattern of the input but underestimate the frequency of iambic words, which have been found to be missegmented by infant learners. The outline of the paper is as follows. In Section 2 we review prior work. In Section 3 we introduce our own models. In Section 4 we explain our experimen- tal evaluation and its results. Section 5 discusses our findings, and Section 6 concludes and provides some suggestions for future research. 2 Background and related work Lexical stress is the “accentuation of syllables within words” (Cutler, 2005) and has long been ar- gued to play an important role in adult word recog- nition. Following Cutler and Carter (1987)’s obser- vation that stressed syllables tend to occur at the be- ginnings of words in English, Jusczyk et al. (1993) investigated whether infants acquiring English take advantage of this fact. Their study demonstrated that this is indeed the case for 9 month olds, al- though they found no indication of using stressed syllables as cues for word boundaries in 6 month olds. Their findings have been replicated and ex- tended in subsequent work (Jusczyk et al., 1999b; Thiessen and Saffran, 2003; Curtin et al., 2005; Thiessen and Saffran, 2007). A brief summary of the key findings is as follows: English infants treat stressed syllables as cues for the beginnings of words from roughly 7 months of age, suggesting that the role played by stress needs to be acquired, and that this requires antecedent segmentation by non- stress-based means (Thiessen and Saffran, 2007). They also exhibit a preference for low-pass filtered stress-initial words from this age, suggesting that it is indeed stress and not other phonetic or phono- tactic properties that are treated as a cue for word- beginnings (Jusczyk et al., 1993). In fact, phontactic cues seem to be used later than stress cues (Jusczyk et al., 1999a) and seem to be outweighed by stress cues (Mattys and Jusczyk, 2000). The earliest computational model for word seg- mentation incorporating stress cues we are aware of is the recurrent network model of Christiansen et al. (1998) and Christiansen and Curtin (1999). They only reported a word-token f-score of 44% (roughly, segmentation accuracy: see Section 4), which is considerably below the performance of subsequent models, making a direct comparison complicated. Yang (2004) introduced a simple incremental algo- rithm that relies on stress by embodying a Unique Stress Constraint (USC) that allows at most a sin- gle stressed syllable per word. On pre-syllabified child directed speech, he reported a word token f- score of 85.6% for a non-statistical algorithm that exploits the USC. While the USC has been argued to be near-to-universal and follows from the “cul- minative function of stress” (Fromkin, 2001; Cutler, 2005), the high score Yang reported crucially de- pends on every word token carrying stress, including function words. More recently, Lignos (2010, 2011, 2012) further explored Yang’s original algorithm, taking into account that function words should not be assumed to possess lexical stress cues. While his scores are in line with those reported by Yang, the importance of stress for this learner were more modest, providing a gain of around 2.5% (Lignos, 2011). Also, the Yang/Lignos learner is unable to acquire knowledge about the role stress plays in the language, e.g. that stress tends to fall on particular positions within words. Doyle and Levy (2013) extend the Bigram model of Goldwater et al. (2009) by adding stress- templates to the lexical generator. A stress-template indicates how many syllables the word has, and which of these syllables (if any) are stressed. This allows the model to acquire knowledge about the stress patterns of its input by assigning different probabilities to the different stress-templates. How- ever, Doyle and Levy (2013) do not directly exam- ine the probabilities assigned to the stress-templates; they only report that their model does slightly prefer stress-initial words over the baseline model by cal- culating the fraction of stress-initial word types in the output segmentations of their models. They also demonstrate that stress cues do indeed aid segmen- tation, although their reported gain of 1% in token f-score is even smaller than that reported by Lig- nos (2011). Our own approach differs from theirs in assuming phonemic rather than pre-syllabified in- put (although our model could, trivially, be run on syllabified input as well) and makes use of Adap- tor Grammars instead of the Goldwater et al. (2009) Bigram model, providing us with a flexible frame- work for exploring the usefulness of stress in differ- 94 ent models. Adaptor Grammar (Johnson et al., 2007) is a grammar-based formalism for specifying non- parametric hierarchical models. Previous work ex- plored the usefulness of, for example, syllable- structure (Johnson, 2008b; Johnson and Goldwa- ter, 2009) or morphology (Johnson, 2008b; Johnson, 2008a) in word segmentation. The closest work to our own is Johnson and Demuth (2010) who investi- gate the usefulness of tones for Mandarin phonemic segmentation. Their way of adding tones to a model of word segmentation is very similar to our way of incorporating stress. 3 Models We give an intuitive description of the mathemati- cal background of Adaptor Grammars in 3.1, refer- ring the reader to Johnson et al. (2007) for technical details. The models we examine are derived from the collocational model of Johnson and Goldwater (2009) by varying three parameters, resulting in 6 models: two baselines that do not take advantage of stress cues and either do or do not use phonotactics, as described in Section 3.2; and four stress models that differ with respect to the use of phonotactics, and as to whether they embody the Unique Stress Constraint introduced by Yang (2004). We describe these models in section 3.3. 3.1 Adaptor Grammars Briefly, an Adaptor Grammar (AG) can be seen as a probabilistic context-free grammar (PCFG) with a special set of adapted non-terminals. We use un- derlining to distinguish adapted non-terminals ( X ) from non-adapted non-terminals ( Y ). The distri- bution for each adapted non-terminal X is drawn from a Pitman-Yor Process which takes as its base- distribution the tree-distribution over trees rooted in X as defined by the PCFG. As an effect, each adapted non-terminal can be seen as having associ- ated with it a cache of previously-generated subtrees that can be reused without having to be regenerated using the individual PCFG rules. This allows AGs to learn reusable sub-trees such as words, sequences of words, or smaller units such as Onsets and Codas. Thus, while ordinary PCFGs have a finite number of parameters (one probability for each rule), Adap- tor Grammars in addition have a parameter for every possible complete tree rooted in any of its adapted non-terminals, leading to a potentially infinite num- ber of such parameters. The Pitman-Yor Process in- duces a rich-get-richer dynamics, biasing the model towards identifying a small set of units that can be reused as often as possible. In the case of word seg- mentation, the model will try to identify as compact a lexicon as possible to segment the unsegmented input. 3.2 Baseline models Our starting point is the state-of-the-art AG model for word segmentation, Johnson and Goldwater (2009)’s colloc3-syll model, reproduced in Fig- ure 1.2 The model assumes that words are grouped into larger collocational units that themselves can be grouped into even larger collocational units. This accounts for the fact that in natural language, there are strong word-to-word dependencies that need to be accounted for if severe undersegmentations of the form “is in the” are to be avoided (Goldwater, 2007; Johnson and Goldwater, 2009; Börschinger et al., 2012). It also uses a language-independent form of syllable structure to constrain the space of possi- ble words. Finally, this model can learn word-initial onsets and word-final codas. In a language like En- glish, this ability provides additional cues to word- boundaries as certain onsets are much more likely to occur word-initially than medially (e.g. “bl” in “black”), and analogously for certain codas (e.g. “dth” in “width” or “ngth” in “strength”). We define an additional baseline model by replac- ing rules (5) and (6) by (17), and deleting rules (7) to (12). This removes the model’s ability to use phono- tactic cues to word-boundaries. Word → Syll ( Syll ) ( Syll ) ( Syll ) (17) We refer to the model in Figure 1 as the colloc3- phon model, and the model that results from sub- stituting and removing rules as described as the colloc3-nophon model. Alternatively, one could limit the models ability to capture word-to-word de- pendencies by removing rules (1) to (3). This results 2We follow Johnson and Goldwater (2009) in limiting the length of possible words to four syllables to speed up runtime. In pilot experiments, this choice did not have a noticeable effect on segmentation performance. 95 Collocations3 → Collocation3 + (1) Collocation3 → Collocation2 + (2) Collocation2 → Collocation + (3) Collocation → Word + (4) Word → SyllIF (5) Word → SyllI ( Syll ) ( Syll ) SyllF (6) SyllIF → ( OnsetI ) RhymeF (7) SyllI → ( OnsetI ) Rhyme (8) SyllF → ( Onset ) RhymeF (9) CodaF → Consonant + (10) RhymeF → Vowel ( CodaF ) (11) OnsetI → Consonant + (12) Syll → ( Onset ) Rhyme (13) Rhyme → Vowel ( Coda ) (14) Onset → Consonant + (15) Coda → Consonant + (16) Figure 1: The baseline model. We use regular-expression notation to abbreviate multiple rules. X{n} stands for up to n repetitions of X , brackets indicate optionality, and X + stands for one or more repetitions of X . X indicates an adapted non-terminal. Rules that introduce terminals for the pre-terminals Vowel , Consonant are omitted. Refer to the main text for an explanation of the grammar. in the colloc-model (Johnson, 2008b) that has previ- ously been found to behave similarly to the Bigram model used in Doyle and Levy (2013) (Johnson, 2008b; Börschinger et al., 2012). We performed ex- periments with the colloc-model as well and found similar results to Doyle and Levy (2013) which are, while overall worse, similar in trend to the results obtained for the colloc3-models. For the rest of the paper, therefore, we will focus on variants of the colloc3-model. 3.3 Stress-based models In order for stress cues to be helpful, the model must have some way of associating the position of stress with word-boundaries. Intuitively, the reason stress helps infants in segmenting English is that a stressed syllable is a reliable indicator of the beginning of a word (Jusczyk et al., 1993). More generally, if there is a (reasonably) reliable relationship between the position of stressed syllables and beginnings (or Word →{SSyll | USyll}{1,4} (18) SSyll → ( Onset ) RhymeS (19) USyll → ( Onset ) RhymeU (20) RhymeS → Vowel ∗( Coda ) (21) RhymeU → Vowel ( Coda ) (22) Onset → Consonant + (23) Coda → Consonant + (24) Figure 2: Description of the all-stress-patterns model. We use X{m,n} for “at least m and at most n repetitions of X ” and {X | Y} for “either X or Y ”. Stress is asso- ciated with a vowel by suffixing it with the special termi- nal symbol ∗ , leading to a distinction between stressed ( SSyll ) and unstressed ( USyll ) syllables. A word can consist of any possible sequence of up to four syllables, as indicated by the regular-expression notation. By ad- ditionally adding initial and final variants of SSyll and USyll as in Figure 1, phonotactics can be combined with stress cues. endings) of words, a learner might exploit this rela- tionship. In a Bayesian model, this intuition can be captured by modifying the lexical generator, that is, the distribution that generates Word s. Here, changing the lexical generator corresponds to modifying the rules expanding Word . A straight- forward way to modify it accordingly is to enu- merate all possible sequences of stressed and un- stressed syllables.3 A lexical generator like this is given in Figure 2. In the data, stress cues are rep- resented using a special terminal “∗” that follows a stressed vowel, as illustrated in Figure 3. In the grammar, “∗” is constrained to only surface follow- ing a Vowel , rendering a syllable in which it occurs stressed ( SSyll ). Syllables that do not contain a “∗” are considered unstressed ( USyll ). By performing inference for the probabilities assigned to the dif- ferent expansions of rule (18), our models can, for example, learn that a bi-syllabic word that is stress- initial (a trochee) is more probable than one that puts stress on the second syllable (an iamb). This (partly) captures the tendency of English for stress-initial words and thus provide an additional cue for identi- fying words; and it is exactly the kind of preference infant learners of English seem to acquire (Jusczyk 3This is, in essence, also the strategy chosen by Doyle and Levy (2013). 96 grammar phon stress USC colloc3-nophon colloc3-phon • colloc3-nophon-stress • colloc3-phon-stress • • colloc3-nophon-stress-usc • • colloc3-phon-stress-usc • • • Table 1: The different models used in our experiments. “phon” indicates whether phonotactics are used, “stress” whether stress cues are used and “usc” whether the Unique Stress Constraint is assumed. orthographic the do-gie no-stress dh ah d ao g iy stress dh ah d ao * g iy Figure 3: Illustration of the input-representation we choose. We indicate primary stress according to the dic- tionary with bold-face in the orthography. The phonemic transcription uses ARPABET and is produced using an extended version of CMUDict. Primary stress is indi- cated by inserting the special symbol “*” after the vowel of a stressed syllable. et al., 1993). We can combine this lexical generator with the colloc3-nophon baseline, resulting in the colloc3- nophon-stress model. We can also add phonotac- tics to the lexical generator in Figure 2 by adding initial and final variants of SSyll and USyll , anal- ogous to rules (5) to (12) in Figure 1. This yields the colloc3-phon-stress model. We can also add the Unique Stress Constraint (USC) (Yang, 2004) by excluding all variants of rule (18) that generate two or more stressed syllables. For example, while the lexical generator for the colloc3-nophon-stress model will include the rule Word → SSyll SSyll , the lexical generator embodying the USC lacks this rule. We refer to the models that include the USC as colloc3-nophon-stress-usc and colloc3-phon-stress- usc models. A compact overview of the six different models is given in Table 1. 4 Experiments We evaluate our models on several corpora of child directed speech. We first describe the corpora we used, then the experimental methodology employed and finally the experimental results. As the trend is comparable across all corpora, we only discuss in detail results obtained on the Alex corpus. For com- pleteness, however, Table 3 reports the “standard” evaluation of performing inference over all of the three corpora. 4.1 Corpora and corpus creation Following Christiansen et al. (1998) and Doyle and Levy (2013), we use the Korman corpus (Korman, 1984) as one of our corpora. It comprises child- directed speech for very young infants, aged be- tween 6 and 16 weeks and, like all other cor- pora used in this paper, is available through the CHILDES database (MacWhinney, 2000). We de- rive a phonemicized version of the corpus using an extended version of CMUDict (Carnegie Mellon University, 2008)4, as we were unable to obtain the stress-annotated version of this corpus used in previ- ous experiments. The phonemicized version is pro- duced by replacing each orthographic word in the transcript with the first pronunciation given by the dictionary. CMUDict also annotates lexical stress, and we use this information to add stress cues to the corpus. We only code primary lexical stresses in the input, ignoring secondary stresses in line with ex- perimental work that indicates that human listeners are capable of reliably distinguishing primary and secondary stress (Mattys, 2000). Due to the very low frequency of words with 3 or more syllables in these corpora, this choice has very little effect on the number of stress cues available in the input. Our ver- sion of the Korman corpus contains, in total, 11413 utterances. Unlike Christiansen et al. (1998), Yang (2004), and Doyle and Levy (2013), we follow Lig- nos and Yang (2010) in making the more realistic as- sumption that the 94 mono-syllabic function words listed by Selkirk (1984) never surface with lexical stress. As function words account for roughly 50% of the tokens but only roughly 5% of the types in our corpora, this means that the type and token distribu- tion of stress patterns differs dramatically in all our corpora, as can be seen from Table 2. We also added stress information to the Brent- Bernstein-Ratner corpus (Bernstein-Ratner, 1987; Brent, 1999), following the procedure just out- lined. This corpus is a de-facto standard for evaluat- 4http://svn.code.sf.net/p/cmusphinx/ code/trunk/cmudict/cmudict.0.7a 97 Pattern brent korman alex Tok Typ Tok Typ Tok Typ W+ .48 .07 .47 .08 .44 .05 SW∗ .49 .86 .49 .86 .52 .87 WSW∗ .03 .07 .03 .06 .04 .07 Other .00 .00 .00 .00 .00 .00 Table 2: Relative frequencies for stress patterns for the corpora used in our study. X∗ stands for 0 or more, X+ for one or more repetitions of X, and S for a stressed and W for an unstressed syllable. Note the stark asymmetry between type and token frequencies for unstressed words. Up to two-decimal places, patterns other than the ones given have relative frequency 0.00 (frequencies might not sum to 1 as an artefact of rounding to 2 decimal places). ing models of Bayesian word segmentation (Brent, 1999; Goldwater, 2007; Goldwater et al., 2009; Johnson and Goldwater, 2009), comprising in total 9790 utterances. As our third corpus, we use the Alex portion of the Providence corpus (Demuth et al., 2006; Börschinger et al., 2012). A major benefit of the Providence corpus is that the video-recordings from which the transcripts were produced are available through CHILDES alongside the transcripts. This will allow future work to rely on even more realis- tic stress cues that can be derived directly from the acoustic signal. While beyond the scope of this pa- per, we believe choosing a corpus that makes richer information available will be important for future work on stress (and other acoustic) cues. Another major benefit of the Alex corpus is that it provides longitudinal data for a single infant, rather than be- ing a concatenation of transcripts collected from multiple children, such as the Korman and the Brent- Bernstein-Ratner corpus. In total, the Alex corpus comprises 17948 utterances. Note that despite the differences in age of the in- fants and overall make-up of the corpora, the dis- tribution of stress patterns across the corpora is roughly the same, as shown by Table 2 for the first 10,000 utterances of each of the corpora. This sug- gests that the distribution of stress patterns both at a token and type level is a robust property of English child-directed speech. 4.2 Evaluation procedure The aim of our experiments is to understand the contribution of stress cues to the Bayesian word segmentation models described in Section 3. To get an idea of how input size interacts with this, we look at prefixes of the corpora with increasing sizes (100, 200, 500, 1000, 2000, 5000, and 10,000 utterances). In addition, we are interested in under- standing what kind of stress pattern preferences our models acquire. For this, we also collect samples of the probabilities assigned to the different expansions of rule (18), allowing us to examine this directly. The standard evaluation of segmentation models in- volves having them segment their input in an un- supervised manner and evaluating performance on how well they segmented that input. We addition- ally evaluate the models on a test set for each cor- pus. Use of a separate test set has previously been suggested as a means of testing how well the knowl- edge a learner acquired generalizes to novel utter- ances (Pearl et al., 2011), and is required for the kind of comparison across different sizes of input we are interested in to determine whether there the role of stress cues interacts with the input size. We create the test-sets by taking the final 1000 ut- terances for each corpus. These 1000 utterances will be segmented by the model after it has performed inference on its input, without making any further changes to the lexicon that the model has induced. In other words, the model will have to segment each of the test utterances using only the lexicon (and any additional knowledge about co-occurrences, phono- tactics, and stress) it has acquired from the training portion of the corpus during inference. We measure segmentation performance using the standard metric of token f-score (Brent, 1999) which is the harmonic mean of token precision and recall. Token f-score provides an overall impression of how accurate individual word tokens were identified. To illustrate, if the gold segmentation is “the dog”, the segmentation “th e dog” has a token precision of 1 3 (one out of three predicted words is correct); a token recall of 1 2 (one of the two gold words was correctly identified); and a token f-score of 0.4. 4.3 Inference For inference, we closely follow Johnson and Gold- water (2009): we put vague priors on all the hyper- 98 p s usc alex korman brent train test train test train test .81 .81 .85 .83 .82 .82 • .85 .84 .86 .84 .86 .86 • .86 .87 .87 .86 .86 .87 • • .88 .88 .88 .87 .87 .87 • • .87 .88 .87 .88 .86 .87 • • • .88 .88 .88 .87 .87 .88 Table 3: Token f-scores on both train and test portions for all three corpora when inference is performed over the full corpus. Note that the benefit of stress is clearer when evaluating on the test set, and that overall, perfor- mance of the different models is comparable across all three corpora. Models are coded according to the key in Table 1. parameters of our models and run 4 chains for 1000 iterations, collecting 20 samples from each chain with a lag of 10 iterations between each sample af- ter a burn-in of 800 iterations, using both batch- initialization and table-label resampling to ensure good convergence of the sampler. We construct a single segmentation from the posterior samples us- ing their minimum Bayes risk decoding, providing a single score for each condition. 4.4 Experimental conditions Each of our six models is evaluated on inputs of in- creasing size, starting at 100 and ending at 10,000 utterances, allowing us to investigate both how per- formance and “knowledge” of the learner varies as a function of input size. For completeness, we also report the “standard” evaluation, i.e. performance of our models on all corpora when trained on the entire input in Table 3. We will focus our discussion on the results obtained on the Alex corpus, which are de- picted in Figure 4, where the input size is depicted on the x-axis, and the segmentation f-score for the test-set on the y-axis. 5 Discussion We find a clear improvement for the stress-models over both the colloc3-nophon and the colloc3-phon models. As can be seen in Table 3, the overall trend is the same for all three corpora, both when evaluating on the input and the separate test-set.5 5We performed Wilcox rank sum tests on the individual scores of the 4 independent chains for each model on the full training data sets and found that the stress-models were always Note how the relative gain for stress is roughly 1% higher when evaluating on the test-set; this might have to do with Jusczyk (1997)’s observa- tion that the advantage of stress “might be more evident for relatively unexpected or unfamiliarized strings” (Jusczyk, 1997). A closer look at Figure 4 indicates further interesting differences between the colloc3-nophon and the colloc3-phon models that only become evident when considering different in- put sizes. 5.1 Stress cues without phonotactics For the colloc3-nophon models, we observe a rel- atively stable improvement by adding stress cues of 6-7%, irrespective of input size and whether or not the Unique Stress Constraint (USC) is assumed. The sole exception to this occurs when the learner only gets to see 100 utterances: in this case, the colloc-nophon-stress model only shows a 3% im- provement, whereas the colloc3-nophon-stress-usc model obtains a boost of roughly 8%. Noticeable consistent differences between the colloc3-nophon- stress and colloc3-nophon-stress-usc model, how- ever, all but disappear starting from around 500 ut- terances. This is somewhat surprising, considering that it is the USC that was argued by Yang (2004) to be key for taking advantage of stress.6 We take this behaviour to indicate that even with as little evidence as 200 to 500 utterances, a Bayesian ideal learner can effectively infer that something like the USC is true of English. This also becomes clear when examining how the learn- ers’ preferences for different stress patterns evolve over time, as we do in Section 5.3 below. 5.2 Stress cues and phonotactics Overall, the models including phonotactic cues per- form better than those that do not rely on phono- tactics. However, the overall gain contributed by stress to the colloc3-phon baseline is smaller, al- significantly more accurate (p < 0.05) than the baseline models except when evaluating on the training data for the Korman and Brent corpora. 6On data in which function words are marked for stress (as in Yang (2004) and Doyle and Levy (2013)), the USC yields ex- tremely high scores across all models, simply because roughly every second word is a function word. Given that this assump- tion is extremely unnatural, we do not take this as an argument for the USC. 99 0.65 0.70 0.75 0.80 0.85 100 200 500 1000 2000 5000 10000 number of input utterances se g m e n ta tio n f − sc o re colloc3−nophon colloc3−phon colloc3−nophon−stress colloc3−phon−stress colloc3−nophon−stress−usc colloc3−phon−stress−usc Figure 4: Segmentation performance of the different models, across different input sizes and as evaluated on the test-set for the Alex corpus. The no-stress baselines are given in red, the stress-models without the Unique Stress Constraint (USC) in green and the ones including the USC in black. Solid lines indicate models that use, dashed lines models that do not use phonotactics. Refer to the text for discussion. though this seems to depend on the size of the input. While phonotactics by itself appears to be a pow- erful cue, yielding a noticeable 4-5% improvement over the colloc3-nophon baseline, the learner seems to require at least around 500 utterances before the colloc3-phon model becomes clearly more accurate than the colloc3-nophon model. In contrast, even for only 100 utterances stress cues by themselves provide a 3% improvement to the colloc3-nophon model, indicating that they can be taken advantage of earlier. While the number of utterances processed by a Bayesian ideal learner is not directly related to developmental stages, this observation is consistent with the psycholinguists’ claim that phonotactics are used by infants for word segmentation after they have begun to use stress for segmentation (Jusczyk et al., 1999a). Turning to the interaction between stress and phonotactics, we see that there is no consistent ad- vantage of including the USC in the model. This is, in fact, even clearer than for the colloc3-nophon model where at least for small inputs of size 100, the USC added almost 5% in performance. For the colloc3-phon models, we only observe a 1-2% im- provement by adding the USC up until 500 utter- ances. This further strengthens the point that even in the absence of such an innate constraint, a statisti- cal learner can take advantage of stress cues and, as we show below, actually acquire something like the USC from the input. The 4% difference between the colloc3-phon- stress / colloc3-phon-stress-usc models to the colloc3-phon baseline is smaller than the 7% dif- ference between the colloc3-nophon and colloc3- nophon-stress models. This shows that there is a redundancy between phonotactic and stress cues in large amounts of data, as their joint contribution to the colloc3-nophon baseline is less than the sum of their individual contributions at 10,000 utterances, of 4% (for phonotactics) and 7% (for stress). Unlike for the colloc3-nophon models, we also see a clear impact of input size. In particular, at 100 utterances the addition of stress cues leads to an 8 – 10% improvement, depending on whether or not the USC is assumed, whereas for the colloc3- nophon model we only observed a 3 – 8% improve- ment. This is particularly striking when we con- sider that by themselves, the phonotactic cues only contribute a 1% improvement to the colloc3-nophon baseline when trained on the 100 utterance corpus, 100 indicating a synergistic interaction (rather than re- dundancy) between phonotactics and stress for small inputs. This effect disappears starting from around 1000 utterances; for inputs of size 1000 and larger, the net-gain of stress drops from roughly 10% to a 3–4% improvement. That is, while we did not notice any relationship between input size and impact of stress cues for the colloc3-nophon model, we do see such an interaction for the combination of phonotac- tics and stress cues which, taken together, lead to a larger relative gain in performance on smaller inputs than on large ones. 5.3 Acquisition of stress patterns In addition to acquiring a lexicon, the Bayesian learner acquires knowledge about the possible stress patterns of English words. The fact that this knowl- edge is explicitly represented through the PCFG rules and their probabilities that define the lexi- cal generator allows us to study the generalisations about stress the model actually acquires. While Doyle and Levy (2013) suggest carrying out such an analysis, they restrict themselves to estimating the fraction of stress patterns in the segmented out- put. As shown in Table 2, however, the type and token distributions of stress patterns can differ sub- stantially. We therefore investigate the stress pref- erences acquired by our learner by examining the probabilities assigned to the different expansions of rule (18), aggregating the probabilities of the indi- vidual rules into patterns. For example, the rules Word → SSyll ( USyll ){0,3} correspond to the pattern “Stress on the first syllable”, whereas the rules Word → USyll{1,4} correspond to the pat- tern “Unstressed word”. By computing the respec- tive probabilities, we get the overall probability as- signed by a learner to the pattern. Figure 5 provides this information for several dif- ferent rule patterns. Additionally, these plots in- clude the empirical type (red dotted) and token pro- portions (red double-dashed) for the input corpus. Note how for the two major patterns, all models suc- cessfully track the type, rather than the token fre- quency, correctly developing a preference for stress- initial over unstressed words, despite the compa- rable token frequency of these two patterns. This is compatible with a recent proposal by Thiessen and Saffran (2007), who argue that infants infer the stress pattern over their lexicon. For a Bayesian model such as ours or Goldwater et al. (2009)’s, there is no need to pre-specify that the distribution ought to be learned over types rather than tokens, as the models automatically interpolate between type and token statistics according to the properties of their input (Goldwater et al., 2006). In addition, a Bayesian framework provides a simple answer to the question of how a learner might identify the role of stress in its language without already having ac- quired at least some words. By combining differ- ent kinds of cues, e.g. distributional, phonotactic and prosodic, in a principled manner a Bayesian learner can jointly segment its input and learn the appropriate role of each cue, without having to pre- specify specific preferences that might differ across languages. The iambic rule pattern that puts stress on the sec- ond syllable is much more infrequent on a token level. All models track this low token frequency, underestimating the type frequency of this pattern by a fair amount. This suggests that learning this pattern correctly requires considerably more input than for the other patterns. Indeed, the iambic pat- tern is known to pose problems for infants when they start using stress as an effective cue. It is only from roughly 10 months of age that infants successfully segment iambic words (Jusczyk et al., 1999b). Not surprisingly, the USC doesn’t aid in learning about this pattern because it is completely silent on where stress might fall (and does not noticeably improve segmentation performance to begin with). Finally, we can also investigate whether the models that lack the USC nevertheless learn that words contain at most one lexically stressed syl- lable. The bottom-right graph in Figure 5 plots the probability assigned by the models to patterns that violate the USC. This includes, for example, the rules Word → SyllS SyllS and Word → SyllS SyllU SyllS . Note how the probabilities as- signed to these rules approaches zero, indicating that the learner becomes more certain that there are no words that contain more than one syllable with lex- ical stress. As we argued above, this suggests that a Bayesian learner can acquire the USC from a mod- est amount of data — it will properly infer that the unnatural patterns are simply not supported by the input. To summarize, by examining the internal 101 0.55 0.60 0.65 0.70 0.75 0.80 0.85 100 200 500 1000 2000 5000 10000 P (S tr e ss o n f ir st ) 0.02 0.03 0.04 0.05 0.06 0.07 100 200 500 1000 2000 5000 10000 number of input utterances P (S tr e ss o n s e co n d ) 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 100 200 500 1000 2000 5000 10000 P (U n st re ss e d w o rd ) 0.05 0.10 100 200 500 1000 2000 5000 10000 number of input utterances P (V io la te s U S C ) colloc3−nophon−stress colloc3−phon−stress colloc3−nophon−stress−usc colloc3−phon−stress−usc Figure 5: Evolution of the knowledge the learner acquires on the Alex corpus. The red dotted line indicates the empirical type distribution of a specific pattern, and the double-dashed line the empirical token distribution. Top-Left: Stress-initial pattern, Top-Right: Unstressed Words, Bottom-Left: Stress-second pattern, Bottom-Right: Patterns that violate the USC. state of the Bayesian learners we can characterise how their knowledge about the stress preferences of their languages develops, rather than merely measur- ing how well they perform word segmentation. We find that the iambic pattern that has been observed to pose problems for infant learners also is harder for the Bayesian learner to acquire, arguably due to its extremely low token-frequency. 6 Conclusion and Future Work We have presented Adaptor Grammar models of word segmentation that are able to take advantage of stress cues and are able to learn from phonemic input. We find that phonotactics and stress interact in interesting ways, and that stress cues makes a sta- ble contribution to existing word segmentation mod- els, improving their performance by 4-6% token f- score. We also find that the USC introduced by Yang (2004) need not be prebuilt into a model but can be acquired by a Bayesian learner from the data. Sim- ilarly, we directly investigate the stress preferences acquired by our models and find that for stress-initial and unstressed words, they track type rather than token frequencies. The rare stress-second pattern seems to require more input to be properly acquired, which is compatible with infant development data. An important goal for future research is to eval- uate segmentation models on typologically different languages and to study the relative usefulness of dif- ferent cues cross-lingually. For example, languages such as French lack lexical stress; it would be inter- esting to know whether in such a case, phonotactic (or other) cues are more important. Relatedly, recent work such as Börschinger et al. (2013) has found that artificially created data often masks the com- plexity exhibited by real speech. This suggests that future work should use data directly derived from the acoustic signal to account for contextual effects, rather than using dictionary look-up or other heuris- tics. In using the Alex corpus, for which good qual- ity audio is available, we have taken a first step in this direction. 102 Acknowledgements This research was supported by the Australian Research Council’s Discovery Projects funding scheme (project numbers DP110102506 and DP110102593). We’d like to thank Professor Dupoux and our other colleagues at the Laboratoire de Sciences Cognitives et Psycholinguistique in Paris for hosting us while this research was per- formed, as well as the Mairie de Paris, the fondation Pierre Gilles de Gennes, the Ecole des Hautes Etudes en Sciences Sociales, the Ecole Normale Supérieure, The Region Ile de France, the European Research Council (ERC-2011-AdG-295810 BOOT- PHON), the Agence Nationale pour la Recherche (ANR-2010-BLAN-1901-1 BOOTLANG, ANR- 10-IDEX-0001-02 and ANR-10-LABX-0087) and the Fondation de France. We’d also like to thank three anonymous reviewers for helpful comments and suggestions. References N. Bernstein-Ratner. 1987. The phonology of parent- child speech. In K. Nelson and A. van Kleeck, editors, Children’s Language, volume 6. Erlbaum, Hillsdale, NJ. Benjamin Börschinger, Katherine Demuth, and Mark Johnson. 2012. Studying the effect of input size for Bayesian word segmentation on the Providence cor- pus. In Proceedings of the 24th International Con- ference on Computational Linguistics (Coling 2012), pages 325–340. Coling 2012 Organizing Committee. Benjamin Börschinger, Mark Johnson, and Katherine De- muth. 2013. A joint model of word segmentation and phonological variation for English word-final /t/- deletion. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1508–1516. Association for Computational Linguistics. M. Brent. 1999. An efficient, probabilistically sound algorithm for segmentation and word discovery. Ma- chine Learning, 34:71–105. M. Christiansen and S. Curtin. 1999. The power of sta- tistical learning: No need for algebraic rules. In Pro- ceedings of the 21st Annual Conference of the Cogni- tive Science Society. Morten H Christiansen, Joseph Allen, and Mark S Sei- denberg. 1998. Learning to segment speech using multiple cues: A connectionist model. Language and Cognitive Processes, 13(2-3):221–268. Suzanne Curtin, Toben H Mintz, and Morten H Chris- tiansen. 2005. Stress changes the representational landscape: Evidence from word segmentation. Cog- nition, 96(3):233–262. Anne Cutler and David M Carter. 1987. The predomi- nance of strong initial syllables in the English vocabu- lary. Computer Speech and Language, 2(3):133–142. Anne Cutler, Jacques Mehler, Dennis Norris, and Juan Segui. 1986. The syllable’s differing role in the seg- mentation of French and English. Journal of Memory and Language, 25(4):385 – 400. Anne Cutler. 2005. Lexical stress. In David B. Pisoni and Robert E. Remez, editors, The Handbook of Speech Perception, pages 264–289. Blackwell Pub- lishing. K. Demuth, J. Culbertson, and J. Alter. 2006. Word- minimality, epenthesis, and coda licensing in the ac- quisition of English. Language and Speech, 49:137– 174. Gabriel Doyle and Roger Levy. 2013. Combining mul- tiple information types in Bayesian word segmenta- tion. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies, pages 117–126. Association for Computational Linguistics. Victoria Fromkin, editor. 2001. Linguistics: An Intro- duction to Linguistic Theory. Blackwell, Oxford, UK. Sharon Goldwater, Tom Griffiths, and Mark John- son. 2006. Interpolating between types and tokens by estimating power-law generators. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 459–466. MIT Press. Sharon Goldwater, Thomas L. Griffiths, and Mark John- son. 2009. A Bayesian framework for word segmen- tation: Exploring the effects of context. Cognition, 112(1):21–54. Sharon Goldwater. 2007. Nonparametric Bayesian Mod- els of Lexical Acquisition. Ph.D. thesis, Brown Uni- versity. Mark Johnson and Katherine Demuth. 2010. Unsu- pervised phonemic Chinese word segmentation using Adaptor Grammars. In Proceedings of the 23rd In- ternational Conference on Computational Linguistics (Coling 2010), pages 528–536. Coling 2010 Organiz- ing Committee. Mark Johnson and Sharon Goldwater. 2009. Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor gram- mars. In Proceedings of Human Language Technolo- gies: The 2009 Annual Conference of the North Amer- ican Chapter of the Association for Computational 103 Linguistics, pages 317–325. Association for Compu- tational Linguistics. Mark Johnson, Thomas L. Griffiths, and Sharon Goldwa- ter. 2007. Adaptor Grammars: A framework for spec- ifying compositional nonparametric Bayesian models. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Ad- vances in Neural Information Processing Systems 19, pages 641–648. MIT Press, Cambridge, MA. Mark Johnson. 2008a. Unsupervised word segmentation for Sesotho using Adaptor Grammars. In Proceedings of the Tenth Meeting of ACL Special Interest Group on Computational Morphology and Phonology, pages 20–27. Association for Computational Linguistics. Mark Johnson. 2008b. Using Adaptor Grammars to identify synergies in the unsupervised acquisition of linguistic structure. In Proceedings of the 46th Annual Meeting of the Association of Computational Linguis- tics, pages 398–406. Association for Computational Linguistics. Peter W Jusczyk, Anne Cutler, and Nancy J Redanz. 1993. Infants’ preference for the predominant stress patterns of English words. Child Development, 64(3):675–687. Peter W. Jusczyk, E. A. Hohne, and A. Bauman. 1999a. Infants’ sensitivity to allophonic cues for word seg- mentation. Perception and Psychophysics, 61:1465– 1476. Peter W. Jusczyk, Derek M. Houston, and Mary New- some. 1999b. The beginnings of word segmentation in English-learning infants. Cognitive Psychology, 39(3- 4):159–207. Peter Jusczyk. 1997. The discovery of spoken language. MIT Press, Cambridge, MA. Myron Korman. 1984. Adaptive aspects of maternal vo- calizations in differing contexts at ten weeks. First Language, 5:44–45. Constantine Lignos and Charles Yang. 2010. Reces- sion segmentation: simpler online word segmentation using limited resources. In Proceedings of the Four- teenth Conference on Computational Natural Lan- guage Learning, pages 88–97. Association for Com- putational Linguistics. Constantine Lignos. 2011. Modeling infant word seg- mentation. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pages 29–38. Association for Computational Linguistics. Constantine Lignos. 2012. Infant word segmentation: An incremental, integrated model. In Proceedings of the West Coast Conference on Formal Linguistics 30. Brian MacWhinney. 2000. The CHILDES project: Tools for analyzing talk: Volume I: Transcription format and programs, volume II: The database. Computational Linguistics, 26(4):657–657. Sven L Mattys and Peter W Jusczyk. 2000. Phonotac- tic cues for segmentation of fluent speech by infants. Cognition, 78(2):91–121. Sven L Mattys. 2000. The perception of primary and secondary stress in English. Perception and Psy- chophysics, 62(2):253–265. Lisa Pearl, Sharon Goldwater, and Mark Steyvers. 2011. Online learning mechanisms for Bayesian models of word segmentation. Research on Language and Com- putation, 8(2):107–132. Elisabeth O. Selkirk. 1984. Phonology and Syntax: The Relation Between Sound and Structure. MIT Press. Erik D Thiessen and Jenny R Saffran. 2003. When cues collide: use of stress and statistical cues to word boundaries by 7-to 9-month-old infants. Developmen- tal Psychology, 39(4):706. Erik D Thiessen and Jenny R Saffran. 2007. Learning to learn: Infants acquisition of stress-based strategies for word segmentation. Language Learning and Develop- ment, 3(1):73–100. Carnegie Mellon University. 2008. The CMU pronounc- ing dictionary, v.0.7a. Charles Yang. 2004. Universal grammar, statistics or both? Trends in Cognitive Sciences, 8(10):451–456. 104