Transactions of the Association for Computational Linguistics, vol. 7, pp. 357–373, 2019. Action Editor: Trevor Cohn. Submission batch: 10/2018; Revision batch: 1/2019; Published 7/2019. c© 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. A Generative Model for Punctuation in Dependency Trees Xiang Lisa Li⇤ and Dingquan Wang⇤ and Jason Eisner Department of Computer Science, Johns Hopkins University xli150@jhu.edu, {wdd,jason}@cs.jhu.edu Abstract Treebanks traditionally treat punctuation marks as ordinary words, but linguists have suggested that a tree’s “true” punctuation marks are not observed (Nunberg, 1990). These latent “underlying” marks serve to delimit or separate constituents in the syn- tax tree. When the tree’s yield is rendered as a written sentence, a string rewriting mech- anism transduces the underlying marks into “surface” marks, which are part of the ob- served (surface) string but should not be re- garded as part of the tree. We formalize this idea in a generative model of punc- tuation that admits efficient dynamic pro- gramming. We train it without observing the underlying marks, by locally maximiz- ing the incomplete data likelihood (simi- larly to the EM algorithm). When we use the trained model to reconstruct the tree’s underlying punctuation, the results appear plausible across 5 languages, and in par- ticular are consistent with Nunberg’s anal- ysis of English. We show that our gener- ative model can be used to beat baselines on punctuation restoration. Also, our recon- struction of a sentence’s underlying punctu- ation lets us appropriately render the surface punctuation (via our trained underlying-to- surface mechanism) when we syntactically transform the sentence. 1 Introduction Punctuation enriches the expressiveness of writ- ten language. When converting from spoken to written language, punctuation indicates pauses or pitches; expresses propositional attitude; and is conventionally associated with certain syntactic constructions such as apposition, parenthesis, quo- tation, and conjunction. In this paper, we present a latent-variable model of punctuation usage, inspired by the rule- based approach to English punctuation of Nun- berg (1990). Training our model on English data ⇤Equal contribution. learns rules that are consistent with Nunberg’s hand-crafted rules. Our system is automatic, so we use it to obtain rules for Arabic, Chinese, Spanish, and Hindi as well. Moreover, our rules are stochastic, which al- lows us to reason probabilistically about ambigu- ous or missing punctuation. Across the 5 lan- guages, our model predicts surface punctuation better than baselines, as measured both by per- plexity (§4) and by accuracy on a punctuation restoration task (§ 6.1). We also use our model to correct the punctuation of non-native writers of English (§ 6.2), and to maintain natural punc- tuation style when syntactically transforming En- glish sentences (§ 6.3). In principle, our model could also be used within a generative parser, al- lowing the parser to evaluate whether a candidate tree truly explains the punctuation observed in the input sentence (§8). Punctuation is interesting In The Linguistics of Punctuation, Nunberg (1990) argues that punctu- ation (in English) is more than a visual counter- part of spoken-language prosody, but forms a lin- guistic system that involves “interactions of point indicators (i.e. commas, semicolons, colons, pe- riods and dashes).” He proposes that much as in phonology (Chomsky and Halle, 1968), a gram- mar generates underlying punctuation which then transforms into the observed surface punctuation. Consider generating a sentence from a syntactic grammar as follows: Hail the king [, Arthur Pendragon ,] [, who wields [ “ Excalibur ” ] ,] . Although the full tree is not depicted here, some of the constituents are indicated with brackets. In this underlying generated tree, each appositive NP is surrounded by commas. On the surface, however, the two adjacent commas after Pendragon will now be collapsed into one, and the final comma will be absorbed into the adjacent period. Fur- thermore, in American English, the typographic 357 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 convention is to move the final punctuation inside the quotation marks. Thus a reader sees only this modified surface form of the sentence: Hail the king, Arthur Pendragon, who wields “Excalibur.” Note that these modifications are string transfor- mations that do not see or change the tree. The resulting surface punctuation marks may be clues to the parse tree, but (contrary to NLP convention) they should not be included as nodes in the parse tree. Only the underlying marks play that role. Punctuation is meaningful Pang et al. (2002) use question and exclamation marks as clues to sentiment. Similarly, quotation marks may be used to mark titles, quotations, reported speech, or dubious terminology (University of Chicago, 2010). Because of examples like this, methods for determining the similarity or meaning of syntax trees, such as a tree kernel (Agarwal et al., 2011) or a recursive neural network (Tai et al., 2015), should ideally be able to consider where the un- derlying punctuation marks attach. Punctuation is helpful Surface punctuation re- mains correlated with syntactic phrase structure. NLP systems for generating or editing text must be able to deploy surface punctuation as human writ- ers do. Parsers and grammar induction systems benefit from the presence of surface punctuation marks (Jones, 1994; Spitkovsky et al., 2011). It is plausible that they could do better with a linguisti- cally informed model that explains exactly why the surface punctuation appears where it does. Pat- terns of punctuation usage can also help identify the writer’s native language (Markov et al., 2018). Punctuation is neglected Work on syntax and parsing tends to treat punctuation as an af- terthought rather than a phenomenon governed by its own linguistic principles. Treebank annota- tion guidelines for punctuation tend to adopt sim- ple heuristics like “attach to the highest possi- ble node that preserves projectivity” (Bies et al., 1995; Nivre et al., 2018).1 Many dependency parsing works exclude punctuation from evalua- tion (Nivre et al., 2007b; Koo and Collins, 2010; Chen and Manning, 2014; Lei et al., 2014; Kiper- wasser and Goldberg, 2016), although some others retain punctuation (Nivre et al., 2007a; Goldberg and Elhadad, 2010; Dozat and Manning, 2017). 1http://universaldependencies.org/u/ dep/punct.html Unpunctuated Tree: T Dale means river valley rootnsubj dobj ATTACH tree: T 0 Underlying sequence: u sentence: ū Surface sentence: x̄sequence: x NOISYCHANNEL u0 u1 u2 u3 u4 x0 x1 x2 x3 x4 “ Dale ” means “ river valley ” . “ Dale ” means “ river valley . ” root. nsubj“ ” dobj“ ” Figure 1: The generative story of a sentence. Given an unpunctuated tree T at top, at each node w 2 T , the ATTACH process stochastically attaches a left puncteme l and a right puncteme r, which may be empty. The resulting tree T 0 has underlying punctua- tion u. Each slot’s punctuation ui 2 u is rewritten to xi 2 x by NOISYCHANNEL. In tasks such as word embedding induction (Mikolov et al., 2013; Pennington et al., 2014) and machine translation (Zens et al., 2002), punctua- tion marks are usually either removed or treated as ordinary words (Řehůřek and Sojka, 2010). Yet to us, building a parse tree on a surface sentence seems as inappropriate as morphologi- cally segmenting a surface word. In both cases, one should instead analyze the latent underlying form, jointly with recovering that form. For exam- ple, the proper segmentation of English hoping is not hop-ing but hope-ing (with underlying e), and the proper segmentation of stopping is neither stopp-ing nor stop-ping but stop-ing (with only one underlying p). Cot- terell et al. (2015, 2016) get this right for morphol- ogy. We attempt to do the same for punctuation. 2 Formal Model We propose a probabilistic generative model of sentences (Figure 1): p(x̄) = P T,T 0psyn(T) · p✓(T 0 |T) · p�(x̄|ū(T 0)) (1) First, an unpunctuated dependency tree T is stochastically generated by some recursive pro- cess psyn (e.g., Eisner, 1996, Model C).2 Second, each constituent (i.e., dependency subtree) sprouts optional underlying punctuation at its left and right edges, according to a probability distribution p✓ that depends on the constituent’s syntactic role (e.g., dobj for “direct object”). This punctuated tree T 0 yields the underlying string ū = ū(T 0), which is edited by a finite-state noisy channel p� to arrive at the surface sentence x̄. 2Our model could be easily adapted to work on con- stituency trees instead. 358 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 http://universaldependencies.org/u/dep/punct.html http://universaldependencies.org/u/dep/punct.html This third step may alter the sequence of punc- tuation tokens at each slot between words—for ex- ample, in §1, collapsing the double comma , , between Pendragon and who. u and x denote just the punctuation at the slots of ū and x̄ respec- tively, with ui and xi denoting the punctuation to- ken sequences at the ith slot. Thus, the transfor- mation at the ith slot is ui 7! xi. Since this model is generative, we could train it without any supervision to explain the observed surface string x̄: maximize the likelihood p(x̄) in (1), marginalizing out the possible T, T 0 values. In the present paper, however, we exploit known T values (as observed in the “depunctuated” ver- sion of a treebank). Because T is observed, we can jointly train ✓, � to maximize just p(x | T) = X T 0 p✓(T 0 | T) · p�(x | u(T 0)) (2) That is, the psyn model that generated T be- comes irrelevant, but we still try to predict what surface punctuation will be added to T . We still marginalize over the underlying punctuation marks u. These are never observed, but they must explain the surface punctuation marks x (§ 2.2), and they must be explained in turn by the syntax tree T (§ 2.1). The trained generative model then lets us restore or correct punctuation in new trees T (§6). 2.1 Generating Underlying Punctuation The ATTACH model characterizes the probability of an underlying punctuated tree T 0 given its cor- responding unpunctuated tree T , which is given by p✓(T 0 | T) = Y w2T p✓(lw, rw | w) (3) where lw, rw 2 V are the left and right punctemes that T 0 attaches to the tree node w. Each puncteme (Krahn, 2014) in the finite set V is a string of 0 or more underlying punctuation tokens.3 The proba- bility p✓(l, r | w) is given by a log-linear model 3Multi-token punctemes are occasionally useful. For ex- ample, the puncteme ... might consist of either 1 or 3 to- kens, depending on how the tokenizer works; similarly, the puncteme ?! might consist of 1 or 2 tokens. Also, if a sin- gle constituent of T gets surrounded by both parentheses and quotation marks, this gives rise to punctemes (“ and ”). (A better treatment would add the parentheses as a separate puncteme pair at a unary node above the quotation marks, but that would have required T 0 to introduce this extra node.) 1. Point Absorption 3. Period Absorption „ 7!, ,. 7!. -, 7!- .? 7!? .! 7!! -; 7!; ;. 7!. abbv. 7!abbv 2. Quote Transposition 4. Bracket Absorptions ”, 7!,” ”. 7!.” ,) 7!) -) 7!) (, 7!( ,” 7!” “, 7!“ Table 1: Some of Nunberg’s punctuation interaction rules in English, in priority order. The absorption rules ensure that when there are two adjacent tokens, the “weaker” one is deleted (where the strength ordering is {?, !, (, ), “, ”} > . > {;, :} > - > ,), except that bracketing tokens such as () and “” do not absorb tokens outside the material they bracket. p✓(l, r|w) / ( exp ✓>f(l, r, w) if (l, r) 2 Wd(w) 0 otherwise (4) where V is the finite set of possible punctemes and Wd ✓ V2 gives the possible puncteme pairs for a node w that has dependency relation d = d(w) to its parent. V and Wd are estimated heuristically from the tokenized surface data (§4). f(l, r, w) is a sparse binary feature vector, and ✓ is the cor- responding parameter vector of feature weights. The feature templates in Appendix A4 consider the symmetry between l and r, and their compatibility with (a) the POS tag of w’s head word, (b) the de- pendency paths connecting w to its children and the root of T , (c) the POS tags of the words flank- ing the slots containing l and r, (d) surface punc- tuation already added to w’s subconstituents. 2.2 From Underlying to Surface From the tree T 0, we can read off the sequence of underlying punctuation tokens ui at each slot i between words. Namely, ui concatenates the right punctemes of all constituents ending at i with the left punctemes of all constituents starting at i (as illustrated by the examples in §1 and Figure 1). The NOISYCHANNEL model then transduces ui to a surface token sequence xi, for each i = 0, . . . , n independently (where n is the sentence length). Nunberg’s formalism Much like Chomsky and Halle’s (1968) phonological grammar of English, Nunberg’s (1990) descriptive English punctuation grammar (Table 1) can be viewed computationally as a priority string rewriting system, or Markov algorithm (Markov, 1960; Caracciolo di Forino, 1968). The system begins with a token string u. 4The appendices (supplementary material) are available at https://arxiv.org/abs/1906.11298. 359 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 https://arxiv.org/abs/1906.11298 https://arxiv.org/abs/1906.11298 abcde . ab 7! ab abcde . bc 7! b a bde . bd 7! db a dbe . be 7! e a d e Figure 2: Editing abcde 7! ade with a sliding win- dow. (When an absorption rule maps 2 tokens to 1, our diagram leaves blank space that is not part of the out- put string.) At each step, the left-to-right process has already committed to the green tokens as output; has not yet looked at the blue input tokens; and is currently considering how to (further) rewrite the black tokens. The right column shows the chosen edit. At each step it selects the highest-priority local rewrite rule that can apply, and applies it as far left as possible. When no more rules can apply, the final state of the string is returned as x. Simplifying the formalism Markov algorithms are Turing complete. Fortunately, Johnson (1972) noted that in practice, phonological u 7! x maps described in this formalism can usually be imple- mented with finite-state transducers (FSTs). For computational simplicity, we will formu- late our punctuation model as a probabilistic FST (PFST)—a locally normalized left-to-right rewrite model (Cotterell et al., 2014). The probabilities for each language must be learned, using gradient descent. Normally we expect most probabilities to be near 0 or 1, making the PFST nearly determin- istic (i.e., close to a subsequential FST). However, permitting low-probability choices remains useful to account for typographical errors, dialectal dif- ferences, and free variation in the training corpus. Our PFST generates a surface string, but the invertibility of FSTs will allow us to work back- wards when analyzing a surface string (§3). A sliding-window model Instead of having rule priorities, we apply Nunberg-style rules within a 2-token window that slides over u in a single left- to-right pass (Figure 2). Conditioned on the cur- rent window contents ab, a single edit is selected stochastically: either ab 7!ab (no change), ab 7!b (left absorption), ab 7! a (right absorption), or ab 7! ba (transposition). Then the window slides rightward to cover the next input token, together with the token that is (now) to its left. a and b are always real tokens, never boundary symbols. � specifies the conditional edit probabilities.5 5Rather than learn a separate edit probability distribution for each bigram ab, one could share parameters across bi- grams. For example, Table 1’s caption says that “stronger” tokens tend to absorb “weaker” ones. A model that incor- These specific edit rules (like Nunberg’s) can- not insert new symbols, nor can they delete all of the underlying symbols. Thus, surface xi is a good clue to ui: all of its tokens must appear underly- ingly, and if xi = ✏ (the empty string) then ui = ✏. The model can be directly implemented as a PFST (Appendix D4) using Cotterell et al.’s (2014) more general PFST construction. Our single-pass formalism is less expressive than Nunberg’s. It greedily makes decisions based on at most one token of right context (“label bias”). It cannot rewrite ’”. 7!.’” or ”,. 7!.” because the . is encountered too late to percolate leftward; luckily, though, we can handle such En- glish examples by sliding the window right-to-left instead of left-to-right. We treat the sliding direc- tion as a language-specific parameter.6 2.3 Training Objective Building on equation (2), we train ✓, � to lo- cally maximize the regularized conditional log- likelihood ⇣ X x,T log p(x | T)� ⇠ · E T 0 [c(T 0)]2 ⌘ � & · ||✓||2 (5) where the sum is over a training treebank.7 The expectation E[· · · ] is over T 0 ⇠ p(· | T, x). This generalized expectation term pro- vides posterior regularization (Mann and McCal- lum, 2010; Ganchev et al., 2010), by encourag- ing parameters that reconstruct trees T 0 that use symmetric punctuation marks in a “typical” way. The function c(T 0) counts the nodes in T 0 whose punctemes contain “unmatched” symmetric punc- tuation tokens: for example, ) is “matched” only when it appears in a right puncteme with ( at the comparable position in the same constituent’s left puncteme. The precise definition is given in Ap- pendix B.4 porated this insight would not have to learn O(|⌃|2) separate absorption probabilities (two per bigram ab), but only O(|⌃|) strengths (one per unigram a, which may be regarded as a 1-dimensional embedding of the punctuation token a). We figured that the punctuation vocabulary ⌃ was small enough (Table 2) that we could manage without the additional com- plexity of embeddings or other featurization, although this does presumably hurt our generalization to rare bigrams. 6We could have handled all languages uniformly by mak- ing � 2 passes of the sliding window (via a composition of � 2 PFSTs), with at least one pass in each direction. 7In retrospect, there was no good reason to square the ET 0[c(T 0)] term. However, when we started redoing the ex- periments, we found the results essentially unchanged. 360 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 https://arxiv.org/abs/1906.11298 https://arxiv.org/abs/1906.11298 https://arxiv.org/abs/1906.11298 In our development experiments on English, the posterior regularization term was necessary to dis- cover an aesthetically appealing theory of under- lying punctuation. When we dropped this term (⇠ = 0) and simply maximized the ordinary regu- larized likelihood, we found that the optimization problem was underconstrained: different training runs would arrive at different, rather arbitrary un- derlying punctemes. For example, one training run learned an ATTACH model that used underlying “. to terminate sentences, along with a NOISY- CHANNEL model that absorbed the left quotation mark into the period. By encouraging the under- lying punctuation to be symmetric, we broke the ties. We also tried making this a hard constraint (⇠ = 1), but then the model was unable to explain some of the training sentences at all, giving them probability of 0. For example, I went to the “ special place ” cannot be explained, be- cause special place is not a constituent.8 3 Inference In principle, working with the model (1) is straightforward, thanks to the closure properties of formal languages. Provided that psyn can be en- coded as a weighted CFG, it can be composed with the weighted tree transducer p✓ and the weighted FST p� to yield a new weighted CFG (similarly to Bar-Hillel et al., 1961; Nederhof and Satta, 2003). Under this new grammar, one can recover the opti- mal T, T 0 for x̄ by dynamic programming, or sum over T, T 0 by the inside algorithm to get the likeli- hood p(x̄). A similar approach was used by Levy (2008) with a different FST noisy channel. In this paper we assume that T is observed, al- lowing us to work with equation (2). This cuts the computation time from O(n3) to O(n).9 Whereas the inside algorithm for (1) must consider O(n2) possible constituents of x̄ and O(n) ways of build- ing each, our algorithm for (2) only needs to iterate over the O(n) true constituents of T and the 1 true way of building each. However, it must still con- sider the |Wd| puncteme pairs for each constituent. 8Recall that the NOISYCHANNEL model family (§ 2.2) requires the surface “ before special to appear under- lyingly, and also requires the surface ✏ after special to be empty underlyingly. These hard constraints clash with the ⇠ = 1 hard constraint that the punctuation around special must be balanced. The surface ” after place causes a similar problem: no edge can generate the match- ing underlying “. 9We do O(n) multiplications of N ⇥ N matrices where Algorithm 1 The algorithm for scoring a given (T, x) pair. The code in blue is used during train- ing to get the posterior regularization term in (5). Input: T , x . Training pair (omits T 0, u) Output: p(x | T), E[c(T 0)] 1: procedure TOTALSCORE(T , x) 2: for i = 1 to n do 3: compute WFSA (Mi, �i, ⇢i) 4: E 0 . exp. count of unmatched punctemes 5: procedure IN(w) . w 2 T 6: i, k slots at left, right of w constit 7: j slot at right of w headword 8: Mleft ( Q w02leftkids(w) IN(w 0))⇢j�1 9: Mright �>j ( Q w02rightkids(w) IN(w 0)) 10: M0 Mleft · 1 · Mright . RNj⇥1, R1⇥Nj 11: M 0 . RNi⇥Nk 12: for (l, r) 2 Wd(w) do 13: p p✓(l, r | w) 14: M M + p · Mi(l)M0Mk(r) 15: E E + p · l,r have unmatched punc 16: return M . RNi⇥Nk 17: Mroot IN(root(T)) 18: return �>0 Mroot⇢n, E . R, R 3.1 Algorithms Given an input sentence x̄ of length n, our job is to sum over possible trees T 0 that are consistent with T and x̄, or to find the best such T 0. This is roughly a lattice parsing problem—made easier by knowing T . However, the possible ū values are characterized not by a lattice but by a cyclic WFSA (as |ui| is unbounded whenever |xi| > 0). For each slot 0  i  n, transduce the sur- face punctuation string xi by the inverted PFST for p� to obtain a weighted finite-state automa- ton (WFSA) that describes all possible underly- ing strings ui.10 This WFSA accepts each pos- sible ui with weight p�(xi | ui). If it has Ni states, we can represent it (Berstel and Reutenauer, 1988) with a family of sparse weight matrices Mi(�) 2 RNi⇥Ni , whose element at row s and column t is the weight of the s ! t arc labeled with �, or 0 if there is no such arc. Additional vectors �i, ⇢i 2 RNi specify the initial and final weights. (�i is one-hot if the PFST has a single N = O(# of punc types · max # of punc tokens per slot). 10Constructively, compose the u-to-x PFST (from the end of § 2.2) with a straight-line FSA accepting only xi, and project the resulting WFST to its input tape (Pereira and Ri- ley, 1996), as explained at the end of Appendix D. 361 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 https://arxiv.org/abs/1906.11298 initial state, of weight 1.) For any puncteme l (or r) in V, we define Mi(l) = Mi(l1)Mi(l2) · · · Mi(l|l|), a product over the 0 or more tokens in l. This gives the total weight of all s !⇤ t WFSA paths labeled with l. The subprocedure in Algorithm 1 essentially extends this to obtain a new matrix IN(w) 2 RNi⇥Nk , where the subtree rooted at w stretches from slot i to slot k. Its element IN(w)st gives the total weight of all extended paths in the ū WFSA from state s at slot i to state t at slot k. An extended path is defined by a choice of underly- ing punctemes at w and all its descendants. These punctemes determine an s-to-final path at i, then initial-to-final paths at i+1 through k�1, then an initial-to-t path at k. The weight of the extended path is the product of all the WFSA weights on these paths (which correspond to transition prob- abilities in p� PFST) times the probability of the choice of punctemes (from p✓). This inside algorithm computes quantities needed for training (§ 2.3). Useful variants arise via well-known methods for weighted derivation forests (Berstel and Reutenauer, 1988; Goodman, 1999; Li and Eisner, 2009; Eisner, 2016). Specifically, to modify Algorithm 1 to maximize over T 0 values (§§ 6.2–6.3) instead of summing over them, we switch to the derivation semiring (Goodman, 1999), as follows. Whereas IN(w)st used to store the total weight of all extended paths from state s at slot i to state t at slot j, now it will store the weight of the best such extended path. It will also store that extended path’s choice of un- derlying punctemes, in the form of a puncteme- annotated version of the subtree of T that is rooted at w. This is a potential subtree of T 0. Thus, each element of IN(w) has the form (r, D) where r 2 R and D is a tree. We define addition and multiplication over such pairs: (r, D) + (r0, D0) = ( (r, D) if r > r0 (r0, D0) otherwise (6) (r, D) · (r0, D0) = (rr0, DD0) (7) where DD0 denotes an ordered combination of two trees. Matrix products UV and scalar-matrix products p · V are defined in terms of element ad- dition and multiplication as usual: (UV)st = P rUsr · Vrt (8) (p · V)st = p · Vst (9) What is DD0? For presentational purposes, it is convenient to represent a punctuated dependency tree as a bracketed string. For example, the under- lying tree T 0 in Figure 1 would be [ [“ Dale ”] means [“ [ river ] valley ”] ] where the words correspond to nodes of T . In this case, we can represent every D as a partial bracketed string and define DD0 by string concatenation. This presentation ensures that multiplication (7) is a complete and associative (though not commutative) operation, as in any semiring. As base cases, each real-valued element of Mi(l) or Mk(r) is now paired with the string [l or r] respectively,11 and the real number 1 at line 10 is paired with the string w. The real-valued elements of the �i and ⇢i vectors and the 0 matrix at line 11 are paired with the empty string ✏, as is the real number p at line 13. In practice, the D strings that appear within the matrix M of Algorithm 1 will always represent complete punctuated trees. Thus, they can actu- ally be represented in memory as such, and differ- ent trees may share subtrees for efficiency (using pointers). The product in line 10 constructs a ma- trix of trees with root w and differing sequences of left/right children, while the product in line 14 annotates those trees with punctemes l, r. To sample a possible T 0 from the derivation for- est in proportion to its probability (§ 6.1), we use the same algorithm but replace equation (6) with (r, D) + (r0, D0) = ( (r + r0, D) if u < rr+r0 (r + r0, D0) otherwise with u ⇠ Uniform(0, 1) being a random number. 3.2 Optimization Having computed the objective (5), we find the gradient via automatic differentiation, and opti- mize ✓, � via Adam (Kingma and Ba, 2014)—a variant of stochastic gradient decent—with learn- ing rate 0.07, batchsize 5, sentence per epoch 400, and L2 regularization. (These hyperparam- eters, along with the regularization coefficients & and ⇠ from equation (5), were tuned on dev data (§4) for each language respectively.) We train 11We still construct the real matrix Mi(l) by ordinary ma- trix multiplication before pairing its elements with strings. This involves summation of real numbers: each element of the resulting real matrix is a marginal probability, which sums over possible PFST paths (edit sequences) that could map the underlying puncteme l to a certain substring of the surface slot xi. Similarly for Mk(r). 362 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 the punctuation model for 30 epochs. The initial NOISYCHANNEL parameters (�) are drawn from N(0, 1), and the initial ATTACH parameters (✓) are drawn from N(0, 1) (with one minor excep- tion described in Appendix A). 4 Intrinsic Evaluation of the Model Data. Throughout §§4–6, we will examine the punctuation model on a subset of the Univer- sal Dependencies (UD) version 1.4 (Nivre et al., 2016)—a collection of dependency treebanks across 47 languages with unified POS-tag and de- pendency label sets. Each treebank has designated training, development, and test portions. We ex- periment on Arabic, English, Chinese, Hindi, and Spanish (Table 2)—languages with diverse punc- tuation vocabularies and punctuation interaction rules, not to mention script directionality. For each treebank, we use the tokenization provided by UD, and take the punctuation tokens (which may be multi-character, such as ...) to be the tokens with the PUNCT tag. We replace each straight dou- ble quotation mark " with either “ or ” as appro- priate, and similarly for single quotation marks.12 We split each non-punctuation token that ends in . (such as etc.) into a shorter non-punctuation token (etc) followed by a special punctuation to- ken called the “abbreviation dot” (which is distinct from a period). We prepend a special punctuation mark ˆ to every sentence x̄, which can serve to absorb an initial comma, for example.13 We then replace each token with the special symbol UNK if its type appeared fewer than 5 times in the training portion. This gives the surface sentences. To estimate the vocabulary V of underlying punctemes, we simply collect all surface token se- quences xi that appear at any slot in the training portion of the processed treebank. This is a gener- ous estimate. Similarly, we estimate Wd (§ 2.1) as all pairs (l, r) 2 V2 that flank any d constituent. Recall that our model generates surface punctu- ation given an unpunctuated dependency tree. We train it on each of the 5 languages independently. We evaluate on conditional perplexity, which will be low if the trained model successfully assigns a high probability to the actual surface punctuation in a held-out corpus of the same language. 12For en and en_esl, “ and ” are distinguished by language-specific part-of-speech tags. For the other 4 lan- guages, we identify two " dependents of the same head word, Language Treebank #Token %Punct #Omit #Type Arabic ar 282K 7.9 255 18 Chinese zh 123K 13.8 3 23 English en 255K 11.7 40 35 en_esl 97.7K 9.8 2 16 Hindi hi 352K 6.7 21 15 Spanish es_ancora 560K 11.7 25 16 Table 2: Statistics of our datasets. “Treebank” is the UD treebank identifier, “#Token” is the number of to- kens, “%Punct” is the percentage of punctuation to- kens, “#Omit” is the small number of sentences con- taining non-leaf punctuation tokens (see footnote 19), and “#Type” is the number of punctuation types after preprocessing. (Recall from §4 that preprocessing dis- tinguishes between left and right quotation mark types, and between abbreviation dot and period dot types.) Baselines. We compare our model against three baselines to show that its complexity is necessary. Our first baseline is an ablation study that does not use latent underlying punctuation, but generates the surface punctuation directly from the tree. (To implement this, we fix the parameters of the noisy channel so that the surface punctuation equals the underlying with probability 1.) If our full model performs significantly better, it will demonstrate the importance of a distinct underlying layer. Our other two baselines ignore the tree struc- ture, so if our full model performs significantly better, it will demonstrate that conditioning on ex- plicit syntactic structure is useful. These baselines are based on previously published approaches that reduce the problem to tagging: Xu et al. (2016) use a BiLSTM-CRF tagger with bigram topology; Tilk and Alumäe (2016) use a BiGRU tagger with attention. In both approaches, the model is trained to tag each slot i with the correct string xi 2 V⇤ (possibly ✏ or ˆ). These are discriminative proba- bilistic models (in contrast to our generative one). Each gives a probability distribution over the tag- gings (conditioned on the unpunctuated sentence), so we can evaluate their perplexity.14 Results. As shown in Table 3, our full model beats the baselines in perplexity in all 5 languages. Also, in 4 of 5 languages, allowing a trained NOISYCHANNEL (rather than the identity map) replacing the left one with “ and the right one with ”. 13For symmetry, we should also have added a final mark. 14These methods learn word embeddings that optimize conditional log-likelihood on the punctuation restoration training data. They might do better if these embeddings were shared with other tasks, as multi-task learning might lead them to discover syntactic categories of words. 363 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 https://arxiv.org/abs/1906.11298 Attn. CRF ATTACH +NC DIR Arabic 1.4676 1.3016 1.2230 1.1526 L Chinese 1.6850 1.4436 1.1921 1.1464 L English 1.5737 1.5247 1.5636 1.4276 R Hindi 1.1201 1.1032 1.0630 1.0598 L Spanish 1.4397 1.3198 1.2364 1.2103 R Table 3: Results of the conditional perplexity experi- ment (§4), reported as perplexity per punctuation slot, where an unpunctuated sentence of n words has n + 1 slots. Column “Attn.” is the BiGRU tagger with atten- tion, and “CRF” stands for the BiLSTM-CRF tagger. “ATTACH” is the ablated version of our model where surface punctuation is directly attached to the nodes. Our full model “+NC” adds NOISYCHANNEL to trans- duce the attached punctuation into surface punctuation. DIR is the learned direction (§ 2.2) of our full model’s noisy channel PFST: Left-to-right or Right-to-left. Our models are given oracle parse trees T . The best per- plexity is boldfaced, along with all results that are not significantly worse (paired permutation test, p < 0.05). significantly improves the perplexity. 5 Analysis of the Learned Grammar 5.1 Rules Learned from the Noisy Channel We study our learned probability distribution over noisy channel rules (ab 7! b, ab 7! a, ab 7! ab, ab 7!ba) for English. The probability distributions corresponding to six of Nunberg’s English rules are shown in Figure 3. By comparing the orange and blue bars, observe that the model trained on the en_cesl treebank learned different quotation rules from the one trained on the en treebank. This is because en_cesl follows British style, whereas en has American-style quote transposition.15 We now focus on the model learned from the en treebank. Nunberg’s rules are deterministic, and our noisy channel indeed learned low-entropy rules, in the sense that for an input ab with un- derlying count � 25,16 at least one of the possi- ble outputs (a, b, ab or ba) always has probability > 0.75. The one exception is ”. 7!.” for which the argmax output has probability ⇡ 0.5, because writers do not apply this quote transposition rule consistently. As shown by the blue bars in Fig- ure 3, the high-probability transduction rules are 15American style places commas and periods inside the quotation marks, even if they are not logically in the quote. British style (more sensibly) places unquoted periods and commas in their logical place, sometimes outside the quo- tation marks if they are not part of the quote. 16For rarer underlying pairs ab, the estimated distributions sometimes have higher entropy due to undertraining. Figure 3: Rewrite probabilities learned for English, averaged over the last 4 epochs on en treebank (blue bars) or en_esl treebank (orange bars). The header above each figure is the underlying punctuation string (input to NOISYCHANNEL). The two counts in the fig- ure headers are the number of occurrences of the under- lying punctuation strings in the 1-best reconstruction of underlying punctuation sequences (by Algorithm 1) re- spectively in the en and en_esl treebank. Each bar represents one surface punctuation string (output of NOISYCHANNEL), its height giving the probability. consistent with Nunberg’s hand-crafted determin- istic grammar in Table 1. Our system has high precision when we look at the confident rules. Of the 24 learned edits with conditional probability > 0.75, Nunberg lists 20. Our system also has good recall. Nunberg’s hand-crafted schemata consider 16 punctuation types and generate a total of 192 edit rules, in- cluding the specimens in Table 1. That is, of the 162 = 256 possible underlying punctuation bi- grams ab, 34 are supposed to undergo absorption or transposition. Our method achieves fairly high recall, in the sense that when Nunberg proposes ab 7!�, our learned p(� | ab) usually ranks highly among all probabilities of the form p(�0 | ab). 75 of Nunberg’s rules got rank 1, 48 got rank 2, and the remaining 69 got rank > 2. The mean recipro- cal rank was 0.621. Recall is quite high when we restrict to those Nunberg rules ab 7! � for which our model is confident how to rewrite ab, in the sense that some p(�0 | ab) > 0.5. (This tends to eliminate rare ab: see footnote 5.) Of these 55 Nunberg rules, 38 rules got rank 1, 15 got rank 2, and only 2 got rank worse than 2. The mean recip- rocal rank was 0.836. ¿What about Spanish? Spanish uses inverted question marks ¿ and exclamation marks ¡, which form symmetric pairs with the regular question marks and exclamation marks. If we try to ex- trapolate to Spanish from Nunberg’s English for- 364 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 malization, the English mark most analogous to ¿ is (. Our learned noisy channel for Spanish (not graphed here) includes the high-probability rules ,¿ 7!,¿ and :¿ 7!:¿ and ¿, 7!¿ which match Nunberg’s treatment of ( in English. 5.2 Attachment Model What does our model learn about how dependency relations are marked by underlying punctuation? ˆ,Earlier, Kerry said ,“...,in fact, answer the question”. ˆEarlier, Kerry said ,“...,in fact, answer the question.” root.,advmod, ,“ccomp” ,nmod, The above example17 illustrates the use of specific puncteme pairs to set off the advmod, ccomp, and nmod relations. Notice that said takes a complement (ccomp) that is symmetrically quoted but also left delimited by a comma, which is indeed how direct speech is punctuated in English. This example also illustrates quotation transposition. The top five relations that are most likely to generate symmetric punctemes and their top (l, r) pairs are shown in Table 4. Section 1 ,2 , ,...7, and 8... Section 1 ,2 ,...7, and 8... ,conj, ,conj, conj cc The above example18 shows how our model han- dles commas in conjunctions of 2 or more phrases. UD format dictates that each conjunct after the first is attached by the conj relation. As shown above, each such conjunct is surrounded by under- lying commas (via the N.,.,.conj feature from Appendix A), except for the one that bears the conjunction and (via an even stronger weight on the C.✏.✏.���!conj.cc feature). Our learned feature weights indeed yield p(` = ✏, r = ✏) > 0.5 for the final conjunct in this example. Some writers omit the “Oxford comma” before the conjunction: this style can be achieved simply by changing “sur- rounded” to “preceded” (that is, changing the N feature to N.,.✏.conj). 6 Performance on Extrinsic Tasks We evaluate the trained punctuation model by us- ing it in the following three tasks. 17[en] Earlier, Kerry said, “Just because you get an honorable discharge does not, in fact, answer that question.” 18[en] Sections 1, 2, 5, 6, 7, and 8 will survive any termination of this License. parataxis appos list advcl ccomp 2.38 2.29 1.33 0.77 0.53 , , 26.8 , , 18.8 ✏ ✏ 60.0 ✏ ✏ 73.8 ✏ ✏ 90.8 ✏ ✏ 20.1 : ✏ 18.1 , , 22.3 , , 21.2 “ ” 2.4 ( ) 13.0 - ✏ 15.9 , ✏ 5.3 ✏ , 3.1 , , 2.4 - ✏ 9.7 ✏ ✏ 14.4 < > 3.0 ( ) 0.74 :“ ” 0.9 : ✏ 8.1 ( ) 13.1 ( ) 3.0 ✏ - 0.21 “ ,” 0.8 Table 4: The top 5 relations that are most likely to generate symmetric punctemes, the entropy of their puncteme pair (row 2), and their top 5 puncteme pairs (rows 3–7) with their probabilities shown as percent- ages. The symmetric punctemes are in boldface. 6.1 Punctuation Restoration In this task, we are given a depunctuated sentence d̄19 and must restore its (surface) punctuation. Our model supposes that the observed punctuated sen- tence x̄ would have arisen via the generative pro- cess (1). Thus, we try to find T , T 0, and x̄ that are consistent with d̄ (a partial observation of x̄). The first step is to reconstruct T from d̄. This initial parsing step is intended to choose the T that maximizes psyn(T | d̄).20 This step depends only on psyn and not on our punctuation model (p✓, p�). In practice, we choose T via a dependency parser that has been trained on an unpunctuated treebank with examples of the form (d̄, T).21 Equation (2) now defines a distribution over (T 0, x) given this T . To obtain a single prediction for x, we adopt the minimum Bayes risk (MBR) approach of choosing surface punctuation x̂ that minimizes the expected loss with respect to the unknown truth x⇤. Our loss function is the total edit distance over all slots (where edits operate on punctuation tokens). Finding x̂ exactly would be intractable, so we use a sampling-based approx- imation and draw m = 1000 samples from the posterior distribution over (T 0, x). We then define x̂ = argmin x2S(T) X x⇤2S(T) p̂(x⇤|T) · loss(x, x⇤) (10) where S(T) is the set of unique x values in the sample and p̂ is the empirical distribution given by the sample. This can be evaluated in O(m2) time. 19 To depunctuate a treebank sentence, we remove all to- kens with POS-tag PUNCT or dependency relation punct. These are almost always leaves; else we omit the sentence. 20Ideally, rather than maximize, one would integrate over possible trees T , in practice by sampling many values Tk from psyn(· | ū) and replacing S(T) in (10) with S k S(Tk). 21Specifically, the Yara parser (Rasooli and Tetreault, 2015), a fast non-probabilistic transition-based parser that uses rich non-local features (Zhang and Nivre, 2011). 365 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 https://arxiv.org/abs/1906.11298 We evaluate on Arabic, English, Chinese, Hindi, and Spanish. For each language, we train both the parser and the punctuation model on the training split of that UD treebank (§4), and evaluate on held-out data. We compare to the BiLSTM-CRF baseline in §4 (Xu et al., 2016).22 We also compare to a “trivial” deterministic base- line, which merely places a period at the end of the sentence (or a "|" in the case of Hindi) and adds no other punctuation. Because most slots do not in fact have punctuation, the trivial baseline already does very well; to improve on it, we must fix its errors without introducing new ones. Our final comparison on test data is shown in the table in Figure 4. On all 5 languages, our method beats (usually significantly) its 3 com- petitors: the trivial deterministic baseline, the BiLSTM-CRF, and the ablated version of our model (ATTACH) that omits the noisy channel. Of course, the success of our method depends on the quality of the parse trees T (which is par- ticularly low for Chinese and Arabic). The graph in Figure 4 explores this relationship, by evaluat- ing (on dev data) with noisier trees obtained from parsers that were variously trained on only the first 10%, 20%, . . . of the training data. On all 5 lan- guages, provided that the trees are at least 75% correct, our punctuation model beats both the triv- ial baseline and the BiLSTM-CRF (which do not use trees). It also beats the ATTACH ablation base- line at all levels of tree accuracy (these curves are omitted from the graph to avoid clutter). In all lan- guages, better parses give better performance, and gold trees yield the best results. 6.2 Punctuation Correction Our next goal is to correct punctuation errors in a learner corpus. Each sentence is drawn from the Cambridge Learner Corpus treebanks, which provide original (en_esl) and corrected (en_cesl) sentences. All kinds of errors are corrected, such 22We copied their architecture exactly but re-tuned the hy- perparameters on our data. We also tried tripling the amount of training data by adding unannotated sentences (provided along with the original annotated sentences by Ginter et al. (2017)), taking advantage of the fact that the BiLSTM-CRF does not require its training sentences to be annotated with trees. However, this actually hurt performance slightly, per- haps because the additional sentences were out-of-domain. We also tried the BiGRU-with-attention architecture of Tilk and Alumäe (2016), but it was also weaker than the BiLSTM- CRF (just as in Table 3). We omit all these results from Fig- ure 4 to reduce clutter. p 8 ATTACH a-- --a Arabic 0.064 0.064 0.063 0.059 0.053 Chinese 0.110 0.109 0.104 0.102 0.048 English 0.100 0.108 0.092 0.090 0.079 Hindi 0.025 0.023 0.019 0.018 0.013 Spanish 0.093 0.092 0.085 0.078 0.068 Figure 4: Edit distance per slot (which we call average edit distance, or AED) for each of the 5 corpora. Lower is better. The table gives the final AED on the test data. Its first 3 columns show the baseline methods just as in Table 3: the trivial deterministic method, the BiLSTM- CRF, and the ATTACH ablation baseline that attaches the surface punctuation directly to the tree. Column 4 is our method that incorporates a noisy channel, and column 5 (in gray) is our method using oracle (gold) trees. We boldface the best non-oracle result as well as all that are not significantly worse (paired permutation test, p < 0.05). The curves show how our method’s AED (on dev data) varies with the labeled attachment score (LAS) of the trees, where --a at x = 100 uses the oracle (gold) trees, a-- at x < 100 uses trees from our parser trained on 100% of the training data, and the #-- points at x ⌧ 100 use increasingly worse parsers. The p and 8 at the right of the graph show the AED of the trivial deterministic baseline and the BiLSTM-CRF baseline, which do not use trees. as syntax errors, but we use only the 30% of sen- tences whose depunctuated trees T are isomorphic between en_esl and en_cesl. These en_cesl trees may correct word and/or punctuation errors in en_esl, as we wish to do automatically. We assume that an English learner can make mistakes in both the attachment and the noisy channel steps. A common attachment mistake is the failure to surround a non-restrictive relative clause with commas. In the noisy channel step, mistakes in quote transposition are common. Correction model. Based on the assumption about the two error sources, we develop a dis- criminative model for this task. Let x̄e de- note the full input sentence, and let xe and xc denote the input (possibly errorful) and output (corrected) punctuation sequences. We model p(xc | x̄e) = P T P T 0c psyn(T | x̄e) · p✓(T 0c | T, xe) · p�(xc | T 0c). Here T is the depunctu- ated parse tree, T 0c is the corrected underlying tree, T 0 e is the error underlying tree, and we assume p✓(T 0 c | T, xe) = P T 0e p(T 0e | T, xe) · p✓(T 0c | T 0e). 366 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 In practice we use a 1-best pipeline rather than summing. Our first step is to reconstruct T from the error sentence x̄e. We choose T that max- imizes psyn(T | x̄e) from a dependency parser trained on en_esl treebank examples (x̄e, T ). The second step is to reconstruct T 0e based on our punc- tuation model trained on en_esl. We choose T 0e that maximizes p(T 0e | T, xe). We then reconstruct T 0 c by p(T 0c | T 0 e) = Q we2T 0e p(l, r | we) (11) where we is the node in T 0e, and p(l, r | we) is a similar log-linear model to equation (4) with addi- tional features (Appendix C4) which look at we. Finally, we reconstruct xc based on the noisy channel p�(xc | T 0c) in § 2.2. During training, � is regularized to be close to the noisy channel param- eters in the punctuation model trained on en_cesl. We use the same MBR decoder as in § 6.1 to choose the best action. We evaluate using AED as in § 6.1. As a second metric, we use the script from the CoNLL 2014 Shared Task on Grammati- cal Error Correction (Ng et al., 2014): it computes the F0.5-measure of the set of edits found by the system, relative to the true set of edits. As shown in Table 5, our method achieves bet- ter performance than the punctuation restoration baselines (which ignore input punctuation). On the other hand, it is soundly beaten by a new BiLSTM-CRF that we trained specifically for the task of punctuation correction. This is the same as the BiLSTM-CRF in the previous section, ex- cept that the BiLSTM now reads a punctuated input sentence (with possibly erroneous punctua- tion). To be precise, at step 0  i  n, the BiL- STM reads a concatenation of the embedding of word i (or BOS if i = 0) with an embedding of the punctuation token sequence xi. The BiLSTM- CRF wins because it is a discriminative model tai- lored for this task: the BiLSTM can extract arbi- trary contextual features of slot i that are corre- lated with whether xi is correct in context. 6.3 Sentential Rephrasing We suspect that syntactic transformations on a sentence should often preserve the underlying punctuation attached to its tree. The surface punc- tuation can then be regenerated from the trans- formed tree. Such transformations include ed- its that are suggested by a writing assistance tool (Heidorn, 2000), or subtree deletions in compres- sive summarization (Knight and Marcu, 2002). p 8 a-- parsed gold 8-corr AED 0.052 0.051 0.047 0.034 0.033 0.005 F0.5 0.779 0.787 0.827 0.876 0.881 0.984 Table 5: AED and F0.5 results on the test split of English-ESL data. Lower AED is better; higher F0.5 is better. The first three columns (markers corre- spond to Figure 4) are the punctuation restoration base- lines, which ignore the input punctuation. The fourth and fifth columns are our correction models, which use parsed and gold trees. The final column is the BiLSTM-CRF model tailored for the punctuation cor- rection task. For our experiment, we evaluate an interesting case of syntactic transformation. Wang and Eis- ner (2016) consider a systematic rephrasing pro- cedure by rearranging the order of dependent sub- trees within a UD treebank, in order to synthesize new languages with different word order that can then be used to help train multi-lingual systems (i.e., data augmentation with synthetic data). As Wang and Eisner acknowledge (2016, foot- note 9), their permutations treat surface punctua- tion tokens like ordinary words, which can result in synthetic sentences whose punctuation is quite unlike that of real languages. In our experiment, we use Wang and Eisner’s (2016) “self-permutation” setting, where the de- pendents of each noun and verb are stochastically reordered, but according to a dependent ordering model that has been trained on the same language. For example, rephrasing a English sentence SCONJ ADJ PUNCT DET NOUN VERB PUNCT If true , the caper failed . mark det punct advcl nsubj punct root under an English ordering model may yield DET NOUN VERB PUNCT SCONJ ADJ PUNCT the caper failed . If true , markdet root nsubj punct advcl punct which is still grammatical except that , and . are wrongly swapped (after all, they have the same POS tag and relation type). Worse, permutation may yield bizarre punctuation such as , , at the start of a sentence. Our punctuation model gives a straightforward remedy—instead of permuting the tree directly, we first discover its most likely underlying tree ˆ,If true, the caper failed. det nsubj root. mark ,advcl, by the maximizing variant of Algorithm 1 (§ 3.1). Then, we permute the underlying tree and sample the surface punctuation from the distribution modeled by the trained PFST, yielding 367 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 https://arxiv.org/abs/1906.11298 Punctuation All Base Half Full Base Half Full Arabic 156.0 231.3 186.1 540.8 590.3 553.4 Chinese 165.2 110.0 61.4 205.0 174.4 78.7 English 98.4 74.5 51.0 140.9 131.4 75.4 Hindi 10.8 11.0 9.7 118.4 118.8 91.8 Spanish 266.2 259.2 194.5 346.3 343.4 239.3 Table 6: Perplexity (evaluated on the train split to avoid evaluating generalization) of a trigram language model trained (with add-0.001 smoothing) on differ- ent versions of rephrased training sentences. “Punc- tuation” only evaluates perplexity on the trigrams that have punctuation. “All” evaluates on all the tri- grams. “Base” permutes all surface dependents includ- ing punctuation (Wang and Eisner, 2016). “Full” is our full approach: recover underlying punctuation, per- mute remaining dependents, regenerate surface punc- tuation. “Half” is like “Full” but it permutes the non- punctuation tokens identically to “Base.” The permu- tation model is trained on surface trees or recovered underlying trees T 0, respectively. In each 3-way com- parison, we boldface the best result (always significant under a paired permutation test over per-sentence log- probabilities, p < 0.05). ˆthe caper failed ,If true,. ˆthe caper failed ,If true . det nsubj root. mark ,advcl, We leave the handling of capitalization to future work. We test the naturalness of the permuted sen- tences by asking how well a word trigram lan- guage model trained on them could predict the original sentences.23 As shown in Table 6, our per- mutation approach reduces the perplexity over the baseline on 4 of the 5 languages, often dramati- cally. 7 Related Work Punctuation can aid syntactic analysis, since it signals phrase boundaries and sentence structure. Briscoe (1994) and White and Rajkumar (2008) parse punctuated sentences using hand-crafted constraint-based grammars that implement Nun- berg’s approach in a declarative way. These gram- mars treat surface punctuation symbols as ordi- nary words, but annotate the nonterminal cate- gories so as to effectively keep track of the under- lying punctuation. This is tantamount to crafting a grammar for underlyingly punctuated sentences and composing it with a finite-state noisy channel. 23So the two approaches to permutation yield different training data, but are compared fairly on the same test data. The parser of Ma et al. (2014) takes a differ- ent approach and treats punctuation marks as fea- tures of their neighboring words. Zhang et al. (2013) use a generative model for punctuated sen- tences, leting them restore punctuation marks dur- ing transition-based parsing of unpunctuated sen- tences. Li et al. (2005) use punctuation marks to segment a sentence: this "divide and rule" strat- egy reduces ambiguity in parsing of long Chinese sentences. Punctuation can similarly be used to constrain syntactic structure during grammar in- duction (Spitkovsky et al., 2011). Punctuation restoration (§ 6.1) is useful for tran- scribing text from unpunctuated speech. The task is usually treated by tagging each slot with zero or more punctuation tokens, using a traditional sequence labeling method: conditional random fields (Lui and Wang, 2013; Lu and Ng, 2010), re- current neural networks (Tilk and Alumäe, 2016), or transition-based systems (Ballesteros and Wan- ner, 2016). 8 Conclusion and Future Work We have provided a new computational approach to modeling punctuation. In our model, syntactic constituents stochastically generate latent under- lying left and right punctemes. Surface punctu- ation marks are not directly attached to the syn- tax tree, but are generated from sequences of adja- cent punctemes by a (stochastic) finite-state string rewriting process . Our model is inspired by Nun- berg’s (1990) formal grammar for English punctu- ation, but is probabilistic and trainable. We give exact algorithms for training and inference. We trained Nunberg-like models for 5 lan- guages and L2 English. We compared the English model to Nunberg’s, and showed how the trained models can be used across languages for punctua- tion restoration, correction, and adjustment. In the future, we would like to study the usefulness of the recovered underlying trees on tasks such as syntactically sensitive sentiment analysis (Tai et al., 2015), machine translation (Cowan et al., 2006), relation extraction (Cu- lotta and Sorensen, 2004), and coreference reso- lution (Kong et al., 2010). We would also like to investigate how underlying punctuation could aid parsing. For discriminative parsing, features for scoring the tree could refer to the underly- ing punctuation, not just the surface punctuation. For generative parsing (§3), we could follow the 368 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 scheme in equation (1). For example, the psyn factor in equation (1) might be a standard re- current neural network grammar (RNNG) (Dyer et al., 2016); when a subtree of T is completed by the REDUCE operation of psyn, the punctuation- augmented RNNG (1) would stochastically attach subtree-external left and right punctemes with p✓ and transduce the subtree-internal slots with p�. In the future, we are also interested in enriching the T 0 representation and making it more differ- ent from T , to underlyingly account for other phe- nomena in T such as capitalization, spacing, mor- phology, and non-projectivity (via reordering). Acknowledgments This material is based upon work supported by the National Science Foundation under Grant Nos. 1423276 and 1718846, including a REU supple- ment to the first author. We are grateful to the state of Maryland for the Maryland Advanced Research Computing Center, a crucial resource. We thank Xiaochen Li for early discussion, Argo lab mem- bers for further discussion, and the three reviewers for quality comments. References Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, and Rebecca Passonneau. 2011. Sen- timent analysis of Twitter data. In Proceedings of the Workshop on Language in Social Media (LSM 2011), pages 30–38. Miguel Ballesteros and Leo Wanner. 2016. A neu- ral network architecture for multilingual punc- tuation generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1048–1053. Yehoshua Bar-Hillel, M. Perles, and E. Shamir. 1961. On formal properties of simple phrase structure grammars. Zeitschrift für Phonetik, Sprachwissenschaft und Kommunika- tionsforschung, 14:143–172. Reprinted in Y. Bar-Hillel (1964), Language and Information: Selected Essays on their Theory and Applica- tion, Addison-Wesley 1964, pages 116–150. Jean Berstel and Christophe Reutenauer. 1988. Rational Series and their Languages. Springer- Verlag. Ann Bies, Mark Ferguson, Karen Katz, Robert MacIntyre, Victoria Tredinnick, Grace Kim, Mary Ann Marcinkiewicz, and Britta Schas- berger. 1995. Bracketing guidelines for Tree- bank II style: Penn Treebank project. Technical Report MS-CIS-95-06, University of Pennsyl- vania. Ted Briscoe. 1994. Parsing (with) punctuation, etc. Technical report, Xerox European Re- search Laboratory. Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neu- ral networks. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 740–750. Noam Chomsky and Morris Halle. 1968. The Sound Pattern of English. Harper and Row, New York. Ryan Cotterell, Nanyun Peng, and Jason Eisner. 2014. Stochastic contextual edit distance and probabilistic FSTs. In Proceedings of the 52nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), pages 625–630. Ryan Cotterell, Nanyun Peng, and Jason Eisner. 2015. Modeling word forms using latent under- lying morphs and phonology. Transactions of the Association for Computational Linguistics (TACL), 3:433–447. Ryan Cotterell, Tim Vieira, and Hinrich Schütze. 2016. A joint model of orthography and mor- phological segmentation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 664–669. Brooke Cowan, Ivona Kučerová, and Michael Collins. 2006. A discriminative model for tree- to-tree translation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 232– 241. Aron Culotta and Jeffrey Sorensen. 2004. De- pendency tree kernels for relation extraction. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL). 369 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 http://www.aclweb.org/anthology/W11-0705 http://www.aclweb.org/anthology/W11-0705 https://doi.org/10.18653/v1/D16-1111 https://doi.org/10.18653/v1/D16-1111 https://doi.org/10.18653/v1/D16-1111 http://search.proquest.com/openview/fb41296047fb7453dcb1de182b4aa0b6/1 http://search.proquest.com/openview/fb41296047fb7453dcb1de182b4aa0b6/1 ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/root.ps.gz ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/root.ps.gz https://www.cl.cam.ac.uk/~ejb1/punct-pos-parsing.ps https://www.cl.cam.ac.uk/~ejb1/punct-pos-parsing.ps https://doi.org/10.3115/v1/D14-1082 https://doi.org/10.3115/v1/D14-1082 https://doi.org/10.3115/v1/D14-1082 https://doi.org/10.3115/v1/P14-2102 https://doi.org/10.3115/v1/P14-2102 http://cs.jhu.edu/~jason/papers/#cotterell-peng-eisner-2015 http://cs.jhu.edu/~jason/papers/#cotterell-peng-eisner-2015 http://www.aclweb.org/anthology/N16-1080 http://www.aclweb.org/anthology/N16-1080 http://www.aclweb.org/anthology/W06-1628 http://www.aclweb.org/anthology/W06-1628 http://www.aclweb.org/anthology/P04-1054 http://www.aclweb.org/anthology/P04-1054 Timothy Dozat and Christopher Manning. 2017. Efficient third-order dependency parsers. In Proceedings of the 5th International Confer- ence on Learning Representations (ICLR). Chris Dyer, Adhiguna Kuncoro, Miguel Balles- teros, and Noah A. Smith. 2016. Recurrent neural network grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 199–209. Jason Eisner. 1996. Three new probabilistic mod- els for dependency parsing: An exploration. In Proceedings of the 16th International Confer- ence on Computational Linguistics (COLING), pages 340–345. Jason Eisner. 2016. Inside-outside and forward- backward algorithms are just backprop. In Pro- ceedings of the EMNLP Workshop on Struc- tured Prediction for NLP. A. Caracciolo di Forino. 1968. String process- ing languages and generalized Markov algo- rithms. In D. G. Bobrow, editor, Symbol Manip- ulation Languages and Techniques, pages 191– 206. North-Holland Publishing Company, Am- sterdam. Kuzman Ganchev, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for struc- tured latent variable models. Journal of Ma- chine Learning Research, 11:2001–2049. Filip Ginter, Jan Hajič, Juhani Luotolahti, Milan Straka, and Daniel Zeman. 2017. CoNLL 2017 shared task - automatically annotated raw texts and word embeddings. LINDAT/CLARIN dig- ital library at the Institute of Formal and Ap- plied Linguistics (ÚFAL), Faculty of Mathe- matics and Physics, Charles University. Yoav Goldberg and Michael Elhadad. 2010. An efficient algorithm for easy-first non-directional dependency parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), pages 742–750. Joshua Goodman. 1999. Semiring parsing. Com- putational Linguistics, 25(4):573–605. George Heidorn. 2000. Intelligent writing assis- tance. In Robert Dale, Herman Moisl, and Harold Somers, editors, Handbook of Natural Language Processing, pages 181–207. Marcel Dekker, New York. C. Douglas Johnson. 1972. Formal Aspects of Phonological Description. Mouton. Bernard E. M. Jones. 1994. Exploring the role of punctuation in parsing natural text. In COLING 1994 Volume 1: The 15th International Confer- ence on Computational Linguistics. Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In Proceed- ings of the International Conference on Learn- ing Representations (ICLR). Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing us- ing bidirectional LSTM feature representations. Transactions of the Association for Computa- tional Linguistics (TACL), 4:313–327. Kevin Knight and Daniel Marcu. 2002. Summa- rization beyond sentence extraction: A proba- bilistic approach to sentence compression. Ar- tificial Intelligence, 139(1):91–107. Fang Kong, Guodong Zhou, Longhua Qian, and Qiaoming Zhu. 2010. Dependency-driven anaphoricity determination for coreference res- olution. In Proceedings of the 23rd Interna- tional Conference on Computational Linguis- tics (COLING), pages 599–607. Terry Koo and Michael Collins. 2010. Efficient third-order dependency parsers. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1– 11. Albert E. Krahn. 2014. A New Paradigm for Punctuation. Ph.D. thesis, The University of Wisconsin-Milwaukee. Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. 2014. Low-rank tensors for scoring dependency structures. In Proceed- ings of the 52nd Annual Meeting of the As- sociation for Computational Linguistics (ACL), pages 1381–1391. 370 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 https://arxiv.org/pdf/1611.01734.pdf https://doi.org/10.18653/v1/N16-1024 https://doi.org/10.18653/v1/N16-1024 http://cs.jhu.edu/~jason/papers/#eisner-1996-coling http://cs.jhu.edu/~jason/papers/#eisner-1996-coling http://cs.jhu.edu/~jason/papers/#eisner-2016 http://cs.jhu.edu/~jason/papers/#eisner-2016 http://hdl.handle.net/11234/1-1989 http://hdl.handle.net/11234/1-1989 http://hdl.handle.net/11234/1-1989 http://www.aclweb.org/anthology/N10-1115 http://www.aclweb.org/anthology/N10-1115 http://www.aclweb.org/anthology/N10-1115 http://research.microsoft.com/~joshuago/finalring.ps https://books.google.com/books?id=MnEjBsMIxxsC&lpg=PP1&pg=PA186 https://books.google.com/books?id=MnEjBsMIxxsC&lpg=PP1&pg=PA186 http://www.aclweb.org/anthology/C94-1069 http://www.aclweb.org/anthology/C94-1069 https://arxiv.org/pdf/1412.6980.pdf https://arxiv.org/pdf/1412.6980.pdf http://aclweb.org/anthology/Q16-1023 http://aclweb.org/anthology/Q16-1023 http://www.aclweb.org/anthology/C10-1068 http://www.aclweb.org/anthology/C10-1068 http://www.aclweb.org/anthology/C10-1068 http://aclweb.org/anthology/P10-1001 http://aclweb.org/anthology/P10-1001 https://dc.uwm.edu/cgi/viewcontent.cgi?article=1470&context=etd https://dc.uwm.edu/cgi/viewcontent.cgi?article=1470&context=etd http://www.aclweb.org/anthology/P14-1130 http://www.aclweb.org/anthology/P14-1130 Roger Levy. 2008. A noisy-channel model of human sentence comprehension under uncer- tain input. In Proceedings of the 2008 Con- ference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 234–243. Xing Li, Chengqing Zong, and Rile Hu. 2005. A hierarchical parsing approach with punctuation processing for long Chinese sentences. In Pro- ceedings of the International Joint Conference on Natural Language Processing (IJCNLP). Zhifei Li and Jason Eisner. 2009. First- and second-order expectation semirings with appli- cations to minimum-risk training on translation forests. In Proceedings of the Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 40–51. Wei Lu and Hwee Tou Ng. 2010. Better punctu- ation prediction with dynamic conditional ran- dom fields. In Proceedings of the 2010 Con- ference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 177–186. Marco Lui and Li Wang. 2013. Recovering casing and punctuation using conditional ran- dom fields. In Proceedings of the Australasian Language Technology Association Workshop (ALTA), pages 137–141. Ji Ma, Yue Zhang, and Jingbo Zhu. 2014. Punc- tuation processing for projective dependency parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 791–796. Gideon S. Mann and Andrew McCallum. 2010. Generalized expectation criteria for semi- supervised learning with weakly labeled data. Journal of Machine Learning Research, 11:955–984. Andrey Andreevich Markov. 1960. The theory of algorithms. American Mathematical Society Translations, series 2(15):1–14. Ilia Markov, Vivi Nastase, and Carlo Strappar- ava. 2018. Punctuation as native language in- terference. In Proceedings of the 27th Inter- national Conference on Computational Linguis- tics (COLING), pages 3456–3466. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013. Efficient estimation of word representations in vector space. Computing Re- search Repository (CoRR), arXiv:1301.3781. Mark-Jan Nederhof and Giorgio Satta. 2003. Probabilistic parsing as intersection. In 8th In- ternational Workshop on Parsing Technologies (IWPT), pages 137–148. Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Chris- tian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–14. Joakim Nivre, Željko Agić, Lars Ahrenberg, Maria Jesus Aranzabe, Masayuki Asahara, Aitziber Atutxa, Miguel Ballesteros, John Bauer, Kepa Bengoetxea, Yevgeni Berzak, Riyaz Ahmad Bhat, Eckhard Bick, Carl Börstell, Cristina Bosco, Gosse Bouma, Sam Bowman, Gülşen Cebiroğlu Eryiğit, Giuseppe G. A. Celano, Fabricio Chalub, Çağrı Çöl- tekin, Miriam Connor, Elizabeth Davidson, Marie-Catherine de Marneffe, Arantza Diaz de Ilarraza, Kaja Dobrovoljc, Timothy Dozat, Kira Droganova, Puneet Dwivedi, Marhaba Eli, Tomaž Erjavec, Richárd Farkas, Jennifer Foster, Claudia Freitas, Katarína Gajdošová, Daniel Galbraith, Marcos Garcia, Moa Gär- denfors, Sebastian Garza, Filip Ginter, Iakes Goenaga, Koldo Gojenola, Memduh Gökır- mak, Yoav Goldberg, Xavier Gómez Guino- vart, Berta Gonzáles Saavedra, Matias Gri- oni, Normunds Grūzı̄tis, Bruno Guillaume, Jan Hajič, Linh Hà Mỹ, Dag Haug, Barbora Hladká, Radu Ion, Elena Irimia, Anders Jo- hannsen, Fredrik Jørgensen, Hüner Kaşıkara, Hiroshi Kanayama, Jenna Kanerva, Boris Katz, Jessica Kenney, Natalia Kotsyba, Si- mon Krek, Veronika Laippala, Lucia Lam, Phuong Lê Hồng, Alessandro Lenci, Nikola Ljubešić, Olga Lyashevskaya, Teresa Lynn, Aibek Makazhanov, Christopher Manning, Cătălina Mărănduc, David Mareček, Héctor Martínez Alonso, André Martins, Jan Mašek, Yuji Matsumoto, Ryan McDonald, Anna Mis- silä, Verginica Mititelu, Yusuke Miyao, Simon- etta Montemagni, Keiko Sophie Mori, Shun- suke Mori, Bohdan Moskalevskyi, Kadri Muis- 371 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 http://www.aclweb.org/anthology/D08-1025 http://www.aclweb.org/anthology/D08-1025 http://www.aclweb.org/anthology/D08-1025 http://www.aclweb.org/anthology/I05-2002 http://www.aclweb.org/anthology/I05-2002 http://www.aclweb.org/anthology/I05-2002 http://cs.jhu.edu/~jason/papers/#li-eisner-2009 http://cs.jhu.edu/~jason/papers/#li-eisner-2009 http://cs.jhu.edu/~jason/papers/#li-eisner-2009 http://cs.jhu.edu/~jason/papers/#li-eisner-2009 http://www.aclweb.org/anthology/U13-1020 http://www.aclweb.org/anthology/U13-1020 http://www.aclweb.org/anthology/U13-1020 https://doi.org/10.3115/v1/P14-2128 https://doi.org/10.3115/v1/P14-2128 https://doi.org/10.3115/v1/P14-2128 http://www.jmlr.org/papers/volume11/mann10a/mann10a.pdf http://www.jmlr.org/papers/volume11/mann10a/mann10a.pdf http://www.jmlr.org/papers/volume11/mann10a/mann10a.pdf http://www.aclweb.org/anthology/C18-1293 http://www.aclweb.org/anthology/C18-1293 http://arxiv.org/abs/1301.3781 http://arxiv.org/abs/1301.3781 https://doi.org/10.3115/v1/W14-1701 https://doi.org/10.3115/v1/W14-1701 chnek, Nina Mustafina, Kaili Müürisep, Luong Nguyễn Thi., Huyền Nguyễn Thi. Minh, Vitaly Nikolaev, Hanna Nurmi, Petya Osenova, Robert Östling, Lilja Øvrelid, Valeria Paiva, Elena Pas- cual, Marco Passarotti, Cenel-Augusto Perez, Slav Petrov, Jussi Piitulainen, Barbara Plank, Martin Popel, Lauma Pretkalnin, a, Prokopis Prokopidis, Tiina Puolakainen, Sampo Pyysalo, Alexandre Rademaker, Loganathan Ramasamy, Livy Real, Laura Rituma, Rudolf Rosa, Shadi Saleh, Baiba Saulı̄te, Sebastian Schuster, Wolf- gang Seeker, Mojgan Seraji, Lena Shakurova, Mo Shen, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Mária Šimková, Kiril Simov, Aaron Smith, Carolyn Spadine, Alane Suhr, Umut Sulubacak, Zsolt Szántó, Takaaki Tanaka, Reut Tsarfaty, Francis Ty- ers, Sumire Uematsu, Larraitz Uria, Gertjan van Noord, Viktor Varga, Veronika Vincze, Lars Wallin, Jing Xian Wang, Jonathan North Washington, Mats Wirén, Zdeněk Žabokrt- ský, Amir Zeldes, Daniel Zeman, and Hanzhi Zhu. 2016. Universal Dependencies 1.4. LINDAT/CLARIN digital library at the In- stitute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. Data available at http: //universaldependencies.org. Joakim Nivre, Johan Hall, Sandra Kübler, Ryan McDonald, Jens Nilsson, Sebastian Riedel, and Deniz Yuret. 2007a. The CoNLL 2007 shared task on dependency parsing. In Proceedings of the CoNLL Shared Task Session of EMNLP- CoNLL 2007, pages 915–932. Joakim Nivre, Johan Hall, Jens Nilsson, Atanas Chanev, Gülşen Eryigit, Sandra Kübler, Sve- toslav Marinov, and Erwin Marsi. 2007b. Malt- parser: A language-independent system for data-driven dependency parsing. Natural Lan- guage Engineering, 13(2):95–135. Joakim Nivre et al. 2018. Universal depen- dencies annotation guidelines. Available at universaldependencies.org. Geoffrey Nunberg. 1990. The Linguistics of Punc- tuation. Number 18 in CSLI Lecture Notes. Center for the Study of Language and Informa- tion. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Senti- ment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002). Jeffrey Pennington, Richard Socher, and Christo- pher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Fernando C. N. Pereira and Michael D. Riley. 1996. Speech recognition by composition of weighted finite automata. Computing Research Repository (CoRR), arXiv:cmp-lg/9603001. Mohammad Sadegh Rasooli and Joel R. Tetreault. 2015. Yara parser: A fast and accurate depen- dency parser. Computing Research Repository, arXiv:1503.06733 (version 2). Radim Řehůřek and Petr Sojka. 2010. Software framework for topic modelling with large cor- pora. In Proceedings of the LREC 2010 Work- shop on New Challenges for NLP Frameworks, pages 45–50. Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. 2011. Punctuation: Making a point in unsupervised dependency parsing. In Proceed- ings of the Fifteenth Conference on Computa- tional Natural Language Learning, CoNLL ’11, pages 19–28. Kai Sheng Tai, Richard Socher, and Christo- pher D. Manning. 2015. Improved semantic representations from tree-structured long short- term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Interna- tional Joint Conference on Natural Language Processing (ACL-COLING), pages 1556–1566. Ottokar Tilk and Tanel Alumäe. 2016. Bidirec- tional recurrent neural network with attention mechanism for punctuation restoration. In In- terspeech, pages 3047–3051. Ke M. Tran, Yonatan Bisk, Ashish Vaswani, Daniel Marcu, and Kevin Knight. 2016. Unsu- pervised neural hidden Markov models. In Pro- ceedings of the Workshop on Structured Predic- tion for NLP, pages 63–71. 372 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 http://hdl.handle.net/11234/1-1827 http://universaldependencies.org http://universaldependencies.org http://www.aclweb.org/anthology/D/D07/D07-1096 http://www.aclweb.org/anthology/D/D07/D07-1096 http://universaldependencies.org/guidelines.html http://universaldependencies.org/guidelines.html universaldependencies.org http://www.aclweb.org/anthology/W02-1011 http://www.aclweb.org/anthology/W02-1011 http://www.aclweb.org/anthology/W02-1011 https://doi.org/10.3115/v1/D14-1162 https://doi.org/10.3115/v1/D14-1162 https://arxiv.org/abs/cmp-lg/9603001 https://arxiv.org/abs/cmp-lg/9603001 http://arxiv.org/abs/1503.06733 http://arxiv.org/abs/1503.06733 http://is.muni.cz/publication/884893/en http://is.muni.cz/publication/884893/en http://is.muni.cz/publication/884893/en http://dl.acm.org/citation.cfm?id=2018936.2018939 http://dl.acm.org/citation.cfm?id=2018936.2018939 https://doi.org/10.3115/v1/P15-1150 https://doi.org/10.3115/v1/P15-1150 https://doi.org/10.3115/v1/P15-1150 https://doi.org/10.18653/v1/W16-5907 https://doi.org/10.18653/v1/W16-5907 University of Chicago. 2010. The Chicago Man- ual of Style. University of Chicago Press. Dingquan Wang and Jason Eisner. 2016. The Galactic Dependencies treebanks: Getting more data by synthesizing new languages. Transac- tions of the Association for Computational Lin- guistics (TACL), 4:491–505. Michael White and Rajakrishnan Rajkumar. 2008. A more precise analysis of punctuation for broad-coverage surface realization with CCG. In Proceedings of the COLING 2008 Workshop on Grammar Engineering Across Frameworks, pages 17–24. K. Xu, L. Xie, and K. Yao. 2016. Investigat- ing LSTM for punctuation prediction. In 2016 10th International Symposium on Chinese Spo- ken Language Processing (ISCSLP), pages 1–5. Richard Zens, Franz Josef Och, and Hermann Ney. 2002. Phrase-based statistical machine transla- tion. In Annual Conference on Artificial Intelli- gence, pages 18–32. Dongdong Zhang, Shuangzhi Wu, Nan Yang, and Mu Li. 2013. Punctuation prediction with transition-based parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), pages 752– 760. Yue Zhang and Joakim Nivre. 2011. Transition- based dependency parsing with rich non-local features. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 188–193. 373 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 http://cs.jhu.edu/~jason/papers/#wang-eisner-2016 http://cs.jhu.edu/~jason/papers/#wang-eisner-2016 http://cs.jhu.edu/~jason/papers/#wang-eisner-2016 http://www.aclweb.org/anthology/W08-1703 http://www.aclweb.org/anthology/W08-1703 https://doi.org/10.1109/ISCSLP.2016.7918492 https://doi.org/10.1109/ISCSLP.2016.7918492 https://link.springer.com/chapter/10.1007/3-540-45751-8_2 https://link.springer.com/chapter/10.1007/3-540-45751-8_2 http://www.aclweb.org/anthology/P13-1074 http://www.aclweb.org/anthology/P13-1074 http://www.aclweb.org/anthology/P11-2033 http://www.aclweb.org/anthology/P11-2033 http://www.aclweb.org/anthology/P11-2033