Transactions of the Association for Computational Linguistics, vol. 7, pp. 357–373, 2019. Action Editor: Trevor Cohn.
Submission batch: 10/2018; Revision batch: 1/2019; Published 7/2019.

c© 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

A Generative Model for Punctuation in Dependency Trees

Xiang Lisa Li⇤ and Dingquan Wang⇤ and Jason Eisner
Department of Computer Science, Johns Hopkins University
xli150@jhu.edu, {wdd,jason}@cs.jhu.edu

Abstract

Treebanks traditionally treat punctuation
marks as ordinary words, but linguists have
suggested that a tree’s “true” punctuation
marks are not observed (Nunberg, 1990).
These latent “underlying” marks serve to
delimit or separate constituents in the syn-
tax tree. When the tree’s yield is rendered as
a written sentence, a string rewriting mech-
anism transduces the underlying marks into
“surface” marks, which are part of the ob-
served (surface) string but should not be re-
garded as part of the tree. We formalize
this idea in a generative model of punc-
tuation that admits efficient dynamic pro-
gramming. We train it without observing
the underlying marks, by locally maximiz-
ing the incomplete data likelihood (simi-
larly to the EM algorithm). When we use
the trained model to reconstruct the tree’s
underlying punctuation, the results appear
plausible across 5 languages, and in par-
ticular are consistent with Nunberg’s anal-
ysis of English. We show that our gener-
ative model can be used to beat baselines
on punctuation restoration. Also, our recon-
struction of a sentence’s underlying punctu-
ation lets us appropriately render the surface
punctuation (via our trained underlying-to-
surface mechanism) when we syntactically
transform the sentence.

1 Introduction
Punctuation enriches the expressiveness of writ-
ten language. When converting from spoken to
written language, punctuation indicates pauses or
pitches; expresses propositional attitude; and is
conventionally associated with certain syntactic
constructions such as apposition, parenthesis, quo-
tation, and conjunction.

In this paper, we present a latent-variable
model of punctuation usage, inspired by the rule-
based approach to English punctuation of Nun-
berg (1990). Training our model on English data

⇤Equal contribution.

learns rules that are consistent with Nunberg’s
hand-crafted rules. Our system is automatic, so we
use it to obtain rules for Arabic, Chinese, Spanish,
and Hindi as well.

Moreover, our rules are stochastic, which al-
lows us to reason probabilistically about ambigu-
ous or missing punctuation. Across the 5 lan-
guages, our model predicts surface punctuation
better than baselines, as measured both by per-
plexity (§4) and by accuracy on a punctuation
restoration task (§ 6.1). We also use our model
to correct the punctuation of non-native writers
of English (§ 6.2), and to maintain natural punc-
tuation style when syntactically transforming En-
glish sentences (§ 6.3). In principle, our model
could also be used within a generative parser, al-
lowing the parser to evaluate whether a candidate
tree truly explains the punctuation observed in the
input sentence (§8).

Punctuation is interesting In The Linguistics of
Punctuation, Nunberg (1990) argues that punctu-
ation (in English) is more than a visual counter-
part of spoken-language prosody, but forms a lin-
guistic system that involves “interactions of point
indicators (i.e. commas, semicolons, colons, pe-
riods and dashes).” He proposes that much as in
phonology (Chomsky and Halle, 1968), a gram-
mar generates underlying punctuation which then
transforms into the observed surface punctuation.

Consider generating a sentence from a syntactic
grammar as follows:

Hail the king [, Arthur Pendragon ,]
[, who wields [ “ Excalibur ” ] ,] .

Although the full tree is not depicted here, some of
the constituents are indicated with brackets. In this
underlying generated tree, each appositive NP is
surrounded by commas. On the surface, however,
the two adjacent commas after Pendragon will
now be collapsed into one, and the final comma
will be absorbed into the adjacent period. Fur-
thermore, in American English, the typographic

357

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021


convention is to move the final punctuation inside
the quotation marks. Thus a reader sees only this
modified surface form of the sentence:

Hail the king, Arthur Pendragon,
who wields “Excalibur.”

Note that these modifications are string transfor-
mations that do not see or change the tree. The
resulting surface punctuation marks may be clues
to the parse tree, but (contrary to NLP convention)
they should not be included as nodes in the parse
tree. Only the underlying marks play that role.

Punctuation is meaningful Pang et al. (2002)
use question and exclamation marks as clues to
sentiment. Similarly, quotation marks may be
used to mark titles, quotations, reported speech,
or dubious terminology (University of Chicago,
2010). Because of examples like this, methods for
determining the similarity or meaning of syntax
trees, such as a tree kernel (Agarwal et al., 2011)
or a recursive neural network (Tai et al., 2015),
should ideally be able to consider where the un-
derlying punctuation marks attach.

Punctuation is helpful Surface punctuation re-
mains correlated with syntactic phrase structure.
NLP systems for generating or editing text must be
able to deploy surface punctuation as human writ-
ers do. Parsers and grammar induction systems
benefit from the presence of surface punctuation
marks (Jones, 1994; Spitkovsky et al., 2011). It is
plausible that they could do better with a linguisti-
cally informed model that explains exactly why the
surface punctuation appears where it does. Pat-
terns of punctuation usage can also help identify
the writer’s native language (Markov et al., 2018).

Punctuation is neglected Work on syntax and
parsing tends to treat punctuation as an af-
terthought rather than a phenomenon governed by
its own linguistic principles. Treebank annota-
tion guidelines for punctuation tend to adopt sim-
ple heuristics like “attach to the highest possi-
ble node that preserves projectivity” (Bies et al.,
1995; Nivre et al., 2018).1 Many dependency
parsing works exclude punctuation from evalua-
tion (Nivre et al., 2007b; Koo and Collins, 2010;
Chen and Manning, 2014; Lei et al., 2014; Kiper-
wasser and Goldberg, 2016), although some others
retain punctuation (Nivre et al., 2007a; Goldberg
and Elhadad, 2010; Dozat and Manning, 2017).

1http://universaldependencies.org/u/
dep/punct.html

Unpunctuated Tree: T
Dale means river valley

rootnsubj dobj

ATTACH

tree: T 0
Underlying sequence: u

sentence: ū

Surface sentence: x̄sequence: x

NOISYCHANNEL

u0 u1 u2 u3 u4

x0 x1 x2 x3 x4

“ Dale ” means “ river valley ” .

“ Dale ” means “ river valley . ”

root.
nsubj“ ” dobj“ ”

Figure 1: The generative story of a sentence. Given
an unpunctuated tree T at top, at each node w 2
T , the ATTACH process stochastically attaches a left
puncteme l and a right puncteme r, which may be
empty. The resulting tree T 0 has underlying punctua-
tion u. Each slot’s punctuation ui 2 u is rewritten to
xi 2 x by NOISYCHANNEL.

In tasks such as word embedding induction
(Mikolov et al., 2013; Pennington et al., 2014) and
machine translation (Zens et al., 2002), punctua-
tion marks are usually either removed or treated as
ordinary words (Řehůřek and Sojka, 2010).

Yet to us, building a parse tree on a surface
sentence seems as inappropriate as morphologi-
cally segmenting a surface word. In both cases,
one should instead analyze the latent underlying
form, jointly with recovering that form. For exam-
ple, the proper segmentation of English hoping
is not hop-ing but hope-ing (with underlying
e), and the proper segmentation of stopping
is neither stopp-ing nor stop-ping but
stop-ing (with only one underlying p). Cot-
terell et al. (2015, 2016) get this right for morphol-
ogy. We attempt to do the same for punctuation.

2 Formal Model

We propose a probabilistic generative model of
sentences (Figure 1):

p(x̄) =
P

T,T 0psyn(T) · p✓(T
0 |T) · p�(x̄|ū(T 0))

(1)

First, an unpunctuated dependency tree T is
stochastically generated by some recursive pro-
cess psyn (e.g., Eisner, 1996, Model C).2 Second,
each constituent (i.e., dependency subtree) sprouts
optional underlying punctuation at its left and right
edges, according to a probability distribution p✓
that depends on the constituent’s syntactic role
(e.g., dobj for “direct object”). This punctuated
tree T 0 yields the underlying string ū = ū(T 0),
which is edited by a finite-state noisy channel p�
to arrive at the surface sentence x̄.

2Our model could be easily adapted to work on con-
stituency trees instead.

358

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021

http://universaldependencies.org/u/dep/punct.html
http://universaldependencies.org/u/dep/punct.html


This third step may alter the sequence of punc-
tuation tokens at each slot between words—for ex-
ample, in §1, collapsing the double comma , ,
between Pendragon and who. u and x denote
just the punctuation at the slots of ū and x̄ respec-
tively, with ui and xi denoting the punctuation to-
ken sequences at the ith slot. Thus, the transfor-
mation at the ith slot is ui 7! xi.

Since this model is generative, we could train
it without any supervision to explain the observed
surface string x̄: maximize the likelihood p(x̄) in
(1), marginalizing out the possible T, T 0 values.

In the present paper, however, we exploit known
T values (as observed in the “depunctuated” ver-
sion of a treebank). Because T is observed, we can
jointly train ✓, � to maximize just

p(x | T) =
X

T 0

p✓(T
0 | T) · p�(x | u(T 0)) (2)

That is, the psyn model that generated T be-
comes irrelevant, but we still try to predict what
surface punctuation will be added to T . We
still marginalize over the underlying punctuation
marks u. These are never observed, but they must
explain the surface punctuation marks x (§ 2.2),
and they must be explained in turn by the syntax
tree T (§ 2.1). The trained generative model then
lets us restore or correct punctuation in new trees
T (§6).

2.1 Generating Underlying Punctuation

The ATTACH model characterizes the probability
of an underlying punctuated tree T 0 given its cor-
responding unpunctuated tree T , which is given by

p✓(T
0 | T) =

Y

w2T
p✓(lw, rw | w) (3)

where lw, rw 2 V are the left and right punctemes
that T 0 attaches to the tree node w. Each puncteme
(Krahn, 2014) in the finite set V is a string of 0 or
more underlying punctuation tokens.3 The proba-
bility p✓(l, r | w) is given by a log-linear model

3Multi-token punctemes are occasionally useful. For ex-
ample, the puncteme ... might consist of either 1 or 3 to-
kens, depending on how the tokenizer works; similarly, the
puncteme ?! might consist of 1 or 2 tokens. Also, if a sin-
gle constituent of T gets surrounded by both parentheses and
quotation marks, this gives rise to punctemes (“ and ”).
(A better treatment would add the parentheses as a separate
puncteme pair at a unary node above the quotation marks, but
that would have required T 0 to introduce this extra node.)

1. Point Absorption 3. Period Absorption
„ 7!, ,. 7!. -, 7!- .? 7!? .! 7!!
-; 7!; ;. 7!. abbv. 7!abbv
2. Quote Transposition 4. Bracket Absorptions
”, 7!,” ”. 7!.” ,) 7!) -) 7!) (, 7!(

,” 7!” “, 7!“

Table 1: Some of Nunberg’s punctuation interaction
rules in English, in priority order. The absorption rules
ensure that when there are two adjacent tokens, the
“weaker” one is deleted (where the strength ordering
is {?, !, (, ), “, ”} > . > {;, :} > - > ,), except
that bracketing tokens such as () and “” do not absorb
tokens outside the material they bracket.

p✓(l, r|w) /
(
exp ✓>f(l, r, w) if (l, r) 2 Wd(w)
0 otherwise (4)

where V is the finite set of possible punctemes and
Wd ✓ V2 gives the possible puncteme pairs for a
node w that has dependency relation d = d(w) to
its parent. V and Wd are estimated heuristically
from the tokenized surface data (§4). f(l, r, w) is
a sparse binary feature vector, and ✓ is the cor-
responding parameter vector of feature weights.
The feature templates in Appendix A4 consider the
symmetry between l and r, and their compatibility
with (a) the POS tag of w’s head word, (b) the de-
pendency paths connecting w to its children and
the root of T , (c) the POS tags of the words flank-
ing the slots containing l and r, (d) surface punc-
tuation already added to w’s subconstituents.

2.2 From Underlying to Surface
From the tree T 0, we can read off the sequence
of underlying punctuation tokens ui at each slot i
between words. Namely, ui concatenates the right
punctemes of all constituents ending at i with the
left punctemes of all constituents starting at i (as
illustrated by the examples in §1 and Figure 1).
The NOISYCHANNEL model then transduces ui to
a surface token sequence xi, for each i = 0, . . . , n
independently (where n is the sentence length).

Nunberg’s formalism Much like Chomsky and
Halle’s (1968) phonological grammar of English,
Nunberg’s (1990) descriptive English punctuation
grammar (Table 1) can be viewed computationally
as a priority string rewriting system, or Markov
algorithm (Markov, 1960; Caracciolo di Forino,
1968). The system begins with a token string u.

4The appendices (supplementary material) are available at
https://arxiv.org/abs/1906.11298.

359

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021

https://arxiv.org/abs/1906.11298
https://arxiv.org/abs/1906.11298


abcde . ab 7! ab
abcde . bc 7! b
a bde . bd 7! db
a dbe . be 7! e
a d e

Figure 2: Editing abcde 7! ade with a sliding win-
dow. (When an absorption rule maps 2 tokens to 1, our
diagram leaves blank space that is not part of the out-
put string.) At each step, the left-to-right process has
already committed to the green tokens as output; has
not yet looked at the blue input tokens; and is currently
considering how to (further) rewrite the black tokens.
The right column shows the chosen edit.

At each step it selects the highest-priority local
rewrite rule that can apply, and applies it as far
left as possible. When no more rules can apply,
the final state of the string is returned as x.

Simplifying the formalism Markov algorithms
are Turing complete. Fortunately, Johnson (1972)
noted that in practice, phonological u 7! x maps
described in this formalism can usually be imple-
mented with finite-state transducers (FSTs).

For computational simplicity, we will formu-
late our punctuation model as a probabilistic FST
(PFST)—a locally normalized left-to-right rewrite
model (Cotterell et al., 2014). The probabilities
for each language must be learned, using gradient
descent. Normally we expect most probabilities to
be near 0 or 1, making the PFST nearly determin-
istic (i.e., close to a subsequential FST). However,
permitting low-probability choices remains useful
to account for typographical errors, dialectal dif-
ferences, and free variation in the training corpus.

Our PFST generates a surface string, but the
invertibility of FSTs will allow us to work back-
wards when analyzing a surface string (§3).

A sliding-window model Instead of having rule
priorities, we apply Nunberg-style rules within a
2-token window that slides over u in a single left-
to-right pass (Figure 2). Conditioned on the cur-
rent window contents ab, a single edit is selected
stochastically: either ab 7!ab (no change), ab 7!b
(left absorption), ab 7! a (right absorption), or
ab 7! ba (transposition). Then the window slides
rightward to cover the next input token, together
with the token that is (now) to its left. a and b are
always real tokens, never boundary symbols. �
specifies the conditional edit probabilities.5

5Rather than learn a separate edit probability distribution
for each bigram ab, one could share parameters across bi-
grams. For example, Table 1’s caption says that “stronger”
tokens tend to absorb “weaker” ones. A model that incor-

These specific edit rules (like Nunberg’s) can-
not insert new symbols, nor can they delete all of
the underlying symbols. Thus, surface xi is a good
clue to ui: all of its tokens must appear underly-
ingly, and if xi = ✏ (the empty string) then ui = ✏.

The model can be directly implemented as
a PFST (Appendix D4) using Cotterell et al.’s
(2014) more general PFST construction.

Our single-pass formalism is less expressive
than Nunberg’s. It greedily makes decisions based
on at most one token of right context (“label
bias”). It cannot rewrite ’”. 7!.’” or ”,. 7!.”
because the . is encountered too late to percolate
leftward; luckily, though, we can handle such En-
glish examples by sliding the window right-to-left
instead of left-to-right. We treat the sliding direc-
tion as a language-specific parameter.6

2.3 Training Objective

Building on equation (2), we train ✓, � to lo-
cally maximize the regularized conditional log-
likelihood
⇣ X

x,T

log p(x | T)� ⇠ · E
T 0
[c(T 0)]2

⌘
� & · ||✓||2

(5)
where the sum is over a training treebank.7

The expectation E[· · · ] is over T 0 ⇠ p(· |
T, x). This generalized expectation term pro-
vides posterior regularization (Mann and McCal-
lum, 2010; Ganchev et al., 2010), by encourag-
ing parameters that reconstruct trees T 0 that use
symmetric punctuation marks in a “typical” way.
The function c(T 0) counts the nodes in T 0 whose
punctemes contain “unmatched” symmetric punc-
tuation tokens: for example, ) is “matched” only
when it appears in a right puncteme with ( at the
comparable position in the same constituent’s left
puncteme. The precise definition is given in Ap-
pendix B.4

porated this insight would not have to learn O(|⌃|2) separate
absorption probabilities (two per bigram ab), but only O(|⌃|)
strengths (one per unigram a, which may be regarded as a
1-dimensional embedding of the punctuation token a). We
figured that the punctuation vocabulary ⌃ was small enough
(Table 2) that we could manage without the additional com-
plexity of embeddings or other featurization, although this
does presumably hurt our generalization to rare bigrams.

6We could have handled all languages uniformly by mak-
ing � 2 passes of the sliding window (via a composition of
� 2 PFSTs), with at least one pass in each direction.

7In retrospect, there was no good reason to square the
ET 0[c(T 0)] term. However, when we started redoing the ex-
periments, we found the results essentially unchanged.

360

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021

https://arxiv.org/abs/1906.11298
https://arxiv.org/abs/1906.11298
https://arxiv.org/abs/1906.11298


In our development experiments on English, the
posterior regularization term was necessary to dis-
cover an aesthetically appealing theory of under-
lying punctuation. When we dropped this term
(⇠ = 0) and simply maximized the ordinary regu-
larized likelihood, we found that the optimization
problem was underconstrained: different training
runs would arrive at different, rather arbitrary un-
derlying punctemes. For example, one training run
learned an ATTACH model that used underlying
“. to terminate sentences, along with a NOISY-
CHANNEL model that absorbed the left quotation
mark into the period. By encouraging the under-
lying punctuation to be symmetric, we broke the
ties. We also tried making this a hard constraint
(⇠ = 1), but then the model was unable to explain
some of the training sentences at all, giving them
probability of 0. For example, I went to the
“ special place ” cannot be explained, be-
cause special place is not a constituent.8

3 Inference

In principle, working with the model (1) is
straightforward, thanks to the closure properties
of formal languages. Provided that psyn can be en-
coded as a weighted CFG, it can be composed with
the weighted tree transducer p✓ and the weighted
FST p� to yield a new weighted CFG (similarly to
Bar-Hillel et al., 1961; Nederhof and Satta, 2003).
Under this new grammar, one can recover the opti-
mal T, T 0 for x̄ by dynamic programming, or sum
over T, T 0 by the inside algorithm to get the likeli-
hood p(x̄). A similar approach was used by Levy
(2008) with a different FST noisy channel.

In this paper we assume that T is observed, al-
lowing us to work with equation (2). This cuts the
computation time from O(n3) to O(n).9 Whereas
the inside algorithm for (1) must consider O(n2)
possible constituents of x̄ and O(n) ways of build-
ing each, our algorithm for (2) only needs to iterate
over the O(n) true constituents of T and the 1 true
way of building each. However, it must still con-
sider the |Wd| puncteme pairs for each constituent.

8Recall that the NOISYCHANNEL model family (§ 2.2)
requires the surface “ before special to appear under-
lyingly, and also requires the surface ✏ after special to
be empty underlyingly. These hard constraints clash with
the ⇠ = 1 hard constraint that the punctuation around
special must be balanced. The surface ” after place
causes a similar problem: no edge can generate the match-
ing underlying “.

9We do O(n) multiplications of N ⇥ N matrices where

Algorithm 1 The algorithm for scoring a given
(T, x) pair. The code in blue is used during train-
ing to get the posterior regularization term in (5).
Input: T , x . Training pair (omits T 0, u)
Output: p(x | T), E[c(T 0)]

1: procedure TOTALSCORE(T , x)
2: for i = 1 to n do
3: compute WFSA (Mi, �i, ⇢i)
4: E  0 . exp. count of unmatched punctemes
5: procedure IN(w) . w 2 T
6: i, k  slots at left, right of w constit
7: j  slot at right of w headword
8: Mleft (

Q
w02leftkids(w) IN(w

0))⇢j�1
9: Mright �>j (

Q
w02rightkids(w) IN(w

0))

10: M0  Mleft · 1 · Mright . RNj⇥1, R1⇥Nj
11: M  0 . RNi⇥Nk
12: for (l, r) 2 Wd(w) do
13: p  p✓(l, r | w)
14: M  M + p · Mi(l)M0Mk(r)
15: E  E + p · l,r have unmatched punc
16: return M . RNi⇥Nk
17: Mroot  IN(root(T))
18: return �>0 Mroot⇢n, E . R, R

3.1 Algorithms

Given an input sentence x̄ of length n, our job is
to sum over possible trees T 0 that are consistent
with T and x̄, or to find the best such T 0. This
is roughly a lattice parsing problem—made easier
by knowing T . However, the possible ū values
are characterized not by a lattice but by a cyclic
WFSA (as |ui| is unbounded whenever |xi| > 0).

For each slot 0  i  n, transduce the sur-
face punctuation string xi by the inverted PFST
for p� to obtain a weighted finite-state automa-
ton (WFSA) that describes all possible underly-
ing strings ui.10 This WFSA accepts each pos-
sible ui with weight p�(xi | ui). If it has Ni
states, we can represent it (Berstel and Reutenauer,
1988) with a family of sparse weight matrices
Mi(�) 2 RNi⇥Ni , whose element at row s and
column t is the weight of the s ! t arc labeled
with �, or 0 if there is no such arc. Additional
vectors �i, ⇢i 2 RNi specify the initial and final
weights. (�i is one-hot if the PFST has a single

N = O(# of punc types · max # of punc tokens per slot).
10Constructively, compose the u-to-x PFST (from the end

of § 2.2) with a straight-line FSA accepting only xi, and
project the resulting WFST to its input tape (Pereira and Ri-
ley, 1996), as explained at the end of Appendix D.

361

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021

https://arxiv.org/abs/1906.11298


initial state, of weight 1.)
For any puncteme l (or r) in V, we define

Mi(l) = Mi(l1)Mi(l2) · · · Mi(l|l|), a product
over the 0 or more tokens in l. This gives the total
weight of all s !⇤ t WFSA paths labeled with l.

The subprocedure in Algorithm 1 essentially
extends this to obtain a new matrix IN(w) 2
RNi⇥Nk , where the subtree rooted at w stretches
from slot i to slot k. Its element IN(w)st gives
the total weight of all extended paths in the ū
WFSA from state s at slot i to state t at slot k. An
extended path is defined by a choice of underly-
ing punctemes at w and all its descendants. These
punctemes determine an s-to-final path at i, then
initial-to-final paths at i+1 through k�1, then an
initial-to-t path at k. The weight of the extended
path is the product of all the WFSA weights on
these paths (which correspond to transition prob-
abilities in p� PFST) times the probability of the
choice of punctemes (from p✓).

This inside algorithm computes quantities
needed for training (§ 2.3). Useful variants arise
via well-known methods for weighted derivation
forests (Berstel and Reutenauer, 1988; Goodman,
1999; Li and Eisner, 2009; Eisner, 2016).

Specifically, to modify Algorithm 1 to maximize
over T 0 values (§§ 6.2–6.3) instead of summing
over them, we switch to the derivation semiring
(Goodman, 1999), as follows. Whereas IN(w)st
used to store the total weight of all extended paths
from state s at slot i to state t at slot j, now it will
store the weight of the best such extended path. It
will also store that extended path’s choice of un-
derlying punctemes, in the form of a puncteme-
annotated version of the subtree of T that is rooted
at w. This is a potential subtree of T 0.

Thus, each element of IN(w) has the form
(r, D) where r 2 R and D is a tree. We define
addition and multiplication over such pairs:

(r, D) + (r0, D0) =

(
(r, D) if r > r0

(r0, D0) otherwise
(6)

(r, D) · (r0, D0) = (rr0, DD0) (7)

where DD0 denotes an ordered combination of
two trees. Matrix products UV and scalar-matrix
products p · V are defined in terms of element ad-
dition and multiplication as usual:

(UV)st =
P

rUsr · Vrt (8)
(p · V)st = p · Vst (9)

What is DD0? For presentational purposes, it is
convenient to represent a punctuated dependency
tree as a bracketed string. For example, the under-
lying tree T 0 in Figure 1 would be [ [“ Dale ”]
means [“ [ river ] valley ”] ] where
the words correspond to nodes of T . In this case,
we can represent every D as a partial bracketed
string and define DD0 by string concatenation.
This presentation ensures that multiplication
(7) is a complete and associative (though not
commutative) operation, as in any semiring. As
base cases, each real-valued element of Mi(l)
or Mk(r) is now paired with the string [l or r]
respectively,11 and the real number 1 at line 10 is
paired with the string w. The real-valued elements
of the �i and ⇢i vectors and the 0 matrix at line 11
are paired with the empty string ✏, as is the real
number p at line 13.

In practice, the D strings that appear within the
matrix M of Algorithm 1 will always represent
complete punctuated trees. Thus, they can actu-
ally be represented in memory as such, and differ-
ent trees may share subtrees for efficiency (using
pointers). The product in line 10 constructs a ma-
trix of trees with root w and differing sequences
of left/right children, while the product in line 14
annotates those trees with punctemes l, r.

To sample a possible T 0 from the derivation for-
est in proportion to its probability (§ 6.1), we use
the same algorithm but replace equation (6) with

(r, D) + (r0, D0) =

(
(r + r0, D) if u < rr+r0
(r + r0, D0) otherwise

with u ⇠ Uniform(0, 1) being a random number.

3.2 Optimization
Having computed the objective (5), we find the
gradient via automatic differentiation, and opti-
mize ✓, � via Adam (Kingma and Ba, 2014)—a
variant of stochastic gradient decent—with learn-
ing rate 0.07, batchsize 5, sentence per epoch
400, and L2 regularization. (These hyperparam-
eters, along with the regularization coefficients &
and ⇠ from equation (5), were tuned on dev data
(§4) for each language respectively.) We train

11We still construct the real matrix Mi(l) by ordinary ma-
trix multiplication before pairing its elements with strings.
This involves summation of real numbers: each element of
the resulting real matrix is a marginal probability, which sums
over possible PFST paths (edit sequences) that could map the
underlying puncteme l to a certain substring of the surface
slot xi. Similarly for Mk(r).

362

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021


the punctuation model for 30 epochs. The initial
NOISYCHANNEL parameters (�) are drawn from
N(0, 1), and the initial ATTACH parameters (✓)
are drawn from N(0, 1) (with one minor excep-
tion described in Appendix A).

4 Intrinsic Evaluation of the Model

Data. Throughout §§4–6, we will examine the
punctuation model on a subset of the Univer-
sal Dependencies (UD) version 1.4 (Nivre et al.,
2016)—a collection of dependency treebanks
across 47 languages with unified POS-tag and de-
pendency label sets. Each treebank has designated
training, development, and test portions. We ex-
periment on Arabic, English, Chinese, Hindi, and
Spanish (Table 2)—languages with diverse punc-
tuation vocabularies and punctuation interaction
rules, not to mention script directionality. For each
treebank, we use the tokenization provided by UD,
and take the punctuation tokens (which may be
multi-character, such as ...) to be the tokens with
the PUNCT tag. We replace each straight dou-
ble quotation mark " with either “ or ” as appro-
priate, and similarly for single quotation marks.12

We split each non-punctuation token that ends in
. (such as etc.) into a shorter non-punctuation
token (etc) followed by a special punctuation to-
ken called the “abbreviation dot” (which is distinct
from a period). We prepend a special punctuation
mark ˆ to every sentence x̄, which can serve to
absorb an initial comma, for example.13 We then
replace each token with the special symbol UNK if
its type appeared fewer than 5 times in the training
portion. This gives the surface sentences.

To estimate the vocabulary V of underlying
punctemes, we simply collect all surface token se-
quences xi that appear at any slot in the training
portion of the processed treebank. This is a gener-
ous estimate. Similarly, we estimate Wd (§ 2.1) as
all pairs (l, r) 2 V2 that flank any d constituent.

Recall that our model generates surface punctu-
ation given an unpunctuated dependency tree. We
train it on each of the 5 languages independently.
We evaluate on conditional perplexity, which will
be low if the trained model successfully assigns a
high probability to the actual surface punctuation
in a held-out corpus of the same language.

12For en and en_esl, “ and ” are distinguished by
language-specific part-of-speech tags. For the other 4 lan-
guages, we identify two " dependents of the same head word,

Language Treebank #Token %Punct #Omit #Type
Arabic ar 282K 7.9 255 18
Chinese zh 123K 13.8 3 23

English
en 255K 11.7 40 35

en_esl 97.7K 9.8 2 16
Hindi hi 352K 6.7 21 15
Spanish es_ancora 560K 11.7 25 16

Table 2: Statistics of our datasets. “Treebank” is the
UD treebank identifier, “#Token” is the number of to-
kens, “%Punct” is the percentage of punctuation to-
kens, “#Omit” is the small number of sentences con-
taining non-leaf punctuation tokens (see footnote 19),
and “#Type” is the number of punctuation types after
preprocessing. (Recall from §4 that preprocessing dis-
tinguishes between left and right quotation mark types,
and between abbreviation dot and period dot types.)

Baselines. We compare our model against three
baselines to show that its complexity is necessary.
Our first baseline is an ablation study that does not
use latent underlying punctuation, but generates
the surface punctuation directly from the tree. (To
implement this, we fix the parameters of the noisy
channel so that the surface punctuation equals the
underlying with probability 1.) If our full model
performs significantly better, it will demonstrate
the importance of a distinct underlying layer.

Our other two baselines ignore the tree struc-
ture, so if our full model performs significantly
better, it will demonstrate that conditioning on ex-
plicit syntactic structure is useful. These baselines
are based on previously published approaches that
reduce the problem to tagging: Xu et al. (2016)
use a BiLSTM-CRF tagger with bigram topology;
Tilk and Alumäe (2016) use a BiGRU tagger with
attention. In both approaches, the model is trained
to tag each slot i with the correct string xi 2 V⇤
(possibly ✏ or ˆ). These are discriminative proba-
bilistic models (in contrast to our generative one).
Each gives a probability distribution over the tag-
gings (conditioned on the unpunctuated sentence),
so we can evaluate their perplexity.14

Results. As shown in Table 3, our full model
beats the baselines in perplexity in all 5 languages.
Also, in 4 of 5 languages, allowing a trained
NOISYCHANNEL (rather than the identity map)

replacing the left one with “ and the right one with ”.
13For symmetry, we should also have added a final mark.
14These methods learn word embeddings that optimize

conditional log-likelihood on the punctuation restoration
training data. They might do better if these embeddings were
shared with other tasks, as multi-task learning might lead
them to discover syntactic categories of words.

363

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021

https://arxiv.org/abs/1906.11298


Attn. CRF ATTACH +NC DIR
Arabic 1.4676 1.3016 1.2230 1.1526 L
Chinese 1.6850 1.4436 1.1921 1.1464 L
English 1.5737 1.5247 1.5636 1.4276 R
Hindi 1.1201 1.1032 1.0630 1.0598 L
Spanish 1.4397 1.3198 1.2364 1.2103 R

Table 3: Results of the conditional perplexity experi-
ment (§4), reported as perplexity per punctuation slot,
where an unpunctuated sentence of n words has n + 1
slots. Column “Attn.” is the BiGRU tagger with atten-
tion, and “CRF” stands for the BiLSTM-CRF tagger.
“ATTACH” is the ablated version of our model where
surface punctuation is directly attached to the nodes.
Our full model “+NC” adds NOISYCHANNEL to trans-
duce the attached punctuation into surface punctuation.
DIR is the learned direction (§ 2.2) of our full model’s
noisy channel PFST: Left-to-right or Right-to-left. Our
models are given oracle parse trees T . The best per-
plexity is boldfaced, along with all results that are not
significantly worse (paired permutation test, p < 0.05).

significantly improves the perplexity.

5 Analysis of the Learned Grammar

5.1 Rules Learned from the Noisy Channel
We study our learned probability distribution over
noisy channel rules (ab 7! b, ab 7! a, ab 7! ab,
ab 7!ba) for English. The probability distributions
corresponding to six of Nunberg’s English rules
are shown in Figure 3. By comparing the orange
and blue bars, observe that the model trained on
the en_cesl treebank learned different quotation
rules from the one trained on the en treebank. This
is because en_cesl follows British style, whereas
en has American-style quote transposition.15

We now focus on the model learned from the
en treebank. Nunberg’s rules are deterministic,
and our noisy channel indeed learned low-entropy
rules, in the sense that for an input ab with un-
derlying count � 25,16 at least one of the possi-
ble outputs (a, b, ab or ba) always has probability
> 0.75. The one exception is ”. 7!.” for which
the argmax output has probability ⇡ 0.5, because
writers do not apply this quote transposition rule
consistently. As shown by the blue bars in Fig-
ure 3, the high-probability transduction rules are

15American style places commas and periods inside the
quotation marks, even if they are not logically in the quote.
British style (more sensibly) places unquoted periods and
commas in their logical place, sometimes outside the quo-
tation marks if they are not part of the quote.

16For rarer underlying pairs ab, the estimated distributions
sometimes have higher entropy due to undertraining.

Figure 3: Rewrite probabilities learned for English,
averaged over the last 4 epochs on en treebank (blue
bars) or en_esl treebank (orange bars). The header
above each figure is the underlying punctuation string
(input to NOISYCHANNEL). The two counts in the fig-
ure headers are the number of occurrences of the under-
lying punctuation strings in the 1-best reconstruction of
underlying punctuation sequences (by Algorithm 1) re-
spectively in the en and en_esl treebank. Each bar
represents one surface punctuation string (output of
NOISYCHANNEL), its height giving the probability.

consistent with Nunberg’s hand-crafted determin-
istic grammar in Table 1.

Our system has high precision when we look at
the confident rules. Of the 24 learned edits with
conditional probability > 0.75, Nunberg lists 20.

Our system also has good recall. Nunberg’s
hand-crafted schemata consider 16 punctuation
types and generate a total of 192 edit rules, in-
cluding the specimens in Table 1. That is, of the
162 = 256 possible underlying punctuation bi-
grams ab, 34 are supposed to undergo absorption
or transposition. Our method achieves fairly high
recall, in the sense that when Nunberg proposes
ab 7!�, our learned p(� | ab) usually ranks highly
among all probabilities of the form p(�0 | ab). 75
of Nunberg’s rules got rank 1, 48 got rank 2, and
the remaining 69 got rank > 2. The mean recipro-
cal rank was 0.621. Recall is quite high when we
restrict to those Nunberg rules ab 7! � for which
our model is confident how to rewrite ab, in the
sense that some p(�0 | ab) > 0.5. (This tends
to eliminate rare ab: see footnote 5.) Of these 55
Nunberg rules, 38 rules got rank 1, 15 got rank 2,
and only 2 got rank worse than 2. The mean recip-
rocal rank was 0.836.

¿What about Spanish? Spanish uses inverted
question marks ¿ and exclamation marks ¡, which
form symmetric pairs with the regular question
marks and exclamation marks. If we try to ex-
trapolate to Spanish from Nunberg’s English for-

364

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021


malization, the English mark most analogous to ¿
is (. Our learned noisy channel for Spanish (not
graphed here) includes the high-probability rules
,¿ 7!,¿ and :¿ 7!:¿ and ¿, 7!¿ which match
Nunberg’s treatment of ( in English.

5.2 Attachment Model

What does our model learn about how dependency
relations are marked by underlying punctuation?

ˆ,Earlier, Kerry said ,“...,in fact, answer the question”.
ˆEarlier, Kerry said ,“...,in fact, answer the question.”

root.,advmod,
,“ccomp”

,nmod,

The above example17 illustrates the use of specific
puncteme pairs to set off the advmod, ccomp,
and nmod relations. Notice that said takes
a complement (ccomp) that is symmetrically
quoted but also left delimited by a comma, which
is indeed how direct speech is punctuated in
English. This example also illustrates quotation
transposition. The top five relations that are most
likely to generate symmetric punctemes and their
top (l, r) pairs are shown in Table 4.

Section 1 ,2 , ,...7, and 8...
Section 1 ,2 ,...7, and 8...

,conj,
,conj,

conj

cc

The above example18 shows how our model han-
dles commas in conjunctions of 2 or more phrases.
UD format dictates that each conjunct after the
first is attached by the conj relation. As shown
above, each such conjunct is surrounded by under-
lying commas (via the N.,.,.conj feature from
Appendix A), except for the one that bears the
conjunction and (via an even stronger weight on
the C.✏.✏.���!conj.cc feature). Our learned feature
weights indeed yield p(` = ✏, r = ✏) > 0.5 for the
final conjunct in this example. Some writers omit
the “Oxford comma” before the conjunction: this
style can be achieved simply by changing “sur-
rounded” to “preceded” (that is, changing the N
feature to N.,.✏.conj).

6 Performance on Extrinsic Tasks

We evaluate the trained punctuation model by us-
ing it in the following three tasks.

17[en] Earlier, Kerry said, “Just because you
get an honorable discharge does not, in fact,
answer that question.”

18[en] Sections 1, 2, 5, 6, 7, and 8 will
survive any termination of this License.

parataxis appos list advcl ccomp
2.38 2.29 1.33 0.77 0.53

, , 26.8 , , 18.8 ✏ ✏ 60.0 ✏ ✏ 73.8 ✏ ✏ 90.8
✏ ✏ 20.1 : ✏ 18.1 , , 22.3 , , 21.2 “ ” 2.4
( ) 13.0 - ✏ 15.9 , ✏ 5.3 ✏ , 3.1 , , 2.4
- ✏ 9.7 ✏ ✏ 14.4 < > 3.0 ( ) 0.74 :“ ” 0.9
: ✏ 8.1 ( ) 13.1 ( ) 3.0 ✏ - 0.21 “ ,” 0.8

Table 4: The top 5 relations that are most likely to
generate symmetric punctemes, the entropy of their
puncteme pair (row 2), and their top 5 puncteme pairs
(rows 3–7) with their probabilities shown as percent-
ages. The symmetric punctemes are in boldface.

6.1 Punctuation Restoration

In this task, we are given a depunctuated sentence
d̄19 and must restore its (surface) punctuation. Our
model supposes that the observed punctuated sen-
tence x̄ would have arisen via the generative pro-
cess (1). Thus, we try to find T , T 0, and x̄ that are
consistent with d̄ (a partial observation of x̄).

The first step is to reconstruct T from d̄. This
initial parsing step is intended to choose the T that
maximizes psyn(T | d̄).20 This step depends only
on psyn and not on our punctuation model (p✓, p�).
In practice, we choose T via a dependency parser
that has been trained on an unpunctuated treebank
with examples of the form (d̄, T).21

Equation (2) now defines a distribution over
(T 0, x) given this T . To obtain a single prediction
for x, we adopt the minimum Bayes risk (MBR)
approach of choosing surface punctuation x̂ that
minimizes the expected loss with respect to the
unknown truth x⇤. Our loss function is the total
edit distance over all slots (where edits operate on
punctuation tokens). Finding x̂ exactly would be
intractable, so we use a sampling-based approx-
imation and draw m = 1000 samples from the
posterior distribution over (T 0, x). We then define

x̂ = argmin
x2S(T)

X

x⇤2S(T)

p̂(x⇤|T) · loss(x, x⇤) (10)

where S(T) is the set of unique x values in the
sample and p̂ is the empirical distribution given by
the sample. This can be evaluated in O(m2) time.

19 To depunctuate a treebank sentence, we remove all to-
kens with POS-tag PUNCT or dependency relation punct.
These are almost always leaves; else we omit the sentence.

20Ideally, rather than maximize, one would integrate over
possible trees T , in practice by sampling many values Tk
from psyn(· | ū) and replacing S(T) in (10) with

S
k S(Tk).

21Specifically, the Yara parser (Rasooli and Tetreault,
2015), a fast non-probabilistic transition-based parser that
uses rich non-local features (Zhang and Nivre, 2011).

365

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021

https://arxiv.org/abs/1906.11298


We evaluate on Arabic, English, Chinese,
Hindi, and Spanish. For each language, we train
both the parser and the punctuation model on
the training split of that UD treebank (§4), and
evaluate on held-out data. We compare to the
BiLSTM-CRF baseline in §4 (Xu et al., 2016).22

We also compare to a “trivial” deterministic base-
line, which merely places a period at the end of the
sentence (or a "|" in the case of Hindi) and adds no
other punctuation. Because most slots do not in
fact have punctuation, the trivial baseline already
does very well; to improve on it, we must fix its
errors without introducing new ones.

Our final comparison on test data is shown in
the table in Figure 4. On all 5 languages, our
method beats (usually significantly) its 3 com-
petitors: the trivial deterministic baseline, the
BiLSTM-CRF, and the ablated version of our
model (ATTACH) that omits the noisy channel.

Of course, the success of our method depends
on the quality of the parse trees T (which is par-
ticularly low for Chinese and Arabic). The graph
in Figure 4 explores this relationship, by evaluat-
ing (on dev data) with noisier trees obtained from
parsers that were variously trained on only the first
10%, 20%, . . . of the training data. On all 5 lan-
guages, provided that the trees are at least 75%
correct, our punctuation model beats both the triv-
ial baseline and the BiLSTM-CRF (which do not
use trees). It also beats the ATTACH ablation base-
line at all levels of tree accuracy (these curves are
omitted from the graph to avoid clutter). In all lan-
guages, better parses give better performance, and
gold trees yield the best results.

6.2 Punctuation Correction

Our next goal is to correct punctuation errors in
a learner corpus. Each sentence is drawn from
the Cambridge Learner Corpus treebanks, which
provide original (en_esl) and corrected (en_cesl)
sentences. All kinds of errors are corrected, such

22We copied their architecture exactly but re-tuned the hy-
perparameters on our data. We also tried tripling the amount
of training data by adding unannotated sentences (provided
along with the original annotated sentences by Ginter et al.
(2017)), taking advantage of the fact that the BiLSTM-CRF
does not require its training sentences to be annotated with
trees. However, this actually hurt performance slightly, per-
haps because the additional sentences were out-of-domain.
We also tried the BiGRU-with-attention architecture of Tilk
and Alumäe (2016), but it was also weaker than the BiLSTM-
CRF (just as in Table 3). We omit all these results from Fig-
ure 4 to reduce clutter.

p 8 ATTACH a-- --a
Arabic 0.064 0.064 0.063 0.059 0.053
Chinese 0.110 0.109 0.104 0.102 0.048
English 0.100 0.108 0.092 0.090 0.079
Hindi 0.025 0.023 0.019 0.018 0.013
Spanish 0.093 0.092 0.085 0.078 0.068

Figure 4: Edit distance per slot (which we call average
edit distance, or AED) for each of the 5 corpora. Lower
is better. The table gives the final AED on the test data.
Its first 3 columns show the baseline methods just as in
Table 3: the trivial deterministic method, the BiLSTM-
CRF, and the ATTACH ablation baseline that attaches
the surface punctuation directly to the tree. Column 4
is our method that incorporates a noisy channel, and
column 5 (in gray) is our method using oracle (gold)
trees. We boldface the best non-oracle result as well as
all that are not significantly worse (paired permutation
test, p < 0.05). The curves show how our method’s
AED (on dev data) varies with the labeled attachment
score (LAS) of the trees, where --a at x = 100 uses
the oracle (gold) trees, a-- at x < 100 uses trees from
our parser trained on 100% of the training data, and the
#-- points at x ⌧ 100 use increasingly worse parsers.
The p and 8 at the right of the graph show the AED of
the trivial deterministic baseline and the BiLSTM-CRF
baseline, which do not use trees.

as syntax errors, but we use only the 30% of sen-
tences whose depunctuated trees T are isomorphic
between en_esl and en_cesl. These en_cesl
trees may correct word and/or punctuation errors
in en_esl, as we wish to do automatically.

We assume that an English learner can make
mistakes in both the attachment and the noisy
channel steps. A common attachment mistake is
the failure to surround a non-restrictive relative
clause with commas. In the noisy channel step,
mistakes in quote transposition are common.

Correction model. Based on the assumption
about the two error sources, we develop a dis-
criminative model for this task. Let x̄e de-
note the full input sentence, and let xe and xc
denote the input (possibly errorful) and output
(corrected) punctuation sequences. We model
p(xc | x̄e) =

P
T

P
T 0c

psyn(T | x̄e) · p✓(T 0c |
T, xe) · p�(xc | T 0c). Here T is the depunctu-
ated parse tree, T 0c is the corrected underlying tree,
T
0
e is the error underlying tree, and we assume

p✓(T
0
c | T, xe) =

P
T 0e

p(T 0e | T, xe) · p✓(T 0c | T 0e).

366

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021


In practice we use a 1-best pipeline rather than
summing. Our first step is to reconstruct T from
the error sentence x̄e. We choose T that max-
imizes psyn(T | x̄e) from a dependency parser
trained on en_esl treebank examples (x̄e, T ). The
second step is to reconstruct T 0e based on our punc-
tuation model trained on en_esl. We choose T 0e
that maximizes p(T 0e | T, xe). We then reconstruct
T
0
c by

p(T 0c | T
0
e) =

Q
we2T 0e

p(l, r | we) (11)

where we is the node in T 0e, and p(l, r | we) is a
similar log-linear model to equation (4) with addi-
tional features (Appendix C4) which look at we.

Finally, we reconstruct xc based on the noisy
channel p�(xc | T 0c) in § 2.2. During training, � is
regularized to be close to the noisy channel param-
eters in the punctuation model trained on en_cesl.

We use the same MBR decoder as in § 6.1 to
choose the best action. We evaluate using AED
as in § 6.1. As a second metric, we use the script
from the CoNLL 2014 Shared Task on Grammati-
cal Error Correction (Ng et al., 2014): it computes
the F0.5-measure of the set of edits found by the
system, relative to the true set of edits.

As shown in Table 5, our method achieves bet-
ter performance than the punctuation restoration
baselines (which ignore input punctuation). On
the other hand, it is soundly beaten by a new
BiLSTM-CRF that we trained specifically for the
task of punctuation correction. This is the same
as the BiLSTM-CRF in the previous section, ex-
cept that the BiLSTM now reads a punctuated
input sentence (with possibly erroneous punctua-
tion). To be precise, at step 0  i  n, the BiL-
STM reads a concatenation of the embedding of
word i (or BOS if i = 0) with an embedding of
the punctuation token sequence xi. The BiLSTM-
CRF wins because it is a discriminative model tai-
lored for this task: the BiLSTM can extract arbi-
trary contextual features of slot i that are corre-
lated with whether xi is correct in context.

6.3 Sentential Rephrasing
We suspect that syntactic transformations on a
sentence should often preserve the underlying
punctuation attached to its tree. The surface punc-
tuation can then be regenerated from the trans-
formed tree. Such transformations include ed-
its that are suggested by a writing assistance tool
(Heidorn, 2000), or subtree deletions in compres-
sive summarization (Knight and Marcu, 2002).

p 8 a-- parsed gold 8-corr
AED 0.052 0.051 0.047 0.034 0.033 0.005
F0.5 0.779 0.787 0.827 0.876 0.881 0.984

Table 5: AED and F0.5 results on the test split of
English-ESL data. Lower AED is better; higher F0.5
is better. The first three columns (markers corre-
spond to Figure 4) are the punctuation restoration base-
lines, which ignore the input punctuation. The fourth
and fifth columns are our correction models, which
use parsed and gold trees. The final column is the
BiLSTM-CRF model tailored for the punctuation cor-
rection task.

For our experiment, we evaluate an interesting
case of syntactic transformation. Wang and Eis-
ner (2016) consider a systematic rephrasing pro-
cedure by rearranging the order of dependent sub-
trees within a UD treebank, in order to synthesize
new languages with different word order that can
then be used to help train multi-lingual systems
(i.e., data augmentation with synthetic data).

As Wang and Eisner acknowledge (2016, foot-
note 9), their permutations treat surface punctua-
tion tokens like ordinary words, which can result
in synthetic sentences whose punctuation is quite
unlike that of real languages.

In our experiment, we use Wang and Eisner’s
(2016) “self-permutation” setting, where the de-
pendents of each noun and verb are stochastically
reordered, but according to a dependent ordering
model that has been trained on the same language.
For example, rephrasing a English sentence

SCONJ ADJ PUNCT DET NOUN VERB PUNCT
If true , the caper failed .

mark det
punct

advcl

nsubj punct

root

under an English ordering model may yield

DET NOUN VERB PUNCT SCONJ ADJ PUNCT
the caper failed . If true ,

markdet
root

nsubj punct
advcl

punct

which is still grammatical except that , and . are
wrongly swapped (after all, they have the same
POS tag and relation type). Worse, permutation
may yield bizarre punctuation such as , , at the
start of a sentence.

Our punctuation model gives a straightforward
remedy—instead of permuting the tree directly,
we first discover its most likely underlying tree

ˆ,If true, the caper failed.
det nsubj

root.
mark

,advcl,

by the maximizing variant of Algorithm 1 (§ 3.1).
Then, we permute the underlying tree and sample
the surface punctuation from the distribution
modeled by the trained PFST, yielding

367

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021

https://arxiv.org/abs/1906.11298


Punctuation All
Base Half Full Base Half Full

Arabic 156.0 231.3 186.1 540.8 590.3 553.4
Chinese 165.2 110.0 61.4 205.0 174.4 78.7
English 98.4 74.5 51.0 140.9 131.4 75.4
Hindi 10.8 11.0 9.7 118.4 118.8 91.8
Spanish 266.2 259.2 194.5 346.3 343.4 239.3

Table 6: Perplexity (evaluated on the train split to
avoid evaluating generalization) of a trigram language
model trained (with add-0.001 smoothing) on differ-
ent versions of rephrased training sentences. “Punc-
tuation” only evaluates perplexity on the trigrams that
have punctuation. “All” evaluates on all the tri-
grams. “Base” permutes all surface dependents includ-
ing punctuation (Wang and Eisner, 2016). “Full” is
our full approach: recover underlying punctuation, per-
mute remaining dependents, regenerate surface punc-
tuation. “Half” is like “Full” but it permutes the non-
punctuation tokens identically to “Base.” The permu-
tation model is trained on surface trees or recovered
underlying trees T 0, respectively. In each 3-way com-
parison, we boldface the best result (always significant
under a paired permutation test over per-sentence log-
probabilities, p < 0.05).

ˆthe caper failed ,If true,.
ˆthe caper failed ,If true .

det nsubj
root.

mark
,advcl,

We

leave the handling of capitalization to future work.
We test the naturalness of the permuted sen-

tences by asking how well a word trigram lan-
guage model trained on them could predict the
original sentences.23 As shown in Table 6, our per-
mutation approach reduces the perplexity over the
baseline on 4 of the 5 languages, often dramati-
cally.

7 Related Work

Punctuation can aid syntactic analysis, since it
signals phrase boundaries and sentence structure.
Briscoe (1994) and White and Rajkumar (2008)
parse punctuated sentences using hand-crafted
constraint-based grammars that implement Nun-
berg’s approach in a declarative way. These gram-
mars treat surface punctuation symbols as ordi-
nary words, but annotate the nonterminal cate-
gories so as to effectively keep track of the under-
lying punctuation. This is tantamount to crafting
a grammar for underlyingly punctuated sentences
and composing it with a finite-state noisy channel.

23So the two approaches to permutation yield different
training data, but are compared fairly on the same test data.

The parser of Ma et al. (2014) takes a differ-
ent approach and treats punctuation marks as fea-
tures of their neighboring words. Zhang et al.
(2013) use a generative model for punctuated sen-
tences, leting them restore punctuation marks dur-
ing transition-based parsing of unpunctuated sen-
tences. Li et al. (2005) use punctuation marks to
segment a sentence: this "divide and rule" strat-
egy reduces ambiguity in parsing of long Chinese
sentences. Punctuation can similarly be used to
constrain syntactic structure during grammar in-
duction (Spitkovsky et al., 2011).

Punctuation restoration (§ 6.1) is useful for tran-
scribing text from unpunctuated speech. The task
is usually treated by tagging each slot with zero
or more punctuation tokens, using a traditional
sequence labeling method: conditional random
fields (Lui and Wang, 2013; Lu and Ng, 2010), re-
current neural networks (Tilk and Alumäe, 2016),
or transition-based systems (Ballesteros and Wan-
ner, 2016).

8 Conclusion and Future Work

We have provided a new computational approach
to modeling punctuation. In our model, syntactic
constituents stochastically generate latent under-
lying left and right punctemes. Surface punctu-
ation marks are not directly attached to the syn-
tax tree, but are generated from sequences of adja-
cent punctemes by a (stochastic) finite-state string
rewriting process . Our model is inspired by Nun-
berg’s (1990) formal grammar for English punctu-
ation, but is probabilistic and trainable. We give
exact algorithms for training and inference.

We trained Nunberg-like models for 5 lan-
guages and L2 English. We compared the English
model to Nunberg’s, and showed how the trained
models can be used across languages for punctua-
tion restoration, correction, and adjustment.

In the future, we would like to study the
usefulness of the recovered underlying trees on
tasks such as syntactically sensitive sentiment
analysis (Tai et al., 2015), machine translation
(Cowan et al., 2006), relation extraction (Cu-
lotta and Sorensen, 2004), and coreference reso-
lution (Kong et al., 2010). We would also like
to investigate how underlying punctuation could
aid parsing. For discriminative parsing, features
for scoring the tree could refer to the underly-
ing punctuation, not just the surface punctuation.
For generative parsing (§3), we could follow the

368

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021


scheme in equation (1). For example, the psyn
factor in equation (1) might be a standard re-
current neural network grammar (RNNG) (Dyer
et al., 2016); when a subtree of T is completed by
the REDUCE operation of psyn, the punctuation-
augmented RNNG (1) would stochastically attach
subtree-external left and right punctemes with p✓
and transduce the subtree-internal slots with p�.

In the future, we are also interested in enriching
the T 0 representation and making it more differ-
ent from T , to underlyingly account for other phe-
nomena in T such as capitalization, spacing, mor-
phology, and non-projectivity (via reordering).

Acknowledgments

This material is based upon work supported by
the National Science Foundation under Grant Nos.
1423276 and 1718846, including a REU supple-
ment to the first author. We are grateful to the state
of Maryland for the Maryland Advanced Research
Computing Center, a crucial resource. We thank
Xiaochen Li for early discussion, Argo lab mem-
bers for further discussion, and the three reviewers
for quality comments.

References

Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen
Rambow, and Rebecca Passonneau. 2011. Sen-
timent analysis of Twitter data. In Proceedings
of the Workshop on Language in Social Media
(LSM 2011), pages 30–38.

Miguel Ballesteros and Leo Wanner. 2016. A neu-
ral network architecture for multilingual punc-
tuation generation. In Proceedings of the 2016
Conference on Empirical Methods in Natural
Language Processing, pages 1048–1053.

Yehoshua Bar-Hillel, M. Perles, and E. Shamir.
1961. On formal properties of simple
phrase structure grammars. Zeitschrift für
Phonetik, Sprachwissenschaft und Kommunika-
tionsforschung, 14:143–172. Reprinted in Y.
Bar-Hillel (1964), Language and Information:
Selected Essays on their Theory and Applica-
tion, Addison-Wesley 1964, pages 116–150.

Jean Berstel and Christophe Reutenauer. 1988.
Rational Series and their Languages. Springer-
Verlag.

Ann Bies, Mark Ferguson, Karen Katz, Robert
MacIntyre, Victoria Tredinnick, Grace Kim,
Mary Ann Marcinkiewicz, and Britta Schas-
berger. 1995. Bracketing guidelines for Tree-
bank II style: Penn Treebank project. Technical
Report MS-CIS-95-06, University of Pennsyl-
vania.

Ted Briscoe. 1994. Parsing (with) punctuation,
etc. Technical report, Xerox European Re-
search Laboratory.

Danqi Chen and Christopher Manning. 2014. A
fast and accurate dependency parser using neu-
ral networks. In Proceedings of the 2014 Con-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 740–750.

Noam Chomsky and Morris Halle. 1968. The
Sound Pattern of English. Harper and Row,
New York.

Ryan Cotterell, Nanyun Peng, and Jason Eisner.
2014. Stochastic contextual edit distance and
probabilistic FSTs. In Proceedings of the 52nd
Annual Meeting of the Association for Compu-
tational Linguistics (Volume 2: Short Papers),
pages 625–630.

Ryan Cotterell, Nanyun Peng, and Jason Eisner.
2015. Modeling word forms using latent under-
lying morphs and phonology. Transactions of
the Association for Computational Linguistics
(TACL), 3:433–447.

Ryan Cotterell, Tim Vieira, and Hinrich Schütze.
2016. A joint model of orthography and mor-
phological segmentation. In Proceedings of
the 2016 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies
(NAACL-HLT), pages 664–669.

Brooke Cowan, Ivona Kučerová, and Michael
Collins. 2006. A discriminative model for tree-
to-tree translation. In Proceedings of the 2006
Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 232–
241.

Aron Culotta and Jeffrey Sorensen. 2004. De-
pendency tree kernels for relation extraction.
In Proceedings of the 42nd Annual Meeting of
the Association for Computational Linguistics
(ACL).

369

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021

http://www.aclweb.org/anthology/W11-0705
http://www.aclweb.org/anthology/W11-0705
https://doi.org/10.18653/v1/D16-1111
https://doi.org/10.18653/v1/D16-1111
https://doi.org/10.18653/v1/D16-1111
http://search.proquest.com/openview/fb41296047fb7453dcb1de182b4aa0b6/1
http://search.proquest.com/openview/fb41296047fb7453dcb1de182b4aa0b6/1
ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/root.ps.gz
ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/root.ps.gz
https://www.cl.cam.ac.uk/~ejb1/punct-pos-parsing.ps
https://www.cl.cam.ac.uk/~ejb1/punct-pos-parsing.ps
https://doi.org/10.3115/v1/D14-1082
https://doi.org/10.3115/v1/D14-1082
https://doi.org/10.3115/v1/D14-1082
https://doi.org/10.3115/v1/P14-2102
https://doi.org/10.3115/v1/P14-2102
http://cs.jhu.edu/~jason/papers/#cotterell-peng-eisner-2015
http://cs.jhu.edu/~jason/papers/#cotterell-peng-eisner-2015
http://www.aclweb.org/anthology/N16-1080
http://www.aclweb.org/anthology/N16-1080
http://www.aclweb.org/anthology/W06-1628
http://www.aclweb.org/anthology/W06-1628
http://www.aclweb.org/anthology/P04-1054
http://www.aclweb.org/anthology/P04-1054


Timothy Dozat and Christopher Manning. 2017.
Efficient third-order dependency parsers. In
Proceedings of the 5th International Confer-
ence on Learning Representations (ICLR).

Chris Dyer, Adhiguna Kuncoro, Miguel Balles-
teros, and Noah A. Smith. 2016. Recurrent
neural network grammars. In Proceedings of
the 2016 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies
(NAACL-HLT), pages 199–209.

Jason Eisner. 1996. Three new probabilistic mod-
els for dependency parsing: An exploration. In
Proceedings of the 16th International Confer-
ence on Computational Linguistics (COLING),
pages 340–345.

Jason Eisner. 2016. Inside-outside and forward-
backward algorithms are just backprop. In Pro-
ceedings of the EMNLP Workshop on Struc-
tured Prediction for NLP.

A. Caracciolo di Forino. 1968. String process-
ing languages and generalized Markov algo-
rithms. In D. G. Bobrow, editor, Symbol Manip-
ulation Languages and Techniques, pages 191–
206. North-Holland Publishing Company, Am-
sterdam.

Kuzman Ganchev, Jennifer Gillenwater, and Ben
Taskar. 2010. Posterior regularization for struc-
tured latent variable models. Journal of Ma-
chine Learning Research, 11:2001–2049.

Filip Ginter, Jan Hajič, Juhani Luotolahti, Milan
Straka, and Daniel Zeman. 2017. CoNLL 2017
shared task - automatically annotated raw texts
and word embeddings. LINDAT/CLARIN dig-
ital library at the Institute of Formal and Ap-
plied Linguistics (ÚFAL), Faculty of Mathe-
matics and Physics, Charles University.

Yoav Goldberg and Michael Elhadad. 2010. An
efficient algorithm for easy-first non-directional
dependency parsing. In Human Language
Technologies: The 2010 Annual Conference of
the North American Chapter of the Association
for Computational Linguistics (NAACL-HLT),
pages 742–750.

Joshua Goodman. 1999. Semiring parsing. Com-
putational Linguistics, 25(4):573–605.

George Heidorn. 2000. Intelligent writing assis-
tance. In Robert Dale, Herman Moisl, and
Harold Somers, editors, Handbook of Natural
Language Processing, pages 181–207. Marcel
Dekker, New York.

C. Douglas Johnson. 1972. Formal Aspects of
Phonological Description. Mouton.

Bernard E. M. Jones. 1994. Exploring the role of
punctuation in parsing natural text. In COLING
1994 Volume 1: The 15th International Confer-
ence on Computational Linguistics.

Diederik Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. In Proceed-
ings of the International Conference on Learn-
ing Representations (ICLR).

Eliyahu Kiperwasser and Yoav Goldberg. 2016.
Simple and accurate dependency parsing us-
ing bidirectional LSTM feature representations.
Transactions of the Association for Computa-
tional Linguistics (TACL), 4:313–327.

Kevin Knight and Daniel Marcu. 2002. Summa-
rization beyond sentence extraction: A proba-
bilistic approach to sentence compression. Ar-
tificial Intelligence, 139(1):91–107.

Fang Kong, Guodong Zhou, Longhua Qian, and
Qiaoming Zhu. 2010. Dependency-driven
anaphoricity determination for coreference res-
olution. In Proceedings of the 23rd Interna-
tional Conference on Computational Linguis-
tics (COLING), pages 599–607.

Terry Koo and Michael Collins. 2010. Efficient
third-order dependency parsers. In Proceedings
of the 48th Annual Meeting of the Association
for Computational Linguistics (ACL), pages 1–
11.

Albert E. Krahn. 2014. A New Paradigm for
Punctuation. Ph.D. thesis, The University of
Wisconsin-Milwaukee.

Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay,
and Tommi Jaakkola. 2014. Low-rank tensors
for scoring dependency structures. In Proceed-
ings of the 52nd Annual Meeting of the As-
sociation for Computational Linguistics (ACL),
pages 1381–1391.

370

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021

https://arxiv.org/pdf/1611.01734.pdf
https://doi.org/10.18653/v1/N16-1024
https://doi.org/10.18653/v1/N16-1024
http://cs.jhu.edu/~jason/papers/#eisner-1996-coling
http://cs.jhu.edu/~jason/papers/#eisner-1996-coling
http://cs.jhu.edu/~jason/papers/#eisner-2016
http://cs.jhu.edu/~jason/papers/#eisner-2016
http://hdl.handle.net/11234/1-1989
http://hdl.handle.net/11234/1-1989
http://hdl.handle.net/11234/1-1989
http://www.aclweb.org/anthology/N10-1115
http://www.aclweb.org/anthology/N10-1115
http://www.aclweb.org/anthology/N10-1115
http://research.microsoft.com/~joshuago/finalring.ps
https://books.google.com/books?id=MnEjBsMIxxsC&lpg=PP1&pg=PA186
https://books.google.com/books?id=MnEjBsMIxxsC&lpg=PP1&pg=PA186
http://www.aclweb.org/anthology/C94-1069
http://www.aclweb.org/anthology/C94-1069
https://arxiv.org/pdf/1412.6980.pdf
https://arxiv.org/pdf/1412.6980.pdf
http://aclweb.org/anthology/Q16-1023
http://aclweb.org/anthology/Q16-1023
http://www.aclweb.org/anthology/C10-1068
http://www.aclweb.org/anthology/C10-1068
http://www.aclweb.org/anthology/C10-1068
http://aclweb.org/anthology/P10-1001
http://aclweb.org/anthology/P10-1001
https://dc.uwm.edu/cgi/viewcontent.cgi?article=1470&context=etd
https://dc.uwm.edu/cgi/viewcontent.cgi?article=1470&context=etd
http://www.aclweb.org/anthology/P14-1130
http://www.aclweb.org/anthology/P14-1130


Roger Levy. 2008. A noisy-channel model of
human sentence comprehension under uncer-
tain input. In Proceedings of the 2008 Con-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 234–243.

Xing Li, Chengqing Zong, and Rile Hu. 2005. A
hierarchical parsing approach with punctuation
processing for long Chinese sentences. In Pro-
ceedings of the International Joint Conference
on Natural Language Processing (IJCNLP).

Zhifei Li and Jason Eisner. 2009. First- and
second-order expectation semirings with appli-
cations to minimum-risk training on translation
forests. In Proceedings of the Conference on
Empirical Methods in Natural Language Pro-
cessing (EMNLP), pages 40–51.

Wei Lu and Hwee Tou Ng. 2010. Better punctu-
ation prediction with dynamic conditional ran-
dom fields. In Proceedings of the 2010 Con-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 177–186.

Marco Lui and Li Wang. 2013. Recovering
casing and punctuation using conditional ran-
dom fields. In Proceedings of the Australasian
Language Technology Association Workshop
(ALTA), pages 137–141.

Ji Ma, Yue Zhang, and Jingbo Zhu. 2014. Punc-
tuation processing for projective dependency
parsing. In Proceedings of the 52nd Annual
Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), pages
791–796.

Gideon S. Mann and Andrew McCallum. 2010.
Generalized expectation criteria for semi-
supervised learning with weakly labeled
data. Journal of Machine Learning Research,
11:955–984.

Andrey Andreevich Markov. 1960. The theory
of algorithms. American Mathematical Society
Translations, series 2(15):1–14.

Ilia Markov, Vivi Nastase, and Carlo Strappar-
ava. 2018. Punctuation as native language in-
terference. In Proceedings of the 27th Inter-
national Conference on Computational Linguis-
tics (COLING), pages 3456–3466.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
frey Dean. 2013. Efficient estimation of word
representations in vector space. Computing Re-
search Repository (CoRR), arXiv:1301.3781.

Mark-Jan Nederhof and Giorgio Satta. 2003.
Probabilistic parsing as intersection. In 8th In-
ternational Workshop on Parsing Technologies
(IWPT), pages 137–148.

Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Chris-
tian Hadiwinoto, Raymond Hendy Susanto, and
Christopher Bryant. 2014. The CoNLL-2014
shared task on grammatical error correction.
In Proceedings of the Eighteenth Conference
on Computational Natural Language Learning:
Shared Task, pages 1–14.

Joakim Nivre, Željko Agić, Lars Ahrenberg,
Maria Jesus Aranzabe, Masayuki Asahara,
Aitziber Atutxa, Miguel Ballesteros, John
Bauer, Kepa Bengoetxea, Yevgeni Berzak,
Riyaz Ahmad Bhat, Eckhard Bick, Carl
Börstell, Cristina Bosco, Gosse Bouma, Sam
Bowman, Gülşen Cebiroğlu Eryiğit, Giuseppe
G. A. Celano, Fabricio Chalub, Çağrı Çöl-
tekin, Miriam Connor, Elizabeth Davidson,
Marie-Catherine de Marneffe, Arantza Diaz de
Ilarraza, Kaja Dobrovoljc, Timothy Dozat,
Kira Droganova, Puneet Dwivedi, Marhaba
Eli, Tomaž Erjavec, Richárd Farkas, Jennifer
Foster, Claudia Freitas, Katarína Gajdošová,
Daniel Galbraith, Marcos Garcia, Moa Gär-
denfors, Sebastian Garza, Filip Ginter, Iakes
Goenaga, Koldo Gojenola, Memduh Gökır-
mak, Yoav Goldberg, Xavier Gómez Guino-
vart, Berta Gonzáles Saavedra, Matias Gri-
oni, Normunds Grūzı̄tis, Bruno Guillaume,
Jan Hajič, Linh Hà Mỹ, Dag Haug, Barbora
Hladká, Radu Ion, Elena Irimia, Anders Jo-
hannsen, Fredrik Jørgensen, Hüner Kaşıkara,
Hiroshi Kanayama, Jenna Kanerva, Boris
Katz, Jessica Kenney, Natalia Kotsyba, Si-
mon Krek, Veronika Laippala, Lucia Lam,
Phuong Lê Hồng, Alessandro Lenci, Nikola
Ljubešić, Olga Lyashevskaya, Teresa Lynn,
Aibek Makazhanov, Christopher Manning,
Cătălina Mărănduc, David Mareček, Héctor
Martínez Alonso, André Martins, Jan Mašek,
Yuji Matsumoto, Ryan McDonald, Anna Mis-
silä, Verginica Mititelu, Yusuke Miyao, Simon-
etta Montemagni, Keiko Sophie Mori, Shun-
suke Mori, Bohdan Moskalevskyi, Kadri Muis-

371

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021

http://www.aclweb.org/anthology/D08-1025
http://www.aclweb.org/anthology/D08-1025
http://www.aclweb.org/anthology/D08-1025
http://www.aclweb.org/anthology/I05-2002
http://www.aclweb.org/anthology/I05-2002
http://www.aclweb.org/anthology/I05-2002
http://cs.jhu.edu/~jason/papers/#li-eisner-2009
http://cs.jhu.edu/~jason/papers/#li-eisner-2009
http://cs.jhu.edu/~jason/papers/#li-eisner-2009
http://cs.jhu.edu/~jason/papers/#li-eisner-2009
http://www.aclweb.org/anthology/U13-1020
http://www.aclweb.org/anthology/U13-1020
http://www.aclweb.org/anthology/U13-1020
https://doi.org/10.3115/v1/P14-2128
https://doi.org/10.3115/v1/P14-2128
https://doi.org/10.3115/v1/P14-2128
http://www.jmlr.org/papers/volume11/mann10a/mann10a.pdf
http://www.jmlr.org/papers/volume11/mann10a/mann10a.pdf
http://www.jmlr.org/papers/volume11/mann10a/mann10a.pdf
http://www.aclweb.org/anthology/C18-1293
http://www.aclweb.org/anthology/C18-1293
http://arxiv.org/abs/1301.3781
http://arxiv.org/abs/1301.3781
https://doi.org/10.3115/v1/W14-1701
https://doi.org/10.3115/v1/W14-1701


chnek, Nina Mustafina, Kaili Müürisep, Luong
Nguyễn Thi., Huyền Nguyễn Thi. Minh, Vitaly
Nikolaev, Hanna Nurmi, Petya Osenova, Robert
Östling, Lilja Øvrelid, Valeria Paiva, Elena Pas-
cual, Marco Passarotti, Cenel-Augusto Perez,
Slav Petrov, Jussi Piitulainen, Barbara Plank,
Martin Popel, Lauma Pretkalnin, a, Prokopis
Prokopidis, Tiina Puolakainen, Sampo Pyysalo,
Alexandre Rademaker, Loganathan Ramasamy,
Livy Real, Laura Rituma, Rudolf Rosa, Shadi
Saleh, Baiba Saulı̄te, Sebastian Schuster, Wolf-
gang Seeker, Mojgan Seraji, Lena Shakurova,
Mo Shen, Natalia Silveira, Maria Simi, Radu
Simionescu, Katalin Simkó, Mária Šimková,
Kiril Simov, Aaron Smith, Carolyn Spadine,
Alane Suhr, Umut Sulubacak, Zsolt Szántó,
Takaaki Tanaka, Reut Tsarfaty, Francis Ty-
ers, Sumire Uematsu, Larraitz Uria, Gertjan
van Noord, Viktor Varga, Veronika Vincze,
Lars Wallin, Jing Xian Wang, Jonathan North
Washington, Mats Wirén, Zdeněk Žabokrt-
ský, Amir Zeldes, Daniel Zeman, and Hanzhi
Zhu. 2016. Universal Dependencies 1.4.
LINDAT/CLARIN digital library at the In-
stitute of Formal and Applied Linguistics
(ÚFAL), Faculty of Mathematics and Physics,
Charles University. Data available at http:
//universaldependencies.org.

Joakim Nivre, Johan Hall, Sandra Kübler, Ryan
McDonald, Jens Nilsson, Sebastian Riedel, and
Deniz Yuret. 2007a. The CoNLL 2007 shared
task on dependency parsing. In Proceedings
of the CoNLL Shared Task Session of EMNLP-
CoNLL 2007, pages 915–932.

Joakim Nivre, Johan Hall, Jens Nilsson, Atanas
Chanev, Gülşen Eryigit, Sandra Kübler, Sve-
toslav Marinov, and Erwin Marsi. 2007b. Malt-
parser: A language-independent system for
data-driven dependency parsing. Natural Lan-
guage Engineering, 13(2):95–135.

Joakim Nivre et al. 2018. Universal depen-
dencies annotation guidelines. Available at
universaldependencies.org.

Geoffrey Nunberg. 1990. The Linguistics of Punc-
tuation. Number 18 in CSLI Lecture Notes.
Center for the Study of Language and Informa-
tion.

Bo Pang, Lillian Lee, and Shivakumar
Vaithyanathan. 2002. Thumbs up? Senti-

ment classification using machine learning
techniques. In Proceedings of the 2002
Conference on Empirical Methods in Natural
Language Processing (EMNLP 2002).

Jeffrey Pennington, Richard Socher, and Christo-
pher Manning. 2014. GloVe: Global vectors
for word representation. In Proceedings of
the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP), pages
1532–1543.

Fernando C. N. Pereira and Michael D. Riley.
1996. Speech recognition by composition of
weighted finite automata. Computing Research
Repository (CoRR), arXiv:cmp-lg/9603001.

Mohammad Sadegh Rasooli and Joel R. Tetreault.
2015. Yara parser: A fast and accurate depen-
dency parser. Computing Research Repository,
arXiv:1503.06733 (version 2).

Radim Řehůřek and Petr Sojka. 2010. Software
framework for topic modelling with large cor-
pora. In Proceedings of the LREC 2010 Work-
shop on New Challenges for NLP Frameworks,
pages 45–50.

Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel
Jurafsky. 2011. Punctuation: Making a point in
unsupervised dependency parsing. In Proceed-
ings of the Fifteenth Conference on Computa-
tional Natural Language Learning, CoNLL ’11,
pages 19–28.

Kai Sheng Tai, Richard Socher, and Christo-
pher D. Manning. 2015. Improved semantic
representations from tree-structured long short-
term memory networks. In Proceedings of the
53rd Annual Meeting of the Association for
Computational Linguistics and the 7th Interna-
tional Joint Conference on Natural Language
Processing (ACL-COLING), pages 1556–1566.

Ottokar Tilk and Tanel Alumäe. 2016. Bidirec-
tional recurrent neural network with attention
mechanism for punctuation restoration. In In-
terspeech, pages 3047–3051.

Ke M. Tran, Yonatan Bisk, Ashish Vaswani,
Daniel Marcu, and Kevin Knight. 2016. Unsu-
pervised neural hidden Markov models. In Pro-
ceedings of the Workshop on Structured Predic-
tion for NLP, pages 63–71.

372

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021

http://hdl.handle.net/11234/1-1827
http://universaldependencies.org
http://universaldependencies.org
http://www.aclweb.org/anthology/D/D07/D07-1096
http://www.aclweb.org/anthology/D/D07/D07-1096
http://universaldependencies.org/guidelines.html
http://universaldependencies.org/guidelines.html
universaldependencies.org
http://www.aclweb.org/anthology/W02-1011
http://www.aclweb.org/anthology/W02-1011
http://www.aclweb.org/anthology/W02-1011
https://doi.org/10.3115/v1/D14-1162
https://doi.org/10.3115/v1/D14-1162
https://arxiv.org/abs/cmp-lg/9603001
https://arxiv.org/abs/cmp-lg/9603001
http://arxiv.org/abs/1503.06733
http://arxiv.org/abs/1503.06733
http://is.muni.cz/publication/884893/en
http://is.muni.cz/publication/884893/en
http://is.muni.cz/publication/884893/en
http://dl.acm.org/citation.cfm?id=2018936.2018939
http://dl.acm.org/citation.cfm?id=2018936.2018939
https://doi.org/10.3115/v1/P15-1150
https://doi.org/10.3115/v1/P15-1150
https://doi.org/10.3115/v1/P15-1150
https://doi.org/10.18653/v1/W16-5907
https://doi.org/10.18653/v1/W16-5907


University of Chicago. 2010. The Chicago Man-
ual of Style. University of Chicago Press.

Dingquan Wang and Jason Eisner. 2016. The
Galactic Dependencies treebanks: Getting more
data by synthesizing new languages. Transac-
tions of the Association for Computational Lin-
guistics (TACL), 4:491–505.

Michael White and Rajakrishnan Rajkumar. 2008.
A more precise analysis of punctuation for
broad-coverage surface realization with CCG.
In Proceedings of the COLING 2008 Workshop
on Grammar Engineering Across Frameworks,
pages 17–24.

K. Xu, L. Xie, and K. Yao. 2016. Investigat-
ing LSTM for punctuation prediction. In 2016
10th International Symposium on Chinese Spo-
ken Language Processing (ISCSLP), pages 1–5.

Richard Zens, Franz Josef Och, and Hermann Ney.
2002. Phrase-based statistical machine transla-
tion. In Annual Conference on Artificial Intelli-
gence, pages 18–32.

Dongdong Zhang, Shuangzhi Wu, Nan Yang, and
Mu Li. 2013. Punctuation prediction with
transition-based parsing. In Proceedings of
the 51st Annual Meeting of the Association for
Computational Linguistics (ACL), pages 752–
760.

Yue Zhang and Joakim Nivre. 2011. Transition-
based dependency parsing with rich non-local
features. In Proceedings of the 49th Annual
Meeting of the Association for Computational
Linguistics: Human Language Technologies
(NAACL-HLT), pages 188–193.

373

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021

http://cs.jhu.edu/~jason/papers/#wang-eisner-2016
http://cs.jhu.edu/~jason/papers/#wang-eisner-2016
http://cs.jhu.edu/~jason/papers/#wang-eisner-2016
http://www.aclweb.org/anthology/W08-1703
http://www.aclweb.org/anthology/W08-1703
https://doi.org/10.1109/ISCSLP.2016.7918492
https://doi.org/10.1109/ISCSLP.2016.7918492
https://link.springer.com/chapter/10.1007/3-540-45751-8_2
https://link.springer.com/chapter/10.1007/3-540-45751-8_2
http://www.aclweb.org/anthology/P13-1074
http://www.aclweb.org/anthology/P13-1074
http://www.aclweb.org/anthology/P11-2033
http://www.aclweb.org/anthology/P11-2033
http://www.aclweb.org/anthology/P11-2033