A Crossing-Sensitive Third-Order Factorization for Dependency Parsing


A Crossing-Sensitive Third-Order Factorization for Dependency Parsing

Emily Pitler∗
Google Research

76 9th Avenue
New York, NY 10011

epitler@google.com

Abstract

Parsers that parametrize over wider scopes are
generally more accurate than edge-factored
models. For graph-based non-projective
parsers, wider factorizations have so far im-
plied large increases in the computational
complexity of the parsing problem. This paper
introduces a “crossing-sensitive” generaliza-
tion of a third-order factorization that trades
off complexity in the model structure (i.e.,
scoring with features over multiple edges)
with complexity in the output structure (i.e.,
producing crossing edges). Under this model,
the optimal 1-Endpoint-Crossing tree can be
found in O(n4) time, matching the asymp-
totic run-time of both the third-order projec-
tive parser and the edge-factored 1-Endpoint-
Crossing parser. The crossing-sensitive third-
order parser is significantly more accurate
than the third-order projective parser under
many experimental settings and significantly
less accurate on none.

1 Introduction

Conditioning on wider syntactic contexts than sim-
ply individual head-modifier relationships improves
parsing accuracy in a wide variety of parsers and
frameworks (Charniak and Johnson, 2005; McDon-
ald and Pereira, 2006; Hall, 2007; Carreras, 2007;
Martins et al., 2009; Koo and Collins, 2010; Zhang
and Nivre, 2011; Bohnet and Kuhn, 2012; Martins
et al., 2013). This paper proposes a new graph-
based dependency parser that efficiently produces

∗The majority of this work was done while at the University
of Pennsylvania.

the globally optimal dependency tree according to a
third-order model (that includes features over grand-
parents and siblings in the tree) in the class of 1-
Endpoint-Crossing trees (that includes all projective
trees and the vast majority of non-projective struc-
tures seen in dependency treebanks).

Within graph-based projective parsing, the third-
order parser of Koo and Collins (2010) has a run-
time of O(n4), just one factor of n more expensive
than the edge-factored model of Eisner (2000). In-
corporating richer features and producing trees with
crossing edges has traditionally been a challenge,
however, for graph-based dependency parsers. If
parsing is posed as the problem of finding the op-
timal scoring directed spanning tree, then the prob-
lem becomes NP-hard when trees are scored with a
grandparent and/or sibling factorization (McDonald
and Pereira, 2006; McDonald and Satta, 2007). For
various definitions of mildly non-projective trees,
even edge-factored versions are expensive, with
edge-factored running times between O(n4) and
O(n7) (Gómez-Rodrı́guez et al., 2011; Pitler et al.,
2012; Pitler et al., 2013; Satta and Kuhlmann, 2013).

The third-order projective parser of Koo and
Collins (2010) and the edge-factored 1-Endpoint-
Crossing parser described in Pitler et al. (2013) have
some similarities: both use O(n4) time and O(n3)
space, using sub-problems over intervals with one
exterior vertex, which are constructed using one
free split point. The two parsers differ in how the
exterior vertex is used: Koo and Collins (2010)
use the exterior vertex to store a grandparent in-
dex, while Pitler et al. (2013) use the exterior ver-
tex to introduce crossed edges between the point and

41

Transactions of the Association for Computational Linguistics, 2 (2014) 41–54. Action Editor: Joakim Nivre.
Submitted 9/2013; Revised 11/2013; Published 2/2014. c©2014 Association for Computational Linguistics.


Projective 1-Endpoint-Crossing

Edge
O(n3) O(n4)
Eisner (2000) Pitler et al. (2013)

CS-GSib
O(n4) O(n4)
Koo and Collins (2010) This paper

Table 1: Parsing time for various output spaces and model
factorizations. CS-GSib refers to the (crossing-sensitive)
grand-sibling factorization described in this paper.

the interval. This paper proposes merging the two
parsers to achieve the best of both worlds – produc-
ing the best tree in the wider range of 1-Endpoint-
Crossing trees while incorporating the identity of
the grandparent and/or sibling of the child in the
score of an edge whenever the local neighborhood
of the edge does not contain crossing edges. The
crossing-sensitive grandparent-sibling 1-Endpoint-
Crossing parser proposed here takes O(n4) time,
matching the runtime of both the third-order pro-
jective parser and of the edge-factored 1-Endpoint-
Crossing parser (see Table 1).

The parsing algorithms of Koo and Collins (2010)
and Pitler et al. (2013) are reviewed in Section 2.
The proposed crossing-sensitive factorization is de-
fined in Section 3. The parsing algorithm that finds
the optimal 1-Endpoint-Crossing tree according to
this factorization is described in Section 4. The
implemented parser is significantly more accurate
than the third-order projective parser in a variety
of languages and treebank representations (Section
5). Section 6 discusses the proposed approach in the
context of prior work on non-projective parsing.

2 Preliminaries

In a projective dependency tree, each subtree forms
one consecutive interval in the sequence of input
words; equivalently (assuming an artificial root node
placed as either the first or last token), when all
edges are drawn in the half-plane above the sen-
tence, no two edges cross (Kübler et al., 2009). Two
vertex-disjoint edges cross if their endpoints inter-
leave. A 1-Endpoint-Crossing tree is a dependency
tree such that for each edge, all edges that cross it
share a common vertex (Pitler et al., 2013). Note
that the class of projective trees is properly included
within the class of 1-Endpoint-Crossing trees.

To avoid confusion between intervals and edges,

g h e

=

g h m

+

h m e

(a) m is the child of h that e is descended from

g h

=

g h

+

hm ss m

(b) The edge ~ehm is added to the tree; s is m’s
adjacent inner sibling

= +

hm h s r+1r msh

(c) r is s’s outermost descendant; r + 1 is m’s
innermost descendant

Figure 1: Algorithm for grand-sibling projective parsing;
the figures replicate Figure 6 in Koo and Collins (2010).

~eij denotes the directed edge from i to j (i.e., i is the
parent of j). Interval notation ((i, j), [i, j], (i, j], or
[i, j)) is used to denote sets of vertices between i and
j, with square brackets indicating closed intervals
and round brackets indicating open intervals.

2.1 Grand-Sibling Projective Parsing

A grand-sibling factorization allows features over
4-tuples of (g, h, m, s), where h is the parent of
m, g is m’s grandparent, and s is m’s adjacent in-
ner sibling. Features over these grand-sibling 4-
tuples are referred to as “third-order” because they
scope over three edges simultaneously (~egh, ~ehs, and
~ehm). The parser of Koo and Collins (2010) pro-
duces the highest-scoring projective tree according
to this grand-sibling model by adding an external
grandparent index to each of the sub-problems used
in the sibling factorization (McDonald and Pereira,
2006). Figure 6 in Koo and Collins (2010) provided
a pictorial view of the algorithm; for convenience, it
is replicated in Figure 1. An edge ~ehm is added to the
tree in the “trapezoid” step (Figure 1b); this allows
the edge to be scored conditioned on m’s grandpar-
ent (g) and its adjacent inner sibling (s), as all four
relevant indices are accessible.

2.2 Edge-factored 1-Endpoint-Crossing
Parsing

The edge-factored 1-Endpoint-Crossing parser of
Pitler et al. (2013) produces the highest scoring 1-

42


* Which cars do Americans
0 1 2 3 4

?daysfavor most these
98765

Figure 2: A 1-Endpoint-Crossing non-projective English
sentence from the WSJ Penn Treebank (Marcus et al.,
1993), converted to dependencies with PennConverter
(Johansson and Nugues, 2007).

do Americans favor

do ?daysfavor most these

* do

* Which cars do favor

Figure 3: Constructing a 1-Endpoint-Crossing tree with
intervals with one exterior vertex (Pitler et al., 2013).

Endpoint-Crossing tree with each edge ~ehm scored
according to Score(Edge(h, m)). The 1-Endpoint-
Crossing property allows the tree to be built up in
edge-disjoint pieces each consisting of intervals with
one exterior point that has edges into the interval.
For example, the tree in Figure 2 would be built up
with the sub-problems shown in Figure 3.

To ensure that crossings within a sub-problem are
consistent with the crossings that happen as a result
of combination steps, the algorithm uses four dif-
ferent “types” of sub-problems, indicating whether
the edges incident to the exterior point may be inter-
nally crossed by edges incident to the left boundary
point (L), the right (R), either (LR), or neither (N).
In Figure 3, the sub-problem over [*, do] ∪{favor}
would be of type R, and [favor, ?]∪{do} of type L.

2.2.1 Naı̈ve Approach to Including
Grandparent Features

The example in Figure 3 illustrates the difficulty of
incorporating grandparents into the scoring of all
edges in 1-Endpoint-Crossing parsing. The vertex
favor has a parent or child in all three of the sub-
problems. In order to use grandparent scoring for
the edges from favor to favor’s children in the other
two sub-problems, we would need to augment the
problems with the grandparent index do. We also

must add the parent index do to the middle sub-
problem to ensure consistency (i.e., that do is in fact
the parent assigned). Thus, a first attempt to score all
edges with grandparent features within 1-Endpoint-
Crossing trees raises the runtime from O(n4) to
O(n7) (all of the four indices need a “predicted par-
ent” index; at least one edge is always implied so
one of these additional indices can be dropped).

3 Crossing-Sensitive Factorization

Factorizations for projective dependency parsing
have often been designed to allow efficient pars-
ing. For example, the algorithms in Eisner (2000)
and McDonald and Pereira (2006) achieve their ef-
ficiency by assuming that children to the left of the
parent and to the right of the parent are independent
of each other. The algorithms of Carreras (2007)
and Model 2 in Koo and Collins (2010) include
grandparents for only the outermost grand-children
of each parent for efficiency reasons. In a similar
spirit, this paper introduces a variant of the Grand-
Sib factorization that scores crossed edges indepen-
dently (as a CrossedEdge part) and uncrossed edges
under either a grandparent-sibling, grandparent, sib-
ling, or edge-factored model depending on whether
relevant edges in its local neighborhood are crossed.

A few auxiliary definitions are required. For any
parent h and grandparent g, h’s children are parti-
tioned into interior children (those between g and h)
and exterior children (the complementary set of chil-
dren).1 Interior children are numbered from closest
to h through furthest from h; exterior children are
first numbered on the side closer to h from closest
to h through furthest, then the enumeration wraps
around to include the vertices on the side closer to g.
Figure 4 shows a parent h, its grandparent g, and a
possible sequence of three interior and four exterior
children. Note that for a projective tree, there would
not be any children on the far side of g.

Definition 1. Let h be m’s parent. Outer(m) is the
set of siblings of m that are in the same subset of h’s
children and are later in the enumeration than m is.

For example, in the tree in Figure 2,

1Because dependency trees are directed trees, each node ex-
cept for the artificial root has a unique parent. To ensure that
grandparent is defined for the root’s children, assume an artifi-
cial parent of the root for notational convenience.

43


e1 e2i1i2i3 hge3 e4

Figure 4: The exterior children are numbered first begin-
ning on the side closest to the parent, then the side closest
to the grandparent. There must be a path from the root to
g, so the edges from h to its exterior children on the far
side of g are guaranteed to be crossed.

Crossed(~ehs) ¬Crossed(~ehs)
¬GProj(~ehm) Edge(h, m) Sib(h, m, s)
GProj(~ehm) Grand(g, h, m) GrandSib(g, h, m, s)

Table 2: Part type for an uncrossed edge ~ehm for
the crossing-sensitive third-order factorization (g is m’s
grandparent; s is m’s inner sibling).

Outer(most) = {days, cars}.
Definition 2. An uncrossed edge ~ehm is GProj if
both of the following hold:

1. The edge ~egh from h’s parent to h is not crossed

2. None of the edges from h to Outer(m) (m’s
outer siblings) are crossed

Uncrossed GProj edges include the grandparent
in the part. The part includes the sibling if the edge
~ehs from the parent to the sibling is not crossed. Ta-
ble 2 gives the factorization for uncrossed edges.

The parser in this paper finds the optimal 1-
Endpoint-Crossing tree according to this factorized
form. A fully projective tree would decompose into
exclusively GrandSib parts (as all edges would be
uncrossed and GProj ). As all projective trees are
within the 1-Endpoint-Crossing search space, the
optimization problem that the parser solves includes
all projective trees scored with grand-sibling fea-
tures everywhere. Projective parsing with grand-
sibling scores can be seen as a special case, as the
crossing-sensitive 1-Endpoint-Crossing parser can
simulate a grand-sibling projective parser by setting
all Crossed(h, m) scores to −∞.

In Figure 2, the edge from do to Americans is
not GProj because Condition (1) is violated, while
the edge from favor to most is not GProj because
Condition (2) is violated. Under this definition, the
vertices do and favor (which have children in mul-
tiple sub-problems) do not need external grandpar-
ent indices in any of their sub-problems. Table 3

CrossedEdge(*,do) Sib(cars, Which, -)
CrossedEdge(favor,cars) Sib(do, Americans, -)
Sib(do, favor, Americans) CrossedEdge(do,?)
Sib(favor, most, -) Sib(favor, days, most)
GSib(favor, days, these, -)

Table 3: Decomposing Figure 2 according to the
crossing-sensitive third-order factorization described in
Section 3. Null inner siblings are indicated with -.

lists the parts in the tree in Figure 2 according to this
crossing-sensitive third-order factorization.

4 Parsing Algorithm

The parser finds the maximum scoring 1-Endpoint-
Crossing tree according to the factorization in Sec-
tion 3 with a dynamic programming procedure rem-
iniscent of Koo and Collins (2010) (for scoring un-
crossed edges with grandparent and/or sibling fea-
tures) and of Pitler et al. (2013) (for including
crossed edges). The parser also uses novel sub-
problems for transitioning between portions of the
tree with and without crossed edges. This formula-
tion of the parsing problem presents two difficulties:

1. The parser must know whether an edge is
crossed when it is added.

2. For uncrossed edges, the parser must use
the appropriate part for scoring according to
whether other edges are crossed (Table 2).

Difficulty 1 is solved by adding crossed and un-
crossed edges to the tree in distinct sub-problems
(Section 4.1). Difficulty 2 is solved by producing
different versions of subtrees over the same sets of
vertices, both with and without a grandparent index,
which differ in their assumptions about the tree out-
side of that set (Section 4.2). The list of all sub-
problems with their invariants and the full dynamic
program are provided in the supplementary material.

4.1 Enforcing Crossing Edges

The parser adds crossed and uncrossed edges in
distinct portions of the dynamic program. Un-
crossed edges are added only through trapezoid sub-
problems (that may or may not have a grandpar-
ent index), while crossed edges are added in non-
trapezoid sub-problems. To add all uncrossed edges

44


in trapezoid sub-problems, the parser (a) enforces
that any edge added anywhere else must be crossed,
and (b) includes transitional sub-problems to build
trapezoids when the edge ~ehm is not crossed, but the
edge to its inner sibling ~ehs is (and so the construc-
tion step shown in Figure 1b cannot be used).

4.1.1 Crossing Conditions
Pitler et al. (2013) included crossing edges by using
“crossing region” sub-problems over intervals with
an external vertex that optionally contained edges
between the interval and the external vertex. An
uncrossed edge could then be included either by a
derivation that prohibited it from being crossed or
a derivation which allowed (but did not force) it to
be crossed. This ambiguity is removed by enforcing
that (1) each crossing region contains at least one
edge incident to the exterior vertex, and (2) all such
edges are crossed by edges in another sub-problem.
For example, by requiring at least one edge between
do and (favor, ?] and also between favor and (*, do),
the edges in the two sets are guaranteed to cross.

4.1.2 Trapezoids with Edge to Inner Sibling
Crossed

To add all uncrossed edges in trapezoid-style sub-
problems, we must be able to construct a trapezoid
over vertices [h, m] whenever the edge ~ehm is not
crossed. The construction used in Koo and Collins
(2010), repeated graphically in Figure 5a, cannot
be used if the edge ~ehs is crossed, as there would
then exist edges between (h, s) and (s, m), making
s an invalid split point. The parser therefore includes
some “transitional glue” to allow alternative ways to
construct the trapezoid over [h, m] when ~ehm is not
crossed but the edge ~ehs to m’s inner sibling is.

The two additional ways of building trapezoids
are shown graphically in Figures 5b and 5c. Con-
sider the “chain of crossing edges” that includes the
edge ~ehs. If none of these edges are in the subtree
rooted at m, then we can build the tree involving
m and its inner descendants separately (Figure 5b)
from the rest of the tree rooted at h. Within the in-
terval [h, e − 1] the furthest edge incident to h (~ehs)
must be crossed: these intervals are parsed choosing
s and the crossing point of ~ehs simultaneously (as in
Figure 4 in Pitler et al. (2013)).

Otherwise, the sub-tree rooted at m is involved in

g h

=

g h

+

hm ss m

(a) Edge from h to inner sibling s is not crossed (re-
peats Figure 1b)

g h

=

hm mh

+

ee−1

(b) ~ehs is crossed, but the chain of crossing edges
involving ~ehs does not include any descendants of m.
e is m’s descendant furthest from m within (h, m).
s ∈ (h, e−1).

h m

+

d
=

mg h h d

(c) ~ehs is crossed, and the chain of crossing edges
involving ~ehs includes descendants of m. Of m’s de-
scendants that are incident to edges in the chain, d is
the one closest to m (d can be m itself). s ∈ (h, d).

Figure 5: Ways to build a trapezoid when the edge ~ehs to
m’s inner sibling may be crossed.

the chain of crossing edges (Figure 5c). The chain
of crossing edges between h and d (m’s descendant,
which may be m itself) is built up first then concate-
nated with the triangle rooted at m containing m’s
inner descendants not involved in the chain.

Chains of crossing edges are constructed by re-
peatedly applying two specialized types of L items
that alternate between adding an edge from the in-
terval to the exterior point (right-to-left) or from
the exterior point to the interval (left-to-right) (Fig-
ure 6). The boundary edges of the chain can
be crossed more times without violating the 1-
Endpoint-Crossing property, and so the beginning
and end of the chain can be unrestricted crossing
regions. These specialized chain sub-problems are
also used to construct boxes (Figure 1c) over [s, m]
with shared parent h when neither edge ~ehs nor ~ehm
is crossed, but the subtrees rooted at m and at s cross
each other (Figure 7).

Lemma 1. The GrandSib-Crossing parser adds all
uncrossed edges and only uncrossed edges in a tree
in a “trapezoid” sub-problem.

Proof. The only part is easy: when a trapezoid is
built over an interval [h, m], all edges are internal to
the interval, so no earlier edges could cross ~ehm. Af-

45


= +

h s k s k

+

s k dh d

di k

x i d di k

= +

k k

+

x i

x i d

= +

k k

+

x idx i

= +

i d x i d

Figure 6: Constructing a chain of crossing edges

h d m

+

h d m h me

=

h s m h s d

=

d e

+

Figure 7: Constructing a box when edges in m and s’s
subtrees cross each other.

ter the trapezoid is built, only the interval endpoints
h and m are accessible for the rest of the dynamic
program, and so an edge between a vertex in (h, m)
and a vertex /∈ [h, m] can never be added. The
Crossing Conditions ensure that every edge added
in a non-trapezoid sub-problem is crossed.

Lemma 2. The GrandSib-Crossing parser con-
siders all 1-Endpoint-Crossing trees and only 1-
Endpoint-Crossing trees.

Proof. All trees that could have been built in Pitler
et al. (2013) are still possible. It can be verified that
the additional sub-problems added all obey the 1-
Endpoint-Crossing property.

4.2 Reduced Context in Presence of Crossings

A crossed edge (added in a non-trapezoid sub-
problem) is scored as a CrossedEdge part. An
uncrossed edge added in a trapezoid sub-problem,
however, may need to be scored according to a
GrandSib, Grand, Sib, or Edge part, depending on
whether the relevant other edges are crossed. In this
section we show that sibling and grandparent fea-
tures are included in the GrandSib-Crossing parser
as specified by Table 2.

do favor most these days

(a) For good contexts

favor most these daysdo

(b) For bad contexts

Figure 8: For each of the interval sub-problems in Koo
and Collins (2010), the parser constructs versions with
and without the additional grandparent index. Figure 8b
is used if the edge from do to favor is crossed, or if there
are any crossed edges from favor to children to the left of
do or to the right of days. Otherwise, Figure 8a is used.

4.2.1 Sibling Features

Lemma 3. The GrandSib-Crossing parser scores an
uncrossed edge ~ehm with a Sib or GrandSib part if
and only if ~ehs is not crossed.

Proof. Whether the edge to an uncrossed edge’s in-
ner sibling is crossed is known bottom-up through
how the trapezoid is constructed, since the inner sib-
ling is internal to the sub-problem. When ~ehs is not
crossed, the trapezoid is constructed as in Figure 5a,
using the inner sibling as the split point. When the
edge ~ehs is crossed, the trapezoid is constructed as in
Figure 5b or 5c; note that both ways force the edge
to the inner sibling to be crossed.

4.2.2 Grandparent Features for GProj Edges
Koo and Collins (2010) include an external grand-
parent index for each of the sub-problems that the
edges within use for scoring. We want to avoid
adding such an external grandparent index to any
of the crossing region sub-problems (to stay within
the desired time and space constraints) or to inter-
val sub-problems when the external context would
make all internal edges ¬GProj .

For each interval sub-problem, the parser con-
structs versions both with and without a grandpar-
ent index (Figure 8). Which version is used de-
pends on the external context. In a bad context, all
edges to children within an interval are guaranteed
to be ¬GProj . This section shows that all boundary
points in crossing regions are placed in bad contexts,
and then that edges are scored with grandparent fea-
tures if and only if they are GProj .

Bad Contexts for Interval Boundary Points For
exterior vertex boundary points, all edges from it to
its children will be crossed (Section 4.1.1), so it does
not need a grandparent index.

46


Lemma 4. If a boundary point i’s parent (call it g)
is within a sub-problem over vertices [i, j] or [i, j]∪
{x}, then for all uncrossed edges ~eim with m in the
sub-problem, the tree outside of the sub-problem is
irrelevant to whether ~eim is GProj .

Proof. The sub-problem contains the edge ~egi, so
Condition (1) is checked internally. m cannot be
x, since ~eim is uncrossed. If g is x, then ~eim is
¬GProj regardless of the outer context. If both g
and m ∈ (i, j], then Outer(m) ⊆ (i, j]: If m is an
interior child of i (m ∈ (i, g)) then Outer(m) ⊆
(m, g) ⊆ (i, j]. Otherwise, if m is an exterior child
(m ∈ (g, j]), by the “wrapping around” definition of
Outer, Outer(m) ⊆ (g, m) ⊆ (i, j]. Thus Condi-
tion (2) is also checked internally.

We can therefore focus on interval boundary
points with their parent outside of the sub-problem.

Definition 3. The left boundary vertex of an inter-
val [i, j] is in a bad context (BadContextL(i, j)) if
i receives its parent (call it g) from outside of the
sub-problem and either of the following hold:

1. Grand-Edge Crossed: ~egi is crossed

2. Outer-Child-Edge Crossed: An edge from i to
a child of i outside of [i, j] and Outer to j will
be crossed (recall this includes children on the
far side of g if g is to the left of i)

BadContextR(i, j) is defined symmetrically regard-
ing j and j’s parent and children.

Corollary 1. If BadContextL(i, j), then for all ~eim
with m ∈ (i, j], ~eim is ¬GProj . Similarly, if
BadContextR(i, j), for all ~ejm with m ∈ [i, j), ~ejm
is ¬GProj .
No Grandparent Indices for Crossing Regions
We would exceed the desired O(n4) run-time if
any crossing region sub-problems needed any grand-
parent indices. In Pitler et al. (2013), LR sub-
problems with edges from the exterior point crossed
by both the left and the right boundary points were
constructed by concatenating an L and an R sub-
problem. Since the split point was not necessar-
ily incident to a crossed edge, the split point might
have GProj edges to children on the side other than
where it gets its parent; accommodating this would
add another factor of n to the running time and space

x k jx i j
= +

kix

Figure 9: For all split points k, the edge from k’s parent
to k is crossed, so all edges from k to children on either
side were ¬GProj . The case when the split point’s parent
is from the right is symmetric.

x i k j

(a) x is Outer to all
children of k in (k, j].

x i k j

(b) x is Outer to all
children of k in [i, k).

Figure 10: The edge ~ekx is guaranteed to be crossed, so
k is in a BadContext for whichever side it does not get
its parent from.

to store the split point’s parent. To avoid this in-
crease in running time, they are instead built up as
in Figure 9, which chooses the split point so that the
edge from the parent of the split point to it is crossed.

Lemma 5. For all crossing region sub-problems
[i, j] ∪ {x} with i’s parent /∈ [i, j] ∪ {x},
BadContextL(i, j). Similarly, when j’s parent /∈
[i, j]∪{x}, BadContextR(i, j).

Proof. Crossing region sub-problems either com-
bine to form intervals or larger crossing regions.
When they combine to form intervals as in Figure
3, it can be verified that all boundary points are in
a bad context. LR sub-problems were discussed
above. Split points for the L/R/N sub-problems by
construction are incident to a crossed edge to a fur-
ther vertex. If that edge is from the split point’s par-
ent to the split point, then the grand-edge is crossed
and so both sides are in a bad context. If the crossed
edge is from the split point to a child, then that child
is Outer to all other children on the side in which it
does not get its parent (see Figure 10).

Corollary 2. No grandparent indices are needed for
any crossing region sub-problem.

Triangles and Trapezoids with and without
Grandparent Indices The presentation that fol-
lows assumes left-headed versions. Uncrossed
edges are added in two distinct types of trapezoids:
(1) TrapG[h, m, g, L] with an external grandpar-
ent index g, scores the edge ~ehm with grandpar-

47


ent features, and (2) Trap[h, m, L] without a grand-
parent index, scores the edge ~ehm without grand-
parent features. Triangles also have versions with
(TriG[h, e, g, L] and without (Tri[h, e, L]) a grand-
parent index. What follows shows that all GProj
edges are added in TrapG sub-problems, and all
¬GProj uncrossed edges are added in Trap sub-
problems.

Lemma 6. For all k ∈ (i, j), if BadContextL(i, j),
then BadContextL(i, k). Similarly, if
BadContextR(i, j), then BadContextR(k, j).

Proof. BadContextL(i, j) implies either the edge
from i’s parent to i is crossed and/or an edge from i
to a child of i outer to j is crossed. If the edge from
i’s parent to i is crossed, then BadContextL(i, k). If
a child of i is outer to j, then since k ∈ (i, j), such a
child is also outer to k.

Lemma 7. All left-rooted triangle sub-problems
Tri[i, j, L] without a grandparent index are in a
BadContextL(i, j). Similarly for all Tri[i, j, R],
BadContextR(i, j).

Proof. All triangles without grandparent indices are
either placed immediately into a bad context (by
adding a crossed edge to the triangle’s root from its
parent, or a crossed edge from the root to an outer
child) or are combined with other sub-trees to form
larger crossing regions (and therefore the triangle is
in a bad context, using Lemmas 5 and 6).

Lemma 8. All triangle sub-problems with a grand-
parent index TriG[h, e, g, L] are placed in a
¬BadContextL(h, e). Similarly, TriG[e, h, g, R]
are only placed in ¬BadContextR(h, e).
Proof. Consider where a non-empty triangle (h 6=
e) with a grandparent index TriG[h, e, g, L] can be
placed in the full dynamic program and what each
step would imply about the rest of the tree.

If the triangle contains exterior children of h (e
and g are on opposite sides of h), then it can either
combine with a trapezoid to form another larger tri-
angle (as in Figure 1a) or it can combine with an-
other sub-problem to form a box with a grandpar-
ent index (Figure 1c or 7). Boxes with a grandpar-
ent index can only combine with another trapezoid
to form a larger trapezoid (Figure 1b). Both cases

force ~egh to not be crossed and prevent h from hav-
ing any outer crossed children, as h becomes an in-
ternal node within the larger sub-problem.

If the triangle contains interior children of h (e
lies between g and h), then it can either form a trape-
zoid from g to h by combining with a triangle (Fig-
ure 5b) or a chain of crossing edges (Figure 5c), or it
can be used to build a box with a grandparent index
(Figures 1c and 7), which then can only be used to
form a trapezoid from g to h. In either case, a trape-
zoid is constructed from g to h, enforcing that ~egh
cannot be crossed. These steps prevent h from hav-
ing any additional children between g and e (since h
does not appear in the adjacent sub-problems at all
whenever h 6= e), so again the children of h in (e, h)
have no outer siblings.

Lemma 9. In a TriG[h, e, g, L] sub-problem, if an
edge ~ehm is not crossed and no edges from i to sib-
lings of m in (m, e] are crossed, then ~ehm is GProj .

Proof. This follows from (1) the edge ~ehm is not
crossed, (2) the edge ~egh is not crossed by Lemma 8,
and (3) no outer siblings are crossed (outer siblings
in (m, e] are not crossed by assumption and siblings
outer to e are not crossed by Lemma 8).

Lemma 10. An edge ~ehm scored with a GrandSib
or Grand part (added through a TrapG[h, m, g, L]
or TrapG[m, h, g, R] sub-problem) is GProj .

Proof. A TrapG can either (1) combine with de-
scendants of m to form a triangle with a grandparent
index rooted at h (indicating that m is the outermost
inner child of h) or (2) combine with descendants
of m and of m’s adjacent outer sibling (call it o),
forming a trapezoid from h to o (indicating that ~eho
is not crossed). Such a trapezoid could again only
combine with further uncrossed outer siblings until
eventually the final triangle rooted at h with grand-
parent index g is built. As ~ehm was not crossed, no
edges from h to outer siblings within the triangle are
crossed, and ~ehm is within a TriG sub-problem, ~ehm
is GProj by Lemma 9.

Lemma 11. An uncrossed edge ~ehm scored with a
Sib or Edge part (added through a Trap[h, m, L] or
Trap[m, h, R] sub-problem) is ¬GProj .

48


Proof. A Trap can only (1) form a triangle without
a grandparent index, or (2) form a trapezoid to an
outer sibling of m, until eventually a final triangle
rooted at h without a grandparent index is built. This
triangle without a grandparent index is then placed
in a bad context (Lemma 7) and so ~ehm is ¬GProj
(Corollary 1).

4.3 Main Results

Lemma 12. The crossing-sensitive third-order
parser runs in O(n4) time and O(n3) space when
the input is an unpruned graph. When the input
to the parser is a pruned graph with at most k in-
coming edges per node, the crossing-sensitive third-
order parser runs in O(kn3) time and O(n3) space.

Proof. All sub-problems are either over intervals
(two indices), intervals with a grandparent index
(three indices), or crossing regions (three indices).
No crossing regions require any grandparent indices
(Corollary 2). The only sub-problems that require
a maximization over two internal split points are
over intervals and need no grandparent indices (as
the furthest edges from each root are guaranteed to
be crossed within the sub-problem). All steps ei-
ther contain an edge in their construction step or in
the invariant of the sub-problem, so with a pruned
graph as input, the running time is the number of
edges (O(kn)) times the number of possibilities for
the other two free indices (O(n2)). The space is not
reduced as there is not necessarily an edge relation-
ship between the three stored vertices.

Theorem 1. The GrandSib-Crossing parser cor-
rectly finds the maximum scoring 1-Endpoint-
Crossing tree according to the crossing-sensitive
third-order factorization (Section 3) in O(n4) time
and O(n3) space. When the input to the parser is
a pruned graph with at most k incoming edges per
node, the GrandSib-Crossing parser correctly finds
the maximum scoring 1-Endpoint-Crossing tree that
uses only unpruned edges in O(kn3) time and
O(n3) space.

Proof. The correctness of scoring follows from
Lemmas 3, 10, and 11. The search space of 1-
Endpoint-Crossing trees was in Lemma 2 and the
time and space complexity in Lemma 12.

The parser produces the optimal tree in a well-
defined output space. Pruning edges restricts the
output space the same way that constraints enforc-
ing projectivity or the 1-Endpoint-Crossing property
also restrict the output space. Note that if the optimal
unconstrained 1-Endpoint-Crossing tree does not in-
clude any pruned edges, then whether the parser uses
pruning or not is irrelevant; both the pruned and un-
pruned parsers will produce the exact same tree.

5 Experiments

The crossing-sensitive third-order parser was imple-
mented as an alternative parsing algorithm within
dpo3 (Koo and Collins, 2010).2 To ensure a fair
comparison, all code relating to input/output, fea-
tures, learning, etc. was re-used from the origi-
nal projective implementation, and so the only sub-
stantive differences between the projective and 1-
Endpoint-Crossing parsers are the dynamic pro-
gramming charts, the parsing algorithms, and the
routines that extract the maximum scoring tree from
the completed chart.

The treebanks used to prepare the CoNLL shared
task data (Buchholz and Marsi, 2006; Nivre et al.,
2007) vary widely in their conventions for repre-
senting conjunctions, modal verbs, determiners, and
other decisions (Zeman et al., 2012). The exper-
iments use the newly released HamleDT software
(Zeman et al., 2012) that normalizes these treebanks
into one standard format and also provides built-in
transformations to other conjunction styles. The un-
normalized treebanks input to HamleDT were from
the CoNLL 2006 Shared Task (Buchholz and Marsi,
2006) for Danish, Dutch, Portuguese, and Swedish
and from the CoNLL 2007 Shared Task (Nivre et al.,
2007) for Czech.

The experiments include the default Prague
style (Böhmová et al., 2001), Mel’čukian style
(Mel’čuk, 1988), and Stanford style (De Marneffe
and Manning, 2008) for conjunctions. Under the
grandparent-sibling factorization, the two words be-
ing conjoined would never appear in the same scope
for the Prague style (as they are siblings on differ-
ent sides of the conjunct head). In the Mel’čukian
style, the two conjuncts are in a grandparent rela-
tionship and in the Stanford style the two conjuncts

2http://groups.csail.mit.edu/nlp/dpo3/

49


are in a sibling relationship, and so we would expect
to see larger gains for including grandparents and
siblings under the latter two representations. The
experiments also include a nearly projective dataset,
the English Penn Treebank (Marcus et al., 1993),
converted to dependencies with PennConverter (Jo-
hansson and Nugues, 2007).

The experiments use marginal-based pruning
based on an edge-factored directed spanning tree
model (McDonald et al., 2005). Each word’s set of
potential parents is limited to those with a marginal
probability of at least .1 times the probability of the
most probable parent, and cut off this list at a max-
imum of 20 potential parents per word. To ensure
that there is always at least one projective and/or 1-
Endpoint-Crossing tree achievable, the artificial root
is always included as an option. The pruning param-
eters were chosen to keep 99.9% of the true edges
on the English development set.

Following Carreras (2007) and Koo and Collins
(2010), before training the training set trees are
transformed to be the best achievable within the
model class (i.e., the closest projective tree or 1-
Endpoint-Crossing tree). All models are trained
for five iterations of averaged structured perceptron
training. For English, the model after the iteration
that performs best on the development set is used;
for all other languages, the model after the fifth iter-
ation is used.

5.1 Results

Results for edge-factored and (crossing-sensitive)
grandparent-sibling factored models for both projec-
tive and 1-Endpoint-Crossing parsing are in Tables
4 and 5. In 14 out of the 16 experimental set-ups,
the third-order 1-Endpoint-Crossing parser is more
accurate than the third-order projective parser. It is
significantly better than the projective parser in 9 of
the set-ups and significantly worse in none.

Table 6 shows how often the 1-EC CS-GSib
parser used each of the GrandSib, Grand, Sib,
Edge, and CrossedEdge parts for the Mel’čukian
and Stanford style test sets. In both representations,

3Following prior work in graph-based dependency parsing
(for example, Rush and Petrov (2012)), English results use au-
tomatically produced part-of-speech tags and results exclude
punctuation, while the results for all other languages use gold
part-of-speech tags and include punctuation.

Model Du Cz Pt Da Sw
Prague

Proj GSib 80.45 85.12 88.85 88.17 85.50
Proj Edge 80.38 84.04 88.14 88.29 86.09
1-EC CS-GSib 82.78 85.90 89.74 88.64 85.70
1-EC Edge 83.33 84.97 89.21 88.19 86.46

Mel’čukian
Proj GSib 82.26 87.96 89.19 90.23 89.59
Proj Edge 82.09 86.18 88.73 89.29 89.00
1-EC CS-GSib 86.03 87.89 90.34 90.50 89.34
1-EC Edge 85.28 87.57 89.96 90.14 88.97

Stanford
Proj GSib 81.16 86.83 88.80 88.84 87.27
Proj Edge 80.56 86.18 88.61 88.69 87.92
1-EC CS-GSib 84.67 88.34 90.20 89.22 88.15
1-EC Edge 83.62 87.13 89.43 88.74 87.36

Table 4: Overall Unlabeled Attachment Scores (UAS) for
all words.3 CS-GSib is the proposed crossing-sensitive
grandparent-sibling factorization. For each data set, we
bold the most accurate model and those not significantly
different from the most accurate (sign test, p < .05). Lan-
guages are sorted in increasing order of projectivity.

Model UAS
Proj GSib 93.10
Proj Edge 92.63
1-EC CS-GSib 93.22
1-EC Edge 92.80

Table 5: English results

the parser is able to score with a sibling context
more often than it is able to score with a grandpar-
ent, perhaps explaining why the datasets using the
Stanford conjunction representation saw the largest
gains from including the higher order factors into the
1-Endpoint-Crossing parser.

Across languages, the third-order 1-Endpoint-
Crossing parser runs 2.1-2.7 times slower than the
third-order projective parser (71-104 words per sec-
ond, compared with 183-268 words per second).
Parsing speed is correlated with the amount of prun-
ing. The level of pruning mentioned earlier is rela-
tively permissive, retaining 39.0-60.7% of the edges
in the complete graph; a higher level of pruning
could likely achieve much faster parsing times with
the same underlying parsing algorithms.

50


Part Used Du Cz Pt Da Sw
Mel’čukian

CrossedEdge 8.5 4.5 3.2 1.4 1.2
GrandSib 81.2 89.1 90.7 95.7 96.2
Grand 1.1 0.5 0.8 0.3 0.2
Sib 9.0 5.8 5.2 2.6 2.3
Edge < 0.1 < 0.1 0 < 0.1 0

Stanford
CrossedEdge 8.4 5.1 3.3 2.0 1.8
GrandSib 81.4 87.8 90.5 94.2 95.2
Grand 1.1 0.5 0.7 0.3 0.3
Sib 8.9 6.5 5.2 3.5 2.6
Edge < 0.1 0.1 0 < 0.1 0

Table 6: The proportion of edges in the predicted output
trees from the CS-GSib 1-Endpoint-Crossing parser that
would have used each of the five part types for scoring.

6 Discussion

There have been many other notable approaches to
non-projective parsing with larger scopes than single
edges, including transition-based parsers, directed
spanning tree graph-based parsers, and mildly non-
projective graph-based parsers.

Transition-based parsers score actions that the
parser may take to transition between different
configurations. These parsers typically use either
greedy or beam search, and can condition on any
tree context that is in the history of the parser’s
actions so far. Zhang and Nivre (2011) signifi-
cantly improved the accuracy of an arc-eager tran-
sition system (Nivre, 2003) by adding several ad-
ditional classes of features, including some third-
order features. Basic arc-eager and arc-standard
(Nivre, 2004) models that parse left-to-right using
a stack produce projective trees, but transition-based
parsers can be modified to produce crossing edges.
Such modifications include pseudo-projective pars-
ing in which the dependency labels encode transfor-
mations to be applied to the tree (Nivre and Nilsson,
2005), adding actions that add edges to words in the
stack that are not the topmost item (Attardi, 2006),
adding actions that swap the positions of words
(Nivre, 2009), and adding a second stack (Gómez-
Rodrı́guez and Nivre, 2010).

Graph-based approaches to non-projective pars-
ing either consider all directed spanning trees or re-
stricted classes of mildly non-projective trees. Di-
rected spanning tree approaches with higher order
features either use approximate learning techniques,

such as loopy belief propagation (Smith and Eis-
ner, 2008), or use dual decomposition to solve relax-
ations of the problem (Koo et al., 2010; Martins et
al., 2013). While not guaranteed to produce optimal
trees within a fixed number of iterations, these dual
decomposition techniques do give certificates of op-
timality on the instances in which the relaxation is
tight and the algorithm converges quickly.

This paper described a mildly non-projective
graph-based parser. Other parsers in this class find
the optimal tree in the class of well-nested, block
degree two trees (Gómez-Rodrı́guez et al., 2011),
or in a class of trees further restricted based on
gap inheritance (Pitler et al., 2012) or the head-split
property (Satta and Kuhlmann, 2013), with edge-
factored running times of O(n5) − O(n7). The
factorization used in this paper is not immediately
compatible with these parsers: the complex cases in
these parsers are due to gaps, not crossings. How-
ever, there may be analogous “gap-sensitive” factor-
izations that could allow these parsers to be extended
without large increases in running times.

7 Conclusion

This paper proposed an exact, graph-based algo-
rithm for non-projective parsing with higher order
features. The resulting parser has the same asymp-
totic run time as a third-order projective parser, and
is significantly more accurate for many experimental
settings. An exploration of other factorizations that
facilitate non-projective parsing (for example, an
analogous “gap-sensitive” variant) may be an inter-
esting avenue for future work. Recent work has in-
vestigated faster variants for third-order graph-based
projective parsing (Rush and Petrov, 2012; Zhang
and McDonald, 2012) using structured prediction
cascades (Weiss and Taskar, 2010) and cube prun-
ing (Chiang, 2007). It would be interesting to extend
these lines of work to the crossing-sensitive third-
order parser as well.

Acknowledgments

I would like to thank Sampath Kannan, Mitch Mar-
cus, Chris Callison-Burch, Michael Collins, Mark
Liberman, Ben Taskar, Joakim Nivre, and the three
anonymous reviewers for valuable comments on ear-
lier versions of this material.

51


References
G. Attardi. 2006. Experiments with a multilanguage

non-projective dependency parser. In Proceedings of
CoNLL, pages 166–170.

A. Böhmová, J. Hajič, E. Hajičová, and B. Hladká. 2001.
The Prague Dependency Treebank: Three-level anno-
tation scenario. In Anne Abeillé, editor, Treebanks:
Building and Using Syntactically Annotated Corpora,
pages 103–127. Kluwer Academic Publishers.

B. Bohnet and J. Kuhn. 2012. The best of both worlds
– a graph-based completion model for transition-based
parsers. In Proceedings of EACL, pages 77–87.

S. Buchholz and E. Marsi. 2006. CoNLL-X shared task
on multilingual dependency parsing. In Proceedings
of CoNLL, pages 149–164.

X. Carreras. 2007. Experiments with a higher-order
projective dependency parser. In Proceedings of the
CoNLL Shared Task Session of EMNLP-CoNLL, pages
957–961.

E. Charniak and M. Johnson. 2005. Coarse-to-fine n-
best parsing and maxent discriminative reranking. In
Proceedings of ACL, pages 173–180.

D. Chiang. 2007. Hierarchical phrase-based translation.
Computational Linguistics, 33(2):201–228.

M. De Marneffe and C. Manning. 2008. Stanford typed
dependencies manual.

J. Eisner. 2000. Bilexical grammars and their cubic-
time parsing algorithms. In Harry Bunt and Anton
Nijholt, editors, Advances in Probabilistic and Other
Parsing Technologies, pages 29–62. Kluwer Academic
Publishers.

C. Gómez-Rodrı́guez and J. Nivre. 2010. A transition-
based parser for 2-planar dependency structures. In
Proceedings of ACL, pages 1492–1501.

C. Gómez-Rodrı́guez, J. Carroll, and D. Weir. 2011. De-
pendency parsing schemata and mildly non-projective
dependency parsing. Computational Linguistics,
37(3):541–586.

K. Hall. 2007. K-best spanning tree parsing. In Proceed-
ings of ACL, pages 392–399.

R. Johansson and P. Nugues. 2007. Extended
constituent-to-dependency conversion for English. In
Proceedings of the 16th Nordic Conference on Com-
putational Linguistics (NODALIDA), pages 105–112.

T. Koo and M. Collins. 2010. Efficient third-order de-
pendency parsers. In Proceedings of ACL, pages 1–11.

T. Koo, A. M. Rush, M. Collins, T. Jaakkola, and D. Son-
tag. 2010. Dual decomposition for parsing with non-
projective head automata. In Proceedings of EMNLP,
pages 1288–1298.

T. Koo. 2010. Advances in discriminative dependency
parsing. Ph.D. thesis, Massachusetts Institute of Tech-
nology.

S. Kübler, R. McDonald, and J. Nivre. 2009. Depen-
dency parsing. Synthesis Lectures on Human Lan-
guage Technologies, 2(1):1–127.

M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini.
1993. Building a large annotated corpus of En-
glish: The Penn Treebank. Computational Linguistics,
19(2):313–330.

A. F. T. Martins, N. A. Smith, and E. P. Xing. 2009.
Concise integer linear programming formulations for
dependency parsing. In Proceedings of ACL, pages
342–350.

A. Martins, M. Almeida, and N. A. Smith. 2013. Turn-
ing on the turbo: Fast third-order non-projective turbo
parsers. In Proceedings of ACL (Short Papers), pages
617–622.

R. McDonald and F. Pereira. 2006. Online learning of
approximate dependency parsing algorithms. In Pro-
ceedings of EACL, pages 81–88.

R. McDonald and G. Satta. 2007. On the complexity
of non-projective data-driven dependency parsing. In
Proceedings of the 10th International Conference on
Parsing Technologies, pages 121–132.

R. McDonald, F. Pereira, K. Ribarov, and J. Hajič.
2005. Non-projective dependency parsing using span-
ning tree algorithms. In Proceedings of HLT/EMNLP,
pages 523–530.

I. Mel’čuk. 1988. Dependency Syntax: Theory and
Practice. State University of New York Press.

J. Nivre and J. Nilsson. 2005. Pseudo-projective depen-
dency parsing. In Proceedings of ACL, pages 99–106.

J. Nivre, J. Hall, S. Kübler, R. McDonald, J. Nilsson,
S. Riedel, and D. Yuret. 2007. The CoNLL 2007
shared task on dependency parsing. In Proceedings
of the CoNLL Shared Task Session of EMNLP-CoNLL,
pages 915–932.

J. Nivre. 2003. An efficient algorithm for projective de-
pendency parsing. In Proceedings of the 8th Interna-
tional Workshop on Parsing Technologies, pages 149–
160.

J. Nivre. 2004. Incrementality in deterministic depen-
dency parsing. In Proceedings of the Workshop on In-
cremental Parsing: Bringing Engineering and Cogni-
tion Together, pages 50–57.

J. Nivre. 2009. Non-projective dependency parsing in
expected linear time. In Proceedings of ACL, pages
351–359.

E. Pitler, S. Kannan, and M. Marcus. 2012. Dynamic
programming for higher order parsing of gap-minding
trees. In Proceedings of EMNLP, pages 478–488.

E. Pitler, S. Kannan, and M. Marcus. 2013. Find-
ing optimal 1-Endpoint-Crossing trees. Transac-
tions of the Association for Computational Linguistics,
1(Mar):13–24.

52


A. Rush and S. Petrov. 2012. Vine pruning for effi-
cient multi-pass dependency parsing. In Proceedings
of NAACL, pages 498–507.

G. Satta and M. Kuhlmann. 2013. Efficient parsing for
head-split dependency trees. Transactions of the As-
sociation for Computational Linguistics, 1(July):267–
278.

D. A. Smith and J. Eisner. 2008. Dependency parsing by
belief propagation. In Proceedings of EMNLP, pages
145–156.

D. Weiss and B. Taskar. 2010. Structured Prediction
Cascades. In AISTATS, pages 916–923.

D. Zeman, D. Mareček, M. Popel, L. Ramasamy,
J. Štěpánek, Z. Žabokrtský, and J. Hajič. 2012. Ham-
leDT: To parse or not to parse? In Proceedings of
the Eight International Conference on Language Re-
sources and Evaluation (LREC’12), pages 2735–2741.

H. Zhang and R. McDonald. 2012. Generalized higher-
order dependency parsing with cube pruning. In Pro-
ceedings of EMNLP, pages 320–331.

Y. Zhang and J. Nivre. 2011. Transition-based depen-
dency parsing with rich non-local features. In Pro-
ceedings of ACL (Short Papers), pages 188–193.

53


54