Transactions of the Association for Computational Linguistics, 1 (2013) 37–48. Action Editor: Ryan McDonald.
Submitted 11/2012; Revised 2/2013; Published 3/2013. c©2013 Association for Computational Linguistics.

Branch and Bound Algorithm for Dependency Parsing
with Non-local Features

Xian Qian and Yang Liu
Computer Science Department

The University of Texas at Dallas
{qx,yangl}@hlt.utdallas.edu

Abstract

Graph based dependency parsing is inefficient
when handling non-local features due to high
computational complexity of inference. In
this paper, we proposed an exact and effi-
cient decoding algorithm based on the Branch
and Bound (B&B) framework where non-
local features are bounded by a linear combi-
nation of local features. Dynamic program-
ming is used to search the upper bound. Ex-
periments are conducted on English PTB and
Chinese CTB datasets. We achieved competi-
tive Unlabeled Attachment Score (UAS) when
no additional resources are available: 93.17%
for English and 87.25% for Chinese. Parsing
speed is 177 words per second for English and
97 words per second for Chinese. Our algo-
rithm is general and can be adapted to non-
projective dependency parsing or other graph-
ical models.

1 Introduction

For graph based projective dependency parsing, dy-
namic programming (DP) is popular for decoding
due to its efficiency when handling local features.
It performs cubic time parsing for arc-factored mod-
els (Eisner, 1996; McDonald et al., 2005a) and bi-
quadratic time for higher order models with richer
sibling and grandchild features (Carreras, 2007; Koo
and Collins, 2010). However, for models with gen-
eral non-local features, DP is inefficient.

There have been numerous studies on global in-
ference algorithms for general higher order parsing.
One popular approach is reranking (Collins, 2000;

Charniak and Johnson, 2005; Hall, 2007). It typi-
cally has two steps: the low level classifier gener-
ates the top k hypotheses using local features, then
the high level classifier reranks these candidates us-
ing global features. Since the reranking quality is
bounded by the oracle performance of candidates,
some work has combined candidate generation and
reranking steps using cube pruning (Huang, 2008;
Zhang and McDonald, 2012) to achieve higher or-
acle performance. They parse a sentence in bottom
up order and keep the top k derivations for each s-
pan using k best parsing (Huang and Chiang, 2005).
After merging the two spans, non-local features are
used to rerank top k combinations. This approach
is very efficient and flexible to handle various non-
local features. The disadvantage is that it tends to
compute non-local features as early as possible so
that the decoder can utilize that information at inter-
nal spans, hence it may miss long historical features
such as long dependency chains.

Smith and Eisner modeled dependency parsing
using Markov Random Fields (MRFs) with glob-
al constraints and applied loopy belief propaga-
tion (LBP) for approximate learning and inference
(Smith and Eisner, 2008). Similar work was done
for Combinatorial Categorial Grammar (CCG) pars-
ing (Auli and Lopez, 2011). They used posterior
marginal beliefs for inference to satisfy the tree con-
straint: for each factor, only legal messages (satisfy-
ing global constraints) are considered in the partition
function.

A similar line of research investigated the use
of integer linear programming (ILP) based parsing
(Riedel and Clarke, 2006; Martins et al., 2009). This

37


method is very expressive. It can handle arbitrary
non-local features determined or bounded by linear
inequalities of local features. For local models, LP is
less efficient than DP. The reason is that, DP works
on a small number of dimensions in each recursion,
while for LP, the popular revised simplex method
needs to solve a m dimensional linear system in
each iteration (Nocedal and Wright, 2006), where
m is the number of constraints, which is quadratic
in sentence length for projective dependency pars-
ing (Martins et al., 2009).

Dual Decomposition (DD) (Rush et al., 2010;
Koo et al., 2010) is a special case of Lagrangian re-
laxation. It relies on standard decoding algorithms
as oracle solvers for sub-problems, together with a
simple method for forcing agreement between the
different oracles. This method does not need to con-
sider the tree constraint explicitly, as it resorts to dy-
namic programming which guarantees its satisfac-
tion. It works well if the sub-problems can be well
defined, especially for joint learning tasks. Howev-
er, for the task of dependency parsing, using various
non-local features may result in many overlapped
sub-problems, hence it may take a long time to reach
a consensus (Martins et al., 2011).

In this paper, we propose a novel Branch and
Bound (B&B) algorithm for efficient parsing with
various non-local features. B&B (Land and Doig,
1960) is generally used for combinatorial optimiza-
tion problems such as ILP. The difference between
our method and ILP is that the sub-problem in ILP
is a relaxed LP, which requires a numerical solution,
while ours bounds the non-local features by a lin-
ear combination of local features and uses DP for
decoding as well as calculating the upper bound of
the objective function. An exact solution is achieved
if the bound is tight. Though in the worst case,
time complexity is exponential in sentence length,
it is practically efficient especially when adopting a
pruning strategy.

Experiments are conducted on English PennTree
Bank and Chinese Tree Bank 5 (CTB5) with stan-
dard train/develop/test split. We achieved 93.17%
Unlabeled Attachment Score (UAS) for English at a
speed of 177 words per second and 87.25% for Chi-
nese at a speed of 97 words per second.

2 Graph Based Parsing

2.1 Problem Definition
Given a sentence x = x1,x2, . . . ,xn where xi is
the ith word of the sentence, dependency parsing as-
signs exactly one head word to each word, so that
dependencies from head words to modifiers form a
tree. The root of the tree is a special symbol de-
noted by x0 which has exactly one modifier. In this
paper, we focus on unlabeled projective dependency
parsing but our algorithm can be adapted for labeled
or non-projective dependency parsing (McDonald et
al., 2005b).

The inference problem is to search the optimal
parse tree y∗

y∗ = arg maxy∈Y(x)ϕ(x,y)

where Y(x) is the set of all candidate parse trees of
sentence x. ϕ(x,y) is a given score function which
is usually decomposed into small parts

ϕ(x,y) =
∑

c⊆y
ϕc(x) (1)

where c is a subset of edges, and is called a factor.
For example, in the all grandchild model (Koo and
Collins, 2010), the score function can be represented
as

ϕ(x,y) =
∑

ehm∈y
ϕehm (x) +

∑

egh,ehm∈y
ϕegh,ehm (x)

where the first term is the sum of scores of all edges
xh → xm, and the second term is the sum of the
scores of all edge chains xg → xh → xm.

In discriminative models, the score of a parse tree
y is the weighted sum of the fired feature functions,
which can be represented by the sum of the factors

ϕ(x,y) = wT f (x,y) =
∑

c⊆y
wT f (x,c) =

∑

c⊆y
ϕc(x)

where f (x,c) is the feature vector that depends on
c. For example, we could define a feature for grand-
child c = {egh,ehm}

f(x,c) =





1 if xg = would ∧ xh = be
∧xm = happy ∧ c is selected

0 otherwise

38


2.2 Dynamic Programming for Local Models

In first order models, all factors c in Eq(1) contain a
single edge. The optimal parse tree can be derived
by DP with running time O(n3) (Eisner, 1996). The
algorithm has two types of structures: complete s-
pan, which consists of a headword and its descen-
dants on one side, and incomplete span, which con-
sists of a dependency and the region between the
head and modifier. It starts at single word spans, and
merges the spans in bottom up order.

For second order models, the score function
ϕ(x,y) adds the scores of siblings (adjacent edges
with a common head) and grandchildren

ϕ(x,y) =
∑

ehm∈y
ϕehm (x)

+
∑

egh,ehm∈y
ϕehm,egh (x)

+
∑

ehm,ehs∈y
ϕehm,ehs (x)

There are two versions of second order models,
used respectively by Carreras (2007) and Koo et al.
(2010). The difference is that Carreras’ only con-
siders the outermost grandchildren, while Koo and
Collin’s allows all grandchild features. Both models
permit O(n4) running time.

Third-order models score edge triples such as
three adjacent sibling modifiers, or grand-siblings
that score a word, its modifier and its adjacent grand-
children, and the inference complexity is O(n4)
(Koo and Collins, 2010).

In this paper, for all the factors/features that can
be handled by DP, we call them the local fac-
tors/features.

3 The Proposed Method

3.1 Basic Idea

For general high order models with non-local fea-
tures, we propose to use Branch and Bound (B&B)
algorithm to search the optimal parse tree. A B&B
algorithm has two steps: branching and bounding.
The branching step recursively splits the search s-
pace Y(x) into two disjoint subspaces Y(x) =
Y1
∪

Y2 by fixing assignment of one edge. For each
subspace Yi, the bounding step calculates the upper

bound of the optimal parse tree score in the sub-
space: UBYi ≥ maxy∈Yi ϕ(x,y). If this bound is
no more than any obtained parse tree score UBYi ≤
ϕ(x,y′), then all parse trees in subspace Yi are no
more optimal than y′, and Yi could be pruned safely.

The efficiency of B&B depends on the branching
strategy and upper bound computation. For exam-
ple, Sun et al. (2012) used B&B for MRFs, where
they proposed two branching strategies and a novel
data structure for efficient upper bound computation.
Klenner and Ailloud (2009) proposed a variation of
Balas algorithm (Balas, 1965) for coreference reso-
lution, where candidate branching variables are sort-
ed by their weights.

Our bounding strategy is to find an upper bound
for the score of each non-local factor c containing
multiple edges. The bound is the sum of new scores
of edges in the factor plus a constant

ϕc(x) ≤
∑

e∈c
ψe(x) + αc

Based on the new scores {ψe(x)} and constants
{αc}, we define the new score of parse tree y

ψ(x,y) =
∑

c⊆y

(
∑

e∈c
ψe(x) + αc

)

Then we have

ψ(x,y) ≥ ϕ(x,y), ∀y ∈ Y(x)

The advantage of such a bound is that, it is the
sum of new edge scores. Hence, its optimum tree
maxy∈Y(x) ψ(x,y) can be found by DP, which is
the upper bound of maxy∈Y(x) ϕ(x,y), as for any
y ∈ Y(x), ψ(x,y) ≥ ϕ(x,y).

3.2 The Upper Bound Function

In this section, we derive the upper bound function
ψ(x,y) described above. To simplify notation, we
drop x throughout the rest of the paper. Let zc be
a binary variable indicating whether factor c is se-
lected in the parse tree. We reformulate the score
function in Eq(1) as

ϕ(y) ≡ ϕ(z) =
∑

c

ϕczc (2)

39


Correspondingly, the tree constraint is replaced by
z ∈ Z. Then the parsing task is

z∗ = arg maxz∈Zϕczc (3)

Notice that, for any zc, we have

zc = min
e∈c

ze

which means that factor c appears in parse tree if and
only if all its edges {e|e ∈ c} are selected in the tree.
Here ze is short for z{e} for simplicity.

Our bounding method is based on the following
fact: for a set {a1,a2, . . .ar} (aj denotes the jth el-
ement) , its minimum

min{aj} = min
p∈∆

∑

j

pjaj (4)

where ∆ is probability simplex

∆ = {p|pj ≥ 0,
∑

j

pj = 1}

We discuss the bound for ϕczc in two cases: ϕc ≥
0 and ϕc < 0.

If ϕc ≥ 0, we have
ϕczc = ϕc min

e∈c
ze

= ϕc min
pc∈∆

∑

e∈c
pecze

= min
pc∈∆

∑

e∈c
ϕcp

e
cze

The second equation comes from Eq(4). For sim-
plicity, let

gc(pc,z) =
∑

e∈c
ϕcp

e
cze

with domain domgc = {pc ∈ ∆; ze ∈ {0, 1}, ∀e ∈
c}. Then we have

ϕczc = min
pc

gc(pc,z) (5)

If ϕc < 0, we have two upper bounds. One is
commonly used in ILP when all the variables are bi-
nary

a∗ = min
j

{aj}rj=1
⇔

a∗ ≤ aj
a∗ ≥

∑

j

aj − (r − 1)

According to the last inequality, we have the upper
bound for negative scored factors

ϕczc ≤ ϕc
(
∑

e∈c
ze − (rc − 1)

)
(6)

where rc is the number of edges in c. For simplicity,
we use the notation

σc(z) = ϕc

(
∑

e∈c
ze − (rc − 1)

)

The other upper bound when ϕc < 0 is simple

ϕczc ≤ 0 (7)

Notice that, for any parse tree, one of the upper
bounds must be tight. Eq(6) is tight if c appears
in the parse tree: zc = 1, otherwise Eq(7) is tight.
Therefore

ϕczc = min {σc(z), 0}

Let

hc(pc,z) = p
1
cσc(z) + p

2
c · 0

with domhc = {pc ∈ ∆; ze ∈ {0, 1}, ∀e ∈ c}.
According to Eq(4), we have

ϕczc = min
pc

hc(pc,z) (8)

Let

ψ(p,z) =
∑

c,ϕc≥0
gc(pc,z) +

∑

c,ϕc<0

hc(pc,z)

Minimize ψ with respect to p, we have

min
p
ψ(p,z)

= min
p



∑

c,ϕc≥0
gc(pc,z) +

∑

c,ϕc<0

hc(pc,z)




=
∑

c,ϕc≥0
min
pc

gc(pc,z) +
∑

c,ϕc<0

min
pc

hc(pc,z)

=
∑

c,ϕc≥0
ϕczc +

∑

c,ϕc<0

ϕczc

= ϕ(z)

The second equation holds since, for any two fac-
tors, c and c′, gc (or hc) and gc′ (or hc′ ) are separable.
The third equation comes from Eq(5) and Eq(8).

Based on this, we have the following proposition:

40


Proposition 1. For any p,pc ∈ ∆, and z ∈ Z,
ψ(p,z) ≥ ϕ(z).

Therefore, ψ(p,z) is an upper bound function of
ϕ(z). Furthermore, fixing p, ψ(p,z) is a linear func-
tion of ze , see Eq(5) and Eq(8), variables zc for large
factors are eliminated. Hence z′ = arg maxzψ(p,z)
can be solved efficiently by DP.

Because

ψ(p,z′) ≥ ψ(p,z∗) ≥ ϕ(z∗) ≥ ϕ(z′)

after obtaining z′ , we get the upper bound and lower
bound of ϕ(z∗): ψ(p,z′) and ϕ(z′).

The upper bound is expected to be as tight as pos-
sible. Using min-max inequality, we get

max
z∈Z

ϕ(z) = max
z∈Z

min
p
ψ(p,z)

≤ min
p

max
z∈Z

ψ(p,z)

which provides the tightest upper bound of ϕ(z∗).
Since ψ is not differentiable w.r.t p, projected

sub-gradient (Calamai and Moré, 1987; Rush et al.,
2010) is used to search the saddle point. More
specifically, in each iteration, we first fix p and
search z using DP, then we fix z and update p by

pnew = P∆

(
p +

∂ψ

∂p
α

)

where α > 0 is the step size in line search, function
P∆(q) denotes the projection of q onto the proba-
bility simplex ∆. In this paper, we use Euclidean
projection, that is

P∆(q) = min
p∈∆

∥p − q∥2

which can be solved efficiently by sorting (Duchi et
al., 2008).

3.3 Branch and Bound Based Parsing
As discussed in Section 3.1, the B&B recursive pro-
cedure yields a binary tree structure called Branch
and Bound tree. Each node of the B&B tree has
some fixed ze, specifying some must-select edges
and must-remove edges. The root of the B&B tree
has no constraints, so it can produce all possible
parse trees including z∗. Each node has two chil-
dren. One adds a constraint ze = 1 for a free edge

z =e1 0 1

0 1 0 1z =e2

ψ
φ

=9
=4

ψ<LB

ψ
φ

=8
=5

ψ
φ

=7
=4

ψ
φ

=7
=4

ψ
φ

=7
=5

ψ
φ

=4
=3

ψ
φ

=6
=2

min
p

max
z∈Z
ze1 =0
ze2 =1

ψ(p,z)

6

Figure 1: A part of B&B tree. ϕ,ψ are short for
ϕ(z′) and ψ(p′,z′) respectively. For each node,
some edges of the parse tree are fixed. All parse
trees that satisfy the fixed edges compose the subset
of S ⊆ Z. A min-max problem is solved to get the
upper bound and lower bound of the optimal parse
tree over S. Once the upper bound ψ is less than
LB, the node is removed safely.

e and the other fixes ze = 0. We can explore the
search space {z|ze ∈ {0, 1}} by traversing the B&B
tree in breadth first order.

Let S ⊆ Z be subspace of parse trees satisfying
the constraint, i.e., in the branch of the node. For
each node in B&B tree, we solve

p′,z′ = arg min
p

max
z∈S

ψ(p,z)

to get the upper bound and lower bound of the best
parse tree in S. A global lower bound LB is main-
tained which is the maximum of all obtained lower
bounds. If the upper bound of the current node is
lower than the global lower bound, the node can be
pruned from the B&B tree safely. An example is
shown in Figure 1.

When the upper bound is not tight: ψ > LB, we
need to choose a good branching variable to gener-
ate the child nodes. Let G(z′) = ψ(p′,z′) − ϕ(z′)
denote the gap between the upper bound and lower
bound. This gap is actually the accumulated gaps of
all factors c. Let Gc be the gap of c

Gc =

{
gc(p

′
c,z

′) − ϕcz′c if ϕc ≥ 0
hc(p

′
c,z

′) − ϕcz′c if ϕc < 0

41


We choose the branching variable heuristically:
for each edge e, we define its gap as the sum of the
gaps of factors that contain it

Ge =
∑

c,e∈c
Gc

The edge with the maximum gap is selected as the
branching variable.

Suppose there are N nodes on a level of B&B
tree, and correspondingly, we get N branching vari-
ables, among which, we choose the one with the
highest lower bound as it likely reaches the optimal
value faster.

3.4 Lower Bound Initialization
A large lower bound is critical for efficient pruning.
In this section, we discuss an alternative way to ini-
tialize the lower bound LB. We apply the similar
trick to get the lower bound function of ϕ(z).

Similar to Eq(8), for ϕc ≥ 0, we have

ϕczc = max{ϕc
(
∑

e∈c
ze − (rc − 1)

)
, 0}

= max{σc(z), 0}

Using the fact that

max{aj} = max
p∈∆

∑

j

pjaj

we have

ϕczc = max
pc∈∆

p1cσc(z) + p
2
c · 0

= max
pc

hc(pc,z)

For ϕc < 0, we have

ϕczc = max
e∈c

{ϕcze}

= max
pc∈∆

∑

e∈c
pecϕcze

= max
pc

gc(pc,z)

Put the two cases together, we get the lower bound
function

π(p,z) =
∑

c,ϕc≥0
hc(pc,z) +

∑

c,ϕc<0

gc(pc,z)

Algorithm 1 Branch and Bound based parsing
Require: {ϕc}
Ensure: Optimal parse tree z∗

Solve p∗,z∗ = arg maxp,zπ(p,z)
Initialize S = {Z}, LB = π(p∗,z∗)
while S ̸= ∅ do

Set S′ = ∅{nodes that survive from pruning}
foreach S ∈ S

Solve minp maxz ψ(p,z) to get LBS,UBS
LB = max{LB,LBS∈S }, update z∗
foreach S ∈ S, add S to S′, if UBS > LB
Select a branching variable ze.
Clear S = ∅
foreach S ∈ S′

Add S1 = {z|z ∈ S,ze = 1} to S
Add S2 = {z|z ∈ S,ze = 0} to S.

end while

For any p,pc ∈ ∆,z ∈ Z

π(p,z) ≤ ϕ(z)

π(p,z) is not concave, however, we could alterna-
tively optimize z and p to get a good approximation,
which provides a lower bound for ϕ(z∗).

3.5 Summary
We summarize our B&B algorithm in Algorithm 1.

It is worth pointing out that so far in the above
description, we have used the assumption that the
backbone DP uses first order models, however, the
backbone DP can be the second or third order ver-
sion. The difference is that, for higher order DP,
higher order factors such as adjacent siblings, grand-
children are directly handled as local factors.

In the worst case, all the edges are selected for
branching, and the complexity grows exponentially
in sentence length. However, in practice, it is quite
efficient, as we will show in the next section.

4 Experiments

4.1 Experimental Settings
The datasets we used are the English Penn Tree
Bank (PTB) and Chinese Tree Bank 5.0 (CTB5). We
use the standard train/develop/test split as described
in Table 1.

We extracted dependencies using Joakim Nivre’s
Penn2Malt tool with standard head rules: Yamada
and Matsumoto’s (Yamada and Matsumoto, 2003)

42


Train Develop Test
PTB sec. 2-21 sec. 22 sec. 23

CTB5 sec. 001-815 sec. 886-931 sec. 816-885
1001-1136 1148-1151 1137-1147

Table 1: Data split in our experiment

for English, and Zhang and Clark’s (Zhang and
Clark, 2008) for Chinese. Unlabeled attachment s-
core (UAS) is used to evaluate parsing quality1. The
B&B parser is implemented with C++. All the ex-
periments are conducted on the platform Intel Core
i5-2500 CPU 3.30GHz.

4.2 Baseline: DP Based Second Order Parser
We use the dynamic programming based second or-
der parser (Carreras, 2007) as the baseline. Aver-
aged structured perceptron (Collins, 2002) is used
for parameter estimation. We determine the number
of iterations on the validation set, which is 6 for both
corpora.

For English, we train the POS tagger using linear
chain perceptron on training set, and predict POS
tags for the development and test data. The parser is
trained using the automatic POS tags generated by
10 fold cross validation. For Chinese, we use the
gold standard POS tags.

We use five types of features: unigram features,
bigram features, in-between features, adjacent sib-
ling features and outermost grand-child features.
The first three types of features are firstly introduced
by McDonald et al. (2005a) and the last two type-
s of features are used by Carreras (2007). All the
features are the concatenation of surrounding words,
lower cased words (English only), word length (Chi-
nese only), prefixes and suffixes of words (Chinese
only), POS tags, coarse POS tags which are derived
from POS tags using a simple mapping table, dis-
tance between head and modifier, direction of edges.
For English, we used 674 feature templates to gener-
ate large amounts of features, and finally got 86.7M
non-zero weighted features after training. The base-
line parser got 92.81% UAS on the testing set. For
Chinese, we used 858 feature templates, and finally
got 71.5M non-zero weighted features after train-

1For English, we follow Koo and Collins (2010) and ignore
any word whose gold-standard POS tag is one of { “ ” : , .}. For
Chinese, we ignore any word whose POS tag is PU.

ing. The baseline parser got 86.89% UAS on the
testing set.

4.3 B&B Based Parser with Non-local Features
We use the baseline parser as the backbone of our
B&B parser. We tried different types of non-local
features as listed below:

• All grand-child features. Notice that this fea-
ture can be handled by Koo’s second order
model (Koo and Collins, 2010) directly.

• All great grand-child features.

• All sibling features: all the pairs of edges with
common head. An example is shown in Fig-
ure 2.

• All tri-sibling features: all the 3-tuples of edges
with common head.

• Comb features: for any word with more than 3
consecutive modifiers, the set of all the edges
from the word to the modifiers form a comb.2

• Hand crafted features: We perform cross val-
idation on the training data using the baseline
parser, and designed features that may correc-
t the most common errors. We designed 13
hand-craft features for English in total. One ex-
ample is shown in Figure 3. For Chinese, we
did not add any hand-craft features, as the er-
rors in the cross validation result vary a lot, and
we did not find general patterns to fix them.

4.4 Implementation Details
To speed up the solution of the min-max subprob-
lem, for each node in the B&B tree, we initialize p
with the optimal solution of its parent node, since
the child node fixes only one additional edge, its op-
timal point is likely to be closed to its parent’s. For
the root node of B&B tree, we initialize pec =

1
rc

for
factors with non-negative weights and p1c = 0 for

2In fact, our algorithm can deal with non-consecutive mod-
ifiers; however, in such cases, factor detection (detect regular
expressions like x1. ∗ x2. ∗ . . . ) requires the longest com-
mon subsequence algorithm (LCS), which is time-consuming
if many comb features are generated. Similar problems arise
for sub-tree features, which may contain many non-consecutive
words.

43


c0 c1 c2 c3c0 h

c0 c1c0 h c0 c2c0 h c0 c3c0 h

c1 c2h c2 c3h c1 c3h

second order higher order

Figure 2: An example of all sibling features. Top:
a sub-tree; Bottom: extracted sibling features. Ex-
isting higher order DP systems can not handle the
siblings on both sides of head.

regulation occurs through inaction , rather than through ...

Figure 3: An example of hand-craft feature: for the
word sequence A . . . rather than A, where A is a
preposition, the first A is the head of than, than is
the head of rather and the second A.

negative weighted factors. Step size α is initialized
with maxc,ϕc ̸=0{ 1|ϕc| }, as the vector p is bounded in
a unit box. α is updated using the same strategy as
Rush et al. (2010). Two stopping criteria are used.
One is 0 ≤ ψold − ψnew ≤ ϵ, where ϵ > 0 is a given
precision3. The other checks if the bound is tight:
UB = LB. Because all features are boolean (note
that they can be integer), their weights are integer
during each perceptron update, hence the scores of
parse trees are discrete. The minimal gap between
different scores is 1

N×T after averaging, where N is
the number of training samples, and T is the itera-
tion number for perceptron training. Therefore the
upper bound can be tightened as UB = ⌊NTψ⌋

NT
.

During testing, we use the pre-pruning method as
used in Martins et al. (2009) for both datasets to bal-
ance parsing quality and speed. This method uses a
simple classifier to select the top k candidate head-
s for each word and exclude the other heads from
search space. In our experiment, we set k = 10.

3we use ϵ = 10−8 in our implementation

System PTB CTB
Our baseline 92.81 86.89
B&B +all grand-child 92.97 87.02
+all great grand-child 92.78 86.77
+all sibling 93.00 87.05
+all tri-sibling 92.79 86.81
+comb 92.86 86.91
+hand craft 92.89 N/A
+all grand-child + all sibling + com-
b + hand craft

93.17 87.25

3rd order re-impl. 93.03 87.07
TurboParser (reported) 92.62 N/A
TurboParser (our run) 92.82 86.05
Koo and Collins (2010) 93.04 N/A
Zhang and McDonald (2012) 93.06 86.87
Zhang and Nivre (2011) 92.90 86.00

System integration
Bohnet and Kuhn (2012) 93.39 87.5

Systems using additional resources
Suzuki et al. (2009) 93.79 N/A
Koo et al. (2008) 93.5 N/A
Chen et al. (2012) 92.76 N/A

Table 2: Comparison between our system and the-
state-of-art systems.

4.5 Main Result

Experimental results are listed in Table 2. For com-
parison, we also include results of representative
state-of-the-art systems. For the third order pars-
er, we re-implemented Model 1 (Koo and Collins,
2010), and removed the longest sentence in the CTB
dataset, which contains 240 words, due to the O(n4)
space complexity 4. For ILP based parsing, we used
TurboParser5, a speed-optimized parser toolkit. We
trained full models (which use all grandchild fea-
tures, all sibling features and head bigram features
(Martins et al., 2011)) for both datasets using its de-
fault settings. We also list the performance in its
documentation on English corpus.

The observation is that, the all-sibling features are
most helpful for our parser, as some good sibling
features can not be encoded in DP based parser. For
example, a matched pair of parentheses are always
siblings, but their head may lie between them. An-

4In fact, Koo’s algorithm requires only O(n3) space. Our
implementation is O(n4) because we store the feature vectors
for fast training.

5http://www.ark.cs.cmu.edu/TurboParser/

44


other observation is that all great grandchild features
and all tri-sibling features slightly hurt the perfor-
mance and we excluded them from the final system.

When no additional resource is available, our
parser achieved competitive performance: 93.17%
Unlabeled Attachment Score (UAS) for English at
a speed of 177 words per second and 87.25% for
Chinese at a speed of 97 words per second. High-
er UAS is reported by joint tagging and parsing
(Bohnet and Nivre, 2012) or system integration
(Bohnet and Kuhn, 2012) which benefits from both
transition based parsing and graph based parsing.
Previous work shows that combination of the two
parsing techniques can learn to overcome the short-
comings of each non-integrated system (Nivre and
McDonald, 2008; Zhang and Clark, 2008). Sys-
tem combination will be an interesting topic for our
future research. The highest reported performance
on English corpus is 93.79%, obtained by semi-
supervised learning with a large amount of unla-
beled data (Suzuki et al., 2009).

4.6 Tradeoff Between Accuracy and Speed
In this section, we study the trade off between ac-
curacy and speed using different pre-pruning setups.
In Table 3, we show the parsing accuracy and in-
ference time in testing stage with different numbers
of candidate heads k in pruning step. We can see
that, on English dataset, when k ≥ 10, our pars-
er could gain 2 − 3 times speedup without losing
much parsing accuracy. There is a further increase
of the speed with smaller k, at the cost of some ac-
curacy. Compared with TurboParser, our parser is
less efficient but more accurate. Zhang and McDon-
ald (2012) is a state-of-the-art system which adopts
cube pruning for efficient parsing. Notice that, they
did not use pruning which seems to increase parsing
speed with little hit in accuracy. Moreover, they did
labeled parsing, which also makes their speed not
directly comparable.

For each node of B&B tree, our parsing algorithm
uses projected sub-gradient method to find the sad-
dle point, which requires a number of calls to a DP,
hence the efficiency of Algorithm 1 is mainly deter-
mined by the number of DP calls. Figure 4 and Fig-
ure 5 show the averaged parsing time and number of
calls to DP relative to the sentence length with differ-
ent pruning settings. Parsing time grows smoothly

PTB CTB
System UAS w/s UAS w/s
Ours (no prune) 93.18 52 87.28 73
Ours (k = 20) 93.17 105 87.28 76
Ours (k = 10) 93.17 177 87.25 97
Ours (k = 5) 93.10 264 86.94 108
Ours (k = 3) 92.68 493 85.76 128
TurboParser(full) 92.82 402 86.05 192
TurboParser(standard) 92.68 638 85.80 283
TurboParser(basic) 90.97 4670 82.28 2736
Zhang and McDon-
ald (2012)†

93.06 220 86.87 N/A

Table 3: Trade off between parsing accuracy (UAS)
and speed (words per second) with different pre-
pruning settings. k denotes the number of candi-
date heads of each word preserved for B&B parsing.
†Their speed is not directly comparable as they per-
forms labeled parsing without pruning.

when sentence length ≤ 40. There is some fluctua-
tion for the long sentences. This is because there are
very few sentences for a specific long length (usual-
ly 1 or 2 sentences), and the statistics are not stable
or meaningful for the small samples.

Without pruning, there are in total 132, 161 calls
to parse 2, 416 English sentences, that is, each sen-
tence requires 54.7 calls on average. For Chinese,
there are 84, 645 calls for 1, 910 sentences, i.e., 44.3
calls for each sentence on average.

5 Discussion

5.1 Polynomial Non-local Factors

Our bounding strategy can handle a family of non-
local factors that can be expressed as a polynomial
function of local factors. To see this, suppose

zc =
∑

i

αi
∏

e∈Ei
ze

For each i, we introduce new variable zEi =
mine∈Ei ze. Because ze is binary, zEi =

∏
e∈Ei ze.

In this way, we replace zc by several zEi that can be
handled by our bounding strategy.

We give two examples of these polynomial non-
local factors. First is the OR of local factors: zc =
max{ze,z′e}, which can be expressed by zc = ze +
z′e−zez′e. The second is the factor of valency feature

45


0 10 20 30 40 50 60
0

5

10

p
a

rs
in

g
 t

im
e
 (

se
c.

)

sentence length

k=3
k=5
k=10
k=20
no prune

(a) PTB corpus

0 20 40 60 80 100 120 140
0

20

40

60

80

p
a

rs
in

g
 t

im
e
 (

se
c.

)

sentence length

k=3
k=5
k=10
k=20
no prune

(b) CTB corpus

Figure 4 Averaged parsing time (seconds) relative to sentence length with different pruning settings, k
denotes the number of candidate heads of each word in pruning step.

0 10 20 30 40 50 60
0

100

200

C
a

lls
 t
o

 D
P

sentence length

k=3
k=5
k=10
k=20
no prune

(a) PTB corpus

0 20 40 60 80 100 120 140
0

500

1000

C
a

lls
 t
o

 D
P

sentence length

k=3
k=5
k=10
k=20
no prune

(b) CTB corpus

Figure 5 Averaged number of Calls to DP relative to sentence length with different pruning settings, k
denotes the number of candidate heads of each word in pruning step.

(Martins et al., 2009). Let binary variable vik indi-
cate whether word i has k modifiers. Given {ze} for
the edges with head i, then {vik|k = 1, . . . ,n − 1}
can be solved by

∑

k

kjvik =

(
∑

e

ze

)j
0 ≤ j ≤ n − 1

The left side of the equation is the linear function of
vik. The right side of the equation is a polynomial
function of ze. Hence, vik could be expressed as a
polynomial function of ze.

5.2 k Best Parsing
Though our B&B algorithm is able to capture a va-
riety of non-local features, it is still difficult to han-
dle many kinds of features, such as the depth of the
parse tree. Hence, a reranking approach may be use-
ful in order to incorporate such information, where
k parse trees can be generated first and then a second
pass model is used to rerank these candidates based
on more global or non-local features. In addition,
k-best parsing may be needed in many applications
to use parse information and especially utilize infor-

mation from multiple candidates to optimize task-
specific performance. We have not conducted any
experiment for k best parsing, hence we only dis-
cuss the algorithm.

According to proposition 1, we have

Proposition 2. Given p and subset S ⊆ Z, let zk
denote the kth best solution of maxz∈S ψ(p,z). If a
parse tree z′ ∈ S satisfies ϕ(z′) ≥ ψ(p,zk), then z′
is one of the k best parse trees in subset S.

Proof. Since zk is the kth best solution of ψ(p,z),
for zj, j > k, we have ψ(p,zk) ≥ ψ(p,zj) ≥
ϕ(zj). Since the size of the set {zj|j > k} is
|S| − k, hence there are at least |S| − k parse trees
whose scores ϕ(zj) are less than ψ(p,zk). Because
ϕ(z′) ≥ ψ(p,zk), hence z′ is at least the kth best
parse tree in subset S.

Therefore, we can search the k best parse trees
in this way: for each sub-problem, we use DP to
derive the k best parse trees. For each parse tree
z, if ϕ(z) ≥ ψ(p,zk), then z is selected into the k
best set. Algorithm terminates until the kth bound is
tight.

46


6 Conclusion

In this paper we proposed a new parsing algorithm
based on a Branch and Bound framework. The mo-
tivation is to use dynamic programming to search
for the bound. Experimental results on PTB and
CTB5 datasets show that our method is competitive
in terms of both performance and efficiency. Our
method can be adapted to non-projective dependen-
cy parsing, as well as the k best MST algorithm
(Hall, 2007) to find the k best candidates.

Acknowledgments

We’d like to thank Hao Zhang, Andre Martins and
Zhenghua Li for their helpful discussions. We al-
so thank Ryan McDonald and three anonymous re-
viewers for their valuable comments. This work
is partly supported by DARPA under Contract No.
HR0011-12-C-0016 and FA8750-13-2-0041. Any
opinions expressed in this material are those of the
authors and do not necessarily reflect the views of
DARPA.

References
Michael Auli and Adam Lopez. 2011. A comparison of

loopy belief propagation and dual decomposition for
integrated CCG supertagging and parsing. In Proc. of
ACL-HLT.

Egon Balas. 1965. An additive algorithm for solving
linear programs with zero-one variables. Operations
Research, 39(4).

Bernd Bohnet and Jonas Kuhn. 2012. The best of
bothworlds – a graph-based completion model for
transition-based parsers. In Proc. of EACL.

Bernd Bohnet and Joakim Nivre. 2012. A transition-
based system for joint part-of-speech tagging and la-
beled non-projective dependency parsing. In Proc. of
EMNLP-CoNLL.

Paul Calamai and Jorge Moré. 1987. Projected gradien-
t methods for linearly constrained problems. Mathe-
matical Programming, 39(1).

Xavier Carreras. 2007. Experiments with a higher-order
projective dependency parser. In Proc. of EMNLP-
CoNLL.

Eugene Charniak and Mark Johnson. 2005. Coarse-to-
fine n-best parsing and maxent discriminative rerank-
ing. In Proc. of ACL.

Wenliang Chen, Min Zhang, and Haizhou Li. 2012. U-
tilizing dependency language models for graph-based
dependency parsing models. In Proc. of ACL.

Michael Collins. 2000. Discriminative reranking for nat-
ural language parsing. In Proc. of ICML.

Michael Collins. 2002. Discriminative training methods
for hidden markov models: Theory and experiments
with perceptron algorithms. In Proc. of EMNLP.

John Duchi, Shai Shalev-Shwartz, Yoram Singer, and
Tushar Chandra. 2008. Efficient projections onto the
l1-ball for learning in high dimensions. In Proc. of
ICML.

Jason M. Eisner. 1996. Three new probabilistic models
for dependency parsing: an exploration. In Proc. of
COLING.

Keith Hall. 2007. K-best spanning tree parsing. In Proc.
of ACL.

Liang Huang and David Chiang. 2005. Better k-best
parsing. In Proc. of IWPT.

Liang Huang. 2008. Forest reranking: Discriminative
parsing with non-local features. In Proc. of ACL-HLT.

Manfred Klenner and Étienne Ailloud. 2009. Opti-
mization in coreference resolution is not needed: A
nearly-optimal algorithm with intensional constraints.
In Proc. of EACL.

Terry Koo and Michael Collins. 2010. Efficient third-
order dependency parsers. In Proc. of ACL.

Terry Koo, Xavier Carreras, and Michael Collins. 2008.
Simple semi-supervised dependency parsing. In Proc.
of ACL-HLT.

Terry Koo, Alexander M. Rush, Michael Collins, Tommi
Jaakkola, and David Sontag. 2010. Dual decomposi-
tion for parsing with non-projective head automata. In
Proc. of EMNLP.

Ailsa H. Land and Alison G. Doig. 1960. An automat-
ic method of solving discrete programming problems.
Econometrica, 28(3):497–520.

Andre Martins, Noah Smith, and Eric Xing. 2009. Con-
cise integer linear programming formulations for de-
pendency parsing. In Proc. of ACL.

Andre Martins, Noah Smith, Mario Figueiredo, and Pe-
dro Aguiar. 2011. Dual decomposition with many
overlapping components. In Proc. of EMNLP.

Ryan McDonald, Koby Crammer, and Fernando Pereira.
2005a. Online large-margin training of dependency
parsers. In Proc. of ACL.

Ryan McDonald, Fernando Pereira, Kiril Ribarov, and
Jan Hajic. 2005b. Non-projective dependency pars-
ing using spanning tree algorithms. In Proc. of HLT-
EMNLP.

Joakim Nivre and Ryan McDonald. 2008. Integrating
graph-based and transition-based dependency parsers.
In Proc. of ACL-HLT.

Jorge Nocedal and Stephen J. Wright. 2006. Numerical
Optimization. Springer, 2nd edition.

47


Sebastian Riedel and James Clarke. 2006. Incremental
integer linear programming for non-projective depen-
dency parsing. In Proc. of EMNLP.

Alexander M Rush, David Sontag, Michael Collins, and
Tommi Jaakkola. 2010. On dual decomposition and
linear programming relaxations for natural language
processing. In Proc. of EMNLP.

David Smith and Jason Eisner. 2008. Dependency pars-
ing by belief propagation. In Proc. of EMNLP.

Min Sun, Murali Telaprolu, Honglak Lee, and Silvio
Savarese. 2012. Efficient and exact MAP-MRF in-
ference using branch and bound. In Proc. of AISTATS.

Jun Suzuki, Hideki Isozaki, Xavier Carreras, and Michael
Collins. 2009. An empirical study of semi-supervised
structured conditional models for dependency parsing.
In Proc. of EMNLP.

Hiroyasu Yamada and Yuji Matsumoto. 2003. Statistical
dependency analysis with support vector machines. In
Proc. of IWPT.

Yue Zhang and Stephen Clark. 2008. A tale of t-
wo parsers: Investigating and combining graph-based
and transition-based dependency parsing. In Proc. of
EMNLP.

Hao Zhang and Ryan McDonald. 2012. Generalized
higher-order dependency parsing with cube pruning.
In Proc. of EMNLP.

Yue Zhang and Joakim Nivre. 2011. Transition-based
dependency parsing with rich non-local features. In
Proc. of ACL-HLT.

48