Transactions of the Association for Computational Linguistics, 1 (2013) 37–48. Action Editor: Ryan McDonald. Submitted 11/2012; Revised 2/2013; Published 3/2013. c©2013 Association for Computational Linguistics. Branch and Bound Algorithm for Dependency Parsing with Non-local Features Xian Qian and Yang Liu Computer Science Department The University of Texas at Dallas {qx,yangl}@hlt.utdallas.edu Abstract Graph based dependency parsing is inefficient when handling non-local features due to high computational complexity of inference. In this paper, we proposed an exact and effi- cient decoding algorithm based on the Branch and Bound (B&B) framework where non- local features are bounded by a linear combi- nation of local features. Dynamic program- ming is used to search the upper bound. Ex- periments are conducted on English PTB and Chinese CTB datasets. We achieved competi- tive Unlabeled Attachment Score (UAS) when no additional resources are available: 93.17% for English and 87.25% for Chinese. Parsing speed is 177 words per second for English and 97 words per second for Chinese. Our algo- rithm is general and can be adapted to non- projective dependency parsing or other graph- ical models. 1 Introduction For graph based projective dependency parsing, dy- namic programming (DP) is popular for decoding due to its efficiency when handling local features. It performs cubic time parsing for arc-factored mod- els (Eisner, 1996; McDonald et al., 2005a) and bi- quadratic time for higher order models with richer sibling and grandchild features (Carreras, 2007; Koo and Collins, 2010). However, for models with gen- eral non-local features, DP is inefficient. There have been numerous studies on global in- ference algorithms for general higher order parsing. One popular approach is reranking (Collins, 2000; Charniak and Johnson, 2005; Hall, 2007). It typi- cally has two steps: the low level classifier gener- ates the top k hypotheses using local features, then the high level classifier reranks these candidates us- ing global features. Since the reranking quality is bounded by the oracle performance of candidates, some work has combined candidate generation and reranking steps using cube pruning (Huang, 2008; Zhang and McDonald, 2012) to achieve higher or- acle performance. They parse a sentence in bottom up order and keep the top k derivations for each s- pan using k best parsing (Huang and Chiang, 2005). After merging the two spans, non-local features are used to rerank top k combinations. This approach is very efficient and flexible to handle various non- local features. The disadvantage is that it tends to compute non-local features as early as possible so that the decoder can utilize that information at inter- nal spans, hence it may miss long historical features such as long dependency chains. Smith and Eisner modeled dependency parsing using Markov Random Fields (MRFs) with glob- al constraints and applied loopy belief propaga- tion (LBP) for approximate learning and inference (Smith and Eisner, 2008). Similar work was done for Combinatorial Categorial Grammar (CCG) pars- ing (Auli and Lopez, 2011). They used posterior marginal beliefs for inference to satisfy the tree con- straint: for each factor, only legal messages (satisfy- ing global constraints) are considered in the partition function. A similar line of research investigated the use of integer linear programming (ILP) based parsing (Riedel and Clarke, 2006; Martins et al., 2009). This 37 method is very expressive. It can handle arbitrary non-local features determined or bounded by linear inequalities of local features. For local models, LP is less efficient than DP. The reason is that, DP works on a small number of dimensions in each recursion, while for LP, the popular revised simplex method needs to solve a m dimensional linear system in each iteration (Nocedal and Wright, 2006), where m is the number of constraints, which is quadratic in sentence length for projective dependency pars- ing (Martins et al., 2009). Dual Decomposition (DD) (Rush et al., 2010; Koo et al., 2010) is a special case of Lagrangian re- laxation. It relies on standard decoding algorithms as oracle solvers for sub-problems, together with a simple method for forcing agreement between the different oracles. This method does not need to con- sider the tree constraint explicitly, as it resorts to dy- namic programming which guarantees its satisfac- tion. It works well if the sub-problems can be well defined, especially for joint learning tasks. Howev- er, for the task of dependency parsing, using various non-local features may result in many overlapped sub-problems, hence it may take a long time to reach a consensus (Martins et al., 2011). In this paper, we propose a novel Branch and Bound (B&B) algorithm for efficient parsing with various non-local features. B&B (Land and Doig, 1960) is generally used for combinatorial optimiza- tion problems such as ILP. The difference between our method and ILP is that the sub-problem in ILP is a relaxed LP, which requires a numerical solution, while ours bounds the non-local features by a lin- ear combination of local features and uses DP for decoding as well as calculating the upper bound of the objective function. An exact solution is achieved if the bound is tight. Though in the worst case, time complexity is exponential in sentence length, it is practically efficient especially when adopting a pruning strategy. Experiments are conducted on English PennTree Bank and Chinese Tree Bank 5 (CTB5) with stan- dard train/develop/test split. We achieved 93.17% Unlabeled Attachment Score (UAS) for English at a speed of 177 words per second and 87.25% for Chi- nese at a speed of 97 words per second. 2 Graph Based Parsing 2.1 Problem Definition Given a sentence x = x1,x2, . . . ,xn where xi is the ith word of the sentence, dependency parsing as- signs exactly one head word to each word, so that dependencies from head words to modifiers form a tree. The root of the tree is a special symbol de- noted by x0 which has exactly one modifier. In this paper, we focus on unlabeled projective dependency parsing but our algorithm can be adapted for labeled or non-projective dependency parsing (McDonald et al., 2005b). The inference problem is to search the optimal parse tree y∗ y∗ = arg maxy∈Y(x)ϕ(x,y) where Y(x) is the set of all candidate parse trees of sentence x. ϕ(x,y) is a given score function which is usually decomposed into small parts ϕ(x,y) = ∑ c⊆y ϕc(x) (1) where c is a subset of edges, and is called a factor. For example, in the all grandchild model (Koo and Collins, 2010), the score function can be represented as ϕ(x,y) = ∑ ehm∈y ϕehm (x) + ∑ egh,ehm∈y ϕegh,ehm (x) where the first term is the sum of scores of all edges xh → xm, and the second term is the sum of the scores of all edge chains xg → xh → xm. In discriminative models, the score of a parse tree y is the weighted sum of the fired feature functions, which can be represented by the sum of the factors ϕ(x,y) = wT f (x,y) = ∑ c⊆y wT f (x,c) = ∑ c⊆y ϕc(x) where f (x,c) is the feature vector that depends on c. For example, we could define a feature for grand- child c = {egh,ehm} f(x,c) =    1 if xg = would ∧ xh = be ∧xm = happy ∧ c is selected 0 otherwise 38 2.2 Dynamic Programming for Local Models In first order models, all factors c in Eq(1) contain a single edge. The optimal parse tree can be derived by DP with running time O(n3) (Eisner, 1996). The algorithm has two types of structures: complete s- pan, which consists of a headword and its descen- dants on one side, and incomplete span, which con- sists of a dependency and the region between the head and modifier. It starts at single word spans, and merges the spans in bottom up order. For second order models, the score function ϕ(x,y) adds the scores of siblings (adjacent edges with a common head) and grandchildren ϕ(x,y) = ∑ ehm∈y ϕehm (x) + ∑ egh,ehm∈y ϕehm,egh (x) + ∑ ehm,ehs∈y ϕehm,ehs (x) There are two versions of second order models, used respectively by Carreras (2007) and Koo et al. (2010). The difference is that Carreras’ only con- siders the outermost grandchildren, while Koo and Collin’s allows all grandchild features. Both models permit O(n4) running time. Third-order models score edge triples such as three adjacent sibling modifiers, or grand-siblings that score a word, its modifier and its adjacent grand- children, and the inference complexity is O(n4) (Koo and Collins, 2010). In this paper, for all the factors/features that can be handled by DP, we call them the local fac- tors/features. 3 The Proposed Method 3.1 Basic Idea For general high order models with non-local fea- tures, we propose to use Branch and Bound (B&B) algorithm to search the optimal parse tree. A B&B algorithm has two steps: branching and bounding. The branching step recursively splits the search s- pace Y(x) into two disjoint subspaces Y(x) = Y1 ∪ Y2 by fixing assignment of one edge. For each subspace Yi, the bounding step calculates the upper bound of the optimal parse tree score in the sub- space: UBYi ≥ maxy∈Yi ϕ(x,y). If this bound is no more than any obtained parse tree score UBYi ≤ ϕ(x,y′), then all parse trees in subspace Yi are no more optimal than y′, and Yi could be pruned safely. The efficiency of B&B depends on the branching strategy and upper bound computation. For exam- ple, Sun et al. (2012) used B&B for MRFs, where they proposed two branching strategies and a novel data structure for efficient upper bound computation. Klenner and Ailloud (2009) proposed a variation of Balas algorithm (Balas, 1965) for coreference reso- lution, where candidate branching variables are sort- ed by their weights. Our bounding strategy is to find an upper bound for the score of each non-local factor c containing multiple edges. The bound is the sum of new scores of edges in the factor plus a constant ϕc(x) ≤ ∑ e∈c ψe(x) + αc Based on the new scores {ψe(x)} and constants {αc}, we define the new score of parse tree y ψ(x,y) = ∑ c⊆y ( ∑ e∈c ψe(x) + αc ) Then we have ψ(x,y) ≥ ϕ(x,y), ∀y ∈ Y(x) The advantage of such a bound is that, it is the sum of new edge scores. Hence, its optimum tree maxy∈Y(x) ψ(x,y) can be found by DP, which is the upper bound of maxy∈Y(x) ϕ(x,y), as for any y ∈ Y(x), ψ(x,y) ≥ ϕ(x,y). 3.2 The Upper Bound Function In this section, we derive the upper bound function ψ(x,y) described above. To simplify notation, we drop x throughout the rest of the paper. Let zc be a binary variable indicating whether factor c is se- lected in the parse tree. We reformulate the score function in Eq(1) as ϕ(y) ≡ ϕ(z) = ∑ c ϕczc (2) 39 Correspondingly, the tree constraint is replaced by z ∈ Z. Then the parsing task is z∗ = arg maxz∈Zϕczc (3) Notice that, for any zc, we have zc = min e∈c ze which means that factor c appears in parse tree if and only if all its edges {e|e ∈ c} are selected in the tree. Here ze is short for z{e} for simplicity. Our bounding method is based on the following fact: for a set {a1,a2, . . .ar} (aj denotes the jth el- ement) , its minimum min{aj} = min p∈∆ ∑ j pjaj (4) where ∆ is probability simplex ∆ = {p|pj ≥ 0, ∑ j pj = 1} We discuss the bound for ϕczc in two cases: ϕc ≥ 0 and ϕc < 0. If ϕc ≥ 0, we have ϕczc = ϕc min e∈c ze = ϕc min pc∈∆ ∑ e∈c pecze = min pc∈∆ ∑ e∈c ϕcp e cze The second equation comes from Eq(4). For sim- plicity, let gc(pc,z) = ∑ e∈c ϕcp e cze with domain domgc = {pc ∈ ∆; ze ∈ {0, 1}, ∀e ∈ c}. Then we have ϕczc = min pc gc(pc,z) (5) If ϕc < 0, we have two upper bounds. One is commonly used in ILP when all the variables are bi- nary a∗ = min j {aj}rj=1 ⇔ a∗ ≤ aj a∗ ≥ ∑ j aj − (r − 1) According to the last inequality, we have the upper bound for negative scored factors ϕczc ≤ ϕc ( ∑ e∈c ze − (rc − 1) ) (6) where rc is the number of edges in c. For simplicity, we use the notation σc(z) = ϕc ( ∑ e∈c ze − (rc − 1) ) The other upper bound when ϕc < 0 is simple ϕczc ≤ 0 (7) Notice that, for any parse tree, one of the upper bounds must be tight. Eq(6) is tight if c appears in the parse tree: zc = 1, otherwise Eq(7) is tight. Therefore ϕczc = min {σc(z), 0} Let hc(pc,z) = p 1 cσc(z) + p 2 c · 0 with domhc = {pc ∈ ∆; ze ∈ {0, 1}, ∀e ∈ c}. According to Eq(4), we have ϕczc = min pc hc(pc,z) (8) Let ψ(p,z) = ∑ c,ϕc≥0 gc(pc,z) + ∑ c,ϕc<0 hc(pc,z) Minimize ψ with respect to p, we have min p ψ(p,z) = min p   ∑ c,ϕc≥0 gc(pc,z) + ∑ c,ϕc<0 hc(pc,z)   = ∑ c,ϕc≥0 min pc gc(pc,z) + ∑ c,ϕc<0 min pc hc(pc,z) = ∑ c,ϕc≥0 ϕczc + ∑ c,ϕc<0 ϕczc = ϕ(z) The second equation holds since, for any two fac- tors, c and c′, gc (or hc) and gc′ (or hc′ ) are separable. The third equation comes from Eq(5) and Eq(8). Based on this, we have the following proposition: 40 Proposition 1. For any p,pc ∈ ∆, and z ∈ Z, ψ(p,z) ≥ ϕ(z). Therefore, ψ(p,z) is an upper bound function of ϕ(z). Furthermore, fixing p, ψ(p,z) is a linear func- tion of ze , see Eq(5) and Eq(8), variables zc for large factors are eliminated. Hence z′ = arg maxzψ(p,z) can be solved efficiently by DP. Because ψ(p,z′) ≥ ψ(p,z∗) ≥ ϕ(z∗) ≥ ϕ(z′) after obtaining z′ , we get the upper bound and lower bound of ϕ(z∗): ψ(p,z′) and ϕ(z′). The upper bound is expected to be as tight as pos- sible. Using min-max inequality, we get max z∈Z ϕ(z) = max z∈Z min p ψ(p,z) ≤ min p max z∈Z ψ(p,z) which provides the tightest upper bound of ϕ(z∗). Since ψ is not differentiable w.r.t p, projected sub-gradient (Calamai and Moré, 1987; Rush et al., 2010) is used to search the saddle point. More specifically, in each iteration, we first fix p and search z using DP, then we fix z and update p by pnew = P∆ ( p + ∂ψ ∂p α ) where α > 0 is the step size in line search, function P∆(q) denotes the projection of q onto the proba- bility simplex ∆. In this paper, we use Euclidean projection, that is P∆(q) = min p∈∆ ∥p − q∥2 which can be solved efficiently by sorting (Duchi et al., 2008). 3.3 Branch and Bound Based Parsing As discussed in Section 3.1, the B&B recursive pro- cedure yields a binary tree structure called Branch and Bound tree. Each node of the B&B tree has some fixed ze, specifying some must-select edges and must-remove edges. The root of the B&B tree has no constraints, so it can produce all possible parse trees including z∗. Each node has two chil- dren. One adds a constraint ze = 1 for a free edge z =e1 0 1 0 1 0 1z =e2 ψ φ =9 =4 ψ LB, we need to choose a good branching variable to gener- ate the child nodes. Let G(z′) = ψ(p′,z′) − ϕ(z′) denote the gap between the upper bound and lower bound. This gap is actually the accumulated gaps of all factors c. Let Gc be the gap of c Gc = { gc(p ′ c,z ′) − ϕcz′c if ϕc ≥ 0 hc(p ′ c,z ′) − ϕcz′c if ϕc < 0 41 We choose the branching variable heuristically: for each edge e, we define its gap as the sum of the gaps of factors that contain it Ge = ∑ c,e∈c Gc The edge with the maximum gap is selected as the branching variable. Suppose there are N nodes on a level of B&B tree, and correspondingly, we get N branching vari- ables, among which, we choose the one with the highest lower bound as it likely reaches the optimal value faster. 3.4 Lower Bound Initialization A large lower bound is critical for efficient pruning. In this section, we discuss an alternative way to ini- tialize the lower bound LB. We apply the similar trick to get the lower bound function of ϕ(z). Similar to Eq(8), for ϕc ≥ 0, we have ϕczc = max{ϕc ( ∑ e∈c ze − (rc − 1) ) , 0} = max{σc(z), 0} Using the fact that max{aj} = max p∈∆ ∑ j pjaj we have ϕczc = max pc∈∆ p1cσc(z) + p 2 c · 0 = max pc hc(pc,z) For ϕc < 0, we have ϕczc = max e∈c {ϕcze} = max pc∈∆ ∑ e∈c pecϕcze = max pc gc(pc,z) Put the two cases together, we get the lower bound function π(p,z) = ∑ c,ϕc≥0 hc(pc,z) + ∑ c,ϕc<0 gc(pc,z) Algorithm 1 Branch and Bound based parsing Require: {ϕc} Ensure: Optimal parse tree z∗ Solve p∗,z∗ = arg maxp,zπ(p,z) Initialize S = {Z}, LB = π(p∗,z∗) while S ̸= ∅ do Set S′ = ∅{nodes that survive from pruning} foreach S ∈ S Solve minp maxz ψ(p,z) to get LBS,UBS LB = max{LB,LBS∈S }, update z∗ foreach S ∈ S, add S to S′, if UBS > LB Select a branching variable ze. Clear S = ∅ foreach S ∈ S′ Add S1 = {z|z ∈ S,ze = 1} to S Add S2 = {z|z ∈ S,ze = 0} to S. end while For any p,pc ∈ ∆,z ∈ Z π(p,z) ≤ ϕ(z) π(p,z) is not concave, however, we could alterna- tively optimize z and p to get a good approximation, which provides a lower bound for ϕ(z∗). 3.5 Summary We summarize our B&B algorithm in Algorithm 1. It is worth pointing out that so far in the above description, we have used the assumption that the backbone DP uses first order models, however, the backbone DP can be the second or third order ver- sion. The difference is that, for higher order DP, higher order factors such as adjacent siblings, grand- children are directly handled as local factors. In the worst case, all the edges are selected for branching, and the complexity grows exponentially in sentence length. However, in practice, it is quite efficient, as we will show in the next section. 4 Experiments 4.1 Experimental Settings The datasets we used are the English Penn Tree Bank (PTB) and Chinese Tree Bank 5.0 (CTB5). We use the standard train/develop/test split as described in Table 1. We extracted dependencies using Joakim Nivre’s Penn2Malt tool with standard head rules: Yamada and Matsumoto’s (Yamada and Matsumoto, 2003) 42 Train Develop Test PTB sec. 2-21 sec. 22 sec. 23 CTB5 sec. 001-815 sec. 886-931 sec. 816-885 1001-1136 1148-1151 1137-1147 Table 1: Data split in our experiment for English, and Zhang and Clark’s (Zhang and Clark, 2008) for Chinese. Unlabeled attachment s- core (UAS) is used to evaluate parsing quality1. The B&B parser is implemented with C++. All the ex- periments are conducted on the platform Intel Core i5-2500 CPU 3.30GHz. 4.2 Baseline: DP Based Second Order Parser We use the dynamic programming based second or- der parser (Carreras, 2007) as the baseline. Aver- aged structured perceptron (Collins, 2002) is used for parameter estimation. We determine the number of iterations on the validation set, which is 6 for both corpora. For English, we train the POS tagger using linear chain perceptron on training set, and predict POS tags for the development and test data. The parser is trained using the automatic POS tags generated by 10 fold cross validation. For Chinese, we use the gold standard POS tags. We use five types of features: unigram features, bigram features, in-between features, adjacent sib- ling features and outermost grand-child features. The first three types of features are firstly introduced by McDonald et al. (2005a) and the last two type- s of features are used by Carreras (2007). All the features are the concatenation of surrounding words, lower cased words (English only), word length (Chi- nese only), prefixes and suffixes of words (Chinese only), POS tags, coarse POS tags which are derived from POS tags using a simple mapping table, dis- tance between head and modifier, direction of edges. For English, we used 674 feature templates to gener- ate large amounts of features, and finally got 86.7M non-zero weighted features after training. The base- line parser got 92.81% UAS on the testing set. For Chinese, we used 858 feature templates, and finally got 71.5M non-zero weighted features after train- 1For English, we follow Koo and Collins (2010) and ignore any word whose gold-standard POS tag is one of { “ ” : , .}. For Chinese, we ignore any word whose POS tag is PU. ing. The baseline parser got 86.89% UAS on the testing set. 4.3 B&B Based Parser with Non-local Features We use the baseline parser as the backbone of our B&B parser. We tried different types of non-local features as listed below: • All grand-child features. Notice that this fea- ture can be handled by Koo’s second order model (Koo and Collins, 2010) directly. • All great grand-child features. • All sibling features: all the pairs of edges with common head. An example is shown in Fig- ure 2. • All tri-sibling features: all the 3-tuples of edges with common head. • Comb features: for any word with more than 3 consecutive modifiers, the set of all the edges from the word to the modifiers form a comb.2 • Hand crafted features: We perform cross val- idation on the training data using the baseline parser, and designed features that may correc- t the most common errors. We designed 13 hand-craft features for English in total. One ex- ample is shown in Figure 3. For Chinese, we did not add any hand-craft features, as the er- rors in the cross validation result vary a lot, and we did not find general patterns to fix them. 4.4 Implementation Details To speed up the solution of the min-max subprob- lem, for each node in the B&B tree, we initialize p with the optimal solution of its parent node, since the child node fixes only one additional edge, its op- timal point is likely to be closed to its parent’s. For the root node of B&B tree, we initialize pec = 1 rc for factors with non-negative weights and p1c = 0 for 2In fact, our algorithm can deal with non-consecutive mod- ifiers; however, in such cases, factor detection (detect regular expressions like x1. ∗ x2. ∗ . . . ) requires the longest com- mon subsequence algorithm (LCS), which is time-consuming if many comb features are generated. Similar problems arise for sub-tree features, which may contain many non-consecutive words. 43 c0 c1 c2 c3c0 h c0 c1c0 h c0 c2c0 h c0 c3c0 h c1 c2h c2 c3h c1 c3h second order higher order Figure 2: An example of all sibling features. Top: a sub-tree; Bottom: extracted sibling features. Ex- isting higher order DP systems can not handle the siblings on both sides of head. regulation occurs through inaction , rather than through ... Figure 3: An example of hand-craft feature: for the word sequence A . . . rather than A, where A is a preposition, the first A is the head of than, than is the head of rather and the second A. negative weighted factors. Step size α is initialized with maxc,ϕc ̸=0{ 1|ϕc| }, as the vector p is bounded in a unit box. α is updated using the same strategy as Rush et al. (2010). Two stopping criteria are used. One is 0 ≤ ψold − ψnew ≤ ϵ, where ϵ > 0 is a given precision3. The other checks if the bound is tight: UB = LB. Because all features are boolean (note that they can be integer), their weights are integer during each perceptron update, hence the scores of parse trees are discrete. The minimal gap between different scores is 1 N×T after averaging, where N is the number of training samples, and T is the itera- tion number for perceptron training. Therefore the upper bound can be tightened as UB = ⌊NTψ⌋ NT . During testing, we use the pre-pruning method as used in Martins et al. (2009) for both datasets to bal- ance parsing quality and speed. This method uses a simple classifier to select the top k candidate head- s for each word and exclude the other heads from search space. In our experiment, we set k = 10. 3we use ϵ = 10−8 in our implementation System PTB CTB Our baseline 92.81 86.89 B&B +all grand-child 92.97 87.02 +all great grand-child 92.78 86.77 +all sibling 93.00 87.05 +all tri-sibling 92.79 86.81 +comb 92.86 86.91 +hand craft 92.89 N/A +all grand-child + all sibling + com- b + hand craft 93.17 87.25 3rd order re-impl. 93.03 87.07 TurboParser (reported) 92.62 N/A TurboParser (our run) 92.82 86.05 Koo and Collins (2010) 93.04 N/A Zhang and McDonald (2012) 93.06 86.87 Zhang and Nivre (2011) 92.90 86.00 System integration Bohnet and Kuhn (2012) 93.39 87.5 Systems using additional resources Suzuki et al. (2009) 93.79 N/A Koo et al. (2008) 93.5 N/A Chen et al. (2012) 92.76 N/A Table 2: Comparison between our system and the- state-of-art systems. 4.5 Main Result Experimental results are listed in Table 2. For com- parison, we also include results of representative state-of-the-art systems. For the third order pars- er, we re-implemented Model 1 (Koo and Collins, 2010), and removed the longest sentence in the CTB dataset, which contains 240 words, due to the O(n4) space complexity 4. For ILP based parsing, we used TurboParser5, a speed-optimized parser toolkit. We trained full models (which use all grandchild fea- tures, all sibling features and head bigram features (Martins et al., 2011)) for both datasets using its de- fault settings. We also list the performance in its documentation on English corpus. The observation is that, the all-sibling features are most helpful for our parser, as some good sibling features can not be encoded in DP based parser. For example, a matched pair of parentheses are always siblings, but their head may lie between them. An- 4In fact, Koo’s algorithm requires only O(n3) space. Our implementation is O(n4) because we store the feature vectors for fast training. 5http://www.ark.cs.cmu.edu/TurboParser/ 44 other observation is that all great grandchild features and all tri-sibling features slightly hurt the perfor- mance and we excluded them from the final system. When no additional resource is available, our parser achieved competitive performance: 93.17% Unlabeled Attachment Score (UAS) for English at a speed of 177 words per second and 87.25% for Chinese at a speed of 97 words per second. High- er UAS is reported by joint tagging and parsing (Bohnet and Nivre, 2012) or system integration (Bohnet and Kuhn, 2012) which benefits from both transition based parsing and graph based parsing. Previous work shows that combination of the two parsing techniques can learn to overcome the short- comings of each non-integrated system (Nivre and McDonald, 2008; Zhang and Clark, 2008). Sys- tem combination will be an interesting topic for our future research. The highest reported performance on English corpus is 93.79%, obtained by semi- supervised learning with a large amount of unla- beled data (Suzuki et al., 2009). 4.6 Tradeoff Between Accuracy and Speed In this section, we study the trade off between ac- curacy and speed using different pre-pruning setups. In Table 3, we show the parsing accuracy and in- ference time in testing stage with different numbers of candidate heads k in pruning step. We can see that, on English dataset, when k ≥ 10, our pars- er could gain 2 − 3 times speedup without losing much parsing accuracy. There is a further increase of the speed with smaller k, at the cost of some ac- curacy. Compared with TurboParser, our parser is less efficient but more accurate. Zhang and McDon- ald (2012) is a state-of-the-art system which adopts cube pruning for efficient parsing. Notice that, they did not use pruning which seems to increase parsing speed with little hit in accuracy. Moreover, they did labeled parsing, which also makes their speed not directly comparable. For each node of B&B tree, our parsing algorithm uses projected sub-gradient method to find the sad- dle point, which requires a number of calls to a DP, hence the efficiency of Algorithm 1 is mainly deter- mined by the number of DP calls. Figure 4 and Fig- ure 5 show the averaged parsing time and number of calls to DP relative to the sentence length with differ- ent pruning settings. Parsing time grows smoothly PTB CTB System UAS w/s UAS w/s Ours (no prune) 93.18 52 87.28 73 Ours (k = 20) 93.17 105 87.28 76 Ours (k = 10) 93.17 177 87.25 97 Ours (k = 5) 93.10 264 86.94 108 Ours (k = 3) 92.68 493 85.76 128 TurboParser(full) 92.82 402 86.05 192 TurboParser(standard) 92.68 638 85.80 283 TurboParser(basic) 90.97 4670 82.28 2736 Zhang and McDon- ald (2012)† 93.06 220 86.87 N/A Table 3: Trade off between parsing accuracy (UAS) and speed (words per second) with different pre- pruning settings. k denotes the number of candi- date heads of each word preserved for B&B parsing. †Their speed is not directly comparable as they per- forms labeled parsing without pruning. when sentence length ≤ 40. There is some fluctua- tion for the long sentences. This is because there are very few sentences for a specific long length (usual- ly 1 or 2 sentences), and the statistics are not stable or meaningful for the small samples. Without pruning, there are in total 132, 161 calls to parse 2, 416 English sentences, that is, each sen- tence requires 54.7 calls on average. For Chinese, there are 84, 645 calls for 1, 910 sentences, i.e., 44.3 calls for each sentence on average. 5 Discussion 5.1 Polynomial Non-local Factors Our bounding strategy can handle a family of non- local factors that can be expressed as a polynomial function of local factors. To see this, suppose zc = ∑ i αi ∏ e∈Ei ze For each i, we introduce new variable zEi = mine∈Ei ze. Because ze is binary, zEi = ∏ e∈Ei ze. In this way, we replace zc by several zEi that can be handled by our bounding strategy. We give two examples of these polynomial non- local factors. First is the OR of local factors: zc = max{ze,z′e}, which can be expressed by zc = ze + z′e−zez′e. The second is the factor of valency feature 45 0 10 20 30 40 50 60 0 5 10 p a rs in g t im e ( se c. ) sentence length k=3 k=5 k=10 k=20 no prune (a) PTB corpus 0 20 40 60 80 100 120 140 0 20 40 60 80 p a rs in g t im e ( se c. ) sentence length k=3 k=5 k=10 k=20 no prune (b) CTB corpus Figure 4 Averaged parsing time (seconds) relative to sentence length with different pruning settings, k denotes the number of candidate heads of each word in pruning step. 0 10 20 30 40 50 60 0 100 200 C a lls t o D P sentence length k=3 k=5 k=10 k=20 no prune (a) PTB corpus 0 20 40 60 80 100 120 140 0 500 1000 C a lls t o D P sentence length k=3 k=5 k=10 k=20 no prune (b) CTB corpus Figure 5 Averaged number of Calls to DP relative to sentence length with different pruning settings, k denotes the number of candidate heads of each word in pruning step. (Martins et al., 2009). Let binary variable vik indi- cate whether word i has k modifiers. Given {ze} for the edges with head i, then {vik|k = 1, . . . ,n − 1} can be solved by ∑ k kjvik = ( ∑ e ze )j 0 ≤ j ≤ n − 1 The left side of the equation is the linear function of vik. The right side of the equation is a polynomial function of ze. Hence, vik could be expressed as a polynomial function of ze. 5.2 k Best Parsing Though our B&B algorithm is able to capture a va- riety of non-local features, it is still difficult to han- dle many kinds of features, such as the depth of the parse tree. Hence, a reranking approach may be use- ful in order to incorporate such information, where k parse trees can be generated first and then a second pass model is used to rerank these candidates based on more global or non-local features. In addition, k-best parsing may be needed in many applications to use parse information and especially utilize infor- mation from multiple candidates to optimize task- specific performance. We have not conducted any experiment for k best parsing, hence we only dis- cuss the algorithm. According to proposition 1, we have Proposition 2. Given p and subset S ⊆ Z, let zk denote the kth best solution of maxz∈S ψ(p,z). If a parse tree z′ ∈ S satisfies ϕ(z′) ≥ ψ(p,zk), then z′ is one of the k best parse trees in subset S. Proof. Since zk is the kth best solution of ψ(p,z), for zj, j > k, we have ψ(p,zk) ≥ ψ(p,zj) ≥ ϕ(zj). Since the size of the set {zj|j > k} is |S| − k, hence there are at least |S| − k parse trees whose scores ϕ(zj) are less than ψ(p,zk). Because ϕ(z′) ≥ ψ(p,zk), hence z′ is at least the kth best parse tree in subset S. Therefore, we can search the k best parse trees in this way: for each sub-problem, we use DP to derive the k best parse trees. For each parse tree z, if ϕ(z) ≥ ψ(p,zk), then z is selected into the k best set. Algorithm terminates until the kth bound is tight. 46 6 Conclusion In this paper we proposed a new parsing algorithm based on a Branch and Bound framework. The mo- tivation is to use dynamic programming to search for the bound. Experimental results on PTB and CTB5 datasets show that our method is competitive in terms of both performance and efficiency. Our method can be adapted to non-projective dependen- cy parsing, as well as the k best MST algorithm (Hall, 2007) to find the k best candidates. Acknowledgments We’d like to thank Hao Zhang, Andre Martins and Zhenghua Li for their helpful discussions. We al- so thank Ryan McDonald and three anonymous re- viewers for their valuable comments. This work is partly supported by DARPA under Contract No. HR0011-12-C-0016 and FA8750-13-2-0041. Any opinions expressed in this material are those of the authors and do not necessarily reflect the views of DARPA. References Michael Auli and Adam Lopez. 2011. A comparison of loopy belief propagation and dual decomposition for integrated CCG supertagging and parsing. In Proc. of ACL-HLT. Egon Balas. 1965. An additive algorithm for solving linear programs with zero-one variables. Operations Research, 39(4). Bernd Bohnet and Jonas Kuhn. 2012. The best of bothworlds – a graph-based completion model for transition-based parsers. In Proc. of EACL. Bernd Bohnet and Joakim Nivre. 2012. A transition- based system for joint part-of-speech tagging and la- beled non-projective dependency parsing. In Proc. of EMNLP-CoNLL. Paul Calamai and Jorge Moré. 1987. Projected gradien- t methods for linearly constrained problems. Mathe- matical Programming, 39(1). Xavier Carreras. 2007. Experiments with a higher-order projective dependency parser. In Proc. of EMNLP- CoNLL. Eugene Charniak and Mark Johnson. 2005. Coarse-to- fine n-best parsing and maxent discriminative rerank- ing. In Proc. of ACL. Wenliang Chen, Min Zhang, and Haizhou Li. 2012. U- tilizing dependency language models for graph-based dependency parsing models. In Proc. of ACL. Michael Collins. 2000. Discriminative reranking for nat- ural language parsing. In Proc. of ICML. Michael Collins. 2002. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proc. of EMNLP. John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. 2008. Efficient projections onto the l1-ball for learning in high dimensions. In Proc. of ICML. Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: an exploration. In Proc. of COLING. Keith Hall. 2007. K-best spanning tree parsing. In Proc. of ACL. Liang Huang and David Chiang. 2005. Better k-best parsing. In Proc. of IWPT. Liang Huang. 2008. Forest reranking: Discriminative parsing with non-local features. In Proc. of ACL-HLT. Manfred Klenner and Étienne Ailloud. 2009. Opti- mization in coreference resolution is not needed: A nearly-optimal algorithm with intensional constraints. In Proc. of EACL. Terry Koo and Michael Collins. 2010. Efficient third- order dependency parsers. In Proc. of ACL. Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple semi-supervised dependency parsing. In Proc. of ACL-HLT. Terry Koo, Alexander M. Rush, Michael Collins, Tommi Jaakkola, and David Sontag. 2010. Dual decomposi- tion for parsing with non-projective head automata. In Proc. of EMNLP. Ailsa H. Land and Alison G. Doig. 1960. An automat- ic method of solving discrete programming problems. Econometrica, 28(3):497–520. Andre Martins, Noah Smith, and Eric Xing. 2009. Con- cise integer linear programming formulations for de- pendency parsing. In Proc. of ACL. Andre Martins, Noah Smith, Mario Figueiredo, and Pe- dro Aguiar. 2011. Dual decomposition with many overlapping components. In Proc. of EMNLP. Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005a. Online large-margin training of dependency parsers. In Proc. of ACL. Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajic. 2005b. Non-projective dependency pars- ing using spanning tree algorithms. In Proc. of HLT- EMNLP. Joakim Nivre and Ryan McDonald. 2008. Integrating graph-based and transition-based dependency parsers. In Proc. of ACL-HLT. Jorge Nocedal and Stephen J. Wright. 2006. Numerical Optimization. Springer, 2nd edition. 47 Sebastian Riedel and James Clarke. 2006. Incremental integer linear programming for non-projective depen- dency parsing. In Proc. of EMNLP. Alexander M Rush, David Sontag, Michael Collins, and Tommi Jaakkola. 2010. On dual decomposition and linear programming relaxations for natural language processing. In Proc. of EMNLP. David Smith and Jason Eisner. 2008. Dependency pars- ing by belief propagation. In Proc. of EMNLP. Min Sun, Murali Telaprolu, Honglak Lee, and Silvio Savarese. 2012. Efficient and exact MAP-MRF in- ference using branch and bound. In Proc. of AISTATS. Jun Suzuki, Hideki Isozaki, Xavier Carreras, and Michael Collins. 2009. An empirical study of semi-supervised structured conditional models for dependency parsing. In Proc. of EMNLP. Hiroyasu Yamada and Yuji Matsumoto. 2003. Statistical dependency analysis with support vector machines. In Proc. of IWPT. Yue Zhang and Stephen Clark. 2008. A tale of t- wo parsers: Investigating and combining graph-based and transition-based dependency parsing. In Proc. of EMNLP. Hao Zhang and Ryan McDonald. 2012. Generalized higher-order dependency parsing with cube pruning. In Proc. of EMNLP. Yue Zhang and Joakim Nivre. 2011. Transition-based dependency parsing with rich non-local features. In Proc. of ACL-HLT. 48