Shift-Reduce Constituent Parsing with Neural Lookahead Features

Jiangming Liu and Yue Zhang
Singapore University of Technology and Design,

8 Somapah Road, Singapore, 487372
{jiangming liu, yue zhang}@sutd.edu.sg

Abstract

Transition-based models can be fast and accu-
rate for constituent parsing. Compared with
chart-based models, they leverage richer fea-
tures by extracting history information from
a parser stack, which consists of a sequence
of non-local constituents. On the other hand,
during incremental parsing, constituent infor-
mation on the right hand side of the current
word is not utilized, which is a relative weak-
ness of shift-reduce parsing. To address this
limitation, we leverage a fast neural model
to extract lookahead features. In particular,
we build a bidirectional LSTM model, which
leverages full sentence information to predict
the hierarchy of constituents that each word
starts and ends. The results are then passed
to a strong transition-based constituent parser
as lookahead features. The resulting parser
gives 1.3% absolute improvement in WSJ and
2.3% in CTB compared to the baseline, giv-
ing the highest reported accuracies for fully-
supervised parsing.

1 Introduction

Transition-based constituent parsers are fast and ac-
curate, performing incremental parsing using a se-
quence of state transitions in linear time. Pioneer-
ing models rely on a classifier to make local de-
cisions, searching greedily for local transitions to
build a parse tree (Sagae and Lavie, 2005). Zhu
et al. (2013) use a beam search framework, which
preserves linear time complexity of greedy search,
while alleviating the disadvantage of error propaga-
tion. The model gives state-of-the-art accuracies at
a speed of 89 sentences per second on the standard
WSJ benchmark (Marcus et al., 1993).

Zhu et al. (2013) exploit rich features by extract-
ing history information from a parser stack, which
consists of a sequence of non-local constituents.
However, due to the incremental nature of shift-
reduce parsing, the right-hand side constituents of
the current word cannot be used to guide the action
at each step. Such lookahead features (Tsuruoka et
al., 2011) correspond to the outside scores in chart
parsing (Goodman, 1998), which has been effective
for obtaining improved accuracies.

To leverage such information for improving shift-
reduce parsing, we propose a novel neural model
to predict the constituent hierarchy related to each
word before parsing. Our idea is inspired by the
work of Roark and Hollingshead (2009) and Zhang
et al. (2010b), which shows that shallow syntactic
information gathered over the word sequence can be
utilized for pruning chart parsers, improving chart
parsing speed without sacrificing accuracies. For ex-
ample, Roark and Hollingshead (2009) predict con-
stituent boundary information on words as a pre-
processing step, and use such information to prune
the chart. Since such information is much lighter-
weight compared to full parsing, it can be predicted
relatively accurately using sequence labellers.

Different from Roark and Hollingshead (2009),
we collect lookahead constituent information for
shift-reduce parsing, rather than pruning informa-
tion for chart parsing. Our main concern is improv-
ing the accuracy rather than improving the speed.
Accordingly, our model should predict the con-
stituent hierarchy for each word rather than simple
boundary information. For example, in Figure 1(a),
the constituent hierarchy that the word “The” starts
is “S → NP”, and the constituent hierarchy that the
word “table” ends is “S → VP → NP → PP → NP”.

45

Transactions of the Association for Computational Linguistics, vol. 5, pp. 45–58, 2017. Action Editor: Brian Roark.
Submission batch: 5/2016; Revision batch: 9/2016; Published 1/2017.

c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


NP 

VB 

DT NN 

NP 

VP 

S 

(a) (b) 

DT NNS 

The students like 

this book 

ADJP 

JJ 

past 

CC 

and 

JJ 

present 

NP PP 

NP 

DT NN 

the table 

IN 

on 

Word s-type e-type
The [s: S   NP] [e: Ø ]
past [s: ADJP] [e: Ø ] 
and [s: Ø] [e: Ø ] 
present [s: Ø] [e: ADJP ] 
students [s: Ø] [e: NP ] 
like [s: VP] [e: Ø ] 
this [s: NP     NP] [e: Ø ] 
book [s: Ø] [e: NP ] 
on [s: PP] [e: Ø ] 
the [s: NP] [e: Ø ] 
table [s: Ø [e: S     VP     NP     PP    NP ] 

Figure 1: Example constituent hierarchies for the sentence “The past and present students like this book on
the table”. (a) parse tree; (b) constituent hierarchies on words.

For each word, we predict both the constituent hier-
archy it starts and the constituent hierarchy it ends,
using them as lookahead features.

The task is challenging. First, it is significantly
more difficult compared to simple sequence la-
belling, since two sequences of constituent hierar-
chies must be predicted for each word in the input
sequence. Second, for high accuracies, global fea-
tures from the full sentence are necessary since con-
stituent hierarchies contain rich structural informa-
tion. Third, to retain high speed for shift-reduce
parsing, lookahead feature prediction must be exe-
cuted efficiently. It is highly difficult to build such a
model using manual discrete features and structured
search.

Fortunately, sequential recurrent neural networks
(RNNs) are remarkably effective models to encode
the full input sentence. We leverage RNNs for build-
ing our constituent hierarchy predictor. In particular,
an LSTM (Hochreiter and Schmidhuber, 1997) is
used to learn global features automatically from the
input words. For each word, a second LSTM is then
used to generate the constituent hierarchies greed-
ily using features from the hidden layer of the first
LSTM, in the same way a neural language model de-
coder generates output sentences for machine trans-
lation (Bahdanau et al., 2015). The resulting model
solves all three challenges raised above. For fully-
supervised learning, we learn word embeddings as
part of the model parameters.

In the standard WSJ (Marcus et al., 1993) and
CTB 5.1 tests (Xue et al., 2005), our parser gives 1.3
F1 and 2.3 F1 improvement, respectively, over the

Initial State [φ, 0,false, 0]
Final State [S,n,true,m : 2n <= m <= 4n]

Induction Rules:

SHIFT
[S,i,false,k]

[S|w,i+1,false,k+1]

REDUCE-L/R-X
[S|s1s0,i,false,k]
[S|X,i,false,k+1]

UNARY-X
[S|s0,i,false,k]

[S|X,i,false,k+1]

FINISH
[S,n,false,k]

[S,n,true,k+1]

IDLE
[S,n,true,k]

[S,n,true,k+1]

Figure 2: Deduction system for the baseline shift-
reduce parsing process.

baseline of Zhu et al. (2013), resulting in a accuracy
of 91.7 F1 for English and 85.5 F1 for Chinese,
which are the best for fully-supervised models
in the literature. We release our code, based on
ZPar (Zhang and Clark, 2011; Zhu et al., 2013), at
https://github.com/SUTDNLP/LookAheadConparser.

2 Baseline System

We adopt the parser of Zhu et al. (2013) for a base-
line, which is based on the shift-reduce process of
Sagae and Lavie (2005) and the beam search strat-
egy of Zhang and Clark (2011) with global percep-
tron training.

46


2.1 The Shift-Reduce System
Shift-reduce parsers process an input sentence in-
crementally from left to right. A stack is used to
maintain partial phrase-structures, while the incom-
ing words are ordered in a buffer. At each step, a
transition action is applied to consume an input word
or construct a new phrase-structure. The set of tran-
sition actions are

• SHIFT: pop the front word off the buffer, and
push it onto the stack.

• REDUCE-L/R-X: pop the top two constituents
off the stack (L/R means that the head is the
left constituent or the right constituent, respec-
tively), combine them into a new constituent
with label X, and push the new constituent onto
the stack.

• UNARY-X: pop the top constituent off the
stack, raise it to a new constituent X, and push
the new constituent onto the stack.

• FINISH: pop the root node off the stack and end
parsing.

• IDLE: no-effect action on a completed state
without changing items on the stack or buffer,
used to ensure that the same number of actions
are in each item in beam search (Zhu et al.,
2013).

The deduction system for the process is shown in
Figure 2, where a state is represented as [stack,
buffer front index, completion mark, action index],
and n is the number of words in the input. For ex-
ample, given the sentence “They like apples”, the
action sequence “SHIFT, SHIFT, SHIFT, REDUCE-
L-VP, REDUCE-R-S” gives its syntax “(S They (VP
like apples) )”.

2.2 Search and Training
Beam-search is used for decoding with the k best
state items at each step being kept in the agenda.
During initialization, the agenda contains only the
initial state [φ, 0,false, 0]. At each step, each state
in the agenda is popped and expanded by apply-
ing all valid transition actions, and the top k re-
sulting states are put back onto the agenda (Zhu et
al., 2013). The process repeats until the agenda is

Description Templates

UNIGRAM s0tc,s0wc,s1tc,s1wc,s2tc
s2wc,s3tc,s3wc,q0wt,q1wt

q2wt,q3wt,s0lwc,s0rwc

s0uwc,s1lwc,s1rwc,s1uwc

BIGRAM s0ws1w,s0ws1c,s0cs1w,s0cs1c
s0wq0w,s0wq0t,s0cq0w,s0cq0t

q0wq1w,q0wq1t,q0tq1w,q0tq1t

s1wq0w,s1wq0t,s1cq0w,s1cq0t

TRIGRAM s0cs1cs2c,s0ws1cs2c,s0cs1wq0t
s0cs1cs2w,s0cs1cq0t,s0ws1cq0t

s0cs1wq0t,s0cs1cq0w

Extended s0llwc,s0lrwc,s0luwc
s0rlwc,s0rrwc,s0ruwc

s0ulwc,s0urwc,s0uuwc

s1llwc,s1lrwc,s1luwc

s1rlwc,s1rrwc,s1ruwc

Table 1: Baseline feature templates, where si rep-
resents the ith item on the top of the stack and qi
denotes the ith item in the front of the buffer. The
symbol w denotes the lexical head of an item; the
symbol c denotes the constituent label of an item;
the symbol t is the POS of a lexical head; u denotes
unary child; sill denotes the left child of si’s left
child.

empty, and the best completed state is taken as out-
put.

The score of a state is the total score of the transi-
tion actions that have been applied to build it:

C(α) =
N∑

i=1

Φ(αi) ·~θ (1)

Here Φ(αi) represents the feature vector for the ith
action αi in the state item α. N is the total number
of actions in α.

The model parameter vector ~θ is trained online
using the averaged perceptron algorithm with the
early-update strategy (Collins and Roark, 2004).

2.3 Baseline Features
Our baseline features are taken from Zhu et al.
(2013). As shown in Table 1, they include the UN-
IGRAM, BIGRAM, TRIGRAM features of Zhang and
Clark (2009) and the extended features of Zhu et al.
(2013).

47


Templates
s0gs,s0ge,s1gs,s1ge
q0gs,q0ge,q1gs,q1ge

Table 2: Lookahead feature templates, where si rep-
resents the ith item on the top of the stack and qi de-
notes the ith item in the front end of the buffer. The
symbol gs and ge denote the next level constituent
in the s-type hierarchy and e-type hierarchy, respec-
tively.

3 Global Lookahead Features

The baseline features suffer two limitations, as men-
tioned in the introduction. First, they are relatively
local to the state, considering only the neighbouring
nodes of s0 (top of stack) and q0 (front of buffer).
Second, they do not consider lookahead information
beyond s3, or the syntactic structure of the buffer
and sequence. We use an LSTM to capture full sen-
tential information in linear time, representing such
global information that is fed into the baseline parser
as a constituent hierarchy for each word. Lookahead
features are extracted from the constituent hierarchy
to provide top-down guidance for bottom-up pars-
ing.

3.1 Constituent Hierarchy

In a constituency tree, each word can start or end
a constituent hierarchy. As shown in Figure 1, the
word “The” starts a constituent hierarchy “S → NP”.
In particular, it starts a constituent S in the top level,
dominating a constituent NP. The word “table” ends
a constituent hierarchy “S → VP → NP → PP →
NP”. In particular, it ends a constituent hierarchy,
with a constituent S on the top level, dominating a
VP (starting from the word “like”), and then an NP
(starting from the noun phrase “this book”), and then
a PP (starting from the word “in”), and finally an
NP (starting from the word “the”). The extraction of
constituent hierarchies for each word is based on un-
binarized grammars, reflecting the unbinarized trees
that the word starts or ends. The constituent hier-
archy is empty (denoted as φ) if the corresponding
word does not start or end a constituent. The con-
stituent hierarchies are added into the shift-reduce
parser as soft features (section 3.2).

Formally, a constituent hierarchy is defined as

[type : c1 → c2 → ... → cz],

where c is a constituent label (e.g. NP), “→” repre-
sents the top-down hierarchy, and type can be s or e,
denoting that the current word starts or ends the con-
stituent hierarchy, respectively, as shown in Figure
1. Compared with full parsing, the constituent hier-
archies associated with each word have no forced
structural dependencies between each other, and
therefore can be modelled more easily, for each
word individually. Being soft lookahead features
rather than hard constraints, inter-dependencies are
not crucial for the main parser.

3.2 Lookahead Features
The lookahead feature templates are defined in Table
2. In order to ensure parsing efficiency, only simple
feature templates are taken into consideration. The
lookahead features of a state are instantiated for the
top two items on the stack (i.e., s0 and s1) and buffer
(i.e., q0 and q1). The new function Φ′ is defined to
output the lookahead features vector. The scoring of
a state in our model is based on Formula (1) but with
a new term Φ′(αi) · ~θ′:

C′(α) =
N∑

i=1

Φ(αi) ·~θ + Φ′(αi) · ~θ′

For each word, the lookahead feature represents the
next level constituent in the top-down hierarchy,
which can guide bottom-up parsing.

For example, Figure 3 shows two intermediate
states during parsing. In Figure 3(a), the s-type
and e-type lookahead features of s1 (i.e., the word
“The” are extracted from the constituent hierarchy
in the bottom level, namely NP and NULL, respec-
tively. On the other hand, in Figure 3(b), the s-type
lookahead feature of s1 is extracted from the s-type
constituent hierarchy of same word “The”, but it is
S based on current hierarchical level. The e-type
lookahead feature, on the other hand, is extracted
from the e-type constituent hierarchy of end word
“students” of the VP constituent, which is NULL in
the next level. Lookahead features for items on the
buffer are extracted in the same way.

The lookahead features are useful for guiding
shift-reduce decisions given the current state. For

48


stack buffer

DT

The

JJ

past

CC

and

JJ

present

s0s1

S

NP Ø ADJP Ø

s0gs s0ge=nulls1gs s1ge=null

Ø Ø Ø ADJP
q0 q1

q0gs=null q0ge=null q1gs=null q1ge

NP VB DT NN

DT NNS

s0s1 q0 q1

The students

like this book

S

NP VP VP Ø NP NPØ Ø

s1ge=null s0gs s0ge=null q0gs q0ge=null q1gs=null

stack buffer

q1ges1gs

ADJP

past and present

(a) (b)

incorrect

Constituent
hierarchy

Look-ahead
features

Configuration

Figure 3: Two intermediate states for parsing on the sentence “The past and present students like this book
on the table”. Each item on the stack or buffer has two constituent hierarchies: s-type (left) and e-type (right),
respectively, in the corresponding box. Note that the e-type constituent hierarchy of the word “students” is
incorrectly predicted, yet used as soft constraints (i.e., features) in our model.

example, given the intermediate state in Figure 3(a),
s0 has a s-type lookahead feature ADJP, and q1 in
the buffer has e-type lookahead feature ADJP. This
indicates that the two items are likely reduced into
the same constituent. Further, s0 cannot end a con-
stituent because of the empty e-type constituent hi-
erarchy. As a result, the final shift-reduce parser will
assign a higher score to the SHIFT decision.

4 Constituent Hierarchy Prediction

We propose a novel neural model for constituent hi-
erarchy prediction. Inspired by the encoder-decoder
framework for neural machine translation (Bah-
danau et al., 2015; Cho et al., 2014), we use an
LSTM to capture full sentence features, and another
LSTM to generate the constituent hierarchies for
each word. Compared with a CRF-based sequence
labelling model (Roark and Hollingshead, 2009),
the proposed model has three advantages. First,
the global features can be automatically represented.
Second, it can avoid the exponentially large num-
ber of labels if constituent hierarchies are treated as
unique labels. Third, the model size is relatively
small, and does not have a large effect on the final
parser model.

As shown in Figure 4, the neural network con-
sists of three main layers, namely the input layer,
the encoder layer and the decoder layer. The input
layer represents each word using its characters and
token information; the encoder hidden layer uses a

bidirectional recurrent neural network structure to
learn global features from the sentence; and the de-
coder layer predicts constituent hierarchies accord-
ing to the encoder layer features, by using the atten-
tion mechanism (Bahdanau et al., 2015) to compute
the contribution of each hidden unit of the encoder.

4.1 Input Layer
The input layer generates a dense vector representa-
tion of each input word. We use character embed-
dings to alleviate OOV problems in word embed-
dings (Ballesteros et al., 2015; Santos and Zadrozny,
2014; Kim et al., 2016), concatenating character-
embeddings of a word with its word embedding.
Formally, the input representation xi of the word wi
is computed by:

xi = [xwi ; ci att]

ci att =
∑

j

αijc
′
ij,

where xwi is a word embedding vector of the word
wi according to a embedding lookup table, ci att is
a character embedding form of the word wi, cij is
the embedding of the jth character in wi, c′ij is the
character window representation centered at cij, and
αij is the contribution of the c′ij to ci att, which is
computed by:

αij =
e
f(xwi,c

′
ij)

∑
k e

f(xwi,c
′
ik
)

f is a non-linear transformation function.

49


…

h1 h2 hn

h1 h2 hn

…

…

SoftMax

…

c2,1 c2,2 c2,m

…

…
xw2

attention pooling

Decoder layer

Input layer

Encoder layer

y1j

x2 xnx1

s1js1j-1

c2_att
…

x’2 x’nx’1

h1 h2 hn

…

cn,1

…

…
xwn

attention pooling
cn_att

…

cn,2 cn,m’

windows

windows windows

…

c’2,1 c’2,2 c’2,m c’n,1 c’n,2 c’n,m’

Figure 4: Structure of the constituent hierarchy pre-
diction model.

−→
hi denotes the left-to-right encoder

hidden units;
←−
hi denotes the right-to-left encoder

hidden units; s denotes the decoder hidden state vec-
tor; and yij is the jth label of the word wi.

4.2 Encoder Layer

The encoder first uses a window strategy to repre-
sent input nodes with their corresponding local con-
text nodes. Formally, a word window representation
takes the form

x′i = [xi−win; ...; xi; ...; xi+win].

Second, the encoder scans the input sentence and
generates hidden units for each input word using a
recurrent neural network (RNN), which represents
features of the word from the global sequence. For-
mally, given the windowed input nodes x′1, x

′
2, ...,

x′n for the sentence w1, w2, ..., wn, the RNN layer
calculates a hidden node sequence h1, h2, ..., hn.

Long Short-Term Memory (LSTM) mitigates the
vanishing gradient problem in RNN training, by in-
troducing gates (i.e., input i, forget f and output o)
and a cell memory vector c. We use the variation
of Graves and Schmidhuber (2008). Formally, the
values in the LSTM hidden layers are computed as

follows:

ii = σ(W1x
′
i + W2hi−1 + W3 � ci−1 + b1)

fi = 1 − ii
c̃i = tanh(W4x

′
i + W5hi−1 + b2)

ci = fi � ci−1 + ii � c̃i
oi = σ(W6x

′
i + W7hi−1 + W8 � ci + b3)

hi = oi � tanh(ci),

where � is pair-wise multiplication. Further, in or-
der to collect features for xi from both x′1, .., x

′
i−1

and x′i+1, ... x
′
n, we use a bidirectional variation

(Schuster and Paliwal, 1997; Graves et al., 2013).
As shown in Figure 4, the hidden units are generated
by concatenating the corresponding hidden layers of
a left-to-right LSTM

−→
hi and a right-to-left LSTM

←−
hi ,

where
←→
hi = [

−→
hi ;
←−
hi ] for each word wi.

4.3 Decoder Layer

The decoder hidden layer uses two different LSTMs
to generate the s-type and e-type sequences of con-
stituent labels from each encoder hidden output, re-
spectively, as shown in Figure 4. Each constituent
hierarchy is generated bottom-up recurrently. In
particular, a sequence of state vectors is generated
recurrently, with each state yielding a output con-
stituent label. The process starts with a ~0 state vec-
tor and ends when a NULL constituent is generated.
The recurrent state transition process is achieved us-
ing an LSTM model with the hidden vectors of the
encoder layer being used for context features.

Formally, for word wi, the value of the jth state
unit sij of the LSTM is computed by:

sij = f(sij−1,aij,
←→
hi )

1,

where the context aij is computed by:

aij =
∑

k

βijk
←→
hk

βijk =
ef(sij−1,

←→
hk )

∑
k′ e

f(sij−1,
←→
hk′)

1Here, different from typical MT models (Bahdanau et al.,
2015), the chain is predicted sequentially in a feed-forward way
with no feedback of the prediction made. We found that this
fast alternative gives similar results.

50


Here
←→
hk refers to the encoder hidden vector for wk.

The weights of contribution βijk are computed using
the attention mechanism (Bahdanau et al., 2015).

The constituent labels are generated from each
state unit sij, where each constituent label yij is the
output of a SOFTMAX function,

p(yij = l) =
e
s>ijWl

∑
k e

s>ijWk

yij = l denotes that the jth label of the ith word is
l(l ∈ L).

As shown in Figure 4, the SOFTMAX functions
are applied to the state units of the decoder, gener-
ating hierarchical labels bottom-up, until the default
label NULL is predicted.

4.4 Training

We use two separate models to assign the s-type and
e-type labels, respectively. For training each con-
stituent hierarchy predictor, we minimize the follow-
ing training objective:

L(θ) = −
T∑

i

Zi∑

j

log pijo +
λ

2
||θ||2,

where T is the length of the sentence, Zi is the depth
of the constituent hierarchy of the word wi, and pijo
stands for p(yij = o), which is given by the SOFT-
MAX function, and o is the gold label.

We apply back-propagation, using momentum
stochastic gradient descent (Sutskever et al., 2013)
with a learning rate of η = 0.01 for optimization
and regularization parameter λ = 10−6.

5 Experiments

5.1 Experiment Settings

Our English data are taken from the Wall Street Jour-
nal (WSJ) sections of the Penn Treebank (Marcus et
al., 1993). We use sections 2-21 for training, section
24 for system development, and section 23 for final
performance evaluation. Our Chinese data are taken
from the version 5.1 of the Penn Chinese Treebank
(CTB) (Xue et al., 2005). We use articles 001- 270
and 440-1151 for training, articles 301-325 for sys-
tem development, and articles 271-300 for final per-
formance evaluation. For both English and Chinese

hyper-parameters value
Word embedding size 50
Word window size 2
Character embedding size 30
Character window size 2
LSTM hidden layer size 100
Character hidden layer size 60

Table 3: Hyper-parameter settings

s-type e-type parser
1-layer 93.39 81.50 90.43
2-layer 93.76 83.37 90.72
3-layer 93.84 83.42 90.80

Table 4: Performance of the constituent hierarchy
predictor and the corresponding parser on the WSJ
dev dataset. n-layer denotes an LSTM model with n
hidden layers.

data, we adopt ZPar2 for POS tagging, and use ten-
fold jackknifing to assign POS tags automatically to
the training data. In addition, we use ten-fold jack-
knifing to assign constituent hierarchies automati-
cally to the training data for training the parser using
the constituent hierarchy predictor.

We use F1 score to evaluate constituent hierarchy
prediction. For example, if the prediction is “S →
S → VP → NP” and the gold is “S → NP → NP”,
the evaluation process matches the two hierarchies
bottom-up. The precision is 2/4 = 0.5, the recall
is 2/3 = 0.66 and the F1 score is 0.57. A label is
counted as correct if and only if it occurs at the cor-
rect position.

We use EVALB to evaluate parsing performance,
including labelled precision (LP ), labelled recall
(LR), and bracketing F1.3

5.2 Model Settings
For training the constituent hierarchy prediction
model, gold constituent labels are derived from la-
belled constituency trees in the training data. The
hyper-parameters are chosen according to develop-
ment tests, and the values are shown in Table 3.

For the shift-reduce constituency parser, we set
the beam size to 16 for both training and decoding,
which achieves a good tradeoff between efficiency

2https://github.com/SUTDNLP/ZPar
3http://nlp.cs.nyu.edu/evalb

51


s-type e-type parser
all 93.76 83.37 90.72
all w/o wins 93.62 83.34 90.58
all w/o chars 93.51 83.21 90.33
all w/o chars & wins 93.12 82.36 89.18

Table 5: Performance of the constituent hierarchy
predictor and the corresponding parser on the WSJ
dev dataset. all denotes the proposed model with-
out ablation. wins denotes input windows. chars
denotes character-based attention.

and accuracy (Zhu et al., 2013). The optimal train-
ing iteration number is determined on the develop-
ment sets.

5.3 Results of Constituent Hierarchy
Prediction

Table 4 shows the results of constituent hierarchy
prediction, where word and character embeddings
are randomly initialized, and fine-tuned during train-
ing. The third column shows the development pars-
ing accuracies when the labels are used for looka-
head features. As Table 4 shows, when the number
of hidden layers increases, both s-type and e-type
constituent hierarchy prediction improve. The accu-
racy of e-type prediction is relatively lower due to
right-branching in the treebank, which makes e-type
hierarchies longer than s-type hierarchies. In addi-
tion, a 3-layer LSTM does not give significant im-
provements compared to a 2-layer LSTM. For better
tradeoff between efficiency and accuracy, we choose
the 2-layer LSTM as our constituent hierarchy pre-
dictor.

Table 5 shows ablation results for constituent hi-
erarchy prediction given by different reduced ar-
chitectures, which include an architecture without
character embeddings and an architecture with nei-
ther character embeddings nor input windows. We
find that the original architecture achieves the high-
est performance on constituent hierarchy prediction,
compared to the two baselines. The baseline only
without character embeddings has relatively small
influence on constituent hierarchy prediction. On
the other hand, the baseline only without input word
windows has relatively smaller influence on con-
stituent hierarchy prediction. Nevertheless, both of
these two ablation architectures lead to lower pars-

Parser LR LP F1
Fully-supervised
Ratnaparkhi (1997) 86.3 87.5 86.9
Charniak (2000) 89.5 89.9 89.5
Collins (2003) 88.1 88.3 88.2
Sagae and Lavie (2005)† 86.1 86.0 86.0
Sagae and Lavie (2006)† 87.8 88.1 87.9
Petrov and Klein (2007) 90.1 90.2 90.1
Carreras et al. (2008) 90.7 91.4 91.1
Shindo et al. (2012) N/A N/A 91.1
Zhu et al. (2013)† 90.2 90.7 90.4
Socher et al. (2013)* N/A N/A 90.4
Vinyals et al. (2015)* N/A N/A 88.3
Cross and Huang (2016)*† N/A N/A 91.3
Dyer et al. (2016)*† N/A N/A 91.2
This work 91.3 92.1 91.7
Ensemble
Shindo et al. (2012) N/A N/A 92.4
Vinyals et al. (2015)* N/A N/A 90.5
Rerank
Charniak and Johnson (2005) 91.2 91.8 91.5
Huang (2008) 92.2 91.2 91.7
Dyer et al. (2016)*† N/A N/A 93.3
Semi-supervised
McClosky et al. (2006) 92.1 92.5 92.3
Huang and Harper (2009) 91.1 91.6 91.3
Huang et al. (2010) 91.4 91.8 91.6
Zhu et al. (2013)† 91.1 91.5 91.3
Durrett and Klein (2015)* N/A N/A 91.1

Table 6: Comparison of related work on the WSJ
test set. * denotes neural parsing; † denotes methods
using a shift-reduce framework.

ing accuracies. The baseline removing both the
character embeddings and the input word windows
has a relatively low F-score.

5.4 Final Results

For English, we compare the final results with
previous related work on the WSJ test sets. As
shown in Table 64, our model achieves 1.3%
F1 improvement compared to the baseline parser
with fully-supervised learning (Zhu et al., 2013).
Our model outperforms the state-of-the-art fully-
supervised system (Carreras et al., 2008; Shindo
et al., 2012) by 0.6% F1. In addition, our fully-
supervised model also catches up with many state-
of-the-art semi-supervised models (Zhu et al., 2013;

4We treat the methods as semi-supervised if they use pre-
trained word embeddings, word clusters (e.g., Brown clusters)
or extra resources.

52


Parser LR LP F1
Fully-supervised
Charniak (2000) 79.6 82.1 80.8
Bikel (2004) 79.3 82.0 80.6
Petrov and Klein (2007) 81.9 84.8 83.3
Zhu et al. (2013)† 82.1 84.3 83.2
Wang et al. (2015)‡ N/A N/A 83.2
Dyer et al. (2016)*† N/A N/A 84.6
This work 85.2 85.9 85.5
Rerank
Charniak and Johnson (2005) 80.8 83.8 82.3
Dyer et al. (2016)*† N/A N/A 86.9
Semi-supervised
Zhu et al. (2013)† 84.4 86.8 85.6
Wang and Xue (2014)‡ N/A N/A 86.3
Wang et al. (2015)‡ N/A N/A 86.6

Table 7: Comparison of related work on the CTB5.1
test set. * denotes neural parsing; † denotes methods
using a shift-reduce framework; ‡ denotes joint POS
tagging and parsing.

Huang and Harper, 2009; Huang et al., 2010; Dur-
rett and Klein, 2015) by achieving 91.7% F1 on WSJ
test set. The size of our model is much smaller than
the semi-supervised model of Zhu et al. (2013),
which contains rich features from a large automat-
ically parsed corpus. In contrast, our model is about
the same in size compared to the baseline parser.

We carry out Chinese experiments with the same
models, and compare the final results with previous
related work on the CTB test set. As shown in Table
7, our model achieves 2.3% F1 improvement com-
pared to the state-of-the-art baseline system with
fully-supervised learning (Zhu et al., 2013), which is
by far the best result in the literature. In addition, our
fully-supervised model is also comparable to many
state-of-the-art semi-supervised models (Zhu et al.,
2013; Wang and Xue, 2014; Wang et al., 2015; Dyer
et al., 2016) by achieving 85.5% F1 on the CTB test
set. Wang and Xue (2014) and Wang et al. (2015)
do joint POS tagging and parsing.

5.5 Comparison of Speed
Table 8 shows the running times of various parsers
on test sets on a Intel 2.2 GHz processor with 16G
memory. Our parsers are much faster than the re-
lated parser with the same shift-reduce framework
(Sagae and Lavie, 2005; Sagae and Lavie, 2006).
Compared to the baseline parser, our parser gives

Parser #Sent/Second
Ratnaparkhi (1997) Unk
Collins (2003) 3.5
Charniak (2000) 5.7
Sagae and Lavie (2005) 3.7
Sagae and Lavie (2006) 2.2
Petrov and Klein (2007) 6.2
Carreras et al. (2008) Unk
Zhu et al. (2013) 89.5
This work 79.2

Table 8: Comparison of running times on the test
set, where the time for loading models is excluded.
The running times of related parsers are taken from
Zhu et al. (2013).

significant improvement on accuracies (90.4% to
91.7% F1) at the speed of 79.2 sentences per sec-
ond5, in contrast to 89.5 sentences per second on the
standard WSJ benchmark.

6 Error Analysis

We conduct error analysis by measuring parsing ac-
curacies against: different phrase types, constituents
of different span lengths, and different sentence
lengths.

6.1 Phrase Type

Table 9 shows the accuracies of the baseline and the
final parsers with lookahead features on 9 common
phrase types. As the results show, while the parser
with lookahead features achieves improvements on
all of the frequent phrase types, there are relatively
higher improvements on VP, S, SBAR and WHNP.

The constituent hierarchy predictor has relatively
better performance on s-type labels for the con-
stituents VP, WHNP and PP, which are prone to
errors by the baseline system. The constituent hi-
erarchy can give guidance to the constituent parser
for tackling the issue. Compared to the s-type con-
stituent hierarchy, the e-type constituent hierarchy

5The constituent hierarchy prediction is excluded, which
processes an average of 150 sentences per second on a single
CPU. The cost of this step is far less than the cost of parsing,
and can be essentially eliminated by pipelining the constituent
hierarchy prediction and the shift-reduce decoder, by launching
the constituent hierarchy predictor first, and then starting pars-
ing in parallel as soon as the lookahead output is available for
the first sentence, since the lookahead will outpace the parsing
from that point forward.

53


2 4 6 8 10 12 14

85

90

95

span length

F
1

S
co

re
(%

)
baseline

lookahead

Figure 5: Comparison with the baseline on spans of
different lengths.

is relatively more difficult to predict, particularly
for the constituents with long spans such as VP, S
and SBAR. Despite this, the e-type constituent hi-
erarchies with relatively low accuracies also benefit
prediction of constituents with long spans.

6.2 Span Length

Figure 5 shows the F1-scores of the two parsers on
constituents with different span lengths. As the re-
sults show, lookahead features are helpful on both
large spans and small spans, and the performance
gap between the two parsers is larger as the size of
span increases. This reflects the usefulness of long-
range information captured by the constituent hier-
archy predictor and lookahead features.

6.3 Sentence Length

Figure 6 shows the F1-scores of the two parsers on
sentences of different lengths. As the results show,
the parser with lookahead features outperforms the
baseline system on both short sentences and long
sentences. Also, the performance gap between the
two parsers is larger as the length of sentence in-
creases.

The constituent hierarchy predictors generate hi-
erarchical constituents for each input word using
global information. For longer sentences, the pre-
dictors yield deeper constituent hierarchies, offer-
ing corresponding lookahead features. As a result,
compared to the baseline parser, the performance of
the parser with lookahead features decreases more
slowly as the length of the sentences increases.

10 20 30 40 50 50+

85

90

95

F
1

sc
or

e
(%

)

baseline
lookahead

Figure 6: Comparison with the baseline on sen-
tences of different lengths. Sentences with length
[0, 10) fall in the bin 10.

7 Related Work

Our lookahead features are similar in spirit to the
pruners of Roark and Hollingshead (2009) and
Zhang et al. (2010b), which infer the maximum
length of constituents that a particular word can start
or end. However, our method is different in three
main ways. First, rather than using a CRF with
sparse local word window features, a neural network
is used for dense global features on the sentence.
Second, not only the size of constituents but also
the constituent hierarchy is identified for each word.
Third, the results are added into a transition-based
parser as soft features, rather then being used as hard
constraints to a chart parser.

Our concept of constituent hierarchies is simi-
lar to supertags in the sense that both are shallow
parses. For lexicalized grammars such as Combi-
natory Categorial Grammar (CCG), Tree-Adjoining
Grammar (TAG) and Head-Driven Phrase Structure
Grammar (HPSG), each word in the input sentence
is assigned one or more supertags, which are used
to identify the syntactic role of the word to constrain
parsing (Clark, 2002; Clark and Curran, 2004; Car-
reras et al., 2008; Ninomiya et al., 2006; Dridan
et al., 2008; Faleńska et al., 2015). For a lexical-
ized grammar, supertagging can benefit the parsing
in both accuracy and efficiency by offering almost-
parsing information. In particular, Carreras et al.
(2008) used the concept of spine for TAG (Schabes,
1992; Vijay-Shanker and Joshi, 1988), which is sim-
ilar to our constituent hierarchy. However, there are
three differences. First, the spine is defined to de-
scribe the main syntactic tree structure with a series

54


NP VP S PP SBAR ADVP ADJP WHNP QP
baseline 92.06 90.63 90.28 87.93 86.93 84.83 74.12 95.03 89.32
with lookahead feature 93.10 92.45 91.78 88.84 88.59 85.64 74.50 96.18 89.63
improvement +1.04 +1.82 +1.50 +0.91 +1.66 +0.81 +0.38 +1.15 +0.31

constituent hierarchy
s-type 95.18 97.51 93.37 98.01 92.14 88.94 79.88 96.18 91.70
e-type 91.98 76.82 80.72 84.80 66.82 85.01 71.16 95.13 91.02

Table 9: Comparison between the parsers with lookahead features on different phrases types, with the
corresponding constituent hierarchy predictor performances.

of unary projections, while constituent hierarchy is
defined to describe how words can start or end hi-
erarchical constituents (it can be empty if the word
cannot start or end constituents). Second, spines
are extracted from gold trees and used to prune the
search space of parsing as hard constraints. In con-
trast, we use constituent hierarchies as soft features.
Third, Carreras et al. (2008) use spines to prune
chart parsing, while we use constituent hierarchies
to improve a linear shift-reduce parser.

For lexicalized grammars, supertags can benefit
parsing significantly since they contain rich syntac-
tic information as almost parsing (Bangalore and
Joshi, 1999). Recently, there has been a line of
work on better supertagging. Zhang et al. (2010a)
proposed efficient methods to obtain supertags for
HPSG parsing using dependency information. Xu et
al. (2015) and Vaswani et al. (2016) leverage recur-
sive neural networks for supertagging for CCG pars-
ing. In contrast, our models predict the constituent
hierarchy instead of a single supertag for each word
in the input sentence.

Our constituent hierarchy predictor is also related
to sequence-to-sequence learning (Sutskever et al.,
2014), which has been successfully used in neural
machine translation (Bahdanau et al., 2015). The
neural model encodes the source-side sentence into
dense vectors, and then uses them to generate target-
side word by word. There has also been work that di-
rectly applies sequence-to-sequence models for con-
stituent parsing, which generates constituent trees
given raw sentences (Vinyals et al., 2015; Luong et
al., 2015). Compared to Vinyals et al. (2015), who
predict a full parse tree from input, our predictors
tackle a much simpler task, by predicting the con-
stituent hierarchies of each word separately. In ad-
dition, the outputs of the predictors are used for soft
lookahead features in bottom-up parsing, rather than

being taken as output structures directly.
By integrating a neural constituent hierarchy pre-

dictor, our parser is related to neural network mod-
els for parsing, which has given competitive accura-
cies for both constituency parsing (Dyer et al., 2016;
Cross and Huang, 2016; Watanabe and Sumita,
2015) and dependency parsing (Chen and Manning,
2014; Zhou et al., 2015; Dyer et al., 2015). In par-
ticular, our parser is more closely related to neu-
ral models that integrate discrete manual features
(Socher et al., 2013; Durrett and Klein, 2015).
Socher et al. (2013) use neural features to rerank a
sparse baseline parser; Durrett and Klein directly in-
tegrate sparse features into neural layers in a chart
parser. In contrast, we integrate neural information
into sparse features in the form of lookahead fea-
tures.

There has also been work on lookahead features
for parsing. Tsuruoka et al. (2011) run a baseline
parser for a few future steps, and use the output ac-
tions to guide the current action. In contrast to their
model, our model leverages full sentential informa-
tion, yet is significantly faster.

Previous work investigated more efficient parsing
without loss of accuracy, which is required by real
time applications, such as web parsing. Zhang et
al. (2010b) introduced a chart pruner to accelerate
a CCG parser. Kummerfeld et al. (2010) proposed
a self-training method focusing on increasing the
speed of a CCG parser rather than its accuracy.

8 Conclusion

We proposed a novel constituent hierarchy predic-
tor based on recurrent neural networks, aiming to
capture global sentential information. The resulting
constituent hierarchies are fed to a baseline shift-
reduce parser as lookahead features, addressing lim-
itations of shift-reduce parsers in not leveraging

55


right-hand side syntax for local decisions, yet main-
taining the same model size and speed. The resulting
fully-supervised parser outperforms the state-of-the-
art baseline parser by achieving 91.7% F1 on stan-
dard WSJ evaluation and 85.5% F1 on standard CTB
evaluation.

Acknowledgments

We thank the anonymous reviewers for their detailed
and constructive comments, and the co-Editor-in-
Chief Lillian Lee for her extremely detailed copy
editing. This work is supported by T2MOE 201301
of Singapore Ministry of Education. Yue Zhang is
the corresponding author.

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2015. Neural machine translation by jointly
learning to align and translate. ICLR.

Miguel Ballesteros, Chris Dyer, and Noah A. Smith.
2015. Improved transition-based parsing by modeling
characters instead of words with LSTMs. In EMNLP,
pages 349–359.

Srinivas Bangalore and Aravind K. Joshi. 1999. Su-
pertagging: An approach to almost parsing. Compu-
tational Linguistics, 25(2):237–265, June.

Daniel M. Bikel. 2004. On the parameter space of gener-
ative lexicalized statistical parsing models. PhD The-
sis, University of Pennsylvania.

Xavier Carreras, Michael Collins, and Terry Koo. 2008.
TAG, dynamic programming, and the perceptron for
efficient, feature-rich parsing. In CoNLL, pages 9–16,
Morristown, NJ, USA. Association for Computational
Linguistics.

Eugene Charniak and Mark Johnson. 2005. Coarse-to-
fine n-best parsing and MaxEnt discriminative rerank-
ing. In ACL.

Eugene Charniak. 2000. A maximum-entropy-inspired
parser. In ANLP, pages 132–139.

Danqi Chen and Christopher Manning. 2014. A fast and
accurate dependency parser using neural networks. In
EMNLP, pages 740–750, Stroudsburg, PA, USA. As-
sociation for Computational Linguistics.

Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre,
Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk,
and Yoshua Bengio. 2014. Learning phrase represen-
tations using RNN encoder-decoder for statistical ma-
chine translation. In EMNLP, pages 1724–1734.

Stephen Clark and James R. Curran. 2004. The impor-
tance of supertagging for wide-coverage CCG pars-
ing. In COLING, pages 282–288, Morristown, NJ,
USA, August. University of Edinburgh, Association
for Computational Linguistics.

Stephen Clark. 2002. Supertagging for combinatory cat-
egorial grammar. In Proceedings of the Sixth Inter-
national Workshop on Tree Adjoining Grammar and
Related Frameworks, pages 101–106, Universita di
Venezia.

Michael Collins and Brian Roark. 2004. Incremental
parsing with the perceptron algorithm. In ACL, Mor-
ristown, NJ, USA. Association for Computational Lin-
guistics.

Michael Collins. 2003. Head-driven statistical models
for natural language parsing. Computational Linguis-
tics, 29(4):589–637.

James Cross and Liang Huang. 2016. Span-based con-
stituency parsing with a structure-label system and
provably optimal dynamic oracles. In EMNLP.

Rebecca Dridan, Valia Kordoni, and Jeremy Nicholson.
2008. Enhancing performance of lexicalised gram-
mars. In ACL.

Greg Durrett and Dan Klein. 2015. Neural CRF parsing.
In ACL, pages 302–312.

Chris Dyer, Miguel Ballesteros, Wang Ling, Austin
Matthews, and Noah A. Smith. 2015. Transition-
based dependency parsing with stack long short-term
memory. In ACL-IJCNLP, pages 334–343.

Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and
Noah A. Smith. 2016. Recurrent neural network
grammars. In NAACL, pages 199–209.

Agnieszka Faleńska, Anders Björkelund, Özlem
Çetinoğlu, and Wolfgang Seeker. 2015. Stacking
or supertagging for dependency parsing – what’s the
difference? In Proceedings of the 14th International
Conference on Parsing Technologies.

Joshua Goodman. 1998. Parsing inside-out. PhD thesis,
Harvard University.

Alex Graves and Jürgen Schmidhuber. 2008. Offline
handwriting recognition with multidimensional recur-
rent neural networks. In NIPS, pages 545–552.

Alex Graves, Navdeep Jaitly, and Abdel-rahman Mo-
hamed. 2013. Hybrid speech recognition with deep
bidirectional LSTM. In IEEE Workshop on Automatic
Speech Recognition & Understanding (ASRU), pages
273–278. IEEE.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
short-term memory. Neural Computation, 9(8):1735–
1780, November.

Zhongqiang Huang and Mary P. Harper. 2009. Self-
training PCFG grammars with latent annotations
across languages. In EMNLP, pages 832–841.

56


Zhongqiang Huang, Mary P. Harper, and Slav Petrov.
2010. Self-training with products of latent variable
grammars. In EMNLP, pages 12–22.

Liang Huang. 2008. Forest reranking: Discriminative
parsing with non-local features. In ACL, pages 586–
594.

Yoon Kim, Yacine Jernite, David Sontag, and Alexan-
der M. Rush. 2016. Character-aware neural language
models. In AAAI.

Jonathan K. Kummerfeld, Jessika Roesner, Tim Daw-
born, James Haggerty, James R. Curran, and Stephen
Clark. 2010. Faster parsing by supertagger adap-
tation. In ACL, pages 345–355. University of Cam-
bridge, Association for Computational Linguistics,
July.

Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol
Vinyals, and Lukasz Kaiser. 2015. Multi-task se-
quence to sequence learning. ICLR.

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1993. Building a large annotated cor-
pus of English: The Penn treebank. Computational
Linguistics, 19(2):313–330.

David McClosky, Eugene Charniak, and Mark Johnson.
2006. Effective self-training for parsing. In HLT-
NAACL, pages 152–159, Morristown, NJ, USA. As-
sociation for Computational Linguistics.

Takashi Ninomiya, Takuya Matsuzaki, Yoshimasa Tsu-
ruoka, Yusuke Miyao, and Jun’ichi Tsujii. 2006. Ex-
tremely lexicalized models for accurate and fast HPSG
parsing. In EMNLP, pages 155–163. University of
Manchester, Association for Computational Linguis-
tics, July.

Slav Petrov and Dan Klein. 2007. Improved inference
for unlexicalized parsing. In HLT-NAACL, pages 404–
411.

Adwait Ratnaparkhi. 1997. A linear observed time sta-
tistical parser based on maximum entropy models. In
EMNLP.

Brian Roark and Kristy Hollingshead. 2009. Linear
complexity context-free parsing pipelines via chart
constraints. In HLT-NAACL, pages 647–655.

Kenji Sagae and Alon Lavie. 2005. A classifier-based
parser with linear runtime complexity. In IWPT, pages
125–132, Morristown, NJ, USA. Association for Com-
putational Linguistics.

Kenji Sagae and Alon Lavie. 2006. Parser combination
by reparsing. In HLT-NAACL, pages 129–132, Mor-
ristown, NJ, USA. Association for Computational Lin-
guistics.

Cicero D. Santos and Bianca Zadrozny. 2014. Learning
character-level representations for part-of-speech tag-
ging. In ICML, pages 1818–1826.

Yves Schabes. 1992. Stochastic tree-adjoining gram-
mars. In Proceedings of the workshop on Speech and
Natural Language, pages 140–145. Association for
Computational Linguistics.

Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec-
tional recurrent neural networks. Signal Processing,
IEEE transaction, 45(11):2673–2681.

Hiroyuki Shindo, Yusuke Miyao, Akinori Fujino, and
Masaaki Nagata. 2012. Bayesian symbol-refined tree
substitution grammars for syntactic parsing. In ACL,
pages 440–448.

Richard Socher, John Bauer, Christopher D. Manning,
and Andrew Y. Ng. 2013. Parsing with compositional
vector grammars. In ACL, pages 455–465.

Ilya Sutskever, James Martens, George E. Dahl, and Ge-
offrey E. Hinton. 2013. On the importance of ini-
tialization and momentum in deep learning. In ICML,
pages 1139–1147.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural networks.
In NIPS, pages 3104–3112.

Yoshimasa Tsuruoka, Yusuke Miyao, and Jun’ichi
Kazama. 2011. Learning with lookahead: Can
history-based models rival globally optimized models?
In CoNLL, pages 238–246.

Ashish Vaswani, Yonatan Bisk, and Kenji Sagae. 2016.
Supertagging with LSTMs. In NAACL.

K. Vijay-Shanker and Aravind K. Joshi. 1988. A study of
tree adjoining grammars. Citeseer.

Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov,
Ilya Sutskever, and Geoffrey E. Hinton. 2015. Gram-
mar as a foreign language. In NIPS, pages 2773–2781.

Zhiguo Wang and Nianwen Xue. 2014. Joint POS tag-
ging and transition-based constituent parsing in Chi-
nese with non-local features. In ACL, pages 733–742,
Stroudsburg, PA, USA. Association for Computational
Linguistics.

Zhiguo Wang, Haitao Mi, and Nianwen Xue. 2015.
Feature optimization for constituent parsing via neu-
ral networks. In ACL-IJCNLP, pages 1138–1147,
Stroudsburg, PA, USA. Association for Computational
Linguistics.

Taro Watanabe and Eiichiro Sumita. 2015. Transition-
based neural constituent parsing. In ACL, pages 1169–
1179.

Wenduan Xu, Michael Auli, and Stephen Clark. 2015.
CCG supertagging with a recurrent neural network. In
ACL-IJCNLP, pages 250–255, Stroudsburg, PA, USA.
Association for Computational Linguistics.

Naiwen Xue, Fei Xia, Fu-dong Chiou, and Martha
Palmer. 2005. The Penn Chinese treebank: Phrase
structure annotation of a large corpus. Natural Lan-
guage Engineering, 11(2):207–238.

57


Yue Zhang and Stephen Clark. 2009. Transition-based
parsing of the Chinese treebank using a global discrim-
inative model. In ICPT, pages 162–171, Morristown,
NJ, USA. Association for Computational Linguistics.

Yue Zhang and Stephen Clark. 2011. Syntactic process-
ing using the generalized perceptron and beam search.
Computational Linguistics, 37(1):105–151.

Yaozhong Zhang, Takuya Matsuzaki, and Jun’ichi Tsu-
jii. 2010a. A simple approach for HPSG supertag-
ging using dependency information. In NAACL-HLT,
pages 645–648. University of Manchester, Association
for Computational Linguistics, June.

Yue Zhang, Byung-Gyu Ahn, Stephen Clark, Curt
Van Wyk, James R. Curran, and Laura Rimell. 2010b.
Chart pruning for fast lexicalised-grammar parsing. In
COLING, pages 1471–1479.

Hao Zhou, Yue Zhang, Shujian Huang, and Jiajun Chen.
2015. A neural probabilistic structured-prediction
model for transition-based dependency parsing. In
ACL, pages 1213–1222.

Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and
Jingbo Zhu. 2013. Fast and accurate shift-reduce con-
stituent parsing. In ACL, pages 434–443.

58