Optimizing Statistical Machine Translation for Text Simplification

Wei Xu1, Courtney Napoles2, Ellie Pavlick1, Quanze Chen1 and Chris Callison-Burch1
1 Computer and Information Science Department

University of Pennsylvania
{xwe, epavlick, cquanze, ccb}@seas.upenn.edu

2 Department of Computer Science
Johns Hopkins University
courtneyn@jhu.edu

Abstract

Most recent sentence simplification systems
use basic machine translation models to learn
lexical and syntactic paraphrases from a man-
ually simplified parallel corpus. These meth-
ods are limited by the quality and quantity of
manually simplified corpora, which are expen-
sive to build. In this paper, we conduct an in-
depth adaptation of statistical machine trans-
lation to perform text simplification, taking
advantage of large-scale paraphrases learned
from bilingual texts and a small amount of
manual simplifications with multiple refer-
ences. Our work is the first to design auto-
matic metrics that are effective for tuning and
evaluating simplification systems, which will
facilitate iterative development for this task.

1 Introduction

The goal of text simplification is to rewrite an input
text so that the output is more readable. Text sim-
plification has applications for reducing input com-
plexity for natural language processing (Siddharthan
et al., 2004; Miwa et al., 2010; Chen et al., 2012b)
and providing reading aids for people with lim-
ited language skills (Petersen and Ostendorf, 2007;
Watanabe et al., 2009; Allen, 2009; De Belder and
Moens, 2010; Siddharthan and Katsos, 2010) or lan-
guage impairments such as dyslexia (Rello et al.,
2013), autism (Evans et al., 2014), and aphasia (Car-
roll et al., 1999).

It is widely accepted that sentence simplification
can be implemented by three major types of oper-

ations: splitting, deletion and paraphrasing (Feng,
2008). The splitting operation decomposes a long
sentence into a sequence of shorter sentences. Dele-
tion removes less important parts of a sentence. The
paraphrasing operation includes reordering, lexical
substitutions and syntactic transformations. While
sentence splitting (Siddharthan, 2006; Petersen and
Ostendorf, 2007; Narayan and Gardent, 2014; An-
grosh et al., 2014) and deletion (Knight and Marcu
2002; Clarke and Lapata 2006; Filippova and Strube
2008; Filippova et al. 2015; Rush et al. 2015; and
others) have been intensively studied, there has been
considerably less research on developing new para-
phrasing models for text simplification — most pre-
vious work has used off-the-shelf statistical machine
translation (SMT) technology and achieved reason-
able results (Coster and Kauchak, 2011a,b; Wubben
et al., 2012; Štajner et al., 2015). However, they have
either treated the judgment technology as a black
(Coster and Kauchak, 2011a,b; Narayan and Gar-
dent, 2014; Angrosh et al., 2014; Štajner et al., 2015)
or they have been limited to modifying only one as-
pect of it, such as the translation model (Zhu et al.,
2010; Woodsend and Lapata, 2011) or the reranking
component (Wubben et al., 2012).

In this paper, we present a complete adaptation
of a syntax-based machine translation framework to
perform simplification. Our methodology poses text
simplification as a paraphrasing problem: given an
input text, rewrite it subject to the constraints that
the output should be simpler than the input, while
preserving as much meaning of the input as pos-
sible, and maintaining the well-formedness of the
text. Going beyond previous work, we make di-

401

Transactions of the Association for Computational Linguistics, vol. 4, pp. 401–415, 2016. Action Editor: Stefan Riezler.
Submission batch: 10/2015; Revision batch: 2/2016; Published 7/2016.

c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


rect modifications to four key components in the
SMT pipeline:1 1) two novel simplification-specific
tunable metrics; 2) large-scale paraphrase rules au-
tomatically derived from bilingual parallel corpora,
which are more naturally and abundantly available
than manually simplified texts; 3) rich rule-level
simplification features; and 4) multiple reference
simplifications collected via crowdsourcing for tun-
ing and evaluation. In particular, we report the
first study that shows promising correlations of au-
tomatic metrics with human evaluation. Our work
answers the call made in a recent TACL paper (Xu
et al., 2015) to address problems in current simplifi-
cation research — we amend human evaluation cri-
teria, develop automatic metrics, and generate an
improved multiple reference dataset.

Our work is primarily focused on lexical simplifi-
cation (rewriting words or phrases with simpler ver-
sions), and to a lesser extent on syntactic rewrite
rules that simplify the input. It largely ignores the
important subtasks of sentence splitting and dele-
tion. Our focus on lexical simplification does not af-
fect the generality of the presented work, since dele-
tion or sentence splitting could be applied as pre- or
post-processing steps.

2 Background

Xu et al. (2015) laid out a series of problems that
are present in current text simplification research,
and argued that we should deviate from the previous
state-of-the-art benchmarking setup.

First, the Simple English Wikipedia data has dom-
inated simplification research since 2010 (Zhu et al.,
2010; Siddharthan, 2014), and is used together with
Standard English Wikipedia to create parallel text
to train MT-based simplification systems. How-
ever, recent studies (Xu et al., 2015; Amancio and
Specia, 2014; Hwang et al., 2015; Štajner et al.,
2015) showed that the parallel Wikipedia simplifi-
cation corpus contains a large proportion of inade-
quate (not much simpler) or inaccurate (not aligned
or only partially aligned) simplifications. It is one of
the leading reasons that existing simplification sys-
tems struggle to generate simplifying paraphrases
and leave the input sentences unchanged (Wubben

1Our code and data are made available at: https://
github.com/cocoxu/simplification/

et al., 2012). Previously researchers attempted some
quick fixes by adding phrasal deletion rules (Coster
and Kauchak, 2011a) or reranking n-best outputs
based on their dissimilarity to the input (Wubben
et al., 2012). In contrast, we exploit data with im-
proved quality and enlarged quantity, namely, large-
scale paraphrase rules automatically derived from
bilingual corpora and a small amount of manual
simplification data with multiple references for tun-
ing parameters. We then systematically design new
tuning metrics and rich simplification-specific fea-
tures into a syntactic machine translation model to
enforce optimization towards simplicity. This ap-
proach achieves better simplification performance
without relying on a manually simplified corpus to
learn paraphrase rules, which is important given the
fact that Simple Wikipedia and the newly released
Newsela simplification corpus (Xu et al., 2015) are
only available for English.

Second, previous evaluation used in the simplifi-
cation literature is uninformative and not compara-
ble across models due to the complications between
the three different operations of paraphrasing, dele-
tion, and splitting. This, combined with the unreli-
able quality of Simple Wikipedia as a gold reference
for evaluation, has been the bottleneck for develop-
ing automatic metrics. There exist only a few stud-
ies (Wubben et al., 2012; Štajner et al., 2014) on au-
tomatic simplification evaluation using existing MT
metrics which show limited correlation with human
assessments. In this paper, we restrict ourselves to
lexical simplification, where we believe MT-derived
evaluation metrics can best be deployed. Our newly
proposed metric is the first automatic metric that
shows reasonable correlation with human evalua-
tion on the text simplification task. We also intro-
duce multiple references to make automatic evalua-
tion feasible.

The most related work to ours is that of Gan-
itkevitch et al. (2013) on sentence compression, in
which compression of word and sentence lengths
can be more straightforwardly implemented in fea-
tures and the objective function in the SMT frame-
work. We want to stress that sentence simplifica-
tion is not a simple extension of sentence compres-
sion, but is a much more complicated task, primarily
because high-quality data is much harder to obtain
and the solution space is more constrained by word

402


choice and grammar. Our work is also related to
other tunable metrics designed to be very simple and
light-weight to ensure fast repeated computation for
tuning bilingual translation models (Liu et al., 2010;
Chen et al., 2012a). To the best of our knowledge,
no tunable metric has been attempted for simplifica-
tion, except for BLEU. Nor do any evaluation met-
rics exist for simplification, although there are sev-
eral designed for other text-to-text generation tasks:
grammatical error correction (Napoles et al., 2015;
Felice and Briscoe, 2015; Dahlmeier and Ng, 2012),
paraphrase generation (Chen and Dolan, 2011; Xu
et al., 2012; Sun and Zhou, 2012), and conversation
generation (Galley et al., 2015). Another line of re-
lated work is lexical simplification that focuses on
finding simpler synonyms of a given complex word
(Yatskar et al., 2010; Biran et al., 2011; Specia et al.,
2012; Horn et al., 2014).

3 Adapting Machine Translation for
Simplification

We adapt the machinery of statistical machine trans-
lation to the task of text simplification by making
changes in the following four key components:

3.1 Simplification-specific Objective Functions

In the statistical machine translation framework, one
crucial element is to design automatic evaluation
metrics to be used as training objectives. Train-
ing algorithms, such as MERT (Och, 2003) or PRO
(Hopkins and May, 2011), then directly optimize the
model parameters such that the end-to-end simplifi-
cation quality is optimal. Unfortunately, previous
work on text simplification has only used BLEU for
tuning, which is insufficient as we show empirically
in Section 4. We propose two new light-weight met-
rics instead: FKBLEU that explicitly measures read-
ability and SARI that implicitly measures it by com-
paring against the input and references.

Unlike machine translation metrics which do not
compare against the (foreign) input sentence, it is
necessary to compare simplification system outputs
against the inputs to assess readability changes. It
is also important to keep tunable metrics as simple
as possible, since they are repeatedly computed dur-
ing the tuning process for hundreds of thousands of
candidate outputs.

FKBLEU
Our first metric combines a previously proposed

metric for paraphrase generation, iBLEU (Sun and
Zhou, 2012), and the widely used readability metric,
Flesch-Kincaid Index (Kincaid et al., 1975). iBLEU
is an extension of the BLEU metric to measure di-
versity as well as adequacy of the generated para-
phrase output. Given a candidate sentence O, human
references R and input text I, iBLEU is defined as:

iBLEU = α× BLEU(O,R) (1)
−(1−α)× BLEU(O,I).

where α is a parameter taking balance between ade-
quacy and dissimilarity, and set to 0.9 empirically as
suggested by Sun and Zhou (2012).

Since the text simplification task aims at improv-
ing readability, we include the Flesch-Kincaid Index
(FK) which estimates the readability of text using
cognitively motivated features (Kincaid et al., 1975):

FK = 0.39×
(

#words
#sentences

)
(2)

+11.8×
(
#syllables
#words

)
−15.59

with a lower value indicating higher readability.2 We
adapt FK to score individual sentences and change it
so that it counts punctuation tokens as well as word,
and counts each punctuation token as one syllable.
This prevents it from arbitrarily deleting punctua-
tion.

FK measures readability assuming that the text is
well-formed, and therefore is insufficient alone as
a metric for generating or evaluating automatically
generated sentences. Combining FK and iBLEU
captures both a measure of readability and adequacy.
The resulting objective function, FKBLEU, is de-
fined as a geometric mean of the iBLEU and the FK
difference between input and output sentences:

FKBLEU = iBLEU(I,R,O)× FKdiff(I,O)
FKdiff = sigmoid(FK(O)− FK(I))

(3)

Sentences with higher FKBLEU values are better
simplifications with higher readability.

2The FK coefficients were derived via multiple regression
applied to the reading compression test scores of 531 Navy per-
sonnel reading training manuals. These values are typically
used unmodified, as we do here.

403


SARI
We design a second new metric SARI that prin-

cipally compares system output against references
and against the input sentence. It explicitly mea-
sures the goodness of words that are added, deleted
and kept by the systems (Figure 1).

We reward addition operations, where system out-
put O was not in the input I but occurred in any of
the references R, i.e. O ∩ I ∩R. We define n-gram
precision p(n) and recall r(n) for addition opera-
tions as follows:3

padd(n) =

∑
g∈O min

(
#g(O ∩ I), #g(R)

)

∑
g∈O #g(O ∩ I)

radd(n) =

∑
g∈O min

(
#g(O ∩ I), #g(R)

)

∑
g∈O #g(R ∩ I)

(4)

where #g(·) is a binary indicator of occurrence of n-
grams g in a given set (and is a fractional indicator
in some later formulas) and

#g(O ∩ I) = max(#g(O)−#g(I), 0)
#g(R ∩ I) = max(#g(R)−#g(I), 0)

Therefore, in the example below, the addition of un-
igram now is rewarded in both padd(n) and radd(n),
while the addition of you in OUTPUT-1 is penalized
in padd(n):

INPUT: About 95 species are currently accepted .
REF-1: About 95 species are currently known .
REF-2: About 95 species are now accepted .
REF-3: 95 species are now accepted .
OUTPUT-1: About 95 you now get in .
OUTPUT-2: About 95 species are now agreed .
OUTPUT-3: About 95 species are currently agreed.

The corresponding SARI scores of these three toy
outputs are 0.2683, 0.7594, 0.5890, which match
with intuitions about their quality. To put it in
perspective, the BLEU scores are 0.1562, 0.6435,
0.6435 respectively. BLEU fails to distinguish be-
tween OUTPUT-2 and OUTPUT-3 because match-
ing any one of references is credited the same. Not
all the references are necessarily complete simpli-
fications, e.g. REF-1 doesn’t simplify the word

3In the rare case when the denominator is 0 in calculating
precision p or recall r, we simply set the value of p and r to 0.

Input

System
outputHuman 

references

Input that is unchanged by 
system and which is not in 

the reference
Input that is retained 

in the references, 
but was deleted by 

the system Overlap 
between all 3

Input that was 
correctly deleted by 

the system, and 
replaced by content 
from the references

Potentially 
incorrect 

system output

Figure 1: Metrics that evaluate the output of monolingual
text-to-text generation systems can compare system out-
put against references and against the input sentence, un-
like in MT metrics which do not compare against the (for-
eign) input sentence. The different regions of this Venn
diagram are treated differently with our SARI metric.

currently, which gives BLEU too much latitude for
matching the input.

Words that are retained in both the system out-
put and references should be rewarded. When mul-
tiple references are used, the number of references
in which an n-gram was retained matters. It takes
into account that some words/phrases are considered
simple and are unnecessary (but still encouraged) to
be simplified. We use R′ to mark the n-gram counts
over R with fractions, e.g. if a unigram (word about
in above example) occurs in 2 out of the total r ref-
erences, then its count is weighted by 2/r in compu-
tation of precision and recall:

pkeep(n) =

∑
g∈I min

(
#g(I ∩O), #g(I ∩R′)

)

∑
g∈I #g(I ∩O)

rkeep(n) =

∑
g∈I min

(
#g(I ∩O), #g(I ∩R′)

)

∑
g∈I #g(I ∩R′)

(5)

where

#g(I ∩O) = min
(
#g(I), #g(O)

)

#g(I ∩R′) = min
(
#g(I), #g(R)/r

)

For deletion, we only use precision because over-
deleting hurts readability much more significantly
than not deleting:

pdel(n) =
∑

g∈I min
(
#g(I∩O),#g(I∩R′)

)

∑
g∈I #g(I∩O)

(6)

where

#g(I ∩O) = max
(
#g(I)−#g(O), 0

)

#g(I ∩R′) = max
(
#g(I)−#g(R)/r, 0

)

404


[RB] solely → only
Lexical [NN] objective → goal

[JJ] undue → unnecessary
[VP] accomplished → carried out

Phrasal [VP/PP] make a significant contribution → contribute greatly
[VP/S] is generally acknowledged that → is widely accepted that

[NP/VP] the manner in which NN → the way NN
Syntactic [NP] NNP ’s population → the people of NNP

[NP] NNP ’s JJ legislation → the JJ law of NNP

Table 1: Example paraphrase rules in the Paraphrase Database (PPDB) that result in simplifications of the input. The
rules are synchronous context-free grammar (SCFG) rules where uppercase indicates non-terminal symbols. Non-
terminals can be complex symbols like VP/S which indicates that the rule forms a verb phrase (VP) missing a sentence
(S) to its right. The final syntactic rule both simplifies and reorders the input phrase.

The precision of what is kept also reflects the suf-
ficiency of deletions. The n-gram counts are also
weighted in R′ to compensate n-grams, such as the
word currently in the example, that are not consid-
ered as required simplification by human editors.

Together, in SARI, we use arithmetic average of
n-gram precisions Poperation and recalls Roperation:

SARI = d1Fadd + d2Fkeep + d3Pdel (7)

where d1 = d2 = d3 = 1/3 and

Poperation =
1

k

∑

n=[1,...,k]

poperation(n)

Roperation =
1

k

∑

n=[1,...,k]

roperation(n)

Foperation =
2×Poperation ×Roperation
Poperation + Roperation

operation ∈ [del,keep,add]

where k is the highest n-gram order and set to 4 in
our experiments.

3.2 Incorporating Large-Scale Paraphrase
Rules

Another challenge for text simplification is generat-
ing an ample set of rewrite rules that potentially sim-
plify an input sentence. Most early work has relied
on either hand-crafted rules (Chandrasekar et al.,
1996; Carroll et al., 1999; Siddharthan, 2006; Vick-
rey and Koller, 2008) or dictionaries like WordNet
(Devlin et al., 1999; Kaji et al., 2002; Inui et al.,
2003). Other more recent studies have relied on

the parallel Normal-Simple Wikipedia Corpus to au-
tomatically extract rewrite rules (Zhu et al., 2010;
Woodsend and Lapata, 2011; Coster and Kauchak,
2011b; Wubben et al., 2012; Narayan and Gar-
dent, 2014; Siddharthan and Angrosh, 2014; An-
grosh et al., 2014). This technique does manage to
learn a small number of transformations that sim-
plify. However, we argue that because the size of the
Normal-Simple Wikipedia parallel corpus is quite
small (108k sentence pairs with 2 million words),
the diversity and coverage of patterns that can be
learned is actually quite limited.

In this paper we will leverage the large-scale Para-
phrase Database (PPDB)4 (Ganitkevitch et al., 2013;
Pavlick et al., 2015) as a rich source of lexical,
phrasal and syntactic simplification operations. It
is created by extracting English paraphrases from
bilingual parallel corpora using a technique called
“bilingual pivoting” (Bannard and Callison-Burch,
2005). The PPDB is represented as a synchronous
context-free grammar (SCFG), which is commonly
used as the formalism for syntax-based machine
translation (Zollmann and Venugopal, 2006; Chiang,
2007; Weese et al., 2011). Table 1 shows some ex-
ample paraphrase rules in the PPDB.

PPDB employs 1000 times more data (106 mil-
lion sentence pairs with 2 billion words) than the
Normal-Simple Wikipedia parallel corpus. The En-
glish portion of PPDB contains over 220 million
paraphrase rules, consisting of 8 million lexical,
73 million phrasal and 140 million syntactic para-

4http://paraphrase.org

405


phrase patterns. The key differences between the
paraphrase rules from PPDB and the transforma-
tions learned by the naive application of SMT to
the Normal-Simple Wikipedia parallel corpus, are
that the PPDB paraphrases are much more diverse.
For example, PPDB contains 214 paraphrases for
ancient including antique, ancestral, old, age-old,
archeological, former, antiquated, longstanding, ar-
chaic, centuries-old, and so on. However, there is
nothing inherent in the rule extraction process to say
which of the PPDB paraphrases are simplifications.

In this paper, we model the task by incorporating
rich features into each rule and let SMT advances
in decoding and optimization determine how well a
rule simplifies an input phrase. An alternative way
of using PPDB for simplification would be to sim-
ply discard any of its rules which did not result in
a simplified output, possibly using a simple super-
vised classifier (Pavlick and Callison-Burch, 2016).

3.3 Simplification-specific Features for
Paraphrase Rules

Designing good features is an essential aspect of
modeling. For each input sentence i and its candi-
date output sentence j, a vector of feature functions
~ϕ = {ϕ1...ϕN} are combined with a weight vector
~w in a linear model to obtain a single score h~w:

h~w(i,j) = ~w · ~ϕ(i,j) (8)

In SMT, typical feature functions are phrase trans-
lation probabilities, word-for-word lexical transla-
tion probabilities, a rule application penalty (which
governs whether the system prefers fewer longer
phrases or a greater number of shorter phrases), and
a language model probability. Together these fea-
tures are what the model uses to distinguish between
good and bad translations. For monolingual transla-
tion tasks, previous research suggests that features
like paraphrase probability and distributional sim-
ilarity are potentially helpful in picking out good
paraphrases (Chan et al., 2011) and for text-to-text
generation (Ganitkevitch et al., 2012b). While these
two features quantify how good a paraphrase rule is
in general, they do not indicate how good the rule is
for a specific task, like simplification.

For each paraphrase rule, we use all the 33 fea-
tures that were distributed with PPDB 1.0 and add

9 new features for simplification purposes:5 length
in characters, length in words, number of syllables,
language model scores, and fraction of common En-
glish words in each rule. These features are com-
puted for both sides of a paraphrase pattern, the word
with the maximum number of syllables on each side
and the difference between the two sides, when it is
applicable. We use language models built from the
Gigaword corpus and the Simple Wikipedia corpus
collected by Kauchak (2013). We also use a list of
3000 most common US English words compiled by
Paul and Bernice Noll.6

3.4 Creating Multiple References
Like with machine translation, where there are many
equally good translations, in simplification there
may be several ways of simplifying a sentence. Most
previous work on text simplification only uses a sin-
gle reference simplification, often from the Simple
Wikipedia. This is undesirable since the Simple
Wikipedia contains a large proportion of inadequate
or inaccurate simplifications (Xu et al., 2015) .

In this study, we collect multiple human reference
simplifications that focus on simplification by para-
phrasing rather than deletion or splitting. We first
selected the Simple-Normal sentence pairs of simi-
lar length (≤ 20% differences in number of tokens)
from the Parallel Wikipedia Simplification (PWKP)
corpus (Zhu et al., 2010) that are more likely to be
paraphrase-only simplifications. We then asked 8
workers on Amazon Mechanical Turk to rewrite a
selected sentence from Normal Wikipedia (a subset
of PWKP) into a simpler version while preserving
its meaning, without losing any information or split-
ting sentence. We removed bad workers by man-
ual inspection on the worker’s first several submis-
sions on the basis of a recent study (Gao et al., 2015)
on crowdsourcing translation that suggests Turkers’
performance stays consistent over time and can be
reliably predicted by their first few translations.

In total, we collected 8 reference simplifications
for 2350 sentences, and randomly split them into
2000 sentences for tuning, 350 for evaluation. Many
crowdsourcing workers were able to provide simpli-
fications of good quality and diversity (see Table 2

5We release the data with details for each feature.
6http://www.manythings.org/vocabulary/

lists/l/noll-about.php

406


for an example and Table 4 for the manual quality
evaluation). Having multiple references allows us to
develop automatic metrics similar to BLEU to take
advantage of the variation across many people’s sim-
plifications. We leave more in-depth investigations
on crowdsourcing simplification (Pellow and Eske-
nazi, 2014a,b) for future work.

3.5 Tuning Parameters

Like in statistical machine translation, we set the
weights of the linear model ~w in the Equation (8)
so that the system’s output is optimized with re-
spect to the automatic evaluation metric on the 2000
sentence development set. We use the pairwise
ranking optimization (PRO) algorithm (Hopkins and
May, 2011) implemented in the open-source Joshua
toolkit (Ganitkevitch et al., 2012a; Post et al., 2013)
for tuning.

Specifically, we train the system to distinguish a
good candidate output j from a bad candidate j′,
measured by an objective function o (Section 3.1),
for an input sentence i:

o(i,j) >o(i,j′) ⇐⇒ h~w(i,j) > h~w(i,j′)
⇐⇒ h~w(i,j)−h~w(i,j′) > 0
⇐⇒ ~w · ~ϕ(i,j)− ~w · ~ϕ(i,j′) > 0
⇐⇒ ~w · (~ϕ(i,j)− ~ϕ(i,j′)) > 0

(9)

Thus, the optimization reduces to a binary classifi-
cation problem. Each training instance is the dif-
ference vector ~ϕ(i,j) − ~ϕ(i,j′)) of a pair of can-
didates, and its training label is positive or negative
depending on whether the value of o(i,j) −o(i,j′)
is positive or negative. The candidates are generated
according to h~w at each iteration, and sampled for
making the training tractable. We use different met-
rics: BLEU, FKBLEU and SARI as objectives.

4 Experiments and Analyses

We implemented all the proposed adaptations into
the open source syntactic machine translation de-
coder Joshua (Post et al., 2013),7 and conducted
the experiments with PPDB and the dataset of 2350
sentences collected in Section 3.4. Most recent

7http://joshua-decoder.org/ We augment its lat-
est version to include the text-to-text generation functionality
described in this paper.

Paraphrase Rule Trans. Model Score
principal → key 4.515
principal → main 4.514
principal → major 4.358
principal → chief 3.205
principal → core 3.025
principal → principal 2.885
principal → top 2.600
principal → senior 2.480
principal → lead 2.377
principal → primary 2.171
principal → prime 1.432
principal → keynote -0.795
able-bodied → valid 6.435
able-bodied → sound 5.838
able-bodied → healthy 4.446
able-bodied → able-bodied 3.372
able-bodied → job-ready 1.611
able-bodied → employable -0.363
able-bodied → non-disabled -2.207

Table 3: Qualitative analysis of candidate paraphrases
ranked by the translation model in SBMT (PPDB +
SARI), showing that the model is optimized towards sim-
plicity in addition to the correctness of paraphrases. The
final simplifications (in bold) are chosen in conjunction
with the language model to fit the context and further bias
towards more common n-grams.

end-to-end sentence simplification systems use a
basic phrase-based MT model trained on parallel
Wikipedia data using the Moses decoder (Štajner
et al., 2015, and others). One of the best systems
is PBMT-R by Wubben et al. (2012), which reranks
Moses’ n-best outputs based on their dissimilarity to
the input to promote simplification. We also build a
baseline by using BLEU as the tuning metric in our
adapted MT framework. We conduct both human
and automatic evaluation to demonstrate the advan-
tage of the proposed simplification systems. We also
show the effectiveness of the two new metrics in tun-
ing and automatic evaluation.

4.1 Qualitative Analysis

Table 2 shows a representative example of the sim-
plification results. The PBMT-R model failed to
learn any good substitutions for the word able-
bodied or the phrase are required to from the man-
ually simplified corpora of limited size. In contrast,
our proposed method can make use of more para-

407


Sentence

Normal Wikipedia Jeddah is the principal gateway to Mecca, Islam’s holiest city, which able-bodied Muslims
are required to visit at least once in their lifetime.

Simple Wikipedia Jeddah is the main gateway to Mecca, the holiest city of Islam, where able-bodied Muslims
must go to at least once in a lifetime.

Mechanical Turk #1 Jeddah is the main entrance to Mecca, the holiest city in Islam, which all healthy Muslims
need to visit at least once in their life.

Mechanical Turk #2 Jeddah is the main entrance to Mecca, Islam’s holiest city, which pure Muslims are re-
quired to visit at least once in their lifetime.

PBMT-R (Wubben et al., 2012) Jeddah is the main gateway to Mecca, Islam’s holiest city, which able-bodied Muslims are
required of Muslims at least once in their lifetime.

SBMT (PPDB + BLEU) Jeddah is the main door to Mecca, Islam’s holiest city, which sound Muslims are to go to
at least once in their life.

SBMT (PPDB + FKBLEU) Jeddah is the main gateway to Mecca, Islam’s holiest city, which sound Muslims must
visit at least once in their life.

SBMT (PPDB + SARI) Jeddah is the main gateway to Mecca, Islam’s holiest city, which sound Muslims have to
visit at least once in their life.

Table 2: Example human reference simplifications and automatic simplification system outputs. The bold font high-
lights the parts of the sentence that are different from the original version in the Normal Wikipedia, and strikethrough
denotes deletions.

phrases learned from the more abundant bilingual
texts. It improves method applicability to languages
other than English, for which no simpler version of
Wikipedia is available.

Our proposed approach also provides an intu-
itive way to inspect the ranking of candidate para-
phrases in the translation model. This is done by
scoring each rule in PPDB by Equation 8 using the
weights optimized in the tuning process, as in Ta-
ble 3. It shows that our proposed method is capable
of capturing the notion of simplicity using a small
amount of parallel tuning data. It correctly ranks key
and main as good simplifications for principal. Its
choices are not always perfect as it prefers sound
over healthy for able-bodied. The final simplifi-
cation outputs are generated according to both the
translation model and the language model trained
on the Gigaword corpus to take into account context
and further bias towards more common n-grams.

4.2 Quantitative Evaluation of Simplification
Systems

For the human evaluation, participants were shown
the original English Wikipedia sentence as a ref-
erence and asked to judge a set of simplifications
that were displayed in random order. They eval-
uated a simplification from each system, the Sim-
ple Wikipedia version, and a Turker simplification.
Judges rated each simplification on two 5-point

scales of meaning retention and grammaticality (0
is the worst and 4 is the best). We also ask partici-
pants to rate Simplicity Gain (Simplicity+) by count-
ing how many successful lexical or syntactic para-
phrases occurred in the simplification. We found this
makes the judgment easier and that it is more infor-
mative than rating the simplicity directly on 5-point
scale, since the original sentences have very differ-
ent readability levels to start with. More importantly,
using simplicity gain avoids over-punishment of er-
rors, which are already penalized for poor meaning
retention and grammaticality, and thus reduces the
bias towards very conservative models. We collect
judgments on these three criteria from five different
annotators and report the average scores.

Table 4 shows that our best system, a syntactic-
based MT system (SBMT) using PPDB as the
source of paraphrase rules and tuning towards the
SARI metric, achieves better performance in all
three simplification measurements than the state-of-
the-art system PBMT-R. The relatively small val-
ues of simplicity gain, even for the two human ref-
erences (Simple Wikipedia and Mechanical Turk),
clearly show the major challenge of simplification,
which is the need of not only generating paraphrases
but also ensuring the generated paraphrases are sim-
pler while fitting the contexts. Although many re-
searchers have noticed this difficulty, PBMT-R is
one of the few that tried to address it by promoting

408


Grammar Meaning Simplicity+ #tokens #chars Edit Dist.
Normal Wikipedia 4.00 4.00 0.00 23 125 0.00
Simple Wikipedia 3.72 3.24 1.03 22 116 6.69
Mechanical Turk 3.70 3.36 1.35 19 104 8.25
PBMT-R (Wubben et al., 2012) 3.18 2.83 0.47 20 108 5.96
SBMT (PPDB + BLEU) 4.00 4.00 0.00 23 125 0.00
SBMT (PPDB + FKBLEU) 3.30 3.05 0.48 21 107 4.03
SBMT (PPDB + SARI) 3.50 3.16 0.65 23 118 3.98

Table 4: Human evaluation (Grammar, Meaning, Simplicity+) and basic statistics of our proposed systems (SBMTs)
and baselines. PBMT-R is an reimplementation of the state-of-the-art system by Wubben et al. (2012). Newly proposed
metrics FKBLEU and SARI show advantages for tuning.

FK BLEU iBLEU FKBLEU SARI
Normal Wikipedia 12.88 99.05 78.41 62.48 26.05
Simple Wikipedia 11.25 66.75 53.53 61.75 38.42
Mechanical Turk 10.80 100.0 74.31 73.60 43.71
PBMT-R (Wubben et al., 2012) 11.10 63.12 48.91 59.00 33.77
SBMT (PPDB + BLEU) 12.88 99.05 78.41 62.48 26.05
SBMT (PPDB + FKBLEU) 10.75 74.48 58.10 66.68 34.18
SBMT (PPDB + SARI) 10.90 72.36 58.15 66.57 37.91

Table 5: Automatic evaluation of different simplification systems. Most systems achieve similar FK readability scores
as human. The SARI metric ranks all 5 different systems and 3 human references in the same order as human assess-
ment. Tuning towards BLEU with all 8 references results in identical transformation (same as Normal Wikipedia),
as this can get a near-perfect BLEU score of 99.05 (out of 100).

outputs that are dissimilar to the input. Our best sys-
tem is able to make more effective paraphrases (bet-
ter Simplicity+) while introducing less errors (better
Grammar and Meaning).

Table 5 shows the automatic evaluation. An en-
couraging fact is that SARI metric ranks all 5 dif-
ferent systems and 3 human references in the same
order as human assessment. Most systems achieve
similar FK readability as human editors, using fewer
words or words with fewer syllables. Tuning to-
wards BLEU with all 8 references results in no trans-
formation (same as input), as this can get a near-
perfect BLEU score of 99.05 (out of 100).

Table 6 shows the computation time for different
metrics. SARI is only slightly slower than BLEU
but achieves much better simplification quality.

Time (milliseconds)
BLEU 0.12540908

FKBLEU 1.2527733
SARI 0.15506646

Table 6: Average computation time of different metrics
per candidate sentence.

4.3 Correlation of Automatic Metrics with
Human Judgments

Table 7 shows the correlation of automatic metrics
with human judgment. There are several interesting
observations.

First, simplicity is essential in measuring the
goodness of simplification. However, none of the
existing metrics (i.e. FK, BLEU, iBLEU) demon-
strate any significant correlation with the simplicity
scores rated by humans, same as noted in previous
work (Wubben et al., 2012; Štajner et al., 2014). In
contrast, our two new metrics, FKBLEU and SARI,
achieve a much better correlation with humans in
simplicity judgment while still capturing the notion
of grammaticality and meaning preservation. This
explains why they are more suitable than BLEU to
be used in training the simplification models. In
particular, SARI provides a balanced and integrative
measurement of system performance that can assist
iterative development. To date, developing advanced
simplification systems has been a difficult and time-
consuming process, since it is impractical to run new

409


Spearman’s ρ ref. Grammar Meaning Simplicity+
FK none - 0.002 (≈ .976) 0.136 (<.010) 0.147 (<.010)

BLEU single 0.366 (<.001) 0.459 (<.001) 0.151 (<.005)
BLEU multiple 0.589 (<.001) 0.701 (<.001) 0.111 (<.050)
iBLEU single 0.313 (<.001) 0.397 (<.001) 0.149 (<.005)
iBLEU multiple 0.492 (<.001) 0.609 (<.001) 0.141 (<.010)

FKBLEU multiple 0.349 (<.001) 0.410 (<.001) 0.235 (<.001)
SARI multiple 0.342 (<.001) 0.397 (<.001) 0.343 (<.001)

Table 7: Correlations (and two-tailed p-values) of metrics against the human ratings at sentence-level (also see Figure
3). In this work, we propose to use multiple (eight) references and two new metrics: FKBLEU and SARI. For all three
criteria of simplification quality, SARI correlates reasonably with human judgments. In contrast, previous works use
only a single reference. Existing metrics BLEU and iBLEU show higher correlations on grammaticality and meaning
preservation using multiple references, but fail to measure the most important aspect of simplification – simplicity.

human evaluation every time a new model is built or
parameters are adjusted.

Second, the correlation of automatic metrics with
human judgment of grammaticality and meaning
preservation is higher than any reported before
(Wubben et al., 2012; Štajner et al., 2014). It val-
idates our argument that constraining simplification
to only paraphrasing reduces the complication from
deletion and splitting, and thus makes automatic
evaluation more feasible. Using multiple references
further improves the correlations.

4.4 Why Does BLEU Correlate Strongly with
Meaning/Grammar, and SARI with
Simplicity?

Here we look more deeply at the correlations of
BLEU and SARI with human judgments. Our SARI
metric has highest correlation with human judg-
ments of simplicity, but BLEU exhibits higher corre-
lations on grammaticality and meaning preservation.

BLEU was designed to evaluate bilingual transla-
tion systems. It measures the n-gram precision of
a system’s output against one or more references.
BLEU ignores recall (and compensates for this with
its brevity penalty). BLEU prefers an output that is
not too short and contains only n-grams that appear
in any reference. The role of multiple references
in BLEU is to capture allowable variations in trans-
lation quality. When applied to monolingual tasks
like simplification, BLEU does not take into account
anything about the differences between the input and
the references. In contrast, SARI takes into account
both precision and recall, by looking at the differ-
ence between the references and the input sentence.

Figure 2: A scatter plot of BLEU scores vs. SARI scores
for the individual sentences in our test set. The metrics’
scores for many sentences substantially diverge. Few of
the sentences that scored perfectly in BLEU receive a
high score from SARI.

In this work, we use multiple references to capture
many different ways of simplifying the input.

Unlike bilingual translation, the more references
created for the monolingual simplification task the
more n-grams of the original input will be included
in the references. That means, with more references,
outputs that are close or identical to the input will get
high BLEU. Outputs with few changes also receive
high Grammar/Meaning scores from human judges;
but these do not necessarily get high SARI score nor
are they good simplifications. BLEU therefore tends
to favor conservative systems that do not make many
changes, while SARI penalizes them. This can be

410


0.1 0.2 0.3 0.4 0.5 0.6 0.7

0

1

2

3

4
H

um
an

 S
co

re

SARI vs. Grammar

0.1 0.2 0.3 0.4 0.5 0.6 0.7

SARI vs. Meaning

0.1 0.2 0.3 0.4 0.5 0.6 0.7

SARI vs. Simplicity

0.0 0.2 0.4 0.6 0.8 1.0 1.2
Automatic Score

0

1

2

3

4

H
um

an
 S

co
re

BLEU vs. Grammar

0.0 0.2 0.4 0.6 0.8 1.0 1.2
Automatic Score

BLEU vs. Meaning

0.0 0.2 0.4 0.6 0.8 1.0 1.2
Automatic Score

H
um

an
 R

at
in

g

BLEU vs. Simplicity

Figure 3: Scatter plots of automatic metrics against human scores for individual sentences.

seen in Figure 2 where sentences with a BLEU score
of 1.0 receive a range of scores from SARI.

The scatter plots in Figure 3 further illustrate the
above analysis. These plots emphasize the correla-
tion of high human scores on meaning/grammar for
systems that make few changes (which BLEU re-
wards, but SARI does not). The tradeoff is that con-
servative outputs with few or no changes do not re-
sult in increased simplicity. SARI correctly rewards
systems that make changes that simplify the input.

5 Conclusions and Future Work

In this paper, we presented an effective adaptation
of statistical machine translation techniques. We
find the approach promising in suggesting two new
directions: designing tunable metrics that corre-
late with human judgements and using simplicity-
enriched paraphrase rules derived from larger data
than the Normal-Simple Wikipedia dataset. For fu-
ture work, we think it might be possible to design
a universal metric that works for multiple text-to-
text generation tasks (including sentence simplifica-
tion, compression and error correction), at the same

time using the same idea of comparing system out-
put against multiple references and against the input.
The metric could possibly include tunable parame-
ters or weighted human judgments on references to
accommodate different tasks. Finally, we are also
interested in designing neural translation models for
the simplification task.

Acknowledgments

The authors would like to thank Juri Ganitkevitch,
Jonny Weese, Kristina Toutanova, Matt Post, and
Shashi Narayan for valuable discussions. We also
thank action editor Stefan Riezler and three anony-
mous reviewers for their thoughtful comments. This
material is based on research sponsored by the NSF
under grant IIS-1430651 and the NSF GRFP under
grant 1232825. The views and conclusions con-
tained in this publication are those of the authors
and should not be interpreted as representing offi-
cial policies or endorsements of the NSF or the U.S.
Government. This research is also supported by the
Alfred P. Sloan Foundation, and by Facebook via a
student fellowship and a faculty research award.

411


References

Allen, D. (2009). A study of the role of relative
clauses in the simplification of news texts for
learners of English. System, 37(4):585–599.

Amancio, M. A. and Specia, L. (2014). An analysis
of crowdsourced text simplifications. In Proceed-
ings of the 3rd Workshop on Predicting and Im-
proving Text Readability for Target Reader Popu-
lations (PITR).

Angrosh, M., Nomoto, T., and Siddharthan, A.
(2014). Lexico-syntactic text simplification and
compression with typed dependencies. In Pro-
ceedings of the 14th Conference of the Euro-
pean Chapter of the Association for Computa-
tional Linguistics (EACL).

Bannard, C. and Callison-Burch, C. (2005). Para-
phrasing with bilingual parallel corpora. In Pro-
ceedings of the 43rd Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL).

Biran, O., Brody, S., and Elhadad, N. (2011).
Putting it simply: A context-aware approach to
lexical simplification. In Proceedings of the 49th
Annual Meeting of the Association for Computa-
tional Linguistics: Human Language Technolo-
gies (ACL-HLT).

Carroll, J., Minnen, G., Pearce, D., Canning, Y., De-
vlin, S., and Tait, J. (1999). Simplifying text for
language-impaired readers. In Proceedings of the
14th Conference of the 9th European Conference
for Computational Linguistics (EACL).

Chan, T. P., Callison-Burch, C., and Van Durme,
B. (2011). Reranking bilingually extracted para-
phrases using monolingual distributional similar-
ity. In Proceedings of the Workshop on Geo-
metrical Models of Natural Language Semantics
(MTTG).

Chandrasekar, R., Doran, C., and Srinivas, B.
(1996). Motivations and methods for text simpli-
fication. In Proceedings of the 16th Conference
on Computational linguistics (COLING).

Chen, B., Kuhn, R., and Larkin, S. (2012a). PORT:
A precision-order-recall MT evaluation metric for
tuning. In Proceedings of the 50th Annual Meet-
ing of the Association for Computational Linguis-
tics (ACL).

Chen, D. L. and Dolan, W. B. (2011). Collecting
highly parallel data for paraphrase evaluation. In
Proceedings of the 49th Annual Meeting of the As-
sociation for Computational Linguistics (ACL).

Chen, H.-B., Huang, H.-H., Chen, H.-H., and
Tan, C.-T. (2012b). A simplification-translation-
restoration framework for cross-domain SMT ap-
plications. In Proceedings of the 24th Interna-
tional Conference on Computational Linguistics
(COLING).

Chiang, D. (2007). Hierarchical phrase-based trans-
lation. Computational Linguistics, 33(2):201–
228.

Clarke, J. and Lapata, M. (2006). Models for
sentence compression: A comparison across do-
mains, training requirements and evaluation mea-
sures. In Proceedings of the 21st International
Conference on Computational Linguistics and
44th Annual Meeting of the Association for Com-
putational Linguistics (ACL-COLING).

Coster, W. and Kauchak, D. (2011a). Learning to
simplify sentences using Wikipedia. In Proceed-
ings of the Workshop on Monolingual Text-To-Text
Generation.

Coster, W. and Kauchak, D. (2011b). Simple En-
glish Wikipedia: A new text simplification task.
In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Hu-
man Language Technologies (ACL-HLT).

Dahlmeier, D. and Ng, H. T. (2012). Better eval-
uation for grammatical error correction. In Pro-
ceedings of the 2012 Conference of the North
American Chapter of the Association for Compu-
tational Linguistics: Human Language Technolo-
gies (NAACL-HLT).

De Belder, J. and Moens, M.-F. (2010). Text simpli-
fication for children. In Proceedings of the SIGIR
Workshop on Accessible Search Systems.

Devlin, S., Tail, J., Canning, Y., Carroll, J., Min-
nen, G., and Pearce, D. (1999). The application
of assistive technology in facilitating the compre-
hension of newspaper text by aphasic people. As-
sistive Technology on the Threshold of the New
Millennium, page 160.

Evans, R., Orasan, C., and Dornescu, I. (2014).
An evaluation of syntactic simplification rules

412


for people with autism. In Proceedings of
the 3rd Workshop on Predicting and Improving
Text Readability for Target Reader Populations
(PITR).

Felice, M. and Briscoe, T. (2015). Towards a stan-
dard evaluation method for grammatical error de-
tection and correction. In Proceedings of the
2015 Conference of the North American Chapter
of the Association for Computational Linguistics
(NAACL).

Feng, L. (2008). Text simplification: A survey. The
City University of New York, Technical Report.

Filippova, K., Alfonseca, E., Colmenares, C. A.,
Kaiser, L., and Vinyals, O. (2015). Sentence com-
pression by deletion with LSTMs. In Proceedings
of the 2015 Conference on Empirical Methods in
Natural Language Processing (EMNLP).

Filippova, K. and Strube, M. (2008). Dependency
tree based sentence compression. In Proceedings
of the Fifth International Natural Language Gen-
eration Conference (INLG).

Galley, M., Brockett, C., Sordoni, A., Ji, Y., Auli,
M., Quirk, C., Mitchell, M., Gao, J., and Dolan,
B. (2015). deltaBLEU: A discriminative metric
for generation tasks with intrinsically diverse tar-
gets. In Proceedings of the 53rd Annual Meeting
of the Association for Computational Linguistics
(ACL).

Ganitkevitch, J., Cao, Y., Weese, J., Post, M., and
Callison-Burch, C. (2012a). Joshua 4.0: Packing,
PRO, and paraphrases. In Proceedings of the Sev-
enth Workshop on Statistical Machine Translation
(WMT).

Ganitkevitch, J., Van Durme, B., and Callison-
Burch, C. (2012b). Monolingual distributional
similarity for text-to-text generation. In Proceed-
ings of the First Joint Conference on Lexical and
Computational Semantics (*SEM).

Ganitkevitch, J., Van Durme, B., and Callison-
Burch, C. (2013). PPDB: The paraphrase
database. In Proceedings of the 2013 Conference
of the North American Chapter of the Association
for Computational Linguistics (NAACL).

Gao, M., Xu, W., and Callison-Burch, C. (2015).
Cost optimization in crowdsourcing translation.

In Proceedings of the 2015 Conference of the
North American Chapter of the Association for
Computational Linguistics (NAACL).

Hopkins, M. and May, J. (2011). Tuning as rank-
ing. In Proceedings of the Conference on Em-
pirical Methods in Natural Language Processing
(EMNLP).

Horn, C., Manduca, C., and Kauchak, D. (2014).
Learning a lexical simplifier using Wikipedia. In
Proceedings of the 52th Annual Meeting of the As-
sociatioin for Computational Linguistics (ACL).

Hwang, W., Hajishirzi, H., Ostendorf, M., and Wu,
W. (2015). Aligning sentences from Standard
Wikipedia to Simple Wikipedia. In Proceed-
ings of the 2015 Conference of the North Ameri-
can Chapter of the Association for Computational
Linguistics (NAACL).

Inui, K., Fujita, A., Takahashi, T., Iida, R., and
Iwakura, T. (2003). Text simplification for read-
ing assistance: A project note. In Proceedings of
the 2nd International Workshop on Paraphrasing
(IWP).

Kaji, N., Kawahara, D., Kurohash, S., and Sato, S.
(2002). Verb paraphrase based on case frame
alignment. In Proceedings of the 40th Annual
Meeting on Association for Computational Lin-
guistics (ACL).

Kauchak, D. (2013). Improving text simplification
language modeling using unsimplified text data.
In Proceedings of the 2013 Conference of the As-
sociation for Computational Linguistics (ACL).

Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L., and
Chissom, B. S. (1975). Derivation of new read-
ability formulas (automated readability index, fog
count and Flesch reading ease formula) for Navy
enlisted personnel. Technical report, Defence
Technical Information Center (DTIC) Document.

Knight, K. and Marcu, D. (2002). Summarization
beyond sentence extraction: A probabilistic ap-
proach to sentence compression. Artificial Intelli-
gence.

Liu, C., Dahlmeier, D., and Ng, H. T. (2010).
TESLA: Translation evaluation of sentences with
linear-programming-based analysis. In Proceed-
ings of the Joint Fifth Workshop on Statistical Ma-
chine Translation and Metrics (MATR).

413


Miwa, M., Saetre, R., Miyao, Y., and Tsujii, J.
(2010). Entity-focused sentence simplification for
relation extraction. In Proceedings of the 23rd
International Conference on Computational Lin-
guistics (COLING).

Napoles, C., Sakaguchi, K., Post, M., and Tetreault,
J. (2015). Ground truth for grammatical error
correction metrics. In Proceedings of the 53rd
Annual Meeting of the Association for Computa-
tional Linguistics (ACL).

Narayan, S. and Gardent, C. (2014). Hybrid simpli-
fication using deep semantics and machine trans-
lation. In Proceedings of the 52nd Annual Meet-
ing of the Association for Computational Linguis-
tics (ACL).

Och, F. J. (2003). Minimum error rate training in
statistical machine translation. In Proceedings of
the 41st Annual Meeting on Association for Com-
putational Linguistics (ACL).

Pavlick, E., Bos, J., Nissim, M., Beller, C., Durme,
B. V., and Callison-Burch, C. (2015). Adding se-
mantics to data-driven paraphrasing. In Proceed-
ings of the 53rd Annual Meeting of the Associa-
tion for Computational Linguistics (ACL).

Pavlick, E. and Callison-Burch, C. (2016). Simple
PPDB: A paraphrase database for simplification.
In The 54th Annual Meeting of the Association for
Computational Linguistics (ACL).

Pellow, D. and Eskenazi, M. (2014a). An open
corpus of everyday documents for simplification
tasks. In Proceedings of the 3rd Workshop on Pre-
dicting and Improving Text Readability for Target
Reader Populations (PITR).

Pellow, D. and Eskenazi, M. (2014b). Tracking hu-
man process using crowd collaboration to enrich
data. In Proceedings of Second AAAI Confer-
ence on Human Computation and Crowdsourcing
(HCOMP).

Petersen, S. E. and Ostendorf, M. (2007). Text sim-
plification for language learners: A corpus anal-
ysis. In Proceedings of Workshop on Speech and
Language Technology for (SLaTE).

Post, M., Ganitkevitch, J., Orland, L., Weese, J.,
Cao, Y., and Callison-Burch, C. (2013). Joshua
5.0: Sparser, better, faster, server. In Proceed-

ings of the Eighth Workshop on Statistical Ma-
chine Translation (WMT).

Rello, L., Baeza-Yates, R. A., and Saggion, H.
(2013). The impact of lexical simplification by
verbal paraphrases for people with and without
dyslexia. In Proceedings of the 14th Interna-
tional Conference on Intelligent Text Processing
and Computational Linguistics (CICLing).

Rush, A. M., Chopra, S., and Weston, J. (2015).
A neural attention model for abstractive sentence
summarization. In Proceedings of the 2015 Con-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP).

Siddharthan, A. (2006). Syntactic simplification and
text cohesion. Research on Language and Com-
putation, 4(1):77–109.

Siddharthan, A. (2014). A survey of research on
text simplification. Special issue of International
Journal of Applied Linguistics, 165(2).

Siddharthan, A. and Angrosh, M. (2014). Hybrid
text simplification using synchronous dependency
grammars with hand-written and automatically
harvested rules. In Proceedings of the 25th Inter-
national Conference on Computational Linguis-
tics (COLING).

Siddharthan, A. and Katsos, N. (2010). Reformulat-
ing discourse connectives for non-expert readers.
In Proceedings of the 2010 Annual Conference of
the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies (NAACL-HLT).

Siddharthan, A., Nenkova, A., and McKeown, K.
(2004). Syntactic simplification for improving
content selection in multi-document summariza-
tion. In Proceedings of the 20th International
Conference on Computational Linguistics (COL-
ING).

Specia, L., Jauhar, S. K., and Mihalcea, R. (2012).
SemEval-2012 task 1: English lexical simplifica-
tion. In Proceedings of the Sixth International
Workshop on Semantic Evaluation (SemEval).

Štajner, S., Béchara, H., and Saggion, H. (2015). A
deeper exploration of the standard PB-SMT ap-
proach to text simplification and its evaluation. In
Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics (ACL).

414


Štajner, S., Mitkov, R., and Saggion, H. (2014). One
step closer to automatic evaluation of text simpli-
fication systems. In Proceedings of the 3rd Work-
shop on Predicting and Improving Text Readabil-
ity for Target Reader Populations (PITR).

Sun, H. and Zhou, M. (2012). Joint learning of a
dual SMT system for paraphrase generation. In
Proceedings of the 50th Annual Meeting of the As-
sociation for Computational Linguistics (ACL).

Vickrey, D. and Koller, D. (2008). Sentence sim-
plication for semantic role labeling. In Proceed-
ings of the 46th Annual Meeting of the Associa-
tion for Computational Linguistics: Human Lan-
guage Technologies (ACL-HLT).

Watanabe, W. M., Junior, A. C., Uzêda, V. R.,
Fortes, R. P. d. M., Pardo, T. A. S., and Aluı́sio,
S. M. (2009). Facilita: Reading assistance
for low-literacy readers. In Proceedings of the
27th ACM International Conference on Design of
Communication (SIGDOC).

Weese, J., Ganitkevitch, J., Callison-Burch, C., Post,
M., and Lopez, A. (2011). Joshua 3.0: Syntax-
based machine translation with the thrax grammar
extractor. In Proceedings of the Sixth Workshop
on Statistical Machine Translation (WMT).

Woodsend, K. and Lapata, M. (2011). Learning to
simplify sentences with quasi-synchronous gram-
mar and integer programming. In Proceedings of
the Conference on Empirical Methods in Natural
Language Processing (EMNLP).

Wubben, S., van den Bosch, A., and Krahmer, E.
(2012). Sentence simplification by monolingual
machine translation. In Proceedings of the 50th
Annual Meeting of the Association for Computa-
tional Linguistics (ACL).

Xu, W., Callison-Burch, C., and Napoles, C. (2015).
Problems in current text simplification research:
New data can help. Transactions of the As-
sociation for Computational Linguistics (TACL),
3:283–297.

Xu, W., Ritter, A., Dolan, B., Grishman, R., and
Cherry, C. (2012). Paraphrasing for style. In Pro-
ceedings of the 24th International Conference on
Computational Linguistics (COLING).

Yatskar, M., Pang, B., Danescu-Niculescu-Mizil, C.,
and Lee, L. (2010). For the sake of simplicity:

Unsupervised extraction of lexical simplifications
from Wikipedia. In Proceedings of the 2010 An-
nual Conference of the North American Chapter
of the Association for Computational Linguistics:
Human Language Technologies (ACL-HLT).

Zhu, Z., Bernhard, D., and Gurevych, I. (2010). A
monolingual tree-based translation model for sen-
tence simplification. In Proceedings of the 23rd
International Conference on Computational Lin-
guistics (COLING).

Zollmann, A. and Venugopal, A. (2006). Syntax
augmented machine translation via chart parsing.
In Proceedings of the Workshop on Statistical Ma-
chine Translation (WMT).

415


416