A Polynomial-Time Dynamic Programming Algorithm for Phrase-Based
Decoding with a Fixed Distortion Limit

Yin-Wen Chang
Google Research, New York
yinwen@google.com

Michael Collins∗
Google Research, New York
mjcollins@google.com

Abstract

Decoding of phrase-based translation mod-
els in the general case is known to be NP-
complete, by a reduction from the traveling
salesman problem (Knight, 1999). In practice,
phrase-based systems often impose a hard
distortion limit that limits the movement of
phrases during translation. However, the im-
pact on complexity after imposing such a con-
straint is not well studied. In this paper, we
describe a dynamic programming algorithm
for phrase-based decoding with a fixed dis-
tortion limit. The runtime of the algorithm is
O(nd!lhd+1) where n is the sentence length,
d is the distortion limit, l is a bound on the
number of phrases starting at any position in
the sentence, and h is related to the maximum
number of target language translations for any
source word. The algorithm makes use of a
novel representation that gives a new perspec-
tive on decoding of phrase-based models.

1 Introduction

Phrase-based translation models (Koehn et al., 2003;
Och and Ney, 2004) are widely used in statisti-
cal machine translation. The decoding problem for
phrase-based translation models is known to be dif-
ficult: the results from Knight (1999) imply that in
the general case decoding of phrase-based transla-
tion models is NP-complete.

The complexity of phrase-based decoding comes
from reordering of phrases. In practice, however,
various constraints on reordering are often imposed
in phrase-based translation systems. A common
constraint is a “distortion limit”, which places a hard
constraint on how far phrases can move. The com-
plexity of decoding with such a distortion limit is an
open question: the NP-hardness result from Knight

∗On leave from Columbia University.

(1999) applies to a phrase-based model with no dis-
tortion limit.

This paper describes an algorithm for phrase-
based decoding with a fixed distortion limit whose
runtime is linear in the length of the sentence, and
for a fixed distortion limit is polynomial in other fac-
tors. More specifically, for a hard distortion limit d,
and sentence length n, the runtime is O(nd!lhd+1),
where l is a bound on the number of phrases start-
ing at any point in the sentence, and h is related to
the maximum number of translations for any word
in the source language sentence.

The algorithm builds on the insight that de-
coding with a hard distortion limit is related to
the bandwidth-limited traveling salesman problem
(BTSP) (Lawler et al., 1985). The algorithm is eas-
ily amenable to beam search. It is quite different
from previous methods for decoding of phrase-based
models, potentially opening up a very different way
of thinking about decoding algorithms for phrase-
based models, or more generally for models in sta-
tistical NLP that involve reordering.

2 Related Work

Knight (1999) proves that decoding of word-to-word
translation models is NP-complete, assuming that
there is no hard limit on distortion, through a reduc-
tion from the traveling salesman problem. Phrase-
based models are more general than word-to-word
models, hence this result implies that phrase-based
decoding with unlimited distortion is NP-complete.

Phrase-based systems can make use of both re-
ordering constraints, which give a hard “distortion
limit” on how far phrases can move, and reorder-
ing models, which give scores for reordering steps,
often penalizing phrases that move long distances.
Moses (Koehn et al., 2007b) makes use of a distor-
tion limit, and a decoding algorithm that makes use

59

Transactions of the Association for Computational Linguistics, vol. 5, pp. 59–71, 2017. Action Editor: Holger Schwenk.
Submission batch: 10/2016; Revision batch: 11/2016; Published 2/2017.

c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


of bit-strings representing which words have been
translated. We show in Section 5.2 of this paper that
this can lead to at least 2n/4 bit-strings for an input
sentence of length n, hence an exhaustive version of
this algorithm has worst-case runtime that is expo-
nential in the sentence length. The current paper is
concerned with decoding phrase-based models with
a hard distortion limit.

Various other reordering constraints have been
considered. Zens and Ney (2003) and Zens et al.
(2004) consider two types of hard constraints: the
IBM constraints, and the ITG (inversion transduc-
tion grammar) constraints from the model of Wu
(1997). They give polynomial time dynamic pro-
gramming algorithms for both of these cases. It is
important to note that the IBM and ITG constraints
are different from the distortion limit constraint con-
sidered in the current paper. Decoding algorithms
with ITG constraints are further studied by Feng et
al. (2010) and Cherry et al. (2012).

Kumar and Byrne (2005) describe a class of re-
ordering constraints and models that can be encoded
in finite state transducers. Lopez (2009) shows
that several translation models can be represented
as weighted deduction problems and analyzes their
complexities.1 Koehn et al. (2003) describe a beam-
search algorithm for phrase-based decoding that is
in widespread use; see Section 5 for discussion.

A number of reordering models have been pro-
posed, see for example Tillmann (2004), Koehn et
al. (2007a) and Galley and Manning (2008).

DeNero and Klein (2008) consider the phrase
alignment problem, that is, the problem of find-
ing an optimal phrase-based alignment for a source-
language/target-language sentence pair. They show
that in the general case, the phrase alignment prob-
lem is NP-hard. It may be possible to extend
the techniques in the current paper to the phrase-
alignment problem with a hard distortion limit.

Various methods for exact decoding of phrase-
based translation models have been proposed. Za-
slavskiy et al. (2009) describe the use of travel-

1An earlier version of this paper states the complexity of de-
coding with a distortion limit as O(I32d) where d is the distor-
tion limit and I is the number of words in the sentence; however
(personal communication from Adam Lopez) this runtime is an
error, and should be O(2I) i.e., exponential time in the length
of the sentence. A corrected version of the paper corrects this.

ing salesman algorithms for phrase-based decod-
ing. Chang and Collins (2011) describe an exact
method based on Lagrangian relaxation. Aziz et al.
(2014) describe a coarse-to-fine approach. These al-
gorithms all have exponential time runtime (in the
length of the sentence) in the worst case.

Galley and Manning (2010) describe a decoding
algorithm for phrase-based systems where phrases
can have discontinuities in both the source and tar-
get languages. The algorithm has some similarities
to the algorithm we propose: in particular, it makes
use of a state representation that contains a list of
disconnected phrases. However, the algorithms dif-
fer in several important ways: Galley and Manning
(2010) make use of bit string coverage vectors, giv-
ing an exponential number of possible states; in con-
trast to our approach, the translations are not formed
in strictly left-to-right ordering on the source side.

3 Background: The Traveling Salesman
Problem on Bandwidth-Limited Graphs

This section first defines the bandwidth-limited trav-
eling salesman problem, then describes a polyno-
mial time dynamic programming algorithm for the
traveling salesman path problem on bandwidth lim-
ited graphs. This algorithm is the algorithm pro-
posed by Lawler et al. (1985)2 with small modifica-
tions to make the goal a path instead of a cycle, and
to consider directed rather than undirected graphs.

3.1 Bandwidth-Limited TSPPs
The input to the problem is a directed graph G =
(V,E), where V is a set of vertices and E is a set of
directed edges. We assume that V = {1, 2, . . . ,n}.
A directed edge is a pair (i,j) where i,j ∈ V , and
i 6= j. Each edge (i,j) ∈ E has an associated
weight wi,j. Given an integer k ≥ 1, a graph is
bandwidth-limited with bandwidth k if

∀(i,j) ∈ E, |i− j| ≤ k
The traveling salesman path problem (TSPP) on

the graph G is defined as follows. We will assume
that vertex 1 is the “source” vertex and vertex n is
the “sink” vertex. The TSPP is to find the minimum
cost directed path from vertex 1 to vertex n, which
passes through each vertex exactly once.

2The algorithm is based on the ideas of Monien and Sudbor-
ough (1981) and Ratliff and Rosenthal (1983).

60


3.2 An Algorithm for Bandwidth-Limited
TSPPs

The key idea of the dynamic-programming algo-
rithm for TSPPs is the definition of equivalence
classes corresponding to dynamic programming
states, and an argument that the number of equiv-
alence classes depends only on the bandwidth k.

The input to our algorithm will be a directed graph
G = (V,E), with weights wi,j, and with bandwidth
k. We define a 1-n path to be any path from the
source vertex 1 to the sink vertex n that visits each
vertex in the graph exactly once. A 1-n path is a
subgraph (V ′,E′) of G, where V ′ = V and E′ ⊆ E.

We will make use of the following definition:

Definition 1. For any 1-n path H, define Hj to be
the subgraph that H induces on vertices 1, 2, . . .j,
where 1 ≤ j ≤ n.

That is, Hj contains the vertices 1, 2, . . .j and the
edges in H between these vertices.

For a given value for j, we divide the vertices V
into three sets Aj, Bj and Cj:

• Aj = {1, 2, . . . , (j −k)}
(Aj is the empty set if j ≤ k).
• Bj = {1 . . .j}\Aj.3
• Cj = {j + 1,j + 2, . . . ,n}

(Cj is the empty set if j = n).

Note that the vertices in subgraph Hj are the union
of the sets Aj and Bj. Aj is the empty set if j ≤ k,
but Bj is always non-empty.

The following Lemma then applies:

Lemma 1. For any 1-n path H in a graph with
bandwidth k, for any 1 ≤ j ≤ n, the subgraph Hj
has the following properties:

1. If vertex 1 is in Aj, then vertex 1 has degree
one.

2. For any vertex v ∈ Aj with v ≥ 2, vertex v has
degree two.

3. Hj contains no cycles.

Proof. The first and second properties are true be-
cause of the bandwidth limit. Under the constraint
of bandwidth k, any edge (u,v) in H such that

3For sets X and Y we use the notation X \ Y to refer to the
set difference: i.e., X \ Y = {x|x ∈ X and x /∈ Y }.

u ∈ Aj, must have v ∈ Aj ∪ Bj = Hj. This fol-
lows because if v ∈ Cj = {j + 1,j + 2, . . .n}
and u ∈ Aj = {1, 2, . . .j − k}, then |u − v| > k.
Similarly any edge (u,v) ∈ H such that v ∈ Aj
must have u ∈ Aj ∪ Bj = Hj. It follows that for
any vertex u ∈ Aj, with u > 1, there are edges
(u,v) ∈ Hj and (v′,u) ∈ Hj, hence vertex u has
degree 2. For vertex u ∈ Aj with u = 1, there is
an edge (u,v) ∈ Hj, hence vertex u has degree 1.
The third property (no cycles) is true because Hj is
a subgraph of H, which has no cycles.

It follows that each connected component of Hj
is a directed path, that the start points of these paths
are in the set {1}∪ Bj, and that the end points of
these paths are in the set Bj.

We now define an equivalence relation on sub-
graphs. Two subgraphs Hj and H′j are in the same
equivalence class if the following conditions hold
(taken from Lawler et al. (1985)):

1. For any vertex v ∈ Bj, the degree of v in Hj
and H′j is the same.

2. For each path (connected component) in Hj
there is a path in H′j with the same start and
end points, and conversely.

The significance of this definition is as follows.
Assume that H∗ is an optimal 1-n path in the graph,
and that it induces the subgraph Hj on vertices
1 . . .j. Assume that H′j is another subgraph over
vertices 1 . . .j, which is in the same equivalence
class as Hj. For any subgraph Hj, define c(Hj) to
be the sum of edge weights in Hj:

c(Hj) =
∑

(u,v)∈Hj
wu,v

Then it must be the case that c(H′j) ≥ c(Hj). Oth-
erwise, we could simply replace Hj by H′j in H

∗,
thereby deriving a new 1-n path with a lower cost,
implying that H∗ is not optimal.

This observation underlies the dynamic program-
ming approach. Define σ to be a function that
maps a subgraph Hj to its equivalence class σ(Hj).
The equivalence class σ(Hj) is a data structure that
stores the degrees of the vertices in Bj, together with
the start and end points of each connected compo-
nent in Hj.

61


Next, define ∆ to be a set of 0, 1 or 2 edges be-
tween vertex (j + 1) and the vertices in Bj. For any
subgraph Hj+1 of a 1-n path, there is some ∆, sim-
ply found by recording the edges incident to vertex
(j + 1). For any Hj, define τ(σ(Hj), ∆) to be the
equivalence class resulting from adding the edges in
∆ to the data structure σ(Hj). If adding the edges
in ∆ to σ(Hj) results in an ill-formed subgraph—for
example, a subgraph that has one or more cycles—
then τ(σ(Hj), ∆) is undefined. The following re-
currence then defines the dynamic program (see Eq.
20 of Lawler et al. (1985)):

α(j + 1,S) = min
∆,S′:τ(S′,∆)=S

(
α(j,S′) + c(∆)

)

Here S is an equivalence class over vertices
{1 . . . (j+1)}, and α(S,j+1) is the minimum score
for any subgraph in equivalence class S. The min is
taken over all equivalence classes S′ over vertices
{1 . . .j}, together with all possible values for ∆.

4 A Dynamic Programming Algorithm for
Phrase-Based Decoding

We now describe the dynamic programming algo-
rithm for phrase-based decoding with a fixed distor-
tion limit. We first give basic definitions for phrase-
based decoding, and then describe the algorithm.

4.1 Basic Definitions
Consider decoding an input sentence consisting of
words x1 . . .xn for some integer n. We assume that
x1 = <s> and xn = </s> where <s> and </s>
are the sentence start and end symbols respectively.
A phrase-based lexicon specifies a set of possible
translations in the form of phrases p = (s,t,e),
where s and t are integers such that 1 ≤ s ≤ t ≤ n,
and e is a sequence of m ≥ 1 target-language words
e1 . . .em. This signifies that words xs . . .xt in the
source language have a translation as e1 . . .em in the
target language. We use s(p), t(p) and e(p) to refer
to the three components of a phrase p = (s,t,e), and
e1(p) . . .em(p) to refer to the words in the target-
language string e(p). We assume that (1, 1,<s>)
and (n,n,</s>) are the only translation entries with
s(p) ≤ 1 and t(p) ≥ n respectively.

A derivation is then defined as follows:
Definition 2 (Derivations). A derivation is a se-
quence of phrases p1 . . .pL such that

• p1 = (1, 1,<s>) and pL = (n,n,</s>).
• Each source word is translated exactly once.
• The distortion limit is satisfied for each pair of
phrases pi−1,pi, that is:

|t(pi−1) + 1 −s(pi)| ≤ d ∀ i = 2 . . .L.

where d is an integer specifying the distortion
limit in the model.

Given a derivation p1 . . .pL, a target-language
translation can be obtained by concatenating the
target-language strings e(p1) . . .e(pL).

The scoring function is defined as follows:

f(p1 . . .pL) = λ(e(p1) . . .e(pL)) +

L∑

i=1

κ(pi)

+
L∑

i=2

η ×|t(pi−1) + 1 −s(pi)| (1)

For each phrase p, κ(p) is the translation score for
the phrase. The parameter η is the distortion penalty,
which is typically a negative constant. λ(e) is a lan-
guage model score for the string e. We will assume
a bigram language model:

λ(e1 . . .em) =
m∑

i=2

λ(ei|ei−1).

The generalization of our algorithm to higher-order
n-gram language models is straightforward.

The goal of phrase-based decoding is to find y∗ =
arg maxy∈Y f(y) where Y is the set of valid deriva-
tions for the input sentence.

Remark (gap constraint): Note that a common
restriction used in phrase-based decoding (Koehn
et al., 2003; Chang and Collins, 2011), is to im-
pose an additional “gap constraint” while decod-
ing. See Chang and Collins (2011) for a descrip-
tion. In this case it is impossible to have a dynamic-
programming state where word xi has not been
translated, and where word xi+k has been translated,
for k > d. This limits distortions further, and it can
be shown in this case that the number of possible
bitstrings is O(2d) where d is the distortion limit.
Without this constraint the algorithm of Koehn et
al. (2003) actually fails to produce translations for
many input sentences (Chang and Collins, 2011).

62


H1 = 〈π1〉 =
〈〈(

1, 1,<s>
)〉〉

H3 = 〈π1〉 =
〈〈(

1, 1,<s>
)(

2, 3, we must
)〉〉

H4 = 〈π1〉 =
〈〈(

1, 1,<s>
)(

2, 3, we must
)(

4, 4, also
)〉〉

H6 = 〈π1,π2〉 =
〈〈(

1, 1,<s>
)(

2, 3, we must
)(

4, 4, also
)〉
,
〈(

5, 6, these criticisms
)〉〉

H7 = 〈π1,π2〉 =
〈〈(

1, 1,<s>
)(

2, 3, we must
)(

4, 4, also
)〉
,
〈(

5, 6, these criticisms
)(

7, 7, seriously
)〉〉

H8 = 〈π1〉 =
〈〈(

1, 1,<s>
)(

2, 3, we must
)(

4, 4, also
)(

8, 8, take
)(

5, 6, these criticisms
)(

7, 7, seriously
)〉〉

H9 = 〈π1〉 =
〈〈(

1, 1,<s>
)(

2, 3, we must
)(

4, 4, also
)(

8, 8, take
)(

5, 6, these criticisms
)(

7, 7, seriously
)(

9, 9, </s>
)〉〉

Figure 1: Sub-derivations Hj for j ∈ {1, 3, 4, 6, 7, 8, 9} induced by the full derivation H =〈〈
(1, 1,<s>)(2, 3, we must)(4, 4, also)(8, 8, take)(5, 6, these criticisms)(7, 7, seriously)(9, 9</s>)

〉〉
. Note that Hj

includes the phrases that cover spans ending before or at position j. Sub-derivation Hj is extended to another sub-
derivation Hj+i by incorporating a phrase of length i.

4.2 The Algorithm
We now describe the dynamic programming algo-
rithm. Intuitively the algorithm builds a deriva-
tion by processing the source-language sentence in
strictly left-to-right order. This is in contrast with the
algorithm of Koehn et al. (2007b), where the target-
language sentence is constructed from left to right.

Throughout this section we will use π, or πi for
some integer i, to refer to a sequence of phrases:

π =
〈
p1 . . .pl

〉

where each phrase pi = (s(pi), t(pi),e(pi)), as de-
fined in the previous section.

We overload the s, t and e operators, so that if
π =

〈
p1 . . .pl

〉
, we have s(π) = s(p1), t(π) =

t(pl), and e(π) = e(p1)·e(p2) . . .·e(pl), where x·y
is the concatenation of strings x and y.

A derivation H consists of a single phrase se-
quence π =

〈
p1 . . .pL

〉
:

H = π =
〈
p1 . . .pL

〉

where the sequence p1 . . .pL satisfies the con-
straints in definition 2.

We now give a definition of sub-derivations and
complement sub-derivations:

Definition 3 (Sub-derivations and Complement Sub-
-derivations). For any H =

〈
p1 . . .pL

〉
, for any

j ∈ {1 . . .n} such that ∃i ∈ {1 . . .L} s.t. t(pi) =
j, the sub-derivation Hj and the complement sub-
derivation H̄j are defined as

Hj =〈π1 . . .πr〉, H̄j = 〈π̄1 . . . π̄r〉

where the following properties hold:
• r is an integer with r ≥ 1.
• Each πi for i = 1 . . .r is a sequence of one
or more phrases, where each phrase p ∈ πi has
t(p) ≤ j.

• Each π̄i for i = 1 . . . (r−1) is a sequence of one
or more phrases, where each phrase p ∈ π̄i has
s(p) > j.

• π̄r is a sequence of zero or more phrases, where
each phrase p ∈ π̄r has s(p) > j. We have zero
phrases in π̄r iff j = n where n is the length of
the sentence.

• Finally, π1 · π̄1 · π2 · π̄2 . . .πr · π̄r = p1 . . .pL
where x · y denotes the concatenation of phrase
sequences x and y.
Note that for any j ∈ {1 . . .n} such that @i ∈
{1 . . .L} such that t(pi) = j, the sub-derivation
Hj and the complement sub-derivation H̄j is not de-
fined.

Thus for each integer j such that there is a phrase
in H ending at point j, we can divide the phrases
in H into two sets: phrases p with t(p) ≤ j, and
phrases p with s(p) > j. The sub-derivation Hj lists
all maximal sub-sequences of phrases with t(p) ≤ j.
The complement sub-derivation H̄j lists all maximal
sub-sequences of phrases with s(p) > j.

Figure 1 gives all sub-derivations Hj for the
derivation

H =
〈〈
p1 . . .p7

〉〉

=
〈〈

(1, 1,<s>)(2, 3, we must)(4, 4, also)

(8, 8, take)(5, 6, these criticisms)

(7, 7, seriously)(9, 9,</s>)
〉〉

As one example, the sub-derivation H7 =
〈π1,π2〉 induced by H has two phrase sequences:

π1 =
〈
(1, 1,<s>)(2, 3, we must)(4, 4, also)

〉

π2 =
〈
(5, 6, these criticisms)(7, 7, seriously)

〉

Note that the phrase sequences π1 and π2 give trans-
lations for all words x1 . . .x7 in the sentence. There

63


are two disjoint phrase sequences because in the
full derivation H, the phrase p = (8, 8, take), with
t(p) = 8 > 7, is used to form a longer sequence of
phrases π1 pπ2.

For the above example, the complement sub-
derivation H̄7 is as follows:

π̄1 =
〈
(8, 8,take)

〉

π̄2 =
〈
(9, 9,</s>)

〉

It can be verified that π1·π̄1·π2·π̄2 = H as required
by the definition of sub-derivations and complement
sub-derivations.

We now state the following Lemma:
Lemma 2. For any derivation H = p1 . . .pL, for
any j such that ∃i such that t(pi) = j, the sub-
derivation Hj = 〈π1 . . .πr〉 satisfies the following
properties:

1. s(π1) = 1 and e1(π1) = <s>.
2. For all positions i ∈ {1 . . .j}, there exists a

phrase p ∈ π, for some phrase sequence π ∈
Hj, such that s(p) ≤ i ≤ t(p).

3. For all i = 2 . . .r, s(πi) ∈{(j −d + 2) . . .j}
4. For all i = 1 . . .r, t(πi) ∈{(j −d) . . .j}

Here d is again the distortion limit.

This lemma is a close analogy of Lemma 1. The
proof is as follows:
Proof of Property 1: For all values of j, the phrase
p1 = (1, 1,<s>) has t(p1) ≤ j, hence we must have
π1 = p1 . . .pk for some k ∈ {1 . . .L}. It follows
that s(π1) = 1 and e1(π1) = <s>.
Proof of Property 2: For any position i ∈ {1 . . .j},
define the phrase (s,t,e) in the derivation H to be
the phrase that covers word i; i.e., the phrase such
that s ≤ i ≤ t. We must have s ∈ {1 . . .j},
because s ≤ i and i ≤ j. We must also have
t ∈{1 . . .j}, because otherwise we have s ≤ j < t,
which contradicts the assumption that there is some
i ∈{1 . . .L} such that t(pi) = j. It follows that the
phrase (s,t,e) has t ≤ j, and from the definition of
sub-derivations it follows that the phrase is in one of
the phrase sequences π1 . . .πr.
Proof of Property 3: This follows from the distor-
tion limit. Consider the complement sub-derivation
H̄j = 〈π̄1 . . . π̄r〉. For the distortion limit to be sat-
isfied, for all i ∈{2 . . .r}, we must have

|t(π̄i−1) + 1 −s(πi)| ≤ d

We must also have t(π̄i−1) > j, and s(πi) ≤ j,
by the definition of sub-derivations. It follows that
s(πi) ∈{(j −d + 2) . . .j}.
Proof of Property 4: This follows from the distortion
limit. First consider the case where π̄r is non-empty.
For the distortion limit to be satisfied, for all i ∈
{1 . . .r}, we must have

|t(πi) + 1 −s(π̄i)| ≤ d

We must also have t(πi) ≤ j, and s(π̄i) > j, by the
definition of sub-derivations. It follows that t(πi) ∈
{(j −d) . . .j}.

Next consider the case where π̄r is empty. In this
case we must have j = n. For the distortion limit to
be satisfied, for all i ∈{1 . . . (r−1)}, we must have

|t(πi) + 1 −s(π̄i)| ≤ d

We must also have t(πi) ≤ j, and s(π̄i) > j, by the
definition of sub-derivations. It follows that t(πi) ∈
{(j−d) . . .j} for i ∈{1 . . . (r−1)}. For i = r, we
must have t(πi) = n, from which it again follows
that t(πr) = n ∈{(j −d) . . .j}.

We now define an equivalence relation between
sub-derivations, which will be central to the dy-
namic programming algorithm. We define a func-
tion σ that maps a phrase sequence π to its signature.
The signature is a four-tuple:

σ(π) = (s,ws, t,wt).

where s is the start position, ws is the start word, t is
the end position and wt is the end word of the phrase
sequence. We will use s(σ), ws(σ), t(σ), and wt(σ)
to refer to each component of a signature σ.

For example, given a phrase sequence

π =
〈
(1, 1, <s>) (2, 2, we) (4, 4, also)

〉
,

its signature is σ(π) = (1, <s>, 4, also).
The signature of a sub-derivation Hj =
〈π1 . . .πr〉 is defined to be

σ(Hj) = 〈σ(π1) . . .σ(πr)〉.

For example, with H7 as defined above, we have

σ(H7) =
〈(

1,<s>, 4, also
)
,
(
5, these, 7, seriously

)〉

Two partial derivations Hj and H′j are in the same
equivalence class iff σ(Hj) = σ(H′j).

We can now state the following Lemma:

64


Lemma 3. Define H∗ to be the optimal deriva-
tion for some input sentence, and H∗j to be a sub-
derivation of H∗. Suppose H′j is another sub-
derivation with j words, such that σ(H′j) = σ(H

∗
j ).

Then it must be the case that f(H∗j ) ≥ f(H′j),
where f is the function defined in Section 4.1.

Proof. Define the sub-derivation and complement
sub-derivation of H∗ as

H∗j = 〈π1 . . .πr〉 H̄∗j = 〈π̄1 . . . π̄r〉

We then have

f(H∗) = f(H∗j ) + f(H̄
∗
j ) + γ (2)

where f(. . .) is as defined in Eq. 1, and γ takes into
account the bigram language modeling scores and
the distortion scores for the transitions π1 → π̄1,
π̄1 → π2, π2 → π̄2, etc.

The proof is by contradiction. Define

H′j = π
′
1 . . .π

′
r

and assume that f(H∗j ) < f(H
′
j). Now consider

H′ = π′1π̄1π
′
2π̄2 . . .π

′
rπ̄r

This is a valid derivation because the transitions
π′1 → π̄1, π̄1 → π′2, π′2 → π̄2 have the same dis-
tortion distances as π1 → π̄1, π̄1 → π2, π2 → π̄2,
hence they must satisfy the distortion limit.

We have

f(H′) = f(H′j) + f(H̄
∗
j ) + γ (3)

where γ has the same value as in Eq. 2. This fol-
lows because the scores for the transitions π′1 → π̄1,
π̄1 → π′2, π′2 → π̄2 are identical to the scores for the
transitions π1 → π̄1, π̄1 → π2, π2 → π̄2, because
σ(H∗j ) = σ(H

′
j).

It follows from Eq. 2 and Eq. 3 that if f(H′j) >
f(H∗j ), then f(H

′) > f(H∗). But this contradicts
the assumption that H∗ is optimal. It follows that
we must have f(H′j) ≤ f(H∗j ).

This lemma leads to a dynamic programming al-
gorithm. Each dynamic programming state consists
of an integer j ∈{1 . . .n} and a set of r signatures:

T = (j,{σ1 . . .σr})

Figure 2 shows the dynamic programming algo-
rithm. It relies on the following functions:

Inputs:
• An integer n specifying the length of the input sequence.
• A function δ(T) returning the set of valid transitions from state T .
• A function τ(T, ∆) returning the state reached from state T by

transition ∆ ∈ δ(T).
• A function valid(T) returning TRUE if state T is valid, otherwise
FALSE.

• A function score(∆) that returns the score for any transition ∆.
Initialization:
T1 = (1,{(1,<s>, 1,<s>)})
α(T1) = 0

T1 = {T1}, ∀j ∈{2 . . .n},Tj = ∅
for j = 1, . . . ,n− 1

for each state T ∈Tj
for each ∆ ∈ δ(T)
T ′ = τ(T, ∆)
if valid(T ′) = FALSE: continue
score = α(T) + score(∆)
Define t to be the integer such that
T ′ = (t,{σ1 . . .σr})
if T ′ /∈Tt
Tt = Tt ∪{T ′}
α(T ′) = score
bp(T ′) = (∆)

else if score > α(T ′)
α(T ′) = score
bp(T ′) = (∆)

Return: the score of the state (n,{(1,<s>,n,</s>)}) in Tn, and
backpointers bp defining the transitions leading to this state.

Figure 2: The phrase-based decoding algorithm. α(T)
is the score for state T . The bp(T) variables are back-
pointers used in recovering the highest scoring sequence
of transitions.

• For any state T , δ(T) is the set of outgoing tran-
sitions from state T .

• For any state T , for any transition ∆ ∈ δ(T),
τ(T, ∆) is the state reached by transition ∆ from
state T .

• For any state T , valid(T) checks if a resulting
state is valid.

• For any transition ∆, score(∆) is the score for
the transition.

We next give full definitions of these functions.

4.2.1 Definitions of δ(T) and τ(T, ∆)
Recall that for any state T , δ(T) returns the set

of possible transitions from state T . In addition
τ(T, ∆) returns the state reached when taking tran-
sition ∆ ∈ δ(T).

Given the state T = (j,{σ1 . . .σr}), each transi-
tion is of the form ψ1 pψ2 where ψ1, p and ψ2 are
defined as follows:

65


• p is a phrase such that s(p) = j + 1.
• ψ1 ∈{σ1 . . .σr}∪{φ}. If ψ1 6= φ, it must be the

case that |t(ψ1) + 1 −s(p)| ≤ d and t(ψ1) 6= n.
• ψ2 ∈{σ1 . . .σr}∪{φ}. If ψ2 6= φ, it must be the

case that |t(p) + 1 −s(ψ2)| ≤ d and s(ψ2) 6= 1.
• If ψ1 6= φ and ψ2 6= φ, then ψ1 6= ψ2.
Thus there are four possible types of transition from
a state T = (j,{σ1 . . .σr}):
Case 1: ∆ = φpφ. In this case the phrase p
is incorporated as a stand-alone phrase. The new
state T ′ is equal to (j′,{σ′1 . . .σ′r+1}) where j′ =
t(p), where σ′i = σi for i = 1 . . .r, and σ

′
r+1 =

(s(p),e1(p), t(p),em(p)).

Case 2: ∆ = σi pφ for some σi ∈ {σ1 . . .σr}.
In this case the phrase p is appended to the signa-
ture σi. The new state T ′ = τ(T, ∆) is of the form
(j′,σ′1 . . .σ

′
r), where j

′ = t(p), where σi is replaced
by (s(σi),ws(σi), t(p),em(p)), and where σ′i′ = σi′
for all i′ 6= i.
Case 3: ∆ = φpσi for some σi ∈ {σ1 . . .σr}.
In this case the phrase p is prepended to the signa-
ture σi. The new state T ′ = τ(T, ∆) is of the form
(j′,σ′1 . . .σ

′
r), where j

′ = t(p), where σi is replaced
by (s(p),e1(p), t(σi),wt(σi)), and where σ′i′ = σi′
for all i′ 6= i.
Case 4: ∆ = σi pσi′ for some σi,σi′ ∈
{σ1 . . .σr}, with i′ 6= i. In this case phrase
p is appended to signature σi, and prepended to
signature σi′, effectively joining the two signa-
tures together. In this case the new state T ′ =
τ(T, ∆) is of the form (j′,σ′1 . . .σ

′
r−1), where sig-

natures σi and σi′ are replaced by a new signature
(s(σi),ws(σi), t(σi′),wt(σi′)), and all other signa-
tures are copied across from T to T ′.

Figure 3 gives the dynamic programming states
and transitions for the derivation H in Figure 1. For
example, the sub-derivation

H7 =
〈〈

(1, 1,<s>)(2, 3, we must)(4, 4, also)
〉
,

〈
(5, 6, these criticisms)(7, 7, seriously)

〉〉

will be mapped to a state

T =
(
7,σ(H7)

)

=
(
7,
{

(1,<s>, 4, also), (5, these, 7, seriously)
})

(
1,

{
σ1 =

(
1,<s>, 1,<s>

)})

(
3,

{
σ1 =

(
1,<s>, 3, must

)})

(
4,

{
σ1 =

(
1,<s>, 4, also

)})

(
6,

{
σ1 =

(
1,<s>, 4, also

)
,

σ2 =
(
5, these, 6, criticisms

)})

(
7,

{
σ1 =

(
1,<s>, 4, also

)
,

σ2 =
(
5, these, 7, seriously

)})

(
8,

{
σ1 =

(
1,<s>, 7, seriously

)})

(
9,

{
σ1 =

(
1,<s>, 9,</s>

)})

σ1 (2, 3, we must) φ

σ1 (4, 4, also) φ

φ (5, 6, these criticisms) φ

σ2 (7, 7, seriously) φ

σ1 (8, 8, take) σ2

σ1 (9, 9, </s>) φ

Figure 3: Dynamic programming states and the transi-
tions from one state to another, using the same example
as in Figure 1. Note that σi = σ(πi) for all πi ∈ Hj.

The transition σ1(8, 8, take) σ2 from this state leads
to a new state,

T ′ =
(
8,
{
σ1 = (1,<s>, 7, seriously)

})

4.3 Definition of score(∆)

Figure 4 gives the definition of score(∆), which
incorporates the language model, phrase scores, and
distortion penalty implied by the transition ∆.

4.4 Definition of valid(T)

Figure 5 gives the definition of valid(T). This
function checks that the start and end points of each
signature are in the set of allowed start and end
points given in Lemma 2.

4.5 A Bound on the Runtime of the Algorithm

We now give a bound on the algorithm’s run time.
This will be the product of terms N and M, where
N is an upper bound on the number of states in the
dynamic program, and M is an upper bound on the
number of outgoing transitions from any state.

For any j ∈{1 . . .n}, define first(j) to be the
set of target-language words that can begin at posi-
tion j and last(j) to be the set of target-language

66


∆ Resulting phrase sequence score(∆)
φpφ (s,e1, t,em) ŵ(p)

σi pφ (s(σi),ws(σi), t,em) ŵ(p) + λ(e1|wt(σi))
+ η ×|t(σi) + 1 −s|

φpσi (s,e1, t(σi),wt(σi)) ŵ(p) + λ(ws(σi)|em)
+ η ×|t + 1 −s(σi)|

σi pσi′ (s(σi),ws(σi), t(σi′ ),wt(σi′ )) ŵ(p) + λ(e1|wt(σi))
+ η ×|t(σi) + 1 −s|
+λ(ws(σi′ )|em)
+ η ×|t + 1 −s(σi′ )|

Figure 4: Four operations that can extend a state
T = (j,{σ1 . . .σr}) by a phrase p = (s,t,e1 . . .em),
and the scores incurred. We define ŵ(p) = κ(p) +∑m

i=2 λ(ei(p)|ei−1(p)). The function ŵ(p) includes the
phrase translation model κ and the language model scores
that can be computed using p alone. The weight η is the
distortion penalty.

Function valid(T)
Input: State T =

(
j,{σ1 . . .σr}

)

for i = 1 . . .r
if s(σi) < j −d + 2 and s(σi) 6= 1

return FALSE
if t(σi) < j −d

return FALSE
return TRUE

Figure 5: The valid function.

words that can end at position j.

first(j) = {w : ∃p = (s,t,e) s.t. s = j,e1 = w}
last(j) = {w : ∃p = (s,t,e) s.t. t = j,em = w}

In addition, define singles(j) to be the set of
phrases that translate the single word at position j:

singles(j) = {p : s(p) = j and t(p) = j}

Next, define h to be the smallest integer such that
for all j, |first(j)| ≤ h, |last(j)| ≤ h, and
|singles(j)| ≤ h. Thus h is a measure of the
maximal ambiguity of any word xj in the input.

Finally, for any position j, define start(j) to be
the set of phrases starting at position j:

start(j) = {p : s(p) = j}

and define l to be the smallest integer such that for
all j, |start(j)| ≤ l. Given these definitions we
can state the following result:

Theorem 1. The time complexity of the algorithm is
O(nd!lhd+1).

To prove this we need the following definition:

Definition 4 (p-structures). For any finite set A of
integers with |A| = k, a p-structure is a set of r or-
dered pairs {(si, ti)}ri=1 that satisfies the following
properties: 1) 0 ≤ r ≤ k; 2) for each i ∈ {1 . . .r},
si ∈ A and ti ∈ A (both si = ti and si 6= ti are
allowed); 3) for each j ∈ A, there is at most one
index i ∈{1 . . .r} such that (si = j) or (ti = j) or
(si = j and ti = j).

We use g(k) to denote the number of unique p-
structures for a set A with |A| = k.

We then have the following Lemmas:

Lemma 4. The function g(k) satisfies g(0) = 0,
g(1) = 2, and the following recurrence for k ≥ 2:

g(k) = 2g(k − 1) + 2(n− 1)g(k − 2)

Proof. The proof is in Appendix A.

Lemma 5. Consider the function h(k) = k2×g(k).
h(k) is in O((k − 2)!).
Proof. The proof is in Appendix B.

We can now prove the theorem:
Proof of Theorem 1: First consider the num-
ber of states in the dynamic program. Each
state is of the form (j,{σ1 . . .σr}) where the set
{(s(σi), t(σi))}ri=1 is a p-structure over the set {1}∪
{(j − d) . . .d}. The number of possible values
for {(s(σi),e(σi))}ri=1 is at most g(d + 2). For
a fixed choice of {(s(σi), t(σi))}ri=1 we will ar-
gue that there are at most hd+1 possible values for
{(ws(σi),wt(σi))}ri=1. This follows because for
each k ∈ {(j − d) . . .j} there are at most h pos-
sible choices: if there is some i such that s(σi) = k,
and t(σi) 6= k, then the associated word ws(σi) is
in the set first(k); alternatively if there is some
i such that t(σi) = k, and s(σi) 6= k, then the as-
sociated word wt(σi) is in the set last(k); alterna-
tively if there is some i such that s(σi) = t(σi) = k
then the associated words ws(σi),wt(σi) must be
the first/last word of some phrase in singles(k);
alternatively there is no i such that s(σi) = k or
t(σi) = k, in which case there is no choice asso-
ciated with position k in the sentence. Hence there
are at most h choices associated with each position
k ∈ {(j − d) . . .j}, giving hd+1 choices in total.
Combining these results, and noting that there are

67


n choices of the variable j, implies that there are at
most ng(d + 2)hd+1 states in the dynamic program.

Now consider the number of transitions from any
state. A transition is of the form ψ1pψ2 as defined
in Section 4.2.1. For a given state there are at most
(d + 2) choices for ψ1 and ψ2, and l choices for p,
giving at most (d + 2)2l choices in total.

Multiplying the upper bounds on the number of
states and number of transitions for each state gives
an upper bound on the runtime of the algorithm as
O(ng(d + 2)hd+1(d + 2)2l). Hence by Lemma 5
the runtime is O(nd!lhd+1) time.

The bound g(d + 2) over the number of possible
values for {(s(σi),e(σi))}ri=1 is somewhat loose, as
the set of p-structures over {1}∪{(j −d) . . .d} in-
cludes impossible values {(si, ti)}ri=1 where for ex-
ample there is no i such that s(σi) = 1. However the
bound is tight enough to give the O(d!) runtime.

5 Discussion

We conclude the paper with discussion of some is-
sues. First we describe how the dynamic program-
ming structures we have described can be used in
conjunction with beam search. Second, we give
more analysis of the complexity of the widely-used
decoding algorithm of Koehn et al. (2003).

5.1 Beam Search

Beam search is widely used in phrase-based decod-
ing; it can also be applied to our dynamic program-
ming construction. We can replace the line

for each state T ∈Tj

in the algorithm in Figure 2 with

for each state T ∈ beam(Tj)

where beam is a function that returns a subset of Tj,
most often the highest scoring elements of Tj un-
der some scoring criterion. A key question concerns
the choice of scoring function γ(T) used to rank
states. One proposal is to define γ(T) = α(T) +
β(T) where α(T) is the score used in the dynamic
program, and β(T) =

∑
i:ws(σi)6=<s> λu(ws(σi)).

Here λu(w) is the score of word w under a unigram
language model. The β(T) scores allow different
states in Tj, which have different words ws(σi) at

the start of signatures, to be comparable: for exam-
ple it compensates for the case where ws(σi) is a rare
word, which will incur a low probability when the
bigram 〈w ws(σi)〉 for some word w is constructed
during search.

The β(T) values play a similar role to “future
scores” in the algorithm of Koehn et al. (2003).
However in the Koehn et al. (2003) algorithm, dif-
ferent items in the same beam can translate differ-
ent subsets of the input sentence, making future-
score estimation more involved. In our case all items
in Tj translate all words x1 . . .xj inclusive, which
may make comparison of different hypotheses more
straightforward.

5.2 Complexity of Decoding with Bit-string
Representations

A common method for decoding phrase-based mod-
els, as described in Koehn et al. (2003), is to use
beam search in conjunction with a search algorithm
that 1) creates the target language string in strictly
left-to-right order; 2) uses a bit string with bits
bi ∈{0, 1} for i = 1 . . .n representing at each point
whether word i in the input has been translated. A
natural question is whether the number of possible
bit strings for a model with a fixed distortion limit d
can grow exponentially quickly with respect to the
length of the input sentence. This section gives an
example that shows that this is indeed the case.

Assume that our sentence length n is such that
(n−2)/4 is an integer. Assume as before x1 = <s>
and xn = </s>. For each k ∈ {0 . . . ((n− 2)/4 −
1)}, assume we have the following phrases for the
words x4k+2 . . .x4k+5:

(4k + 2, 4k + 2,uk) (4k + 3, 4k + 3,vk)

(4k + 4, 4k + 4,wk) (4k + 5, 4k + 5,zk)

(4k + 4, 4k + 5,yk)

Note that the only source of ambiguity is for each
k whether we use yk to translate the entire phrase
x4k+4x4k+5, or whether we use wk and zk to trans-
late x4k+4 and x4k+5 separately.

With a distortion limit d ≥ 5, the number of pos-
sible bit strings in this example is at least 2(n−2)/4.
This follows because for any setting of the variables
b4k+4 ∈ {0, 1} for k ∈ {0 . . . ((n − 2)/4 − 1)},

68


there is a valid derivation p1 . . .pL such that the pre-
fix p1 . . .pl where l = 1 + (n − 2)/4 gives this bit
string. Simply choose p1 = (1, 1,<s>) and for l′ ∈
{0 . . . (n−2)/4−1} choose pl′+2 = (4l′ + 4, 4l′ +
5,yi) if b4k+4 = 1, pl′+2 = (4l′ + 5, 4l′ + 5,zi)
otherwise. It can be verified that p1 . . .pl is a valid
prefix (there is a valid way to give a complete deriva-
tion from this prefix). As one example, for n = 10,
and b4 = 1 and b8 = 0, a valid derivation is

(1, 1,<s>)(4, 5,y1)(9, 9,z2)(7, 7,v2)(3, 3,v1)

(2, 2,u1)(6, 6,u2)(8, 8,w2)(10, 10,</s>)

In this case the prefix (1, 1,<s>)(4, 5,y1)(9, 9,z2)
gives b4 = 1 and b8 = 0. Other values for b4 and b8
can be given by using (5, 5,z1) in place of (4, 5,y1),
and (8, 9,y2) in place of (9, 9,z2), with the follow-
ing phrases modified appropriately.

6 Conclusion

We have given a polynomial-time dynamic program-
ming algorithm for phrase-based decoding with a
fixed distortion limit. The algorithm uses a quite dif-
ferent representation of states from previous decod-
ing algorithms, is easily amenable to beam search,
and leads to a new perspective on phrase-based de-
coding. Future work should investigate the effec-
tiveness of the algorithm in practice.

A Proof of Lemma 4

Without loss of generality assume A =
{1, 2, 3, . . .k}. We have g(1) = 2, because in
this case the valid p-structures are {(1, 1)} and ∅.
To calculate g(k) we can sum over four possibilities:

Case 1: There are g(k− 1) p-structures with si =
ti = 1 for some i ∈ {1 . . .r}. This follows because
once si = ti = 1 for some i, there are g(k − 1)
possible p-structures for the integers {2, 3, 4 . . .k}.
Case 2: There are g(k − 1) p-structures such that
si 6= 1 and ti 6= 1 for all i ∈ {1 . . .r}. This fol-
lows because once si 6= 1 and ti 6= 1 for all i, there
are g(k − 1) possible p-structures for the integers
{2, 3, 4 . . .k}.
Case 3: There are (k− 1) ×g(k− 2) p-structures
such that there is some i ∈ {1 . . .r} with si = 1
and ti 6= 1. This follows because for the i such that

si = 1, there are (k−1) choices for the value for ti,
and there are then g(k− 2) possible p-structures for
the remaining integers in the set {1 . . .k}/{1, ti}.
Case 4: There are (k− 1) ×g(k− 2) p-structures
such that there is some i ∈ {1 . . .r} with ti = 1
and si 6= 1. This follows because for the i such that
ti = 1, there are (k−1) choices for the value for si,
and there are then g(k− 2) possible p-structures for
the remaining integers in the set {1 . . .k}/{1,si}.

Summing over these possibilities gives the fol-
lowing recurrence:
g(k) = 2g(k − 1) + 2(k − 1) ×g(k − 2)

B Proof of Lemma 5

Recall that h(k) = f(k) ×g(k) where f(k) = k2.
Define k0 to be the smallest integer such that for

all k ≥ k0,
2f(k)

f(k − 1) +
2f(k)

f(k − 2) ·
k − 1
k − 3 ≤ k − 2 (4)

For f(k) = k2 we have k0 = 9.
Now choose a constant c such that for all k ∈
{1 . . . (k0 − 1)}, h(k) ≤ c × (k − 2)!. We will
prove by induction that under these definitions of k0
and c we have h(k) ≤ c(k − 2)! for all integers k,
hence h(k) is in O((k − 2)!).

For values k ≥ k0, we have
h(k) = f(k)g(k)

= 2f(k)g(k − 1) + 2f(k)(k − 1)g(k − 2) (5)

=
2f(k)

f(k − 1)h(k − 1) +
2f(k)

f(k − 2) (k − 1)h(k − 2)

≤
(

2cf(k)

f(k − 1) +
2cf(k)

f(k − 2) ·
k − 1
k − 3

)
(k − 3)! (6)

≤ c(k − 2)! (7)
Eq. 5 follows from g(k) = 2g(k−1)+2(k−1)g(k−
2). Eq. 6 follows by the inductive hypothesis that
h(k − 1) ≤ c(k − 3)! and h(k − 2) ≤ c(k − 4)!.
Eq 7 follows because Eq. 4 holds for all k ≥ k0.

References
Wilker Aziz, Marc Dymetman, and Lucia Specia. 2014.

Exact decoding for phrase-based statistical machine
translation. In Proceedings of the Conference on Em-
pirical Methods in Natural Language Processing. As-
sociation for Computational Linguistics.

69


Yin-Wen Chang and Michael Collins. 2011. Exact de-
coding of phrase-based translation models through La-
grangian relaxation. In Proceedings of the Conference
on Empirical Methods in Natural Language Process-
ing, pages 26–37. Association for Computational Lin-
guistics.

Colin Cherry, Robert C Moore, and Chris Quirk. 2012.
On hierarchical re-ordering and permutation parsing
for phrase-based decoding. In Proceedings of the
Seventh Workshop on Statistical Machine Translation,
pages 200–209. Association for Computational Lin-
guistics.

John DeNero and Dan Klein. 2008. The complexity of
phrase alignment problems. In Proceedings of the 46th
Annual Meeting of the Association for Computational
Linguistics on Human Language Technologies: Short
Papers, pages 25–28. Association for Computational
Linguistics.

Yang Feng, Haitao Mi, Yang Liu, and Qun Liu. 2010. An
efficient shift-reduce decoding algorithm for phrased-
based machine translation. In Proceedings of the 23rd
International Conference on Computational Linguis-
tics: Posters, pages 285–293. Association for Compu-
tational Linguistics.

Michel Galley and Christopher D Manning. 2008. A
simple and effective hierarchical phrase reordering
model. In Proceedings of the Conference on Empir-
ical Methods in Natural Language Processing, pages
848–856. Association for Computational Linguistics.

Michel Galley and Christopher D Manning. 2010. Accu-
rate non-hierarchical phrase-based translation. In Hu-
man Language Technologies: The 2010 Annual Con-
ference of the North American Chapter of the Associ-
ation for Computational Linguistics, pages 966–974.
Association for Computational Linguistics.

Kevin Knight. 1999. Decoding complexity in word-
replacement translation models. Computational Lin-
guistics, 25(4).

Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical phrase-based translation. In Proceed-
ings of the 2003 Conference of the North American
Chapter of the Association for Computational Linguis-
tics on Human Language Technology-Volume 1, pages
48–54. Association for Computational Linguistics.

Philipp Koehn, Amittai Axelrod, Chris Callison-Burch,
Miles Osborne, and David Talbot. 2007a. Edinburgh
system description for the 2005 IWSLT speech trans-
lation evaluation. In Proceedings of the Second Work-
shop on Statistical Machine Translation, StatMT ’07,
pages 224–227, Stroudsburg, PA, USA. Association
for Computational Linguistics.

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran, Richard

Zens, Chris Dyer, Ondřej Bojar, Alexandra Con-
stantin, and Evan Herbst. 2007b. Moses: Open source
toolkit for statistical machine translation. In Proceed-
ings of the 45th annual meeting of the ACL on inter-
active poster and demonstration sessions, pages 177–
180. Association for Computational Linguistics.

Shankar Kumar and William Byrne. 2005. Local phrase
reordering models for statistical machine translation.
In Proceedings of the conference on Human Language
Technology and Empirical Methods in Natural Lan-
guage Processing, pages 161–168. Association for
Computational Linguistics.

Eugene Leighton Lawler, Jan Karel Lenstra, Alexander
Hendrik George Rinnooy Kan, and David Bernard
Shmoys. 1985. The Traveling Salesman Problem.
John Wiley & Sons Ltd.

Adam Lopez. 2009. Translation as weighted deduction.
In Proceedings of the 12th Conference of the European
Chapter of the Association for Computational Linguis-
tics, pages 532–540. Association for Computational
Linguistics.

Burkhard Monien and Ivan Hal Sudborough. 1981.
Bandwidth constrained NP-complete problems. In
Proceedings of the thirteenth annual ACM symposium
on Theory of computing, pages 207–217. ACM.

Franz Josef Och and Hermann Ney. 2004. The align-
ment template approach to statistical machine transla-
tion. Computational linguistics, 30(4):417–449.

H Donald Ratliff and Arnon S Rosenthal. 1983. Order-
picking in a rectangular warehouse: a solvable case of
the traveling salesman problem. Operations Research,
31(3):507–521.

Christoph Tillmann. 2004. A unigram orientation model
for statistical machine translation. In Proceedings of
HLT-NAACL 2004: Short Papers, pages 101–104. As-
sociation for Computational Linguistics.

Dekai Wu. 1997. Stochastic inversion transduction
grammars and bilingual parsing of parallel corpora.
Computational linguistics, 23(3):377–403.

Mikhail Zaslavskiy, Marc Dymetman, and Nicola Can-
cedda. 2009. Phrase-based statistical machine transla-
tion as a traveling salesman problem. In Proceedings
of the Joint Conference of the 47th Annual Meeting of
the ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP: Volume
1-Volume 1, pages 333–341. Association for Compu-
tational Linguistics.

Richard Zens and Hermann Ney. 2003. A comparative
study on reordering constraints in statistical machine
translation. In Proceedings of the 41st Annual Meeting
on Association for Computational Linguistics-Volume
1, pages 144–151. Association for Computational Lin-
guistics.

70


Richard Zens, Hermann Ney, Taro Watanabe, and Ei-
ichiro Sumita. 2004. Reordering constraints for
phrase-based statistical machine translation. In Pro-
ceedings of the 20th international conference on
Computational Linguistics, page 205. Association for
Computational Linguistics.

71


72