Google’s Multilingual Neural Machine Translation System:
Enabling Zero-Shot Translation

Melvin Johnson∗, Mike Schuster∗, Quoc V. Le, Maxim Krikun,
Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas,

Martin Wattenberg, Greg Corrado, Macduff Hughes, Jeffrey Dean
Google

{melvinp,schuster}@google.com

Abstract

We propose a simple solution to use a single
Neural Machine Translation (NMT) model to
translate between multiple languages. Our solu-
tion requires no changes to the model architec-
ture from a standard NMT system but instead
introduces an artificial token at the beginning
of the input sentence to specify the required
target language. Using a shared wordpiece vo-
cabulary, our approach enables Multilingual
NMT systems using a single model. On the
WMT’14 benchmarks, a single multilingual
model achieves comparable performance for
English→French and surpasses state-of-the-
art results for English→German. Similarly,
a single multilingual model surpasses state-
of-the-art results for French→English and
German→English on WMT’14 and WMT’15
benchmarks, respectively. On production cor-
pora, multilingual models of up to twelve
language pairs allow for better translation of
many individual pairs. Our models can also
learn to perform implicit bridging between lan-
guage pairs never seen explicitly during train-
ing, showing that transfer learning and zero-
shot translation is possible for neural transla-
tion. Finally, we show analyses that hints at a
universal interlingua representation in our mod-
els and also show some interesting examples
when mixing languages.

1 Introduction

End-to-end Neural Machine Translation
(NMT) (Sutskever et al., 2014; Bahdanau et
al., 2015; Cho et al., 2014) is an approach to machine

∗Corresponding authors.

translation that has rapidly gained adoption in many
large-scale settings (Zhou et al., 2016; Wu et al.,
2016; Crego and et al., 2016). Almost all such
systems are built for a single language pair — so far
there has not been a sufficiently simple and efficient
way to handle multiple language pairs using a single
model without making significant changes to the
basic NMT architecture.

In this paper we introduce a simple method to
translate between multiple languages using a single
model, taking advantage of multilingual data to im-
prove NMT for all languages involved. Our method
requires no change to the traditional NMT model
architecture. Instead, we add an artificial token to
the input sequence to indicate the required target lan-
guage, a simple amendment to the data only. All
other parts of the system — encoder, decoder, atten-
tion, and shared wordpiece vocabulary as described
in Wu et al., (2016) — stay exactly the same. This
method has several attractive benefits:

• Simplicity: Since no changes are made to the
architecture of the model, scaling to more lan-
guages is trivial — any new data is simply
added, possibly with over- or under-sampling
such that all languages are appropriately rep-
resented, and used with a new token if the tar-
get language changes. Since no changes are
made to the training procedure, the mini-batches
for training are just sampled from the overall
mixed-language training data just like for the
single-language case. Since no a-priori deci-
sions about how to allocate parameters for dif-
ferent languages are made, the system adapts
automatically to use the total number of param-

339

Transactions of the Association for Computational Linguistics, vol. 5, pp. 339–351, 2017. Action Editor: Colin Cherry.
Submission batch: 11/2016; Revision batch: 3/2017; Published 10/2017.

c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


eters efficiently to minimize the global loss. A
multilingual model architecture of this type also
simplifies production deployment significantly
since it can cut down the total number of mod-
els necessary when dealing with multiple lan-
guages. Note that at Google, we support a total
of over 100 languages as source and target, so
theoretically 1002 models would be necessary
for the best possible translations between all
pairs, if each model could only support a single
language pair. Clearly this would be problem-
atic in a production environment. Even when
limiting to translating to/from English only, we
still need over 200 models. Finally, batching to-
gether many requests from potentially different
source and target languages can significantly
improve efficiency of the serving system. In
comparison, an alternative system that requires
language-dependent encoders, decoders or at-
tention modules does not have any of the above
advantages.

• Low-resource language improvements: In a
multilingual NMT model, all parameters are
implicitly shared by all the language pairs being
modeled. This forces the model to generalize
across language boundaries during training. It
is observed that when language pairs with little
available data and language pairs with abundant
data are mixed into a single model, translation
quality on the low resource language pair is
significantly improved.

• Zero-shot translation: A surprising benefit of
modeling several language pairs in a single
model is that the model can learn to translate
between language pairs it has never seen in this
combination during training (zero-shot transla-
tion) — a working example of transfer learn-
ing within neural translation models. For ex-
ample, a multilingual NMT model trained with
Portuguese→English and English→Spanish ex-
amples can generate reasonable translations for
Portuguese→Spanish although it has not seen
any data for that language pair. We show that the
quality of zero-shot language pairs can easily be
improved with little additional data of the lan-
guage pair in question (a fact that has been pre-

viously confirmed for a related approach which
is discussed in more detail in the next section).

In the remaining sections of this paper we first
discuss related work and explain our multilingual
system architecture in more detail. Then, we go
through the different ways of merging languages
on the source and target side in increasing diffi-
culty (many-to-one, one-to-many, many-to-many),
and discuss the results of a number of experiments on
WMT benchmarks, as well as on some of Google’s
large-scale production datasets. We present results
from transfer learning experiments and show how
implicitly-learned bridging (zero-shot translation)
performs in comparison to explicit bridging (i.e., first
translating to a common language like English and
then translating from that common language into the
desired target language) as typically used in machine
translation systems. We describe visualizations of the
new system in action, which provide early evidence
of shared semantic representations (interlingua) be-
tween languages. Finally we also show some interest-
ing applications of mixing languages with examples:
code-switching on the source side and weighted tar-
get language mixing, and suggest possible avenues
for further exploration.

2 Related Work

Interlingual translation is a classic method in machine
translation (Richens, 1958; Hutchins and Somers,
1992). Despite its distinguished history, most practi-
cal applications of machine translation have focused
on individual language pairs because it was simply
too difficult to build a single system that translates
reliably from and to several languages.

Neural Machine Translation (NMT) (Kalchbren-
ner and Blunsom, 2013) was shown to be a promis-
ing end-to-end learning approach in Sutskever et
al. (2014); Bahdanau et al. (2015); Cho et al. (2014)
and was quickly extended to multilingual machine
translation in various ways.

An early attempt is the work in Dong et al., (2015),
where the authors modify an attention-based encoder-
decoder approach to perform multilingual NMT by
adding a separate decoder and attention mechanism
for each target language. In Luong et al., (2015a)
multilingual training in a multitask learning setting
is described. This model is also an encoder-decoder

340


network, in this case without an attention mechanism.
To make proper use of multilingual data, they extend
their model with multiple encoders and decoders,
one for each supported source and target language.
In Caglayan et al., (2016) the authors incorporate
multiple modalities other than text into the encoder-
decoder framework.

Several other approaches have been proposed for
multilingual training, especially for low-resource lan-
guage pairs. For instance, in Zoph and Knight (2016)
a form of multi-source translation was proposed
where the model has multiple different encoders and
different attention mechanisms for each source lan-
guage. However, this work requires the presence of a
multi-way parallel corpus between all the languages
involved, which is difficult to obtain in practice. Most
closely related to our approach is Firat et al., (2016a)
in which the authors propose multi-way multilingual
NMT using a single shared attention mechanism but
multiple encoders/decoders for each source/target
language. Recently in Lee et al., (2016) a CNN-
based character-level encoder was proposed which is
shared across multiple source languages. However,
this approach can only perform translations into a
single target language.

Our approach is related to the multitask learning
framework (Caruana, 1998). Despite its promise, this
framework has seen limited practical success in real
world applications. In speech recognition, there have
been many successful reports of modeling multiple
languages using a single model ( (Schultz and Kirch-
hoff, 2006) for an extensive reference and references
therein). Multilingual language processing has also
shown to be successful in domains other than transla-
tion (Gillick et al., 2016; Tsvetkov et al., 2016).

There have been other approaches similar to ours
in spirit, but used for very different purposes. In Sen-
nrich et al.,(2016a), the NMT framework has been
extended to control the politeness level of the target
translation by adding a special token to the source
sentence. The same idea was used in Yamagishi et
al., (2016) to add the distinction between ‘active’ and
‘passive’ tense to the generated target sentence.

Our method has an additional benefit not seen in
other systems: it gives the system the ability to per-
form zero-shot translation, meaning the system can
translate from a source language to a target language
without having seen explicit examples from this spe-

cific language pair during training. Zero-shot trans-
lation was the direct goal of Firat et al., (2016c).
Although they were not able to achieve this direct
goal, they were able to do what they call “zero-
resource” translation by using their pre-trained multi-
way multilingual model and later fine-tuning it with
pseudo-parallel data generated by the model. It
should be noted that the difference between “zero-
shot” and “zero-resource” translation is the additional
fine-tuning step which is required in the latter ap-
proach.

To the best of our knowledge, our work is the first
to validate the use of true multilingual translation
using a single encoder-decoder model, and is inci-
dentally also already used in a production setting.
It is also the first work to demonstrate the possibil-
ity of zero-shot translation, a successful example of
transfer learning in machine translation, without any
additional steps.

3 System Architecture

The multilingual model architecture is identical to
Google’s Neural Machine Translation (GNMT) sys-
tem (Wu et al., 2016) (with the optional addition of
direct connections between encoder and decoder lay-
ers which we have used for some of our experiments).

To be able to make use of multilingual data within
a single system, we propose one simple modification
to the input data, which is to introduce an artificial
token at the beginning of the input sentence to indi-
cate the target language the model should translate
to. For instance, consider the following En→Es pair
of sentences:

How are you? -> ¿Cómo estás?

It will be modified to:

<2es> How are you? -> ¿Cómo estás?

to indicate that Spanish is the target language. Note
that we don’t specify the source language – the model
will learn this automatically.

After adding the token to the input data, we train
the model with all multilingual data consisting of
multiple language pairs at once, possibly after over-
or undersampling some of the data to adjust for the
relative ratio of the language data available. To ad-
dress the issue of translation of unknown words and
to limit the vocabulary for computational efficiency,

341


we use a shared wordpiece model (Schuster and
Nakajima, 2012) across all the source and target data
used for training, usually with 32,000 word pieces.
The segmentation algorithm used here is very similar
(with small differences) to Byte-Pair-Encoding (BPE)
which was described in Gage (1994) and was also
used in Sennrich et al., (2016b) for machine transla-
tion. All training is carried out similar to (Wu et al.,
2016) and implemented in TensorFlow (Abadi and et
al., 2016).

In summary, this approach is the simplest among
the alternatives that we are aware of. During training
and inference, we only need to add one additional
token to each sentence of the source data to specify
the desired target language.

4 Experiments and Results

In this section, we apply our proposed method to
train multilingual models in several different configu-
rations. Since we can have models with either single
or multiple source/target languages we test three in-
teresting cases for mapping languages: 1) many to
one, 2) one to many, and 3) many to many. As al-
ready discussed in Section 2, other models have been
used to explore some of these cases already, but for
completeness we apply our technique to these inter-
esting use cases again to give a full picture of the
effectiveness of our approach.

We will also show results and discuss benefits of
bringing together many (un)related languages in a
single large-scale model trained on production data.
Finally, we will present our findings on zero-shot
translation where the model learns to translate be-
tween pairs of languages for which no explicit par-
allel examples existed in the training data, and show
results of experiments where adding additional data
improves zero-shot translation quality further.

4.1 Datasets, Training Protocols and
Evaluation Metrics

For WMT, we train our models on the WMT’14
En→Fr and the WMT’14 En→De datasets. In both
cases, we use newstest2014 as the test sets to be
able to compare against previous work (Luong et al.,
2015c; Sébastien et al., 2015; Zhou et al., 2016; Wu
et al., 2016). For WMT Fr→En and De→En we
use newstest2014 and newstest2015 as test sets. De-

spite training on WMT’14 data, which is somewhat
smaller than WMT’15, we test our De→En model on
newstest2015, similar to Luong et al., (2015b). The
combination of newstest2012 and newstest2013 is
used as the development set.

In addition to WMT, we also evaluate the multilin-
gual approach on some Google-internal large-scale
production datasets representing a wide spectrum
of languages with very distinct linguistic properties:
En↔Japanese(Ja), En↔Korean(Ko), En↔Es, and
En↔Pt. These datasets are two to three orders of
magnitude larger than the WMT datasets.

Our training protocols are mostly identical to those
described in Wu et al., (2016). We find that some
multilingual models take a little more time to train
than single language pair models, likely because each
language pair is seen only for a fraction of the train-
ing process. We use larger batch sizes with a slightly
higher initial learning rate to speed up the conver-
gence of these models.

We evaluate our models using the standard BLEU
score metric and to make our results comparable
to previous work (Sutskever et al., 2014; Luong et
al., 2015c; Zhou et al., 2016; Wu et al., 2016), we
report tokenized BLEU score as computed by the
multi-bleu.pl script, which can be downloaded
from the public implementation of Moses.1

To test the influence of varying amounts of train-
ing data per language pair we explore two strategies
when building multilingual models: a) where we
oversample the data from all language pairs to be
of the same size as the largest language pair, and b)
where we mix the data as is without any change. The
wordpiece model training is done after the optional
oversampling taking into account all the changed data
ratios. For the WMT models we report results using
both of these strategies. For the production models,
we always balance the data such that the ratios are
equal.

One benefit of the way we share all the components
of the model is that the mini-batches can contain data
from different language pairs during training and in-
ference, which are typically just random samples
from the final training and test data distributions.
This is a simple way of preventing “catastrophic
forgetting” - tendency for knowledge of previously

1http://www.statmt.org/moses/

342


learned task(s) (e.g. language pair A) to be abruptly
forgotten as information relevant to the current task
(e.g. language pair B) is incorporated (French, 1999).
Other approaches to multilingual translation require
complex update scheduling mechanisms to prevent
this effect (Firat et al., 2016b).

4.2 Many to One
In this section we explore having multiple source lan-
guages and a single target language — the simplest
way of combining language pairs. Since there is only
a single target language no additional source token is
required. We perform three sets of experiments:

• The first set of experiments is on the WMT
datasets, where De→En and Fr→En are com-
bined to train a multilingual model. Our base-
lines are two single language pair models:
De→En and Fr→En trained independently. We
perform these experiments once with oversam-
pling and once without.

• The second set of experiments is on production
data where we combine Ja→En and Ko→En,
with oversampling. The baselines are two single
language pair models trained independently.

• Finally, the third set of experiments is on pro-
duction data where we combine Es→En and
Pt→En, with oversampling. The baselines are
again two single language pair models trained
independently.

All of the multilingual and single language pair mod-
els have the same total number of parameters as the
baseline NMT models trained on a single language
pair (using 1024 nodes, 8 LSTM layers and a shared
wordpiece model vocabulary of 32k, a total of 255M
parameters per model). A side effect of this equal
choice of parameters is that it is presumably unfair to
the multilingual models as the number of parameters
available per language pair is reduced by a factor of
N compared to the single language pair models, if
N is the number of language pairs combined in the
multilingual model. The multilingual model also has
to handle the combined vocabulary of all the single
models. We chose to keep the number of parameters
constant for all models to simplify experimentation.
We relax this constraint for some of the large-scale
experiments shown further below.

Table 1: Many to One: BLEU scores on for single lan-
guage pair and multilingual models. ?: no oversampling

Model Single Multi Diff
WMT De→En 30.43 30.59 +0.16
WMT Fr→En 35.50 35.73 +0.23

WMT De→En? 30.43 30.54 +0.11
WMT Fr→En? 35.50 36.77 +1.27

Prod Ja→En 23.41 23.87 +0.46
Prod Ko→En 25.42 25.47 +0.05
Prod Es→En 38.00 38.73 +0.73
Prod Pt→En 44.40 45.19 +0.79

The results are presented in Table 1. For all ex-
periments the multilingual models outperform the
baseline single systems despite the above mentioned
disadvantage with respect to the number of param-
eters available per language pair. One possible hy-
pothesis explaining the gains is that the model has
been shown more English data on the target side,
and that the source languages belong to the same
language families, so the model has learned useful
generalizations.

For the WMT experiments, we obtain a maximum
gain of +1.27 BLEU for Fr→En. Note that the re-
sults on both the WMT test sets are better than other
published state-of-the-art results for a single model,
to the best of our knowledge.

4.3 One to Many

In this section, we explore the application of our
method when there is a single source language and
multiple target languages. Here we need to prepend
the input with an additional token to specify the target
language. We perform three sets of experiments very
similar to the previous section.

Table 2 summarizes the results when performing
translations into multiple target languages. We see
that the multilingual models are comparable to, and
in some cases outperform, the baselines, but not al-
ways. We obtain a large gain of +0.9 BLEU for
En→Es. Unlike the previous set of results, there are
less significant gains in this setting. This is perhaps
due to the fact that the decoder has a more difficult
time translating into multiple target languages which
may even have different scripts, which are combined
into a single shared wordpiece vocabulary. Note that
even for languages with entirely different scripts (e.g.,
Korean and Japanese) there is significant overlap in

343


wordpieces when real data is used, as often numbers,
dates, names, websites, punctuation etc. are actually
using a shared script (ASCII).

Table 2: One to Many: BLEU scores for single language
pair and multilingual models. ?: no oversampling

Model Single Multi Diff
WMT En→De 24.67 24.97 +0.30
WMT En→Fr 38.95 36.84 -2.11

WMT En→De? 24.67 22.61 -2.06
WMT En→Fr? 38.95 38.16 -0.79

Prod En→Ja 23.66 23.73 +0.07
Prod En→Ko 19.75 19.58 -0.17
Prod En→Es 34.50 35.40 +0.90
Prod En→Pt 38.40 38.63 +0.23

We observe that oversampling helps the smaller
language pair (En→De) at the cost of lower quality
for the larger language pair (En→Fr). The model
without oversampling achieves better results on the
larger language compared to the smaller one as ex-
pected. We also find that this effect is more prominent
on smaller datasets (WMT) and much less so on our
much larger production datasets.

4.4 Many to Many
In this section, we report on experiments when there
are multiple source languages and multiple target
languages within a single model — the most difficult
setup. Since multiple target languages are given, the
input needs to be prepended with the target language
token as above.

The results are presented in Table 3. We see that
the multilingual production models with the same
model size and vocabulary size as the single language
models are quite close to the baselines – the average
relative loss in BLEU score across all experiments is
only approximately 2.5%.

Although there are some significant losses in qual-
ity from training many languages jointly using a
model with the same total number of parameters as
the single language pair models, these models re-
duce the total complexity involved in training and
productionization.

4.5 Large-scale Experiments
This section shows the result of combining 12 produc-
tion language pairs having a total of 3B parameters
(255M per single model) into a single multilingual

Table 3: Many to Many: BLEU scores for single language
pair and multilingual models. ?: no oversampling

Model Single Multi Diff
WMT En→De 24.67 24.49 -0.18
WMT En→Fr 38.95 36.23 -2.72

WMT De→En 30.43 29.84 -0.59
WMT Fr→En 35.50 34.89 -0.61

WMT En→De? 24.67 21.92 -2.75
WMT En→Fr? 38.95 37.45 -1.50

WMT De→En? 30.43 29.22 -1.21
WMT Fr→En? 35.50 35.93 +0.43

Prod En→Ja 23.66 23.12 -0.54
Prod En→Ko 19.75 19.73 -0.02
Prod Ja→En 23.41 22.86 -0.55

Prod Ko→En 25.42 24.76 -0.66
Prod En→Es 34.50 34.69 +0.19
Prod En→Pt 38.40 37.25 -1.15
Prod Es→En 38.00 37.65 -0.35
Prod Pt→En 44.40 44.02 -0.38

model. A range of multilingual models were trained,
starting from the same size as a single language pair
model with 255M parameters (1024 nodes) up to
650M parameters (1792 nodes). As above, the input
needs to be prepended with the target language to-
ken. We oversample the examples from the smaller
language pairs to balance the data as explained above.

The results for single language pair models ver-
sus multilingual models with increasing numbers of
parameters are summarized in Table 4. We find that
the multilingual models are on average worse than
the single models (about 5.6% to 2.5% relative de-
pending on size, however, some actually get better)
and as expected the average difference gets smaller
when going to larger multilingual models. It should
be noted that the largest multilingual model we have
trained has still about five times less parameters than
the combined single models.

The multilingual model also requires only roughly
1/12-th of the training time (or computing resources)
to converge compared to the combined single models
(total training time for all our models is still in the
order of weeks). Another important point is that since
we only train for a little longer than a standard single
model, the individual language pairs can see as little
as 1/12-th of the data in comparison to their single
language pair models but still produce satisfactory
results.

In summary, multilingual NMT enables us to

344


Table 4: Large-scale experiments: BLEU scores for single
language pair and multilingual models.

Model Single Multi Multi Multi Multi
#nodes 1024 1024 1280 1536 1792

#params 3B 255M 367M 499M 650M
En→Ja 23.66 21.10 21.17 21.72 21.70
En→Ko 19.75 18.41 18.36 18.30 18.28
Ja→En 23.41 21.62 22.03 22.51 23.18
Ko→En 25.42 22.87 23.46 24.00 24.67
En→Es 34.50 34.25 34.40 34.77 34.70
En→Pt 38.40 37.35 37.42 37.80 37.92
Es→En 38.00 36.04 36.50 37.26 37.45
Pt→En 44.40 42.53 42.82 43.64 43.87
En→De 26.43 23.15 23.77 23.63 24.01
En→Fr 35.37 34.00 34.19 34.91 34.81
De→En 31.77 31.17 31.65 32.24 32.32
Fr→En 36.47 34.40 34.56 35.35 35.52
ave diff - -1.72 -1.43 -0.95 -0.76

vs single - -5.6% -4.7% -3.1% -2.5%

group languages with little loss in quality while hav-
ing the benefits of better training efficiency, smaller
number of models, and easier productionization.

4.6 Zero-Shot Translation
The most straight-forward approach of translating
between languages where no or little parallel data
is available is to use explicit bridging, meaning to
translate to an intermediate language first and then
to translate to the desired target language. The in-
termediate language is often English as xx→En and
En→yy data is more readily available. The two po-
tential disadvantages of this approach are: a) total
translation time doubles, b) the potential loss of qual-
ity by translating to/from the intermediate language.

An interesting benefit of our approach is that it al-
lows to perform directly implicit bridging (zero-shot
translation) between a language pair for which no
explicit parallel training data has been seen without
any modification to the model. Obviously, the model
will only be able to do zero-shot translation between
languages it has seen individually as source and tar-
get languages during training at some point, not for
entirely new ones.

To demonstrate this we will use two multilingual
models — a model trained with examples from two
different language pairs, Pt→En and En→Es (Model
1), and a model trained with examples from four dif-
ferent language pairs, En↔Pt and En↔Es (Model
2). As with the previous multilingual models, both of

these models perform comparable to or even slightly
better than the baseline single models for the lan-
guage pairs explicitly seen. Additionally, we show
that both of these models can generate reasonable
quality Pt→Es translations (BLEU scores above 20)
without ever having seen Pt→Es data during training.
To our knowledge this is the first successful demon-
stration of true multilingual zero-shot translation.

Table 5 summarizes our results for the Pt→Es
translation experiments. Rows (a) and (b) show the
performance of the phrase-based machine translation
(PBMT) system and the NMT system through ex-
plicit bridging (Pt→En, then En→Es). It can be seen
that the NMT system outperforms the PBMT system
by close to 2 BLEU points. For comparison, we also
built a single NMT model on all available Pt→Es
parallel sentences (see (c) in Table 5).

Table 5: Portuguese→Spanish BLEU scores using various
models.

Model Zero-shot BLEU
(a) PBMT bridged no 28.99
(b) NMT bridged no 30.91
(c) NMT Pt→Es no 31.50
(d) Model 1 (Pt→En, En→Es) yes 21.62
(e) Model 2 (En↔{Es, Pt}) yes 24.75
(f) Model 2 + incremental training no 31.77

The most interesting observation is that both
Model 1 and Model 2 can perform zero-shot trans-
lation with reasonable quality (see (d) and (e)) com-
pared to the initial expectation that this would not
work at all. Note that Model 2 outperforms Model
1 by close to 3 BLEU points although Model 2 was
trained with four language pairs as opposed to with
only two for Model 1 (with both models having the
same number of total parameters). In this case the ad-
dition of Spanish on the source side and Portuguese
on the target side helps Pt→Es zero-shot translation
(which is the opposite direction of where we would
expect it to help). We believe that this unexpected
effect is only possible because our shared architec-
ture enables the model to learn a form of interlingua
between all these languages. We explore this hypoth-
esis in more detail in Section 5.

Finally we incrementally train zero-shot Model 2
with a small amount of true Pt→Es parallel data (an
order of magnitude less than Table 5 (c)) and obtain
the best quality and half the decoding time compared

345


to explicit bridging (Table 5 (b)). The resulting model
cannot be called zero-shot anymore since some true
parallel data has been used to improve it. Overall
this shows that the proposed approach of implicit
bridging using zero-shot translation via multilingual
models can serve as a good baseline for further in-
cremental training with relatively small amounts of
true parallel data of the zero-shot direction. This
result is especially significant for non-English low-
resource language pairs where it might be easier to
obtain parallel data with English but much harder to
obtain parallel data for language pairs where neither
the source nor the target language is English. We
explore the effect of using parallel data in more detail
in Section 4.7.

Since Portuguese and Spanish are of the same lan-
guage family, an interesting question is how well
zero-shot translation works for less related languages.
Table 6 shows the results for explicit and implicit
bridging from Spanish to Japanese using the large-
scale model from Table 4 – Spanish and Japanese
can be regarded as quite unrelated. As expected zero-
shot translation works worse than explicit bridging
and the quality drops relatively more (roughly 50%
drop in BLEU score) than for the case of more re-
lated languages as shown above. Despite the quality
drop, this proves that our approach enables zero-shot
translation even between unrelated languages.

Table 6: Spanish→Japanese BLEU scores for explicit and
implicit bridging using the 12-language pair large-scale
model from Table 4.

Model BLEU
NMT Es→Ja explicitly bridged 18.00
NMT Es→Ja implicitly bridged 9.14

4.7 Effect of Direct Parallel Data

In this section, we explore two ways of leveraging
available parallel data to improve zero-shot transla-
tion quality, similar in spirit to what was reported in
Firat et al., (2016c). For our multilingual architecture
we consider:
• Incrementally training the multilingual model

on the additional parallel data for the zero-shot
directions.
• Training a new multilingual model with all avail-

able parallel data mixed equally.

For our experiments, we use a baseline model which
we call “Zero-Shot” trained on a combined parallel
corpus of English↔{Belarusian(Be), Russian(Ru),
Ukrainian(Uk)}. We trained a second model on the
above corpus together with additional Ru↔{Be, Uk}
data. We call this model “From-Scratch”. Both mod-
els support four target languages, and are evaluated
on our standard test sets. As done previously we
oversample the data such that all language pairs are
represented equally. Finally, we take the best check-
point of the “Zero-Shot” model, and run incremental
training on a small portion of the data used to train
the “From-Scratch” model for a short period of time
until convergence (in this case 3% of “Zero-Shot”
model total training time). We call this model “Incre-
mental”.

As can be seen from Table 7, for the English↔X
directions, all three models show comparable scores.
On the Ru↔{Be, Uk} directions, the “Zero-Shot”
model already achieves relatively high BLEU scores
for all directions except one, without any explicit
parallel data. This could be because these languages
are linguistically related. In the “From-Scratch” col-
umn, we see that training a new model from scratch
improves the zero-shot translation directions further.
However, this strategy has a slightly negative effect
on the En↔X directions because our oversampling
strategy will reduce the frequency of the data from
these directions. In the final column, we see that in-
cremental training with direct parallel data recovers
most of the BLEU score difference between the first
two columns on the zero-shot language pairs. In sum-
mary, our shared architecture models the zero-shot
language pairs quite well and hence enables us to
easily improve their quality with a small amount of
additional parallel data.

5 Visual Analysis

The results of this paper — that training a model
across multiple languages can enhance performance
at the individual language level, and that zero-shot
translation can be effective — raise a number of ques-
tions about how these tasks are handled inside the
model. Is the network learning some sort of shared
representation, in which sentences with the same
meaning are represented in similar ways regardless
of language? Does the model operate on zero-shot

346


Table 7: BLEU scores for English↔{Belarusian, Russian,
Ukrainian} models.

Zero-Shot From-Scratch Incremental
En→Be 16.85 17.03 16.99
En→Ru 22.21 22.03 21.92
En→Uk 18.16 17.75 18.27
Be→En 25.44 24.72 25.54
Ru→En 28.36 27.90 28.46
Uk→En 28.60 28.51 28.58
Be→Ru 56.53 82.50 78.63
Ru→Be 58.75 72.06 70.01
Ru→Uk 21.92 25.75 25.34
Uk→Ru 16.73 30.53 29.92

translations in the same way as it treats language
pairs it has been trained on?

One way to study the representations used by the
network is to look at the activations of the network
during translation. A starting point for investigation
is the set of context vectors, i.e., the sum of internal
encoder states weighted by their attention probabili-
ties per step (Eq. (5) in (Bahdanau et al., 2015)).

A translation of a single sentence generates a se-
quence of context vectors. In this context, our orig-
inal questions about shared representation can be
studied by looking at how the vector sequences of
different sentences relate. We could then ask for ex-
ample: Do sentences cluster together depending on
the source or target language? Or instead do sen-
tences with similar meanings cluster, regardless of
language? We try to find answers to these questions
by looking at lower-dimensional representations of
internal embeddings of the network that humans can
more easily interpret.

5.1 Evidence for an Interlingua
Several trained networks indeed show strong vi-
sual evidence of a shared representation. For ex-
ample, Figure 1 below was produced from a many-
to-many model trained on four language pairs,
English↔Japanese and English↔Korean. To visual-
ize the model in action we began with a small corpus
of 74 triples of semantically identical cross-language
phrases. That is, each triple contained phrases in En-
glish, Japanese and Korean with the same underlying
meaning. To compile these triples, we searched a
ground-truth database for English sentences which
were paired with both Japanese and Korean transla-
tions.

We then applied the trained model to translate each
sentence of each triple into the two other possible lan-
guages. Performing this process yielded six new sen-
tences based on each triple, for a total of 74∗6 = 444
total translations with 9,978 steps corresponding to
the same number of context vectors. Since context
vectors are high-dimensional, we use the TensorFlow
Embedding Projector2 to map them into more acces-
sible 3D space via t-SNE (Maaten and Hinton, 2008).
In the following diagrams, each point represents a
single decoding step during the translation process.
Points that represent steps for a given sentence are
connected by line segments.

Figure 1 shows a global view of all 9,978 context
vectors. Points produced from the same original sen-
tence triple are all given the same (random) color.
Inspection of these clusters shows that each strand
represents a single sentence, and clusters of strands
generally represent a set of translations of the same
underlying sentence, but with different source and
target languages.

At right are two close-ups: one of an individual
cluster, still coloring based on membership in the
same triple, and one where we have colored by source
language.

5.2 Partially Separated Representations

Not all models show such clean semantic clustering.
Sometimes we observed joint embeddings in some
regions of space coexisting with separate large clus-
ters which contained many context vectors from just
one language pair.

For example, Figure 2a shows a t-SNE projection
of context vectors from a model that was trained on
Portuguese→English (blue) and English→Spanish
(yellow) and performing zero-shot translation from
Portuguese→Spanish (red). This projection shows
153 semantically identical triples translated as de-
scribed above, yielding 459 total translations. The
large red region on the left primarily contains zero-
shot Portuguese→Spanish translations. In other
words, for a significant number of sentences, the
zero-shot translation has a different embedding than
the two trained translation directions. On the other
hand, some zero-shot translation vectors do seem to

2https://www.tensorflow.org/get_started/
embedding_viz

347


Figure 1: A t-SNE projection of the embedding of 74 semantically identical sentences translated across all 6 possible
directions, yielding a total of 9,978 steps (dots in the image), from the model trained on English↔Japanese and
English↔Korean examples. (a) A bird’s-eye view of the embedding, coloring by the index of the semantic sentence.
Well-defined clusters each having a single color are apparent. (b) A zoomed in view of one of the clusters with the same
coloring. All of the sentences within this cluster are translations of “The stratosphere extends from about 10km to about
50km in altitude.” (c) The same cluster colored by source language. All three source languages can be seen within this
cluster.

Figure 2: (a) A bird’s-eye view of a t-SNE projection of an embedding of the model trained on Portuguese→English
(blue) and English→Spanish (yellow) examples with a Portuguese→Spanish zero-shot bridge (red). The large red
region on the left primarily contains the zero-shot Portuguese→Spanish translations. (b) A scatter plot of BLEU scores
of zero-shot translations versus the average point-wise distance between the zero-shot translation and a non-bridged
translation. The Pearson correlation coefficient is −0.42.

348


fall near the embeddings found in other languages,
as on the large region on the right.

It is natural to ask whether the large cluster of “sep-
arated” zero-shot translations has any significance. A
definitive answer requires further investigation, but
in this case zero-shot translations in the separated
area do tend to have lower BLEU scores.

Figure 2b shows a plot of BLEU scores of a zero-
shot translation versus the average pointwise distance
between it and the same translation from a trained
language pair. An interesting area for future research
is to find a more reliable correspondence between em-
bedding geometry and model performance to predict
the quality of a zero-shot translation during decoding
by comparing it to the embedding of the translation
through a trained language pair.

6 Mixing Languages

Having a mechanism to translate from a random
source language to a single chosen target language
using an additional source token made us think about
what happens when languages are mixed on the
source or target side. In particular, we were interested
in the following two experiments: 1) Can a multilin-
gual model successfully handle multi-language in-
put (code-switching) in the middle of a sentence?; 2)
What happens when a multilingual model is triggered
with a linear mix of two target language tokens?

6.1 Source Language Code-Switching

Here we show how multilingual models deal with
source language code-switching – an example from
a multilingual {Ja,Ko}→En model is below. Mixing
Japanese and Korean in the source produces in many
cases correct English translations, showing that code-
switching can be handled by this model, although
no such code-switching samples were present in the
training data. Note that the model can effectively han-
dle the different typographic scripts since the individ-
ual characters/wordpieces are present in the shared
vocabulary.
• Japanese: 私は東京大学の学生です。→ I

am a student at Tokyo University.
• Korean: 나는도쿄대학의학생입니다. →

I am a student at Tokyo University.
• Japanese/Korean: 私は東京大学학생입니
다. → I am a student of Tokyo University.

Interestingly, the mixed-language translation is
slightly different from both single source language
translations.

6.2 Weighted Target Language Selection

Here we test what happens when we mix target lan-
guages. Using a multilingual En→{Ja, Ko} model,
we feed a linear combination (1−w)<2ja>+w<2ko>
of the embedding vectors for “<2ja>” and “<2ko>”.
Clearly, for w = 0 the model should produce
Japanese, for w = 1 it should produce Korean, but
what happens in between?

The model may produce some sort of intermediate
language (“Japarean”), but the results turn out to
be less surprising. Most of the time the output just
switches from one language to another around w =
0.5. In some cases, for intermediate values of w, the
model switches languages mid-sentence.

A possible explanation for this behavior is that
the target language model, implicitly learned by the
decoder LSTM, may make it very hard to mix words
from different languages, especially when they use
different scripts.

Table 8 shows an example of mixed target lan-
guages (Ja/Ko), where we can observe an interesting
transition in the script and grammar. At wko = 0.58,
the model translates the source sentence into a mix
of Japanese and Korean. At wko = 0.60, the sen-
tence is translated into full Korean, where all of the
source words are captured, however, the ordering of
the words is not natural. When wko is increased to
0.7, the model starts to translate the source sentence
into a Korean sentence that sounds more natural.3

7 Conclusion

We present a simple solution to multilingual NMT.
We show that we can train multilingual NMT mod-
els that can be used to translate between a number
of different languages using a single model where
all parameters are shared, which as a positive side-
effect also improves the translation quality of low-
resource languages in the mix. We also show that
zero-shot translation without explicit bridging is pos-
sible, which is the first time to our knowledge that a

3The Korean translation does not contain spaces and uses
‘。’ as punctuation symbol, and these are all artifacts of applying
a Japanese postprocessor.

349


Table 8: Gradually mixing target languages Ja/Ko.

wko I must be getting somewhere near the centre of the
earth.

0.00 私は地球の中心の近くにどこかに行っている
に違いない。

0.40 私は地球の中心近くのどこかに着いているに
違いない。

0.56 私は地球の中心の近くのどこかになっている
に違いない。

0.58 私は지구の中心의가까이에어딘가에도착하고있
어야한다。

0.60 나는지구의센터의가까이에어딘가에도착하고있
어야한다。

0.70 나는지구의중심근처어딘가에도착해야합니다。
0.90 나는어딘가지구의중심근처에도착해야합니다。
1.00 나는어딘가지구의중심근처에도착해야합니다。

form of true transfer learning has been shown to work
for machine translation. To explicitly improve the
zero-shot translation quality, we explore two ways
of adding available parallel data and find that small
additional amounts are sufficient to reach satisfac-
tory results. In our largest experiment we merge
12 language pairs into a single model and achieve
only slightly lower translation quality as for the sin-
gle language pair baselines despite the drastically
reduced amount of modeling capacity per language
in the multilingual model. Visual interpretation of the
results shows that these models learn a form of inter-
lingua representation between all involved language
pairs. The simple architecture makes it possible to
mix languages on the source or target side to yield
some interesting translation examples. Our approach
has been shown to work reliably in a Google-scale
production setting and enables us to scale to a large
number of languages quickly.

Acknowledgements

We would like to thank the entire Google Brain Team
and Google Translate Team for their foundational
contributions to this project. In particular, we thank
Junyoung Chung for his insights on the topic and
Alex Rudnick and Otavio Good for helpful sugges-
tions. We would also like to thank the TACL Action
Editor and the reviewers for their feedback.

References

Martin Abadi and Paul Barham et al. 2016. Tensorflow:
A system for large-scale machine learning. In 12th
USENIX Symposium on Operating Systems Design and
Implementation (OSDI 16), pages 265–283.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
2015. Neural machine translation by jointly learning
to align and translate. In International Conference on
Learning Representations.

Ozan Caglayan and Walid Aransa et al. 2016. Does mul-
timodality help human and machine for translation and
image captioning? In Proceedings of the First Confer-
ence on Machine Translation, pages 627–633, Berlin,
Germany, August. Association for Computational Lin-
guistics.

Rich Caruana. 1998. Multitask learning. In Learning to
learn, pages 95–133. Springer.

Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre,
Fethi Bougares, Holger Schwenk, and Yoshua Ben-
gio. 2014. Learning phrase representations using RNN
encoder-decoder for statistical machine translation. In
Conference on Empirical Methods in Natural Language
Processing.

Josep Crego and Jungi Kim et al. 2016. Systran’s pure
neural machine translation systems. arXiv preprint
arXiv:1610.05540.

Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng
Wang. 2015. Multi-task learning for multiple language
translation. In Proceedings of the 53rd Annual Meeting
of the Association for Computational Linguistics, pages
1723–1732.

Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016a.
Multi-way, multilingual neural machine translation
with a shared attention mechanism. In The 2016 Con-
ference of the North American Chapter of the Associa-
tion for Computational Linguistics: Human Language
Technologies, San Diego California, USA, June 12-17,
2016, pages 866–875.

Orhan Firat, Kyunghyun Cho, Baskaran Sankaran, Fatos T.
Yarman Vural, and Yoshua Bengio. 2016b. Multi-way,
multilingual neural machine translation. Computer
Speech and Language.

Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan,
Fatos T. Yarman Vural, and Kyunghyun Cho. 2016c.
Zero-resource translation with multi-lingual neural ma-
chine translation. In Proceedings of the 2016 Con-
ference on Empirical Methods in Natural Language
Processing, pages 268–277, Austin, Texas, November.
Association for Computational Linguistics.

Robert M French. 1999. Catastrophic forgetting in
connectionist networks. Trends in cognitive sciences,
3(4):128–135.

350


Philip Gage. 1994. A new algorithm for data compression.
C Users Journal, 12(2):23–38, February.

Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Sub-
ramanya. 2016. Multilingual language processing from
bytes. In Proceedings of the 2016 Conference of the
North American Chapter of the Association for Compu-
tational Linguistics: Human Language Technologies,
pages 1296–1306, San Diego, California, June. Associ-
ation for Computational Linguistics.

William John Hutchins and Harold L. Somers. 1992.
An introduction to machine translation, volume 362.
Academic Press London.

Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent
continuous translation models. In Conference on Em-
pirical Methods in Natural Language Processing.

Jason Lee, Kyunghyun Cho, and Thomas Hofmann.
2016. Fully character-level neural machine transla-
tion without explicit segmentation. arXiv preprint
arXiv:1610.03017.

Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol
Vinyals, and Lukasz Kaiser. 2015a. Multi-task se-
quence to sequence learning. In International Confer-
ence on Learning Representations.

Minh-Thang Luong, Hieu Pham, and Christopher D. Man-
ning. 2015b. Effective approaches to attention-based
neural machine translation. In Conference on Empiri-
cal Methods in Natural Language Processing.

Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol
Vinyals, and Wojciech Zaremba. 2015c. Addressing
the rare word problem in neural machine translation.
In Proceedings of the 53rd Annual Meeting of the As-
sociation for Computational Linguistics and the 7th
International Joint Conference on Natural Language
Processing.

Laurens Van Der Maaten and Geoffrey Hinton. 2008.
Visualizing Data using t-SNE. Journal of Machine
Learning Research, 9.

Richard H Richens. 1958. Interlingual machine transla-
tion. The Computer Journal, 1(3):144–147.

Tanja Schultz and Katrin Kirchhoff. 2006. Multilingual
speech processing. Elsevier Academic Press, Amster-
dam, Boston, Paris.

Mike Schuster and Kaisuke Nakajima. 2012. Japanese
and Korean voice search. 2012 IEEE International
Conference on Acoustics, Speech and Signal Process-
ing.

Jean Sébastien, Cho Kyunghyun, Roland Memisevic, and
Yoshua Bengio. 2015. On using very large target vo-
cabulary for neural machine translation. In Proceedings
of the 53rd Annual Meeting of the Association for Com-
putational Linguistics and the 7th International Joint
Conference on Natural Language Processing.

Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016a. Controlling politeness in neural machine trans-
lation via side constraints. In The 2016 Conference
of the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, San Diego California, USA, June 12-17, 2016,
pages 35–40.

Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016b. Neural machine translation of rare words with
subword units. In Proceedings of the 54th Annual Meet-
ing of the Association for Computational Linguistics.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural networks.
In Advances in Neural Information Processing Systems,
pages 3104–3112.

Yulia Tsvetkov, Sunayana Sitaram, Manaal Faruqui,
Guillaume Lample, Patrick Littell, David Mortensen,
Alan W Black, Lori Levin, and Chris Dyer. 2016. Poly-
glot neural language models: A case study in cross-
lingual phonetic representation learning. In Proceed-
ings of the 2016 Conference of the North American
Chapter of the Association for Computational Linguis-
tics: Human Language Technologies, pages 1357–1366,
San Diego, California, June. Association for Computa-
tional Linguistics.

Yonghui Wu, Mike Schuster, and Zhifeng Chen et al. 2016.
Google’s neural machine translation system: Bridging
the gap between human and machine translation. arXiv
preprint arXiv:1609.08144v2.

Hayahide Yamagishi, Shin Kanouchi, and Mamoru Ko-
machi. 2016. Controlling the voice of a sentence
in Japanese-to-English neural machine translation. In
Proceedings of the 3rd Workshop on Asian Translation,
pages 203–210, Osaka, Japan, December.

Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei
Xu. 2016. Deep recurrent models with fast-forward
connections for neural machine translation. Transac-
tions of the Association for Computational Linguistics,
4:371–383.

Barret Zoph and Kevin Knight. 2016. Multi-source neural
translation. In NAACL HLT 2016, The 2016 Confer-
ence of the North American Chapter of the Association
for Computational Linguistics: Human Language Tech-
nologies, San Diego California, USA, June 12-17, 2016,
pages 30–34.

351


352