Context Gates for Neural Machine Translation Zhaopeng Tu† Yang Liu‡ Zhengdong Lu† Xiaohua Liu† Hang Li† †Noah’s Ark Lab, Huawei Technologies, Hong Kong {tu.zhaopeng,lu.zhengdong,liuxiaohua3,hangli.hl}@huawei.com ‡Department of Computer Science and Technology, Tsinghua University, Beijing liuyang2011@tsinghua.edu.cn Abstract In neural machine translation (NMT), genera- tion of a target word depends on both source and target contexts. We find that source con- texts have a direct impact on the adequacy of a translation while target contexts affect the flu- ency. Intuitively, generation of a content word should rely more on the source context and generation of a functional word should rely more on the target context. Due to the lack of effective control over the influence from source and target contexts, conventional NMT tends to yield fluent but inadequate transla- tions. To address this problem, we propose context gates which dynamically control the ratios at which source and target contexts con- tribute to the generation of target words. In this way, we can enhance both the adequacy and fluency of NMT with more careful con- trol of the information flow from contexts. Experiments show that our approach signif- icantly improves upon a standard attention- based NMT system by +2.3 BLEU points. 1 Introduction Neural machine translation (NMT) (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Bah- danau et al., 2015) has made significant progress in the past several years. Its goal is to construct and utilize a single large neural network to accom- plish the entire translation task. One great advan- tage of NMT is that the translation system can be completely constructed by learning from data with- out human involvement (cf., feature engineering in statistical machine translation (SMT)). The encoder- decoder architecture is widely employed (Cho et al., input jı̄nnián qián liǎng yuè guǎngdōng gāoxı̄n jı̀shù chǎnpı̌n chūkǒu 37.6yı̀ měiyuán NMT in the first two months of this year , the export of new high level technology product was UNK - billion us dollars 5src china ’s guangdong hi - tech exports hit 58 billion dollars 5tgt china ’s export of high and new hi - tech exports of the export of the export of the export of the export of the export of the export of the export of the export of · · · Table 1: Source and target contexts are highly cor- related to translation adequacy and fluency, respec- tively. 5src and 5tgt denote halving the contribu- tions from the source and target contexts when gen- erating the translation, respectively. 2014; Sutskever et al., 2014), in which the encoder summarizes the source sentence into a vector repre- sentation, and the decoder generates the target sen- tence word-by-word from the vector representation. The representation of the source sentence and the representation of the partially generated target sen- tence (translation) at each position are referred to as source context and target context, respectively. The generation of a target word is determined jointly by the source context and target context. Several techniques in NMT have proven to be very effective, including gating (Hochreiter and Schmidhuber, 1997; Cho et al., 2014) and at- tention (Bahdanau et al., 2015) which can model long-distance dependencies and complicated align- 87 Transactions of the Association for Computational Linguistics, vol. 5, pp. 87–99, 2017. Action Editor: Chris Quirk. Submission batch: 6/2016; Revision batch: 10/2016; Published 3/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. ment relations in the translation process. Using an encoder-decoder framework that incorporates gat- ing and attention techniques, it has been reported that the performance of NMT can surpass the per- formance of traditional SMT as measured by BLEU score (Luong et al., 2015). Despite this success, we observe that NMT usu- ally yields fluent but inadequate translations.1 We attribute this to a stronger influence of target con- text on generation, which results from a stronger language model than that used in SMT. One ques- tion naturally arises: what will happen if we change the ratio of influences from the source or target con- texts? Table 1 shows an example in which an attention- based NMT system (Bahdanau et al., 2015) gener- ates a fluent yet inadequate translation (e.g., missing the translation of “guǎngdōng”). When we halve the contribution from the source context, the result fur- ther loses its adequacy by missing the partial trans- lation “in the first two months of this year”. One possible explanation is that the target context takes a higher weight and thus the system favors a shorter translation. In contrast, when we halve the con- tribution from the target context, the result com- pletely loses its fluency by repeatedly generating the translation of “chūkǒu” (i.e., “the export of”) un- til the generated translation reaches the maximum length. Therefore, this example indicates that source and target contexts in NMT are highly correlated to translation adequacy and fluency, respectively. In fact, conventional NMT lacks effective control on the influence of source and target contexts. At each decoding step, NMT treats the source and tar- get contexts equally, and thus ignores the different needs of the contexts. For example, content words in the target sentence are more related to the transla- tion adequacy, and thus should depend more on the source context. In contrast, function words in the target sentence are often more related to the trans- lation fluency (e.g., “of” after “is fond”), and thus should depend more on the target context. In this work, we propose to use context gates to control the contributions of source and target con- texts on the generation of target words (decoding) 1Fluency measures whether the translation is fluent, while adequacy measures whether the translation is faithful to the original sentence (Snover et al., 2009). Figure 1: Architecture of decoder RNN. in NMT. Context gates are non-linear gating units which can dynamically select the amount of context information in the decoding process. Specifically, at each decoding step, the context gate examines both the source and target contexts, and outputs a ratio between zero and one to determine the percentages of information to utilize from the two contexts. In this way, the system can balance the adequacy and fluency of the translation with regard to the genera- tion of a word at each position. Experimental results show that introducing con- text gates leads to an average improvement of +2.3 BLEU points over a standard attention-based NMT system (Bahdanau et al., 2015). An interesting find- ing is that we can replace the GRU units in the de- coder with conventional RNN units and in the mean- time utilize context gates. The translation perfor- mance is comparable with the standard NMT system with GRU, but the system enjoys a simpler structure (i.e., uses only a single gate and half of the param- eters) and a faster decoding (i.e., requires only half the matrix computations for decoding).2 2 Neural Machine Translation Suppose that x = x1, . . .xj, . . .xJ represents a source sentence and y = y1, . . .yi, . . .yI a target sentence. NMT directly models the probability of translation from the source sentence to the target sentence word by word: P(y|x) = I∏ i=1 P(yi|y