Pushing the Limits of Translation Quality Estimation André F. T. Martins Unbabel Instituto de Telecomunicações Lisbon, Portugal andre.martins@unbabel.com Marcin Junczys-Dowmunt Adam Mickiewicz University in Poznań Poznań, Poland junczys@amu.edu.pl Fabio N. Kepler Unbabel L2F/INESC-ID, Lisbon, Portugal University of Pampa, Alegrete, Brazil kepler@unbabel.com Ramón Astudillo Unbabel L2F/INESC-ID Lisbon, Portugal ramon@unbabel.com Chris Hokamp Dublin City University Dublin, Ireland chokamp@computing.dcu.ie Roman Grundkiewicz Adam Mickiewicz University in Poznań Poznań, Poland romang@amu.edu.pl Abstract Translation quality estimation is a task of growing importance in NLP, due to its poten- tial to reduce post-editing human effort in dis- ruptive ways. However, this potential is cur- rently limited by the relatively low accuracy of existing systems. In this paper, we achieve remarkable improvements by exploiting syn- ergies between the related tasks of word-level quality estimation and automatic post-editing. First, we stack a new, carefully engineered, neural model into a rich feature-based word- level quality estimation system. Then, we use the output of an automatic post-editing sys- tem as an extra feature, obtaining striking re- sults on WMT16: a word-level F MULT1 score of 57.47% (an absolute gain of +7.95% over the current state of the art), and a Pearson correla- tion score of 65.56% for sentence-level HTER prediction (an absolute gain of +13.36%). 1 Introduction The goal of quality estimation (QE) is to evaluate a translation system’s quality without access to ref- erence translations (Blatz et al., 2004; Specia et al., 2013). This has many potential usages: informing an end user about the reliability of translated con- tent; deciding if a translation is ready for publish- ing or if it requires human post-editing; highlighting the words that need to be changed. QE systems are particularly appealing for crowd-sourced and pro- fessional translation services, due to their potential to dramatically reduce post-editing times and to save labor costs (Specia, 2011). The increasing interest in this problem from an industrial angle comes as no surprise (Turchi et al., 2014; de Souza et al., 2015; Martins et al., 2016; Kozlova et al., 2016). In this paper, we tackle word-level QE, whose goal is to assign a label of OK or BAD to each word in the translation (Figure 1). Past approaches to this problem include linear classifiers with handcrafted features (Ueffing and Ney, 2007; Biçici, 2013; Shah et al., 2013; Luong et al., 2014), often combined with feature selection (Avramidis, 2012; Beck et al., 2013), recurrent neural networks (de Souza et al., 2014; Kim and Lee, 2016), and systems that com- bine linear and neural models (Kreutzer et al., 2015; Martins et al., 2016). We start by proposing a “pure” QE system (§3) consisting of a new, carefully en- gineered neural model (NEURALQE), stacked into a linear feature-rich classifier (LINEARQE). Along the way, we provide a rigorous empirical analysis to better understand the contribution of the several groups of features and to justify the architecture of the neural system. A second contribution of this paper is bring- ing in the related task of automatic post-editing (APE; Simard et al. (2007)), which aims to au- 205 Transactions of the Association for Computational Linguistics, vol. 5, pp. 205–218, 2017. Action Editor: Stefan Riezler. Submission batch: 12/2016; Revision batch: 2/2017; Published 7/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Source The Sharpen tool sharpens areas in an image . MT Der Schärfen-Werkezug Bereiche in einem Bild schärfer erscheint . PE (reference) Mit dem Scharfzeichner können Sie einzelne Bereiche in einem Bild scharfzeichnen . QE BAD BAD OK OK OK OK BAD BAD OK ‖ HTER = 66.7% Figure 1: Example from the WMT16 word-level QE training set. Shown are the English source sentence, the German translation (MT), its manual post-edition (PE), and the conversion to word quality labels made with the TERCOM tool (QE). Words labeled as OK are shown in green, and those labeled as BAD are shown in red. We also show the HTER (fraction of edit operations to produce PE from MT) computed by TERCOM. tomatically correct the output of machine transla- tion (MT). We show that a variant of the APE sys- tem of Junczys-Dowmunt and Grundkiewicz (2016), trained on a large amount of artificial “roundtrip translations,” is extremely effective when adapted to predict word-level quality labels (yielding APEQE, §4). We further show that the pure and the APE- based QE system are highly complementary (§5): a stacked combination of LINEARQE, NEURALQE, and APEQE boosts the scores even further, leading to a new state of the art on the WMT15 and WMT16 datasets. For the latter, we achieve an F MULT1 score of 57.47%, which represents an absolute improvement of +7.95% over the previous best system. Finally, we provide a simple word-to-sentence conversion to adapt our system to sentence-level QE. This results in a new state of the art for human- targeted translation error rate (HTER) prediction, where we obtain a Pearson’s r correlation score of 65.56% (+13.36% absolute gain), and for sentence ranking, which achieves a Spearman’s ρ correlation score of 65.92% (+17.62%). We complement our findings with error analysis that highlights the syn- ergies between pure and APE-based QE systems. 2 Datasets and System Architecture Datasets. For developing and evaluating our sys- tems, we use the datasets listed in Table 1. These datasets have been used in the QE and APE tasks in WMT 2015–2016 (Bojar et al., 2015, 2016).1 They span two language pairs (English-Spanish and English-German) and two different domains (news translations and information technology). We used the standard train, development and test splits. Each split contains the source and automatically trans- lated sentences (which we use as inputs), the manu- 1Publicly available at http://www.statmt.org/ wmt15 and http://www.statmt.org/wmt16. ally post-edited sentences (output for the APE task), and a sequence of OK/BAD quality labels, one per each translated word (output for the word-level QE task); see Figure 1. Besides these datasets, for training the APE system we make use of artificial roundtrip translations; this will be detailed in §4. Evaluation. For all experiments, we report the of- ficial evaluation metrics of each dataset’s year. For WMT15, the official metric for the word-level QE task is the F1 score of the BAD labels (F BAD1 ). For WMT16, it is the product of the F1 scores for the OK and BAD labels (denoted F MULT1 ). For sentence- level QE, we report the Pearson’s r correlation for HTER prediction and the Spearman’s ρ correlation score for sentence ranking (Graham, 2015). From post-edited sentences to quality labels. In the datasets above, the word quality labels are ob- tained automatically by aligning the translated and the post-edited sentence with the TERCOM soft- ware tool (Snover et al., 2006)2, with the default settings (tokenized, case insensitive, exact matching only, shifts disabled). This tool computes the HTER (the normalized edit distance) between the translated and post-edited sentence. As a by-product, it aligns the words in the two sentences, identifying substitu- tion errors, word deletions (i.e. words omitted by the translation system), and insertions (redundant words in the translation). Words in the MT output that need to be edited are marked by the BAD quality labels. The fact that the quality labels are automatically obtained from the post-edited sentences is not just an artifact of these datasets, but a procedure that is highly convenient for developing QE systems in an industrial setting. Manually annotating word-level quality labels is time-consuming and expensive; on the other hand, post-editing translated sentences is 2http://www.cs.umd.edu/˜snover/tercom. 206 Dataset Language pair # sents # words WMT15, Train En-Es 11,271 257,548 WMT15, Dev En-Es 1,000 23,207 WMT15, Test En-Es 1,817 40,899 WMT16, Train En-De 12,000 210,958 WMT16, Dev En-De 1,000 19,487 WMT16, Test En-De 2,000 34,531 Table 1: Datasets used in this work. commonly part of the workflow of crowd-sourced and professional translation services. Thus, getting quality labels for free from sentences that have al- ready been post-edited is a much more realistic and sustainable process. This observation suggests that we can tackle word-level QE in two ways: 1. Pure QE: run the TER alignment tool (i.e. TER- COM) on the post-edited data, and then train a QE system directly on the generated quality la- bels; 2. APE-based QE: train an APE system on the original post-edited data, and at runtime use the TER aligment tool to convert the automatically post-edited sentences to quality labels. From a machine learning pespective, QE is a se- quence labeling problem (i.e., whose output se- quence has a fixed length and a small number of labels), while APE is a sequence-to-sequence prob- lem (where the output is of variable length and spans a large vocabulary). Therefore, we can regard APE-based QE as a “projection” of a more complex and fine-grained output (APE) into a simpler output space (QE). APE-based QE systems have the poten- tial for being more powerful since they are trained with this finer-grained information (provided there is enough training data to make them generalize well). We report results in §4 confirming this hypothesis. Our system architecture, described in full detail in the following sections, consists of state of the art pure QE and APE-based QE systems, which are then combined to yield a new, more powerful, QE system. 3 Pure Quality Estimation The best performing system in the WMT16 word- level QE task was developed by the Unbabel team (Martins et al., 2016). It is a pure but rather com- plex QE system, ensembling a linear feature-based classifier with three different neural networks with different configurations. In this section, we provide a simpler version of their system, by replacing the three ensembled neural components by a single one, which we engineer in a principled way. We evaluate the resulting system on additional data (WMT15 in addition to WMT16), covering a new language pair and a new content type. Overall, we obtain a slightly higher accuracy with a much simpler system. In this section, we describe the linear (§3.1) and neural (§3.2) components of our system, as well as their combination (§3.3). 3.1 Linear Sequential Model We start with the linear component of our model, a discriminative feature-based sequential model (called LINEARQE), based on Martins et al. (2016). The system receives as input a tuple 〈s,t,A〉, where s = s1 . . .sM is the source sentence, t = t1 . . . tN is the translated sentence, and A ⊆ {(m,n) | 1 ≤ m ≤ M, 1 ≤ n ≤ N} is a set of word alignments. It predicts as output a sequence ŷ = y1 . . .yN , with each yi ∈{BAD, OK}. This is done as follows: ŷ = arg max y N∑ i=1 w>φu(s,t,A,yi) + N+1∑ i=1 w>φb(s,t,A,yi,yi−1). (1) Above, w is a vector of weights, φu(s,t,A,yi) are unigram features (depending only on a single output label), φb(s,t,A,yi,yi−1) are bigram features (de- pending on consecutive output labels), and y0 and yN+1 are special start/stop symbols. Features. Table 2 shows the unigram and bigram features used in the LINEARQE system. Like the baseline systems provided in WMT15/16, we in- clude features that depend on the target word and its aligned source word, as well as the context sur- rounding them.3 A distinctive aspect of our sys- tem is the inclusion of syntactic features, which will 3Features involving the aligned source word are replaced by NIL if the target word is unaligned. If there are multiple aligned source words, they are concatenated into a single feature. 207 Features Label Input (referenced by the ith target word) unigram yi ∧ . . . ∗BIAS ∗WORD, LEFTWORD, RIGHTWORD ∗SOURCEWORD, SOURCELEFTWORD, SOURCERIGHTWORD ∗LARGESTNGRAMLEFT/RIGHT, SOURCELARGESTNGRAMLEFT/RIGHT ∗POSTAG, SOURCEPOSTAG †WORD+LEFTWORD, WORD+RIGHTWORD †WORD+SOURCEWORD, POSTAG+SOURCEPOSTAG simple bigram yi ∧yi−1 ∧ . . . ∗BIAS rich bigrams yi ∧yi−1 ∧ . . . all above yi+1 ∧yi ∧ . . . WORD+SOURCEWORD, POSTAG+SOURCEPOSTAG syntactic yi ∧ . . . DEPREL, WORD+DEPREL HEADWORD/POSTAG+WORD/POSTAG LEFTSIBWORD/POSTAG+WORD/POSTAG RIGHTSIBWORD/POSTAG+WORD/POSTAG GRANDWORD/POSTAG+HEADWORD/POSTAG+WORD/POSTAG Table 2: Features used in the LINEARQE system (see Martins et al., 2016 for a detailed description). Features marked with ∗ are included in the WMT16 baseline system. Those marked with † were proposed by Kreutzer et al. (2015). show to be useful to detect grammatically incor- rect constructions.4 We use features that involve the dependency relation, the head word, and second- order sibling and grandparent structures. Features involving part-of-speech (POS) tags and syntactic information are obtained with TurboTagger and Tur- boParser (Martins et al., 2013).5 Training. The feature weights are learned by run- ning 50 epochs of the max-loss MIRA algorithm (Crammer et al., 2006), with regularization con- stant C ∈{10−k}4k=1 and a Hamming cost function placing a higher penalty on false positives than on false negatives (cFP ∈ {0.5,0.55, . . . ,0.95},cFN = 1 − cFP ), to account for the existence of fewer BAD labels than OK labels in the data. These values are tuned on the development set. Results and feature contribution. Table 3 shows the performance of the LINEARQE system. To help understand the contribution of each group of features, we evaluated different variants of the LINEARQE system on the development sets of WMT15/16. As expected, the use of bigrams im- proves the simple unigram model, and the syntac- 4While syntactic features have been used previously in sentence-level QE (Rubino et al., 2012), they have never been applied to the finer-grained word-level variant tackled here. 5http://www.cs.cmu.edu/˜ark/TurboParser. Features WMT15 (F BAD1 ) WMT16 (F MULT 1 ) unigrams only 41.77 40.05 +simple bigram 42.20 40.63 +rich bigrams 42.80 43.65 +syntactic (full) 43.68 46.11 Table 3: Performance on the WMT15 (En-Es) and WMT16 (En-De) development sets of several configu- rations of LINEARQE. We report the official metric for these shared tasks, F BAD1 for WMT15 and F MULT 1 for WMT16. tic features help even further. The impact of these features is more prominent in WMT16: the rich bi- gram features lead to scores about 3 points above a sequential model with a single indicator bigram feature, and the syntactic features contribute another 2.5 points. The net improvement exceeds 6 points over the unigram model. 3.2 Neural System Next, we describe the neural component of our pure QE system, which we call NEURALQE. In WMT15 and WMT16, the neural QUETCH system (Kreutzer et al., 2015) and its ensemble with other neural mod- els (Martins et al., 2016) were components of the winning systems. However, none of these neural models managed to outperform a linear model when 208 2 x FF 2 x 400 2 x FF 2 x 200 2 x FF 100 + 50 ... ... BiGRU 100 ... ... BiGRU 200 softmax OK/BAD source word source POS target word target POS em be d d in gs 3 x 64 3 x 50 3 x 64 3 x 50 Figure 2: Architecture of our NEURALQE system. considered in isolation—for example, QUETCH ob- tained a F BAD1 of 35.27% in the WMT15 test set, far below the 40.84% score of the linear system built by the same team. By contrast, our carefully en- gineered NEURALQE model attains a performance superior to that of the linear system, as we shall see. Architecture. The architecture of NEURALQE is depicted in Figure 2. We used Keras (Chollet, 2015) to implement our model. The system receives as input the source and target sentences s and t, their word-level alignments A, and their corresponding POS tags obtained from TurboTagger. The input layer follows a similar architecture as QUETCH, with the addition of POS features. A vector rep- resenting each target word is obtained by concate- nating the embedding of that word with those of the aligned word in the source.6 The immediate left and right contexts for source and target words are also concatenated. We use the pre-trained 64- dimensional Polyglot word embeddings (Al-Rfou et al., 2013) for English, German, and Spanish, and re- fine them during training. In addition to this, POS tags for each source and target word are also em- bedded and concatenated. POS embeddings have size 50 and are initialized as described by Glorot and Bengio (2010). A dropout probability of 0.5 is ap- plied to the resulting vector representations. The following layers are then applied in sequence: 1. Two feed-forward layers of size 400 with recti- fied linear units (ReLU; Nair and Hinton (2010)); 6For the cases in which there are multiple source words aligned to the same target word, the embeddings are averaged. 2. A layer with bidirectional gated recurrent units (BiGRU, Cho et al. (2014)) of size 200, where forward and backward vectors are concatenated, trained with layer normalization (Ba et al., 2016); 3. Two feed-forward ReLU layers of size 200; 4. A BiGRU layer of size 100 with identical config- uration to the previous BiGRU; 5. Two more feed-forward ReLU layers of sizes 100 and 50, respectively. As the output layer, a softmax transformation over the OK/BAD labels is applied. The choice for this architecture was dictated by experiments on the WMT16 development data, as we explain next. Training. We train the model with the RMSProp algorithm (Tieleman and Hinton, 2012) by minimiz- ing the cross-entropy with a linear penalty for BAD word predictions, as in Kreutzer et al. (2015). We set the BAD weight factor to 3.0. All hyperparameters are adjusted based on the development set. Target sentences are bucketed by length and then processed in batches (without any padding or truncation). Results and architectural choices. The final re- sults are shown in Table 4. Overall, the final NEU- RALQE model achieves an F MULT1 score of 46.80% on the WMT16 development set, compared with the 46.11% obtained with the LINEARQE system (cf. Table 3). This contrasts with previous neural systems, such as QUETCH (Kreutzer et al., 2015) and any of the three neural systems developed by Martins et al. (2016), which could not outperform a rich feature linear classifier. To justify the most relevant choices regarding the architecture of NEURALQE, we also evaluated sev- eral variations of it on the WMT16 development set. The use of recurrent layers yields the largest con- tribution to the performance of NEURALQE, as the scores drop sharply (by more than 4 points) if they are replaced by feed-forward layers (which would correspond to a mere deeper QUETCH model). The first BiGRU is particulary effective, as scores drop more than 2 points if it is removed. The use of layer normalization on the recurrent layers also con- tributes positively (+1.20) to the final score. As ex- pected, the use of POS tags adds another large im- provement: everything staying the same, the model 209 Model F MULT1 NEURALQE 46.80 No POS tags 44.41 (-2.39) Replace BiGRU by FF 42.36 (-4.44) Only the first BiGRU 45.76 (-1.04) Only the second BiGRU 44.37 (-2.43) Remove FF between BiGRUs 46.35 (-0.45) Narrower layers 45.09 (-1.71) Broader layers 45.02 (-1.78) One more layer at the end 46.31 (-0.49) No layer normalization 45.60 (-1.20) Table 4: Effect of architectural changes in NEURALQE on the WMT16 development set. without POS tags as input performs almost 2.5 points worse. Finally, varying the size of the hidden layers and the depth of the network hurts the final model’s performance, albeit more slightly. 3.3 Stacking Neural and Linear Models We now stack the NEURALQE system (§3.2) into the LINEARQE system (§3.1) as an ensemble strat- egy; we call the resulting system STACKEDQE. Stacking architectures (Wolpert, 1992; Breiman, 1996) have proved effective in structured NLP prob- lems (Cohen and de Carvalho, 2005; Martins et al., 2008). The underlying idea is to combine two sys- tems by letting the prediction of the first system be used as an input feature for the second system. Dur- ing training, it is necessary to jackknife the first sys- tem’s predictions to avoid overfitting the training set. This is done by splitting the training set in K folds (we set K = 10) and training K different instances of the first system, where each instance is trained on K − 1 folds and makes predictions for the left-out fold. The concatenation of all the predictions yields an unbiased training set for the second classifier. Neural intra-ensembles. We also evaluate the performance of intra-ensembled neural systems. We train independent instances of NEURALQE with dif- ferent random initializations and different data shuf- fles, following the approach of Jean et al. (2015) in neural MT. In Tables 5–6, we report the performance on the WMT15 and WMT16 datasets of systems en- sembling 5 and 15 of these instances, called respec- tively NEURALQE-5 and NEURALQE-15. The in- Model F BAD1 dev F BAD 1 test Best system in WMT15 43.1- 43.12 QUETCH+ (2nd best) – 43.05 LINEARQE 43.68 42.50 NEURALQE 43.51 43.35 NEURALQE-5 44.21 43.54 NEURALQE-15 44.11 43.93 STACKEDQE 44.68 43.70 Table 5: Performance of the pure QE systems on the WMT15 datasets. The best performing system in the WMT15 competition was by Esplà-Gomis et al. (2015), followed by Kreutzer et al. (2015)’s QUETCH+, which is also an ensemble of a linear and a neural system. Model F MULT1 dev F MULT 1 test Best system in WMT16 49.25 49.52 Unbabel-Linear (2nd best) 45.94 46.29 LINEARQE 46.11 46.16 NEURALQE 46.80 47.29 NEURALQE-5 47.30 48.50 NEURALQE-15 46.77 47.98 STACKEDQE 49.16 50.27 Table 6: Performance of the pure QE systems on the WMT16 datasets. The best performing system in the WMT16 competition was by Martins et al. (2016), fol- lowed by a linear system developed by the same team (Unbabel-Linear). stances are ensembled by taking the averaged prob- ability of each word being BAD. We see consistent benefits (both for WMT15 and WMT16) in ensem- bling 5 neural systems and (somewhat surprisingly) some degradation with ensembles of 15. Stacking architecture. The individual instances of the neural systems are incorporated in the stacking architecture as different features, yield- ing STACKEDQE. In total, we have 15 predictions (probability values given by each NEURALQE sys- tem) for every word in the training, development and test datasets. These predictions are plugged as addi- tional features in the LINEARQE model. As uni- gram features, we used one real-valued feature for every model prediction at each position, conjoined with the label. As bigram features, we used two real- valued features for every model prediction at the two positions, conjoined with the label pair. 210 The results obtained with this stacked architecture on the WMT15 and WMT16 datasets are shown re- spectively in Tables 5 and 6. In WMT15, it is un- clear if stacking helps over the best intra-ensembled neural system, with a slight improvement in the development set, but a degradation in the test set. In WMT16, however, stacking is clearly beneficial, with a boost of about 2 points over the best intra- ensembled neural system and 3–4 points above the linear system, both in the development and test par- titions. For the remainder of this paper, we will take STACKEDQE as our pure QE system. 4 APE-Based Quality Estimation Now that we have described a pure QE system, we move on to an APE-based QE system (APEQE). Our starting point is the system submitted by the Adam Mickiewicz University (AMU) team to the APE task of WMT16 (Junczys-Dowmunt and Grundkiewicz, 2016). They explored the applica- tion of neural translation models to the APE problem and achieved good results by treating different mod- els as components in a log-linear model, allowing for multiple inputs (the source s and the translated sentence t) that were decoded to the same target lan- guage (post-edited translation p). Two systems were considered, one using s as the input (s → p) and another using t as the input (t → p). A simple string-matching penalty integrated within the log- linear model was used to control for higher faithful- ness with regard to the raw MT output. The penalty fires if the APE system proposes a word in its output that has not been seen in t. To overcome the problem of too little training data, Junczys-Dowmunt and Grundkiewicz (2016) generated large amounts of artificial data via round- trip translations: a large corpus of monolingual sentences is first gathered for the target language in the domain of interest (each sentence is regarded as an artificial post-edited sentence p); then an MT sys- tem is ran to translate these sentences to the source language (which are regarded as the source sen- tences s), and another MT system in the reverse di- rection translates the latter back to the target lan- guage (playing the role of the translations t). The artificial data is filtered to match the HTER statistics of the training and development data for the shared task.7 Their submission improved over the uncor- rected baseline on the unseen WMT16 test set by - 3.2% TER and +5.5% BLEU and outperformed any other system submitted to the shared-task by a large margin. 4.1 Training the APE System We reproduce the experiments from Junczys- Dowmunt and Grundkiewicz (2016) using Nematus (Sennrich et al., 2016) for training and AmuNMT (Junczys-Dowmunt et al., 2016) for decoding. As stated in §3.3, jackknifing is required to avoid overfitting during the training procedure of the stacked classifiers (§5), therefore we start by prepar- ing four jackknifed models. We perform the follow- ing steps: • We divide the original WMT16 training set into four equally sized parts, maintaining correspon- dences between different languages. Four new training sets are created by leaving out one part and concatenating the remaining three parts. • For each of the four new training sets, we train one APE model on a concatenation of a smaller set of artificial data (denoted as “round-trip.n1” in Junczys-Dowmunt and Grundkiewicz (2016), consisting of 531,839 sentence triples) and a 20- fold oversampled new training set. Each of these newly created four APE models has not seen a dif- ferent part of the quartered original training data. • To avoid overfitting, we use scaling dropout8 over GRU steps and input embeddings, with dropout probabilities 0.2, and over source and target words with probabilities 0.1 (Sennrich et al., 2016). • We use Adam (Kingma and Ba, 2014) instead of Adadelta (Zeiler, 2012). • We train both models (s → p and t → p) un- til convergence up to 20 epochs, saving model checkpoints every 10,000 mini-batches. 7The artificial filtered data has been made available by the authors at https://github.com/emjotde/amunmt/ wiki/AmuNMT-for-Automatic-Post-Editing. 8Currently available in the MRT branch of Nematus at https://github.com/rsennrich/nematus 211 System WMT15 WMT16 Best system 23.23 21.52 Uncorrected baseline 22.91 24.76 APE t → p 23.91 22.60 APE s → p 40.44 28.39 APE TER-tuned 23.29 20.99 Table 7: TER scores on the official WMT15 and WMT16 test sets for the APE task. Lower is better. • The last four model checkpoints of each train- ing run are averaged element-wise (Junczys- Dowmunt et al., 2016) resulting in new single models with generally improved performance. To verify the quality of the APE system, we en- semble the 8 resulting models (4 times s → p and 4 times t → p) and add the APE penalty described in Junczys-Dowmunt and Grundkiewicz (2016). This large ensemble across folds is only used during test time. For creating the jackknifed training data, only the models from the corresponding fold are used. Since we combine models of different types, we tune weights on the development set with MERT9 (Och, 2003) towards TER, yielding the model denoted as “APE TER-tuned”. Results are listed in Table 7 for the APE shared task (WMT 16). For the purely s → p and t → p ensembles, models are weighted equally. We achieve slightly better results in terms of TER, the main task metric, than the original sys- tem, using less data. For completeness, we also apply this proce- dure to WMT15 data, generating a similar resource of 500K artificial English-Spanish-Spanish post- editing triplets via roundtrip translation.10 The train- ing, jackknifing and ensembling methods are the same as for the WMT16 setting. For the WMT15 APE shared task, results are less persuasive than for WMT16: none of the shared task participants was able to beat the uncorrected baseline and our sys- tem fails at this as well. However, we produced the 9We found MERT to work better when tuning towards TER than kb-MIRA which has been used in the original paper. 10Our artificially created data might suffer from a higher mis- match between training and development data. While we were able to match the TER statistics of the dev set, BLEU scores are several points lower. The artificial WMT16 data we created in Junczys-Dowmunt and Grundkiewicz (2016) matches both, TER and BLEU scores, of the respective development set. F BAD1 dev F BAD 1 test APE t → p 13.46 12.83 APE s → p 41.56 41.57 APE TER-tuned 5.96 4.72 APEQE 46.44 46.05 Table 8: Performance of APE-based QE systems on the WMT15 development and test sets. F MULT1 dev F MULT 1 test APE t → p 27.46 31.39 APE s → p 51.92 53.70 APE TER-tuned 40.17 41.87 APEQE 54.95 55.68 Table 9: Performance of APE-based QE systems on the WMT16 development and test sets. second strongest system for case-sensitive TER (Ta- ble 7, WMT15) and the strongest for case-insensitve TER (22.49 vs. 22.54). 4.2 Adaptation to QE and Task-Specific Tuning As described in §2, APE outputs can be turned into word quality labels using TER-based word align- ments. Somewhat surprisingly, among the APE sys- tems introduced above, we observe in Table 9 that the s → p APE system is the so-far strongest stand- alone QE system for the WMT16 task in this work. This system is essentially a retrained neural MT component without any additional features.11 The t → p system and the TER-tuned APE ensemble are much weaker in terms of F MULT1 . This is less surprising in the case of the full ensemble, as it has been tuned towards TER for the APE task specif- ically. However, we can obtain even better APE- based QE systems for both shared task settings by tuning the full APE ensembles towards F MULT1 , the official WMT16 QE metric, and towards F BAD1 for WMT15.12 With this approach, we produce our new best stand-alone QE-systems for both shared tasks, which we denote as APEQE. 11Note that this system resembles other QE approaches which use pseudo-reference features (Albrecht and Hwa, 2008; Soricut and Narsale, 2012; Shah et al., 2013), since the s → p is essentially an “alternative” MT system. 12Using again MERT and executing 7 iterations on the offi- cial development set with an n-best list size of 12. 212 F BAD1 dev F BAD 1 test Best system in WMT15 43.1 43.12 LINEARQE 43.68 42.50 NEURALQE 43.51 43.35 STACKEDQE 44.68 43.70 APEQE 46.44 46.05 FULLSTACKEDQE 47.61 47.08 Table 10: Performance of the several word-level QE sys- tems on the WMT15 development and test datasets. The baseline is the best participating system in WMT15, from Esplà-Gomis et al. (2015). 5 Full Stacked System Finally, we consider a larger stacked system where we stack both NEURALQE and APEQE into LIN- EARQE. This will mix pure QE with APE-based QE systems; we call the result FULLSTACKEDQE. The procedure is analogous to that described in §3.3, with one extra binary feature for the APE-based word quality label predictions. For training, we used jackknifing as described in §3.3. 5.1 Word-Level QE The performance of the FULLSTACKEDQE system on the WMT15 and WMT16 datasets are shown in Tables 10–11. We compare with the other systems introduced in this paper, and with the best partici- pating systems at WMT15–16 (Esplà-Gomis et al., 2015; Martins et al., 2016). We can see that the APE-based and the pure QE systems are complementary: the full combination of the linear, neural, and APE-based systems improves the scores with respect to the best individual sys- tem (APEQE) by about 1 point in WMT15 and 2 points in WMT16. Overall, we obtain for WMT16 an F MULT1 score of 57.47%, a new state of the art, and an absolute gain of +7.95% over Martins et al. (2016). This is a remarkable improvement that can pave the way for a wider adoption of word-level QE systems in industrial settings. For WMT15, we also obtain a new state of the art, with a less impres- sive gain of +3.96% over the best previous system. In §6 we analyze the errors made by the pure and the APE-based QE systems to better understand how they complement each other. F MULT1 dev F MULT 1 test Best system in WMT16 49.25 49.52 LINEARQE 46.11 46.16 NEURALQE 46.80 47.29 STACKEDQE 49.16 50.27 APEQE 54.95 55.68 FULLSTACKEDQE 56.80 57.47 Table 11: Performance of the several word-level QE sys- tems on the WMT16 development and test datasets. The baseline is the best participating system in WMT16, from the Unbabel team (Martins et al., 2016). 5.2 Sentence-Level QE Encouraged by the strong results obtained with the FULLSTACKEDQE system in word-level QE, we in- vestigate how we can adapt this system for HTER prediction at sentence level. Prior work (de Souza et al., 2014) incorporated word-level quality pre- dictions as features in a sentence-level QE system, training a feature-based linear classifier. Here, we show that a very simple conversion, which requires no training or tuning, is enough to obtain a substan- tial improvement over the state of the art. For the APE system, it is easy to obtain a predic- tion for HTER: we can simply measure the HTER between the translated sentence t and the predicted corrected sentence p̂. For a pure QE system, we ap- ply the following word-to-sentence conversion tech- nique: (i) run a QE system to obtain a sequence of OK and BAD word quality labels; (ii) use the frac- tion of BAD labels as an estimate for HTER. Note that this procedure, while not requiring any training, is far from perfect. Words that are not in the trans- lated sentence but exist in the reference post-edited sentence do not originate BAD labels, and therefore will not contribute to the HTER estimate. Yet, as we will see, this procedure applied to the STACKEDQE system (i.e. without the APEQE component) is al- ready sufficient to obtain state of the art results. Fi- nally, to combine the APE and pure QE systems to- ward sentence-level QE, we simply take the average of the two HTER predictions above. Table 12 shows the results obtained with our pure QE system (STACKEDQE), with our APE- based system (APEQE), and with the combination of the two (FULLSTACKEDQE). As baselines, we 213 Pearson dev Pearson test Spearman dev Spearman test WMT15 Best system in WMT15 (ranking) – 39.41 – 36.49 Best system in WMT15 (HTER) – 38.46 – 36.81 STACKEDQE 32.29 36.96 33.22 34.44 APEQE 29.33 40.39 30.80 38.74 FULLSTACKEDQE 36.07 44.99 36.68 42.30 WMT16 Best system in WMT16 (ranking) – 52.5 – – Best system in WMT16 (HTER) – 46.0 – 48.3 STACKEDQE 55.30 54.93 56.46 55.34 APEQE 59.04 61.27 61.06 62.48 FULLSTACKEDQE 64.04 65.56 65.52 65.92 Table 12: Performance of our sentence-level QE systems on the WMT15 an WMT16 datasets, as measured by the WMT16 official evaluation script. The baselines are the best WMT15–16 systems in the HTER prediction track (Bicici et al., 2015; Kozlova et al., 2016) and in the sentence ranking track (Langlois, 2015; Kim and Lee, 2016). report the performance of the two best systems in the sentence-level QE tasks at WMT15 and WMT16 (Bicici et al., 2015; Langlois, 2015; Kozlova et al., 2016; Kim and Lee, 2016). The results are striking: for WMT16, even our weakest system (STACKEDQE) with the simple con- version procedure above is already sufficient to ob- tain state of the art results, outperforming Kozlova et al. (2016) and Kim and Lee (2016) by a considerable margin. The APEQE system gives a very large boost over these scores, which are further increased by the combined FULLSTACKEDQE system. Overall, we obtain absolute gains of +13.36% in Pearson’s r cor- relation score for HTER prediction, and +17.62% in Spearman’s ρ correlation for sentence ranking, a considerable advance over the previous state of the art. For WMT15, we also obtain a new state of the art, with less sharp (but still significant) improve- ments: +5.08% in Pearson’s r correlation score, and +5.81% in Spearman’s ρ correlation. 6 Error Analysis Performance over sentence length. To better un- derstand the differences in performance between the pure QE system (STACKEDQE) and the APE-based system (APEQE), we analyze how the two systems, as well as their combination (FULLSTACKEDQE), perform as a function of the sentence length. Figure 3 shows the averaged number of BAD pre- dictions made by the three systems for different sen- tences lengths, in the WMT16 development set. For comparison, we show also the true average num- ber of BAD words in the gold standard. We ob- serve that, for short sentences (less than 5 words), the pure QE system tends to be too optimistic (i.e., it underpredicts BAD words) and the APE-based sys- tem too pessimistic (overpredicting them). In the range of 5-10 words, the pure QE system matches the proportion of BAD words more accurately than the APE-based system. For medium/long sentences, we observe the opposite behavior (this is partic- ularly clear in the 20-25 word range), with the APE-based system being generally better. On the other hand, the combination of the two systems (FULLSTACKEDQE) manages to find a good bal- ance between these two biases, being much closer to the true proportion of BAD labels for both shorter and longer sentences than any of the individual sys- tems. This shows that the two systems complement each other well in the combination. Illustrative examples. Table 13 shows concrete examples of quality predictions on the WMT16 de- velopment data. In the top example, we can see that the APE system correctly replaced Angleichungs- farbe by Mischfarbe, but is under-corrective in other parts. The APEQE system therefore misses several BAD words, but manages to get the correct label (OK) for den. By contrast, the pure QE system er- roneously flags this word as incorrect, but it makes the right decision on Farbton and zu erstellen, be- ing more accurate than APEQE. The combination of the two systems (pure QE and APEQE) leads to 214 Source Combines the hue value of the blend color with the luminance and saturation of the base color to create the result color . MT Kombiniert den Farbton Wert der Angleichungsfarbe mit der Luminanz und Sättigung der Grundfarbe zu erstellen . PE (Reference) Kombiniert den Farbtonwert der Mischfarbe mit der Luminanz und Sättigung der Grundfarbe . APE Kombiniert den Farbton der Mischfarbe mit der Luminanz und die Sättigung der Grundfarbe , um die Ergebnisfarbe zu erstellen . STACKEDQE Kombiniert den Farbton Wert der Angleichungsfarbe mit der Luminanz und Sättigung der Grund- farbe zu erstellen . APEQE Kombiniert den Farbton Wert der Angleichungsfarbe mit der Luminanz und Sättigung der Grund- farbe zu erstellen . FULLSTACKEDQE Kombiniert den Farbton Wert der Angleichungsfarbe mit der Luminanz und Sättigung der Grund- farbe zu erstellen . Source The Video Preview plug - in supports RGB , grayscale , and indexed images . MT Mit dem Zusatzmodul “ Videovorschau ” unterstützt RGB- , Graustufen- und indizierte Bilder . PE (Reference) Das Zusatzmodul “ Videovorschau ” unterstützt RGB- , Graustufen- und indizierte Bilder . APE Das Dialogfeld “ Videovorschau ” unterstützt RGB- , Graustufen- und indizierte Bilder . STACKEDQE Mit dem Zusatzmodul “ Videovorschau ” unterstützt RGB- , Graustufen- und indizierte Bilder . APEQE Mit dem Zusatzmodul “ Videovorschau ” unterstützt RGB- , Graustufen- und indizierte Bilder . FULLSTACKEDQE Mit dem Zusatzmodul “ Videovorschau ” unterstützt RGB- , Graustufen- und indizierte Bilder . Table 13: Examples on WMT16 validation data. Shown are the source and translated sentences, the gold post-edited sentences, the output of the APE system, and the QE predictions of our pure QE and APE-based QE systems as well as their combination. Words predicted as OK are shown in green, those predicted as BAD are shown in red, and differences between the translated and the post-edited sentences are shown in blue. For both examples, the full stacked system predicts all quality labels correctly. Figure 3: Averaged number of words predicted as BAD by the different systems in the WMT16 gold dev set, for different bins of the sentence length. the correct sequential prediction. In the bottom ex- ample, the pure QE system assigns the correct label to Zusatzmodul, while the APE system mistranslates this word to Dialogfeld, leading to a wrong predic- tion by the APEQE system. On the other hand, pure QE misclassifies unterstützt RGB- as BAD words, while the APEQE gets them right. Overall, the APEQE is more accurate in this example. Again, these decisions complement each other well, as can be seen by the combined QE system which outputs the correct word labels for the entire sentence. 7 Conclusions We have presented new state of the art systems for word-level and sentence-level QE that are consid- erably more accurate than previous systems on the WMT15 and WMT16 datasets. First, we proposed a new pure QE system which stacks a linear and a neural system, and is simpler and slighly more accurate than the currently best word-level system. Then, by relating the tasks of APE and word-level QE, we derived a new APE- based QE system, which leverages additional artifi- cial roundtrip translation data, achieving a larger im- provement. Finally, we combined the two systems via a full stacking architecture, boosting the scores even further. Error analysis shows that the pure and APE-based systems are highly complementary. The full system was extended to sentence-level QE by virtue of a simple word-to-sentence conversion, re- 215 quiring no further training or tuning. Acknowledgments We thank the reviewers and the action edi- tor for their insightful comments. This work was partially supported by the the EXPERT project (EU Marie Curie ITN No. 317471), and by Fundação para a Ciência e Tecnolo- gia (FCT), through contracts UID/EEA/50008/2013 and UID/CEC/50021/2013, the LearnBig project (PTDC/EEI-SII/7092/2014), the GoLocal project (grant CMUPERI/TIC/0046/2014), and the Amazon Academic Research Awards program. References Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Distributed Word Representations for Multi- lingual NLP. In Proceedings of the Seventeenth Con- ference on Computational Natural Language Learn- ing, pages 183–192. Joshua Albrecht and Rebecca Hwa. 2008. The role of pseudo references in MT evaluation. In Proceedings of the Third Workshop on Statistical Machine Transla- tion, pages 187–190. Eleftherios Avramidis. 2012. Quality estimation for machine translation output using linguistic analysis and decoding features. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 84–90. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- ton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450. Daniel Beck, Kashif Shah, Trevor Cohn, and Lucia Spe- cia. 2013. SHEF-Lite: When less is more for transla- tion quality estimation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 335–340. Ergun Bicici, Qun Liu, and Andy Way. 2015. Refer- ential translation machines for predicting translation quality and related statistics. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 304–308. Ergun Biçici. 2013. Referential translation machines for quality estimation. In Proceedings of the Eighth Work- shop on Statistical Machine Translation, pages 343– 351. John Blatz, Erin Fitzgerald, George Foster, Simona Gan- drabur, Cyril Goutte, Alex Kulesza, Alberto Sanchis, and Nicola Ueffing. 2004. Confidence estimation for machine translation. In Proceedings of the Interna- tional Conference on Computational Linguistics, page 315. Ondřej Bojar, Rajan Chatterjee, Christian Federmann, Barry Haddow, Chris Hokamp, Matthias Huck, Var- vara Logacheva, , Philipp Koehn, , Christof Monz, Matteo Negri, Pavel Pecina, Matt Post, Carolina Scar- ton, Lucia Specia, and Marco Turchi. 2015. Findings of the 2015 Workshop on Statistical Machine Transla- tion. In Proceedings of the Tenth Workshop on Statis- tical Machine Translation, pages 1–46. Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Anto- nio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurelie Neveol, Mari- ana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. 2016. Findings of the 2016 conference on machine translation. In Pro- ceedings of the First Conference on Machine Transla- tion, pages 131–198. Leo Breiman. 1996. Stacked Regressions. Machine Learning, 24:49–64. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations Using RNN Encoder-Decoder for Sta- tistical Machine Translation. In Proceedings of Empir- ical Methods in Natural Language Processing, pages 1724–1734. François Chollet. 2015. Keras. https://github. com/fchollet/keras. William W. Cohen and Vitor R. de Carvalho. 2005. Stacked Sequential Learning. In Proceedings of In- ternational Joint Conference on Artificial Intelligence, pages 671–676. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev- Shwartz, and Yoram Singer. 2006. Online Passive- Aggressive Algorithms. Journal of Machine Learning Research, 7:551–585. José G. C. de Souza, Jesús González-Rubio, Christian Buck, Marco Turchi, and Matteo Negri. 2014. FBK- UPV-UEdin participation in the WMT14 Quality Esti- mation shared-task. In Proceedings of the Ninth Work- shop on Statistical Machine Translation, pages 322– 328. José G. C. de Souza, Marcello Federico, and Has- san Sawaf. 2015. MT Quality Estimation for E- Commerce Data. In Proceedings of MT Summit XV, vol. 2: MT Users’ Track, pages 20–29. Miquel Esplà-Gomis, Felipe Sánchez-Martı́nez, and Mikel Forcada. 2015. UAlacant word-level machine translation quality estimation system at WMT 2015. 216 In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 309–315. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural net- works. In International Conference on Artificial Intel- ligence and Statistics, pages 249–256. Yvette Graham. 2015. Improving evaluation of machine translation quality estimation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 1804–1813. Sébastien Jean, Orhan Firat, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. Montreal neu- ral machine translation systems for wmt15. In Pro- ceedings of the Tenth Workshop on Statistical Machine Translation, pages 134–140. Marcin Junczys-Dowmunt and Roman Grundkiewicz. 2016. Log-linear combinations of monolingual and bilingual neural machine translation models for auto- matic post-editing. In Proceedings of the First Confer- ence on Machine Translation, pages 751–758. Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu Hoang. 2016. Is neural machine translation ready for deployment? A case study on 30 translation directions. arXiv preprint arXiv:1610.01108. Hyun Kim and Jong-Hyeok Lee. 2016. Recurrent neural network based translation quality estimation. In Pro- ceedings of the First Conference on Machine Transla- tion, pages 787–792. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Anna Kozlova, Mariya Shmatova, and Anton Frolov. 2016. YSDA Participation in the WMT16 Quality Es- timation Shared Task. In Proceedings of the First Con- ference on Machine Translation, pages 793–799. Julia Kreutzer, Shigehiko Schamoni, and Stefan Riezler. 2015. QUality Estimation from ScraTCH (QUETCH): Deep Learning for Word-level Translation Quality Es- timation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 316–322. David Langlois. 2015. LORIA System for the WMT15 Quality Estimation Shared Task. In Proceedings of the Tenth Workshop on Statistical Machine Transla- tion, pages 323–329. Ngoc Quang Luong, Laurent Besacier, and Benjamin Lecouteux. 2014. LIG System for Word Level QE task at WMT14. In Proceedings of the Ninth Work- shop on Statistical Machine Translation, pages 335– 341. André F. T. Martins, Dipanjan Das, Noah A. Smith, and Eric P. Xing. 2008. Stacking Dependency Parsers. In Proceedings of Empirical Methods for Natural Lan- guage Processing, pages 157–166. André F. T Martins, Miguel B. Almeida, and Noah A. Smith. 2013. Turning on the turbo: Fast third-order non-projective turbo parsers. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 617–622. André F. T. Martins, Ramón Astudillo, Chris Hokamp, and Fabio N. Kepler. 2016. Unbabel’s Participation in the WMT16 Word-Level Translation Quality Estima- tion Shared Task. In Proceedings of the First Confer- ence on Machine Translation, pages 806–811. Vinod Nair and Geoffrey E Hinton. 2010. Rectified lin- ear units improve restricted Boltzmann machines. In Proceedings of the International Conference on Ma- chine Learning, pages 807–814. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Annual Meeting on Association for Computational Linguistics, pages 160–167. Raphael Rubino, Jennifer Foster, Joachim Wagner, Jo- hann Roturier, Rasul Samad Zadeh Kaljahi, and Fred Hollowood. 2012. DCU-Symantec submission for the WMT 2012 quality estimation task. In Proceedings of the Seventh Workshop on Statistical Machine Transla- tion, pages 138–144. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Edinburgh Neural Machine Translation Systems for WMT 16. In Proceedings of the First Conference on Machine Translation, pages 371–376. Kashif Shah, Trevor Cohn, and Lucia Specia. 2013. An investigation on the effectiveness of features for trans- lation quality estimation. In Proceedings of the Ma- chine Translation Summit, volume 14, pages 167–174. Michel Simard, Nicola Ueffing, Pierre Isabelle, and Roland Kuhn. 2007. Rule-based translation with sta- tistical phrase-based post-editing. In Proceedings of the Second Workshop on Statistical Machine Transla- tion, pages 203–206. Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- nea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Associa- tion for Machine Translation in the Americas, pages 223–231. Radu Soricut and Sushant Narsale. 2012. Combining quality prediction and system selection for improved automatic translation output. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 163–170. Lucia Specia, Kashif Shah, Jose G.C. de Souza, and Trevor Cohn. 2013. QuEst - a translation quality estimation framework. In Proceedings of the Annual Meeting of the Association for Computational Linguis- tics: System Demonstrations, pages 79–84. 217 Lucia Specia. 2011. Exploiting objective annotations for measuring translation post-editing effort. In Proceed- ings of the 15th Conference of the European Associa- tion for Machine Translation, pages 73–80. Tijmen Tieleman and Geoffrey Hinton. 2012. Rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Ma- chine Learning, 4(2). Marco Turchi, Antonios Anastasopoulos, José GC de Souza, and Matteo Negri. 2014. Adaptive qual- ity estimation for machine translation. In Proceedings of the Annual Meeting of the Association for Compu- tational Linguistics, pages 710–720. Nicola Ueffing and Hermann Ney. 2007. Word-level confidence estimation for machine translation. Com- putational Linguistics, 33(1):9–40. D. Wolpert. 1992. Stacked generalization. Neural Net- works, 5(2):241–260. Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701. 218