key: cord-0141998-k2peca74 authors: Lu, Yu; Zeng, Jiali; Zhang, Jiajun; Wu, Shuangzhi; Li, Mu title: Learning Confidence for Transformer-based Neural Machine Translation date: 2022-03-22 journal: nan DOI: nan sha: ffe9a70445f99516aa3c86651f9121c84b1b78cc doc_id: 141998 cord_uid: k2peca74 Confidence estimation aims to quantify the confidence of the model prediction, providing an expectation of success. A well-calibrated confidence estimate enables accurate failure prediction and proper risk measurement when given noisy samples and out-of-distribution data in real-world settings. However, this task remains a severe challenge for neural machine translation (NMT), where probabilities from softmax distribution fail to describe when the model is probably mistaken. To address this problem, we propose an unsupervised confidence estimate learning jointly with the training of the NMT model. We explain confidence as how many hints the NMT model needs to make a correct prediction, and more hints indicate low confidence. Specifically, the NMT model is given the option to ask for hints to improve translation accuracy at the cost of some slight penalty. Then, we approximate their level of confidence by counting the number of hints the model uses. We demonstrate that our learned confidence estimate achieves high accuracy on extensive sentence/word-level quality estimation tasks. Analytical results verify that our confidence estimate can correctly assess underlying risk in two real-world scenarios: (1) discovering noisy samples and (2) detecting out-of-domain data. We further propose a novel confidence-based instance-specific label smoothing approach based on our learned confidence estimate, which outperforms standard label smoothing. Confidence estimation has become increasingly critical with the widespread deployment of deep neural networks in practice (Amodei et al., 2016) . It aims to measure the model's confidence in the prediction, showing when it probably fails. A calibrated confidence estimate can accurately identify failure, further measuring the potential risk induced by noisy samples and out-of-distribution data prevalent in real scenarios (Nguyen and O'Connor, 2015; Snoek et al., 2019) . Unfortunately, neural machine translation (NMT) is reported to yield poor-calibrated confidence estimate (Kumar and Sarawagi, 2019; Wang et al., 2020) , which is common in the application of modern neural networks (Guo et al., 2017) . It implies that the probability a model assigns to a prediction is not reflective of its correctness. Even worse, the model often fails silently by providing high-probability predictions while being woefully mistaken (Hendrycks and Gimpel, 2017) . We take Figure 1 as an example. The mistranslations are produced with high probabilities (dark green blocks in the dashed box), making it problematic to assess the quality based on prediction probability when having no access to references. The confidence estimation on classification tasks is well-studied in the literature (Platt, 1999; Guo et al., 2017 ). Yet, researches on structured generation tasks like NMT is scarce. Existing researches only study the phenomenon that the generated probability in NMT cannot reflect the accuracy (Müller et al., 2019; Wang et al., 2020) , while little is known about how to establish a well-calibrated confidence estimate to describe the predictive un-certainty of the NMT model accurately. To deal with this issue, we aim to learn the confidence estimate jointly with the training process in an unsupervised manner. Inspired by Ask For Hints (DeVries and Taylor, 2018), we explain confidence as how many hints the NMT model needs to make a correct prediction. Specifically, we design a scenario where ground truth is available for the NMT model as hints to deal with tricky translations. But each hint is given at the price of some penalty. Under this setting, the NMT model is encouraged to translate independently in most cases to avoid penalties but ask for hints to ensure a loss reduction when uncertain about the decision. More hints mean low confidence and vice versa. In practice, we design a confidence network, taking multi-layer hidden states of the decoder as inputs to predict the confidence estimate. Based on this, we further propose a novel confidence-based label smoothing approach, in which the translation more challenging to predict has more smoothing to its labels. Recall the example in Figure 1 . The first phrase "a figure who loves to play" is incorrect, resulting in a low confidence level under our estimation. We notice that the NMT model is also uncertain about the second expression "a national class actor", which is semantically related but has inaccurate wording. The translation accuracy largely agrees with our learned confidence rather than model probabilities. We verify our confidence estimate as a wellcalibrated metric on extensive sentence/word-level quality estimation tasks, which is proven to be more representative in predicting translation accuracy than existing unsupervised metrics (Fomicheva et al., 2020) . Further analyses confirm that our confidence estimate can precisely detect potential risk caused by the distributional shift in two real-world settings: separating noisy samples and identifying out-of-domain data. The model needs more hints to predict fake or tricky translations in these cases, thus assigning them low confidence. Additionally, experimental results show the superiority of our confidence-based label smoothing over the standard label smoothing technique on different-scale translation tasks (WMT14 En⇒De, NIST Zh⇒En, WMT16 Ro⇒En, and IWSLT14 De⇒En). The contributions of this paper are three-fold: • We propose the learned confidence estimate to predict the confidence of the NMT output, which is simple to implement without any degradation on the translation performance. • We prove our learned confidence estimate as a better indicator of translation accuracy on sentence/word-level quality estimation tasks. Furthermore, it enables precise assessment of risk when given noisy data with varying noise degrees and diverse out-of-domain datasets. • We design a novel confidence-based label smoothing method to adaptively tune the mass of smoothing based on the learned confidence level, which is experimentally proven to surpass the standard label smoothing technique. In this section, we first briefly introduce a mainstream NMT framework, Transformer (Vaswani et al., 2017) , with a focus on how to generate prediction probabilities. Then we present an analysis of the confidence miscalibration observed in NMT, which motivates our ideas discussed afterward. The Transformer has a stacked encoder-decoder structure. When given a pair of parallel sentences x = {x 1 , x 2 , ...x S } and y = {y 1 , y 2 , ...y T }, the encoder first transforms input to a sequence of continuous representations h = h 0 1 , h 0 2 , ...h 0 S , which are then passed to the decoder. The decoder is composed of a stack of N identical blocks, each of which includes self-attention, cross-lingual attention, and a fully connected feedforward network. The outputs of l-th block h l t are fed to the successive block. At the t-th position, the model produces the translation probabilities p t , a vocabulary-sized vector, based on outputs of the N -th layer: During training, the model is optimized by minimizing the cross entropy loss: where {W , b} are trainable parameters and y t is denoted as a one-hot vector. During inference, we implement beam search by selecting high-probability tokens from generated probability for each step. Modern neural networks have been found to yield a miscalibrated confidence estimate (Guo et al., 2017; Hendrycks and Gimpel, 2017) . It means that the prediction probability, as used at each inference step, is not reflective of its accuracy. The problem is more complex for structured outputs in NMT. We cannot judge a translation as an error, even if it differs from the ground truth, as several semantically equivalent translations exist for the same source sentence. Thus we manually annotate each target word as OK or BAD on 200 Zh⇒En translations. Only definite mistakes are labeled as BAD, while other uncertain translations are overlooked. Figure 2 reports the density function of prediction probabilities on OK and BAD translations. We observe severe miscalibration in NMT: overconfident problems account for 35.8% when the model outputs BAD translations, and 24.9% OK translations are produced with low probabilities. These issues make it challenging to identify model failure. It further drives us to establish an estimate to describe model confidence better. A well-calibrated confidence estimate should be able to tell when the NMT model probably fails. Ideally, we would like to learn a measure of confidence for each target-side translation, but this remains a thorny problem in the absence of ground truth for confidence estimate. Inspired by Ask For Hints (DeVries and Taylor, 2018) on the image classification task, we define confidence as how many hints the NMT model needs to produce the correct translation. More hints mean low confidence, and that is a high possibility of failure. Motivation. We assume that the NMT model can ask for hints (look at ground-truth labels) during training, but each clue comes at the cost of a slight penalty. Intuitively, a good strategy is to indepen-dently make the predictions that the model is confident about and then ask for clues when the model is uncertain about the decision. Under this assumption, we approximate the confidence level of each translation by counting the number of hints used. To enable the NMT model to ask for hints, we add a confidence estimation network (ConNet) in parallel with the original prediction branch, as shown in Figure 3 . The ConNet takes hidden states of the decoder at t-th step (h t ) as inputs and predicts a single scalar between 0 and 1. where θ c = {W , b } are trainable parameters. σ(·) is the sigmoid function. If the model is confident that it can translate correctly, it should output c t close to 1. Conversely, the model should output c t close to 0 for more hints. To offer the model "hints" during training, we adjust softmax prediction probabilities by interpolating the ground truth probability distribution y t (denoted as a one-hot vector) into the original prediction. The degree of interpolation is decided by the generated confidence c t : The translation loss is calculated using modified prediction probabilities. To prevent the model from minimizing the loss by always setting c t = 0 (receiving all the ground truth), we add a log penalty to the loss function. The final loss is the sum of the translation loss and the confidence loss, which is weighted by the hyper-parameter λ: Under this setting, when c → 1 (the model is quite confident), we can see that p → p and L Conf → 0, which is equal to a standard training procedure. In the case where c → 0 (the model is quite unconfident), we see that p → y (the model obtains correct labels). In this scenario, L NMT would approach 0, but L Conf becomes very large. Thus, the model can reduce the overall loss only when it successfully predicts which outputs are likely to be correct. Hidden State … ℎ 1 ℎ 2 ℎ … Hints Figure 3 : The overview of the framework. The NMT model is allowed to ask for hints (ground-truth translation) during training based on the confidence level predicted by the ConNet. During inference, we use the model prediction p to sample hypotheses. Each translation word comes with a corresponding confidence estimate. Implementation Details. Due to the complexity of Transformer architecture, it requires several optimizations to prevent the confidence branch from degrading the performance of the translation branch. Do not provide hints at the initial stage. The early model is fragile, which lays the groundwork for the following optimization. We find that affording hints at an early period leads to a significant performance drop. To this end, we propose to dynamically control the value of λ (as in Equation 7) by the training step (s) as: where λ 0 and β 0 control the initial value and the declining speed of λ. We expect the weight of confidence loss to be large at the beginning (c → 1) and give hints during middle and later stages. Do not use high-layer hidden states to predict confidence. We find that it would add much burden to the highest layer hidden state if used to predict translation and confidence simultaneously. So we suggest using low-layer hidden states for the confidence branch and leaving the translation branch unchanged (here, the decoder has 6 layers): where h l t is the l-th layer hidden state in the decoder. Besides, other combinations of low-layer hidden states are alternative, i.e., h t = AVE(h 1 t + h 3 t ). Do not let the model lazily learn complex examples. We encounter the situation where the model frequently requests hints rather than learning from difficulty. We follow DeVries and Taylor (2018) to give hints with 50% probability. In practice, we apply Equation 4 to only half of the batch. Smoothing labels is a typical way to prevent the network from miscalibration (Müller et al., 2019) . It has been used in many state-of-the-art models, which assigns a certain probability mass ( 0 ) to other nonground-truth labels (Szegedy et al., 2016) . Here we attempt to employ our confidence estimate to improve smoothing. We propose a novel instancespecific confidence-based label smoothing technique, where predictions with greater confidence receive less label smoothing and vice versa. The amount of label smoothing applied to a prediction ( t ) is proportional to its confidence level. where 0 is the fixed value for vanilla label smoothing,ĉ is the batch-level average confidence level. This section first exhibits empirical studies on the Quality Estimation (QE) task, a primary application of confidence estimation. Then, we present experimental results of our confidence-based label smoothing, an extension of our confidence estimate to better smoothing in NMT. To evaluate the ability of our confidence estimate on mistake prediction, we experiment on extensive sentence/word-level QE tasks. Supervised QE task requires large amounts of parallel data annotated with the human evaluation, which is labor-intensive and impractical for low-resource languages. Here, we propose to address QE in an unsupervised way along with the training of the NMT model. We experiment on WMT2020 QE shared tasks 2 , including high-resource language pairs (English-German and English-Chinese) and mid-resource language pairs (Estonian-English and Romanian-English). This task provides source language sentences, corresponding machine translations, and NMT models used to generate translation. Each translation is annotated with direct assessment (DA) by professional translators, ranging from 0-100, according to the perceived translation quality. We can evaluate the performance of QE in terms of Pearson's correlation with DA scores. We compare our confidence estimate with four unsupervised QE metrics (Fomicheva et al., 2020) : • TP: the sentence-level translation probability normalized by length T . • Softmax-Ent: the average entropy of softmax output distribution at each decoding step. • Sent-Std: the standard deviation of word-level log-probability p(y 1 ), ..., p(y T ). • D-TP: the expectation for the set of TP scores by running K stochastic forward passes through the NMT model with model parametersθ k perturbed by Monte Carlo (MC) dropout (Gal and Ghahramani, 2016) . We also report two supervised QE models: • Predictor-Estimator (Kim et al., 2017) : a weak neural approach, which is usually set as the baseline system for supervised QE tasks. • BERT-BiRNN (Kepler et al., 2019b) : a strong QE model using a large-scale dataset for pretraining and quality labels for fine-tuning. We propose four confidence-based metrics: (1) Conf : the sentence-level confidence estimate averaged by length, (2) Sent-Std-Conf : the standard deviation of word-level log-confidence c 1 , ..., c T , (3) D-Conf : similar to D-TP, we compute the expectation of Conf by running K forward passes through the NMT model, and (4) D-Comb: the combination of D-TP and D-Conf: Note that our confidence estimate is produced together with translations. It is hard to let our model generate exact translations as provided by WMT, even with a similar configuration. Thus, we train our model on parallel sentences as used to train provided NMT models. Then, we employ force decoding on given translations to obtain existing unsupervised metrics and our estimations. We do not use any human judgment labels for supervision. Table 1 shows the Pearson's correlation with DA scores for the above QE indicators. We find that: Our confidence-based metrics substantially surpass probability-based metrics (the first three lines in Table 1 ). Compared with dropout-based methods (D-TP), our metrics obtain comparable results on mid-resource datasets while yielding better performance on high-resource translation tasks. We note that the benefits brought from the MC dropout strategy are limited for our metrics, which is significant in probability-based methods. It also proves the stability of our confidence estimate. In addition, the predictive power of MC dropout comes at the cost of computation, as performing forward passes through the NMT model is time-consuming and impractical for the large-scale dataset. Our approach outperforms PredEst, a weak supervised method, on three tasks and further narrows the gap on Ro-En. Though existing unsupervised QE methods still fall behind with the strong QE model (BERT-BiRNN), the exploration of unsupervised metrics is also meaningful for real-world deployment with the limited annotated dataset. We also validate the effectiveness of our confidence estimate on QE tasks from a more finegrained view. We randomly select 250 sentences from Zh⇒En NIST03 and obtain NMT translations. Two graduate students are asked to annotate each target word as either OK or BAD. We assess the performance of failure prediction with standard metrics, which are introduced in Appendix A. Experimental results are given in Table 3 . We implement competitive failure prediction approaches, including Maximum Softmax Probability (MSP) (Hendrycks and Gimpel, 2017) and Monte Carlo Dropout (MCDropout) (Gal and Ghahramani, 2016) . We find that our learned confidence estimate yields a better separation of OK and BAD translation than MSP. Compared with MCDropout, our metrics achieve competing performance with significant advantages on computational expenses. Overall, the learned confidence estimate is a competitive indicator of translation precision compared with other unsupervised QE metrics. Moreover, the confidence branch added to the NMT system is a light component. It allows each translation to come with quality measurement without degradation of the translation accuracy. The performance with the confidence branch is in Appendix B. We extend our confidence estimate to improve smoothing and experiment on different-scale translation tasks: WMT14 English-to-German (En⇒De), LDC Chinese-to-English (Zh⇒En) 3 , WMT16 Romanian-to-English (Ro⇒En), and IWSLT14 German-to-English (De⇒En). We use the 4-gram BLEU (Papineni et al., 2002) to score the performance. More details about data processing and experimental settings are in Appendix C. As shown in Table 2 , our confidence-based label smoothing outperforms standard label smoothing by adaptively tuning the amount of each label smoothing. For Zh⇒En task, our method improves the performance over Transformer w/o LS by 1.05 BLEU, which also exceeds standard label smoothing by 0.72 BLEU. We find that improvements over standard label smoothing differ in other language pairs (0.35 BLEU in En⇒De, 0.5 BLEU in De⇒En, and 0.79 BLEU in Ro⇒En). It can be attributed to that the seriousness of miscalibration varies in different language pairs and datasets (Wang et al., 2020) . Experimental results with a larger search space (i.e. beam size=30) are also given in Appendix C to support the above findings. Confidence estimation is particularly critical in realworld deployment, where noisy samples and out-ofdistribution data are prevalent (Snoek et al., 2019) . Given those abnormal inputs, neural network models are prone to be highly confident in misclassification . Thus, we need an accurate confidence estimate to detect potential failures caused by odd inputs by assigning them low confidence. This section explores whether our confidence estimate can accurately measure risk under those two conditions. We expect that the model requires more hints to fit noisy labels by predicting low confidence. To test this point, we experiment on the IWSLT14 De⇒En dataset containing 160k parallel sentences. We build several datasets with progressively increasing noisy samples by randomly replacing target-side words with others in the vocabulary. We train on each dataset with the same configuration and picture the learned confidence estimate in Figure 4 . The learned confidence estimate appears to make Table 5 : Comparison of the model probability and our confidence estimate on out-of-domain data detection tasks. We present the rate of unknown words (UNK) and average length of input sentences for each dataset (the average input length of in-domain dataset is 22.47). All scores are shown in percentages and the best results are highlighted in bold. ↑ indicates that higher scores are better, while ↓ indicates that lower scores are better. The shade of colors denotes how many words are corrupted in a sentence (dark orange means a high pollution rate). The dashed line shows averaged learned confidence estimate on the whole dataset. reasonable assessments. (1) It predicts low confidence on noisy samples but high confidence on clean ones. Specifically, the confidence estimate is much lower as a higher pollution degree in one example (darker in color). (2) With increasing noises in the dataset, the NMT model becomes more uncertain about its decision accordingly. Large numbers of noises also raise a challenge for separating clean and noisy samples. We also compare ours with the model probability by giving the accuracy of separating clean and noisy examples under varying pollution rates. We set clean data as the positive example and use evaluation metrics listed in Appendix A. Table 4 , our confidence estimate obtains better results in all cases, especially in a high noise rate. Our metric improves the area under the precision-recall curve (AUPR) from 64.15% to 76.76% and reduces the detection error (DET) from 13.41% to 8.13% at an 80% noise rate. It proves that our confidence estimate is more reliable for detecting potential risks induced by noisy data. For our in-domain examples, we train an NMT model on the 2.1M LDC Zh⇒En news dataset and then sample 1k sentences from NIST2004 as the in-domain testbed. We select five out-of-domain datasets and extract 1k samples from each. Most of them are available for download on OPUS, specified in Appendix D. Regarding the unknown words (UNK) rate, the average length of input sentences, and domain diversity, the descending order based on distance with the in-domain dataset is WMTnews > Tanzil > Tico-19 > TED2013 > News-Commentary. Test sets closer to the in-domain dataset are intuitively harder to tell apart. We use sentence-level posterior probability and confidence estimate of the translation to separate in-and out-of-domain data. Evaluation metrics are in Appendix A. Results are given in Table 5 . We find that our approach performs comparably with the probability-based method on datasets with distinct domains (WMT-news and Tanzil). But when cross-domain knowledge is harder to detect (the last three lines in Table 5 ), our metric yields a better separation of in-and out-of-domain ones. To better understand the behaviour of our confidence estimates on out-of-domain data, we visualize word clouds of the most confident/uncertain words ranked by model probability and our measurements on a medicine dataset Our metrics correctly separate in-and out-ofdomain data from two aspects: (1) word frequency: the NMT model is certain about frequent words yet hesitates on rare words as seen in Figure 5 (b). But colors in Figure 5 (a) are relatively mixing. (2) domain relation: the most uncertain words ranked by our confidence estimate are domain-related, like "patho" and "syndrome", while the most confident words are domain-unrelated (e.g., punctuations and prepositions). This phenomenon cannot be seen in Figure 5 (a), showing that probabilities from softmax fall short in representing model uncertainty for domain-shift data. The task of confidence estimation is crucial in realworld conditions, which helps failure prediction (Corbière et al., 2019) and out-of-distribution de-tection (Hendrycks and Gimpel, 2017; Snoek et al., 2019; Lee et al., 2018) . This section reviews recent researches on confidence estimation and related applications on quality estimation for NMT. Only a few studies have investigated calibration in NMT. Müller et al. (2019) find that the NMT model is well-calibrated in training, which is proven severely miscalibrated in inference (Wang et al., 2020) , especially when predicting the end of a sentence (Kumar and Sarawagi, 2019) . Regarding the complex structures of NMT, the exploration for fixing miscalibration in NMT is scarce. Wang et al. (2019) ; Xiao et al. (2020) use Monte Carlo dropout to capture uncertainty in NMT, which is time-consuming and computationally expensive. Unlike them, we are the first to introduce learned confidence estimate into NMT. Our method is welldesigned to adapt to Transformer architecture and NMT tasks, which is also simple but effective. QE is to predict the quality of the translation provided by an MT system at test time without standard references. Recent supervised QE models are resource-heavy and require a large mass of annotated quality labels for training (Wang et al., 2018; Kepler et al., 2019a; Lu and Zhang, 2020) , which is labor-consuming and unavailable for low-resource languages. Exploring internal information from the NMT system to indicate translation quality is another alternative. Fomicheva et al. (2020) find that uncertainty quantification is competitive in predicting the translation quality, which is also complementary to supervised QE model (Wang et al., 2021) . However, they rely on repeated Monte Carlo dropout (Gal and Ghahramani, 2016) to assess uncertainty at the high cost of computation. Our confidence estimate outperforms existing unsupervised QE metrics, which is also intuitive and easy to implement. In this paper, we propose to learn confidence estimates for NMT jointly with the training process. We demonstrate that learned confidence can better indicate translation accuracy on extensive sentence/word-level QE tasks and precisely measures potential risk induced by noisy samples or out-of-domain data. We further extend the learned confidence estimate to improve smoothing, outperforming the standard label smoothing technique. As our confidence estimate outlines how much the model knows, we plan to apply our work to design a more suitable curriculum during training and post-edit low-confidence translations in the future. We let TP, FP, TN, and FN represent true positives, false positives, true negatives, and false negatives. We use the following metrics for evaluating the accuracy of word-level QE, noisy label identification, and out-of-domain detection: • AUROC: the Area Under the Receiver Operating Characteristic (ROC) curve, which plots the relation between TPR and FPR. • AUPR: the Area Under the Precision-Recall (PR) curve. The PR curve is made by plotting precision = TP/(TP+FP) and recall = TP/(TP+FN). • DET: the Detection Error, which is the minimum possible misclassification probability over all possible threshold when separating positive and negative examples. • EER: the Equal error rate. It is the error rate when the confidence threshold is located where FPR is the same with the false negative rate (FNR) = FN / (TP+FN). We set OK translations in the word-level QE task, clean samples in the noisy data identification task, and in-domain samples in the out-of-domain data detection task as the positive example. The confidence branch added to the NMT system is a light component. It allows each translation to come with quality measurement without degradation of the translation accuracy. Translation results with the confidence branch are given in Table 6 . We see that the added confidence branch does not affect the translation performance. Implementation details in section 3 are necessary for achieving this. For instance, if we use the highest hidden state to predict confidence and translation together, BLEU scores would dramatically decline with a larger beam size, the drop of which is more significant than that of the baseline model. For the En⇒De task, the change is from 27.31 (beam size 4) to 25.6 (beam size 100), while the baseline model even improves 0.5 BLEU further with a larger beam size 100. We experiment on different-scale translation tasks: WMT14 En⇒De, LDC Zh⇒En, WMT16 Ro⇒En, and IWSLT14 De⇒En. Datasets. We tokenize the corpora by Moses (Koehn et al., 2007) . Byte pair encoding (BPE) (Sennrich et al., 2016) is applied to all language pairs to construct a join 32k vocabulary except for Zh⇒En where the source and target languages are separately encoded. For En⇒De, we train on 4.5M training samples. Newstest2013 and newstest2014 are set as validation and test sets. For Zh⇒En, we remove sentences of more than 50 words and collect 2.1M training samples. We use NIST 2002 as the validation set, NIST 2003 -2006 , and 2008 (MT08) as the testbed. For Ro⇒En, we train on 0.61M training data and use newsdev2016 and new-stest2016 as validation and test sets. For De⇒En, we train on its training set with 160k training samples and evaluate on its test set. Settings. We implement the described model with fairseq 5 toolkit for training and evaluating. We follow Vaswani et al. (2017) to set the configurations of models with the base Transformer. The dropout rate of the residual connection is 0.1 except for Zh⇒En (0.3). The experiments last for 150k steps for Zh⇒En and En⇒De, 30k for small-scale De⇒En and Ro⇒En. We average the last ten checkpoints for evaluation and adopt beam search (beam size 4/30, length penalty 0.6). We set ls = 0.1 for the vanilla label smoothing. The hyper-parameters λ 0 and β 0 (as seen Equation 8) control the initial value and declining speed of λ (as in Equation 7), which decides the number of hints the NMT model can receive. To ensure that no hints are available at the early stage of training, we set λ 0 = 30, β 0 = 4.5 * 10 4 for Zh⇒En and En⇒De, β 0 = 1.2 * 10 4 for De⇒En and Ro⇒En. We set 0 = 0.1 (as seen in Equation 10) for all language pairs. Results. A common setting with beam size=4 is given in Table 2 in the main body. Here, we experiment with a larger search space where being over-or under-confident further worsens model performance (Guo et al., 2017) . The results with beam size=30 are listed in Table 7 Table 7 : Translation results (beam size 30) for standard label smoothing and our confidence-based label smoothing on NIST Zh⇒En, WMT14 En⇒De (using case-sensitive BLEU score for evaluation), IWSLT14 De⇒En, and WMT16 Ro⇒En. " * " indicates gains are statistically significant than Transformer w/o LS with p < 0.05. over Transformer w/o LS, exceeding standard label smoothing by 0.58 BLEU scores. The performance gains can also be found in other language pairs, showing the effectiveness of our confidence-based label smoothing with a larger beam size. We select five out-of-domain datasets for our tests (we extract 1k samples each), which are available for download on OPUS 4 . The datasets are: • WMT-News: A parallel corpus of News Test Sets provided by WMT for training SMT 5 , which is rich in content including sports, entertainment, politics, and so on. • Tanzil: This is a collection of Quran translations compiled by the Tanzil project 6 . • Tico-19: This is a collection of translation memories from the Translation Initiative for COVID-19, which has many medical terms 7 . • TED2013: A corpus of TED talks subtitles provided by CASMACAT 8 , which are about personal experiences in informal expression. • News-Commentary: It is also a dataset provided by WMT 9 , but the extracted test set is all about international politics. Concrete problems in AI safety Addressing failure prediction by learning model confidence Learning confidence for out-of-distribution detection in neural networks Unsupervised quality estimation for neural machine translation Dropout as a bayesian approximation: Representing model uncertainty in deep learning On calibration of modern neural networks A baseline for detecting misclassified and out-of-distribution examples in neural networks Unbabel's participation in the WMT19 translation quality estimation shared task Unbabel's participation in the WMT19 translation quality estimation shared task Predictor-estimator using multilevel task learning with stack propagation for neural quality estimation Moses: Open source toolkit for statistical machine translation Calibration of encoder decoder models for neural machine translation A simple unified framework for detecting out-of-distribution samples and adversarial attacks Quality estimation based on multilingual pre-trained language model When does label smoothing help? Deep neural networks are easily fooled: High confidence predictions for unrecognizable images Posterior calibration and exploratory analysis for natural language processing models Bleu: a method for automatic evaluation of machine translation Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods Neural machine translation of rare words with subword units Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift Rethinking the inception architecture for computer vision Attention is all you need Alibaba submission for WMT18 quality estimation task Beyond glassbox features: Uncertainty quantification enhanced quality estimation for neural machine translation Improving back-translation with uncertainty-based confidence estimation On the inference calibration of neural machine translation Wat zei je? detecting out-of-distribution translations with variational transformers This work is supported by the Natural Science Foundation of China under Grant No. 62122088, U1836221, and 62006224.