key: cord-0043100-2vhb6pq0 authors: Shi, Xuewen; Huang, Heyan; Jian, Ping; Tang, Yi-Kun title: Case-Sensitive Neural Machine Translation date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47426-3_51 sha: d697a7360af4ac8e1ce0ff83e9ddea12e2eae11b doc_id: 43100 cord_uid: 2vhb6pq0 Even as an important lexical information for Latin languages, word case is often ignored in machine translation. According to observations, the translation performance drops significantly when we introduce case-sensitive evaluation metrics. In this paper, we introduce two types of case-sensitive neural machine translation (NMT) approaches to alleviate the above problems: i) adding case tokens into the decoding sequence, and ii) adopting case prediction to the conventional NMT. Our proposed approaches incorporate case information to the NMT decoder by jointly learning target word generation and word case prediction. We compare our approaches with multiple kinds of baselines including NMT with naive case-restoration methods and analyze the impacts of various setups on our approaches. Experimental results on three typical translation tasks (Zh-En, En-Fr, En-De) show that our proposed methods lead to the improvements up to 2.5, 1.0 and 0.5 in case-sensitive BLEU scores respectively. Further analyses also illustrate the inherent reasons why our approaches lead to different improvements on different translation tasks. In the real world, many of the natural language texts that are written in Latin language are case sensitive, such as English, French, German, etc. For many natural language processing (NLP) tasks, case information is an important feature for algorithms to distinguish sentence structures, identify the part-of-speech of a word, and recognize named entities. However, most existing machine translation approaches pay little attention to the capitalization correctness of the generated words, which does not meet the needs of practical requirements and may introduce noise to downstream NLP applications [9, 20] . In fact, there is a contradiction in the training corpus preprocessing process: using lowercased corpus can reduce the expansion of the vocabulary but neglecting some morphology information, while keeping the original morphological form "píngguǒ" and "apple" are aligned words pair, which are same in the source side but written in different case in the target side in our examples. The contradiction is that using lowercased "apple" in the second example will lose the information of a proper noun, while using a individual word "Apple" will lose the semantic connection with the parallel pair ("píngguǒ" "apple"). Table 1 . Case insensitive/sensitive BLEU scores on Zh-En translation. Δ represents the reduced BLEU scores compared to the "insensitive". NRC is a rule-based case restoring method and more experiment setup details are described in Sect. 5 will increase the vocabulary and lose its connection with the lowercase form of the word. Figure 1 gives an example to illustrate this contradiction. Using true-cased corpus seems to balance the unnecessary increasing vocabulary and the missing morphology information of word case. However, re-storing cases from true-cased corpus is not as easy as the reverse process. Table 1 shows that using corpus in lowercase and regular case gets the highest case-insensitive and case-sensitive BLEU scores respectively, which reflects the difficulty of case restoration. In this paper, we introduce case-sensitive neural machine translation (NMT) approaches to alleviate the above problems. In our approaches, we apply lowercased vocabulary to both the source input and target output side in the NMT model, and the model is trained to jointly learn to generate translation and distinguish the capitalization of the generated words. During the decoding step, the model predicts the case of the output word while generating the translation. Specifically, we proposed two kinds of methods to this extent: i) mixing case tokens into lowercased corpus to indicate the real case of the adjacent word; ii) expanding NMT model architecture with an additional network layer that performances case prediction. We evaluate on pairs of linguistically disparate corpora in three translation tasks: Chinese-English (Zh-En), English-German (En-De) and English-French (En-Fr), and observe that the proposed techniques improve translation quality on case-sensitive BLEU [16] . We also study the model performances on case-restoration tasks and experimental results show that our proposed methods lead to improvements on P , R and F 1 scores. Recently, neural machine translation (NMT) with encoder-decoder framework [6] has shown promising results on many language pairs [8, 21] , and incorporating linguistic knowledge into neural machine translation has been extensively studied [7, 12, 17] . However, the procedure of NMT decoding rarely considers the case correctness of the generated words, and there are approaches performing case restoration on the machine generated texts [9, 20] . Recent efforts have demonstrated that incorporating linguistic information can be useful in NMT [7, 12, 15, 17, 22, 23] . Since the source sentence is definitive and easy to attach extra information, it is a straightforward way to improve the translation performance by using the source side features [12, 17] . For example, Sennrich and Haddow incorporate linguistic features to improve the NMT performance by appending feature vectors to word embeddings [17] , and the source side hierarchical syntax structures are also used for achieving promising improvement [7, 12] . It is uncertain to leverage target syntactic information for NMT as target words in the real decoding process. Niehues and Cho apply multi-task learning where the encoder of the NMT model is trained to produce multiple tasks such as POS tagging and named-entity recognition into NMT models [15] . There are also works that directly model the syntax of the target sentence during decoding [22] [23] [24] . Word case information is a kind of lexical morphology which is definitive and easy to be obtained without any additional annotation and parsing of the training corpus. Recently, a joint decoder is proposed for predicting words as well as their cases synchronously [25] , which shares a similar spirit with a part of our approaches (see Sect. 4.2). The main distinction of our approaches is that we propose two series of case-sensitive NMT and study various model setups. Given a source sentence x = {x 1 , x 2 , ..., x Tx } and a target sentence y = {y 1 , y 2 , ..., y Ty }, most of popular neural machine translation approaches [3, 8, 21] directly model the conditional probability: where y " and "" to indicate capital words and abbreviation words in a sequence, respectively. This special token can be insert to the left (LCT ) or the right (RCT ) side of the capital word. For the target sequence, LCT represents to predict the case of word previously and then generate general target language word and the case is opposite for applying RCT. For the corpus segmented by subword units [11, 19] , we insert the LCT to the left side of the first subword unit of a capital word and insert RCT to the right side of the last subword unit of a capital word. For instance, Fig. 2 shows the modified sentences by adding LCT and RCT given the original sentence and the sentence encoded by subword units. In this approach, we add an additional case prediction output to the decoder of the encoder-decoder based NMT model on each decoding step. Given a source sentence: x = {x 1 , x 2 , ..., x Tx }, its target translation: y = {y 1 , y 2 , .., y Ty }, and the case category sequence of the target language: c = {c 1 , c 2 , ..., c Tc }, the goal of the extension is to enable NMT model to compute the joint probability P (y, c|x). The overall joint model can be computed as: (2) Intuitively, there are three assumptions about joint predicting c t at time step t: i) predicting c t before generating the word y t (CP pre ), ii) predicting c t after the word y t generated (CP pos ), and iii) predicting the probability of c t and y t synchronously (CP syn ). CP pos : At the time step t, the model first predict y t and then predict the case c t for the known word y t , which is consistent with most of the case restoration process (as shown in Fig. 3(b) ). Under this assumption, the conditional probabilities in Eq. (2) can be computed as: and respectively, where s t and z t are self-attention based context vectors of previous generated y