key: cord-0268260-ytuarb52 authors: Mao, Zhuoyuan; Chu, Chenhui; Kurohashi, Sadao title: Linguistically-driven Multi-task Pre-training for Low-resource Neural Machine Translation date: 2022-01-20 journal: nan DOI: 10.1145/3491065 sha: 0e09f5c6043ea4e40e6af49f4519e5124c612bd2 doc_id: 268260 cord_uid: ytuarb52 In the present study, we propose novel sequence-to-sequence pre-training objectives for low-resource machine translation (NMT): Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English. JASS focuses on masking and reordering Japanese linguistic units known as bunsetsu, whereas ENSS is proposed based on phrase structure masking and reordering tasks. Experiments on ASPEC Japanese--English&Japanese--Chinese, Wikipedia Japanese--Chinese, News English--Korean corpora demonstrate that JASS and ENSS outperform MASS and other existing language-agnostic pre-training methods by up to +2.9 BLEU points for the Japanese--English tasks, up to +7.0 BLEU points for the Japanese--Chinese tasks and up to +1.3 BLEU points for English--Korean tasks. Empirical analysis, which focuses on the relationship between individual parts in JASS and ENSS, reveals the complementary nature of the subtasks of JASS and ENSS. Adequacy evaluation using LASER, human evaluation, and case studies reveals that our proposed methods significantly outperform pre-training methods without injected linguistic knowledge and they have a larger positive impact on the adequacy as compared to the fluency. We release codes here: https://github.com/Mao-KU/JASS/tree/master/linguistically-driven-pretraining. Neural machine translation (NMT) [3, 49] can achieve state-of-the-art performance when large parallel corpora are available for training. However, this prerequisite for parallel corpora limits its usefulness for several language pairs, such as Japanese, Chinese, and Korean, along with domains (history and COVID) for which such large corpora do not exist. Often, these resource-poor language pairs consist of languages that have resource-rich monolingual corpora. Therefore, it is possible to compensate for the lack of parallel corpora by leveraging large monolingual corpora. One popular approach for this is data augmentation, for instance, through back-translation [12, 42] . Another approach involves pre-training the NMT model on tasks that only require monolingual corpora [37, 47] . As a promising technique for leveraging monolingual corpora, pre-training has experienced a surge in popularity in NLP ever since models such as BERT [8] achieved state-of-the-art results in text understanding. However, BERT-like models were not designed to be used for NMT in the sense that they are essentially techniques for pre-training encoders, but not sequence-to-sequence models. To address this, Song et al. [47] , Lewis et al. [21] and Liu et al. [23] recently proposed self-supervised language-agnostic pre-training methods, which are sequence-to-sequence pre-training tasks for NMT, have achieved new state-of-the-art results in low-resource scenarios. Languages that are sufficiently "rich" to have large monolingual corpora often have available tools for linguistic analysis. Meanwhile, usually a low-resource language pair is composed by a resource-rich language and a low-resource language and the linguistic knowledge of the resourcerich language can be easily extracted. In addition, studies such as Sennrich and Haddow [41] and Murthy et al. [27] have demonstrated that linguistic knowledge can improve NMT without using additional corpora. Therefore, it is natural to use monolingual corpora and linguistic tools in bilingual low-resource scenarios. However, the manner in which linguistic knowledge should be provided is not always clear, because NMT models are implemented in an end-to-end scheme. From a technical perspective, it is practical to extract linguistic features on the monolingual side. Therefore, monolingual pre-training provides an ideal framework for leveraging monolingual corpora and injecting linguistic information. In Mao et al. [24] , we proposed a linguistically motivated pre-training approach known as Japanese-specific sequence-to-sequence (JASS), which was inspired by masked sequence-to-sequence pre-training (MASS), but focused on syntactic analysis obtained by using a parser. Particularly, we added syntactic constraints to the sentence-masking process of the MASS to obtain the bunsetsubased MASS (BMASS) task. 1 We also proposed the Bunsetsu reordering-based sequence-to-sequence (BRSS), which is a linguistically motivated reordering task. Several previous studies [21, 39] have provided evidence that "multi-task" pre-training that combines various styles of self-supervised training tasks results in significantly superior results for NMT. We proposed JASS based on a combination of the above-mentioned two tasks and it is tailored for NMT involving Japanese. In contrast, in this study, we also propose linguistically-driven pre-training methods for English to leverage linguistic-specific information in the pre-training phase. 2 They are referred to as phrase structure-based MASS (PMASS) & head finalization-based sequence-to-sequence (HFSS), and their combination is denoted as English-specific sequence-to-sequence (ENSS). 3 Moreover, unlike the proposed methods for Japanese, the proposed methods for English can be transplanted onto any SVO language. Thus, our proposed ENSS and JASS can be applied to any translation pair involving English or Japanese. 4 We experimented with ASPEC Japanese-English & Japanese-Chinese [30] , Wikipedia Japanese-Chinese [4, 5] , and News English-Korean [32] in various pre-training settings for JASS and ENSS. 5 Our results indicate that BMASS, BRSS, and HFSS significantly outperform the state-of-the-art MASS pre-training, whereas PMASS yields marginal improvements. Furthermore, we demonstrate that linguistically-driven multi-task pre-training methods (JASS & ENSS) lead to further improvements of up to +2.9 BLEU points for Japanese to English, +2.7 BLEU points for English to Japanese, +4.3 BLEU points for Japanese to Chinese , +7.0 BLEU points for Chinese to Japanese, +0.5 BLEU points for English to Korean, and +1.3 BLEU points for Korean to English in low-resource scenarios, respectively. Unlike in our previous study Mao et al. [24] , we provide substantial analyses for evaluating the translations generated by JASS and ENSS, which focus on the relationship between different pre-training tasks, and the specific adequacy and fluency of corresponding translations. Specifically, we validate the superior translation adequacy improvement of linguistically-driven methods by implementing automatic adequacy evaluation using LASER, human evaluation, and case study. To confirm the complementary nature between the masked language model and reordering the pre-training task, we performed an evaluation of the pre-training accuracy. We expect this study to extend the usefulness of linguistically-driven pre-training methods for more low-resource language pairs and compensate for the defects of Mao et al. [24] in terms of the empirical evaluation. The contributions of this study can be summarized as follows. (1) BMASS and BRSS: Linguistically-driven novel pre-training methods for NMT involving Japanese. (2) PMASS and HFSS: Linguistically-driven novel pre-training methods for NMT involving English (can be theoretically implemented on any SVO language). There are mainly three lines of work related to improving NMT in low-resource situations: crosslingual transfer, data augmentation, and monolingual pre-training. These approaches are potentially complementary. Our work belongs to the monolingual pre-training category. Cross-lingual transfer addresses the low-resource issue by using data from different language pairs. One can use a richer language pair [60] or several language pairs simultaneously [7, 9] . Murthy et al. [27] also proposed reordering the assisting languages to be similar to a low-resource language. Data augmentation involves the creation of synthetic bilingual data from monolingual data. In the popular back-translation approach [10, 12, 42] , the source side of the data is synthesized using an MT system to back-translate the target side data. Recently, Zhou et al. [58] proposed the creation of this source side through rule-based reordering via word-to-word translation. In monolingual pre-training approaches, all or part of a model is first trained on tasks that require monolingual data. 6 Pre-training has enjoyed significant success in other NLP tasks with the development of GPT [38] , BERT [8] , and several others [33, 48, 53] . Pre-training schemes such as BERT were designed for natural language understanding (NLU) tasks and they are not directly suitable for NMT. Conneau and Lample [6] and Ren et al. [40] proposed multilingual variants. However, they trained the encoder and decoder independently. To address this, Song et al. [47] recently proposed MASS, a new state-of-the-art NMT pre-training task that jointly trains the encoder and decoder. Our approach develops on the initial idea of MASS, but adds more diverse and linguistically-motivated training objectives. Linguistic information is known to be useful for NMT [41] , especially in low-resource scenarios. Outside of pre-training, studies [27, 56, 58] have successfully used a linguistically-motivated reordering similar to that of our BRSS task. Sun et al. [48] used linguistically-motivated pre-training tasks for text understanding. To the best of our knowledge, there are no studies on linguisticallymotivated pre-training tasks for NMT. After the appearance of BERT [8] , several pre-training methods have been proposed to enhance NMT [6, 21-23, 39, 40, 45-47, 51, 52, 54] . Particularly, Song et al. [47] proposed a random span reconstruction task to pre-train a sequence-to-sequence framework for NMT; Wang et al. [51] first proposed using shuffling, deleting, and replacing operations to implement the denoising pretraining for the NMT system; thereafter, Lewis et al. [21] combined the denoising methods with the masked language model pre-training of Song et al. [47] , and provided detailed empirical results for a large number of language pairs; mBART [23] is a multilingual sequence-to-sequence denoising pre-training that is pre-trained through denoising tasks on 25 languages including Japanese, English, Chinese, Russian, and others, and it can be deemed as an extension of Lewis et al. [21] ; other studies focus on leveraging the cross-lingual supervision between languages through word alignment [22] , phrase alignment [40] , sentence-level alignment [6] , code-switching technique [54] , or assisting languages (shared scripts) [46] . Among the above-mentioned pre-training techniques for NMT, we observe that no study has focused on leveraging specific linguistic features for NMT. Syntactic span-masking [59] and semanticaware BERT [57] have been proposed using linguistically-driven supervision for language understanding tasks. However, linguistically-driven methods for sequence-to-sequence pre-training should be considered and explored. Studies have also focused on improving MASS. Siddhant et al. [45] adapted MASS in multilingual scenarios; Qi et al. [36] proposed using an n-stream self-attention mechanism to enhance MASS for language generation tasks. No previous study has attempted to enhance MASS from a linguistic perspective, which will be explored in our study. Moreover, Wang et al. [52] highlighted that multitask learning can significantly benefit multilingual NMT. In addition to the MT task, the essential jointly-learned tasks should be masked langauge model task and denoising (reconstruction) task, which are two basic pre-training styles based on which we propose our linguistically-driven methods. In this section, we introduce the preliminary backgrounds of pre-training and fine-tuning for NMT and MASS, which serve as the backbone for this study. Fig. 1 . Pre-training and fine-tuning for NMT. "S2S" denotes sequence-to-sequence. Attention Decoder x 1 x 4 x 4 Fig. 2 . Sequence-to-sequence structure for MASS. represents a token and 3 to 6 are consecutive tokens to be masked/predicted. We first introduce the pre-training and fine-tuning pipelines for the NMT. As shown in Figure 1 below, we first utilize monolingual corpora to pre-train the initialized sequence-to-sequence model. Subsequently, we use a parallel corpus of languages of interest to fine-tune the pre-trained models. The fine-tuned model was the final NMT model. All the experiments in this study will be conducted on the basis of this pre-training and fine-tuning pipeline for NMT. MASS is a pre-training method for NMT proposed by Song et al. [47] . As shown in Figure 2 , in MASS pre-training, the input is a sequence of tokens where a part of the sequence is masked and the output is a sequence where the masking is inverted. We consider ∈ X, which is a sequence of tokens where X is a monolingual corpus. Additionally, we consider the token span = [ , ], where 0 < ≤ ≤ ( ) and ( ) are the number of tokens in sentence . We denote the masked sequence by , where tokens in positions from to in are replaced by a mask token [ ]. ! is the sequence with an inverted mask, that is, where tokens in positions other than the aforementioned fragments are replaced by the mask token [ ]. In MASS, the pre-training objective is to predict the masked fragments in using an encoder-decoder model, where is the input to the encoder and ! is the target output of the decoder. The log-likelihood objective function is where denotes the model parameters. The number of tokens to be masked is a hyperparameter of the MASS. The NMT model is jointly pre-trained with the MASS task for both the source and target languages. Fig. 3 . Word and bunsetsu segmentations for a Japanese sentence with meaning "LoveLive is made of three projects. " In word for word English translations, "_" represents words with no meaningful translations. In this section, we describe JASS and ENSS, which are our proposed pre-training techniques. Our methods are based on the ideas of the original MASS and are improved by jointly learning multiple linguistics-aware tasks. For Japanese, we propose a bunsetsu-based MASS (BMASS) pre-training and bunsetsu reordering-based sequence-to-sequence (BRSS) pre-training. Their combination, Japanese-specific sequence-to-sequence (JASS) pre-training, is introduced in the following section. Bunsetsu is the syntactic component of Japanese sentences [20, 26] . It is equivalent to the concepts of noun phrases or verb phrases in English syntax and it constitutes a minimal unit of meaning. The concept of "word" is ambiguous for writing systems such as Japanese where wordseparators are not applicable, and Japanese segmenters [20, 26] can segment Japanese sentences either in words or bunsetsus. Therefore, bunsetsu is also more likely to correspond to a welldefined entity or concept than words. Figure 3 illustrates the difference between the word-and bunsetsu-level segmentation. Each bunsetsu contains self-contained information and case markers, which indicate its relation with other bunsetsus. Based on the bunsetsu, we introduce our proposed pre-training techniques for the Japanese. We propose BMASS, which leverages syntactically parsed Japanese monolingual data for sequence-to-sequence pre-training. MASS pre-trains an NMT model by making it predict random parts of a sentence given their context, whereas BMASS involves making the model predict a set of bunsetsus given the contextual bunsetsus. We expect this will allow the model to learn about bunsetsus and thereby focus on predicting meaningful subsequences instead of random, albeit fluent subsequences. To perform BMASS, we modify the definition of mask in Equation 1: Term ( ) denotes the number of tokens in sentence . Subsequently, the − ℎ position span from to corresponds to the start and end of a specific bunsetsu in a Japanese sentence. Consequently, we denote the BMASS loss as L . The main difference between MASS and BMASS is that in MASS, we mask random token spans, whereas in BMASS, we only mask tokens spans that are complete bunsetsus. The number of bunsetsus to be masked constitutes a hyperparameter for BMASS. Note that our BMASS pre-training task differs from the entity masking task of ERNIE [48] and random span masking of SpanBERT [16] . ERNIE and SpanBERT have been proposed without using syntactic units and they are employed in natural language understanding downstream tasks. 4.1.3 BRSS Japanese sentences are typically in an SOV word order that can be reordered to SVO to reduce the difficulty of translation to languages with SVO order. We first define a simple process for reordering a (typically SOV) Japanese sentence into a "SVO Japanese" pseudo-sentence that will be used in BRSS. There are several previous studies on reordering a SOV-ordered sentence to a SVO-ordered sentence [13, 19] . In our case, to consistently leverage bunsetsu units in Japanese with BMASS, we propose bunsetsu-based reordering, which is able to generate an SVO-ordered Japanese sentence while retaining syntactic information at the bunsetsu-level. We first define "chunking signal words" as any punctuation mark or the topic marker "は." The reordering process is as follows: (1) split the sentence into bunsetsus (2) select sequences of bunsetsus bounded by chunking signal words (3) simply reverse the order of the bunsetsus in these sequences without using rules We can now propose BRSS, which involves a Japanese sentence and its reordered version obtained using the aforementioned procedure. Refer to Figure 4 -d as an example of a bunsetsu-reordered sentence. The pre-training objective was a reordering task. We expect that this will allow the system to learn the structure of the Japanese language, and prepare it for the reordering operation it will have to perform when translating to a language with different grammar. Although BRSS task is constructed by simple rules, the predictions for the bunsetsu boundaries and orders are expected to equip the model with abundant linguistic knowledge. We have two choices from which we can make the NMT system predict the original sentence given the reordered sentence (BRSS.F) or vice-versa (BRSS.R). We will experiment with both options. Similar to the proposed methods for Japanese, we propose two linguistically-driven methods for English that are based on the MASS language model and reordering sequence-to-sequence language model, respectively. One is phrase structure-based MASS (PMASS), and the other method is head finalization-based sequence-to-sequence pre-training (HFSS). The combination of PMASS, HFSS, and ENSS is introduced in the next section. Before introducing our proposed methods for English, we first provide background information on head-driven phrase structure grammar and head finalization, which forms our linguistically-driven methods. As opposed to dependency-based grammar, headdriven phrase structure grammar (HPSG) [34, 35] is lexicalism-based grammar that focuses on generalizing phrase structures. HPSG primarily handles word and phrase signs in a sentence in Example of HPSG parsing result and head finalization. Head finalization reorders an English sentence into a Japanese-like sentence. [14] Blue arrows denote the "head. " terms of their syntactic and semantic roles. Thus, HPSG should be an appropriate parsing rule for extracting phrase structures in sentences and applying the following proposed pre-training techniques. Figure 5 (left) shows an instance of parsing an English sentence using HPSG grammar. Using the above-mentioned HPSG, sentences in any language can be characterized using phrase structures. From the definition of a phrase, the "head" of a phrase is subsequently defined as the syntactically determinant part in a phrase. In other words, "head" determines the syntactic category of the phrase and its "dependents. " Particularly, English is referred to as a "head-initial" language because "head" appears before its "dependents, " whereas Japanese is referred to as a "head-final" language because "head" usually appears after "dependents" in a phrase. The deliberate phrase structures provided by the HPSG parser are utilized in several scenarios in the NLP. Particularly, Isozaki et al. [14] proposed a simple reordering rule for the SVO language (head-initial languages) by using the phrase structure information provided by the HPSG parser. Figure 5 shows an example of reordering an English sentence to be an SOV-like sentence on the basis of the result of HPSG parsing. By reordering sentences in SVO languages such as English to be SOV-like sentences, the performance of statistical machine translation (SMT) is improved. Particularly, Isozaki et al. [14] first proposed head finalization and applied it to English-to-Japanese SMT; Han et al. [11] applied it to Chinese-to-Japanese SMT and obtained significant improvements; more recently, Zhou et al. [58] utilized this reordering technique to generate synthetic parallel sentences in the back-translation phase when translating SOV and SVO languages. In this study, we utilize this reordering rule in the pre-training phase for NMT (see Section 4.2.4). In general, we perform PMASS pre-training by limiting the masked tokens in MASS to be an entire phrase span. Thus, for masking plural phrase spans, we denote it as PMASS.P. For masking only a single phrase span, we denote it as PMASS.S. Particularly, the source and target for PMASS.P and PMASS.S pre-training can be generated using our proposed phrase-masking algorithms described in Appendix A. Inspired by MASS, we force the number of masked tokens to be approximately half of the length of the sentence to guarantee the effectiveness of the sequence-to-sequence masked language model. Examples of PMASS.P and PMASS.S are presented in Figure 6 -c. We observe that several phrase spans in PMASS.P and a single long phrase span in PMASS.S are masked. We expect ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 1, No. 1, Article . Publication date: January 2022. [M] Mary lost his wallet . Src. Tgt. Src. Tgt. Src. Tgt. John went to the police because Mary lost his wallet . Tgt. [ such special masking patterns to force the NMT system to extract more phrase-level syntactic information in the pre-training phase. We propose HFSS using the head finalization technique [14] for pre-training English. As shown in Figure 6 -d, the pre-training task is also a reordering task that simulates the translation from SOV languages to English. More precisely, the source sentence for sequence-to-sequence pre-training is the reordered (SOV-like or head-finalized) English sentence, and the target sentence is the original English monolingual sentence. We expect HFSS to help the system learn the word reordering pattern of the translation between head-initial (SVO) and head-final (SOV) languages in advance. According to the prior experiments for Japanese (see Mao et al. [24] and 6.1), BRSS.F consistently outperforms BRSS.R. In addition, BART [21] also claims that reconstructing the original sentence benefits the language generation tasks. Therefore, we do not distinguish HFSS with HFSS.F and HFSS.R (HFSS.F performs pre-training with the SOV-SVO pattern, whereas HFSS.R performs the reverse pattern). 7 Instead, we directly defined HFSS using the pre-training pattern of HFSS.F. Moreover, HFSS is performed on the basis of head finalization, which utilizes the results from HPSG parsers. This is consistent with PMASS in which we extract phrases using HPSG-parsing results. We develop our proposal on English through head finalization, whereas for SOV languages such as Japanese, it is unmanageable to reorder SOV sentences to SVO-like sentences [14] . Furthermore, HFSS can be used for all head-initial languages apart from English, as well-developed reordering rules have been proposed and demonstrated to be effective for NMT. However, BRSS can only be implemented for Japanese-involved translation pairs because bunsetsu information is required to establish the source and target sentences for sequence-to-sequence pre-training. Multi-task pre-training objectives lead to a robust initial state for NMT systems [21, 39] . Because our proposed methods can also be categorized into two groups of pre-training tasks, we propose a multi-task pre-training task for both Japanese and English. We define JASS pre-training, which is a combination of the two previous procedures: BMASS and BRSS. Our actual pre-training will consist of the joint execution of these two pre-training sessions. Therefore, the pre-training objective for JASS is where X represents the monolingual corpus of Japanese, and L denotes the reordering loss using the forward or reverse variants mentioned in Section 4.1.3. We expect BMASS & BRSS to jointly learn syntactic knowledge and BRSS to learn word ordering knowledge. For English, we similarly define ENSS pre-training, which combines PMASS and HFSS. More precisely, the training objective is: where X denotes the monolingual corpus of English, L the PMASS.P or PMASS.S loss, and L ℎ the reordering loss of HFSS. JASS is specifically designed for Japanese, whereas theoretically, ENSS can be transplanted onto any SVO language as long as we can extract the phrase structure information of the corresponding language from a HPSG parser. We also mixed JASS pre-training for Japanese with MASS pre-training for the other languages involved in the translation. In practice, we therefore designated using JASS pre-training for Japanese monolingual data with BMASS and BRSS objectives, along with "other languages" monolingual data with the MASS objective. Similarly, for English, ENSS pre-training consists of PMASS & HFSS for English and MASS for "other languages" involved in fine-tuning translation pair. We also consider attempting the combination of our proposed linguistically-driven methods with a strong baseline pre-training objective, MASS, which we refer to as MASS + JASS (or ENSS) in the subsequent sections. To allow the pre-training model to determine the language and sub-task (MASS, BMASS, BRSS, PMASS, and HFSS) that it should perform, we prepend tags to inputs similar to those used in Johnson et al. [15] (see Section 5.2 for details). In this section, we evaluate our pre-training methods on simulated low-resource scenarios for ASPEC Japanese-English [30] , Japanese-Chinese translations [28] , and realistic low-resource scenarios for Wikipedia Japanese-Chinese [4, 5] and News English-Korean [32] translations. We used monolingual data for pre-training and parallel data for fine-tuning. Refer to Table 1 for an overview. Monolingual data: For pre-training, we use monolingual data of 22M lines each for Japanese, English, Chinese, and Korean, randomly sub-sampled from Common Crawl mentioned in the official WMT monolingual training data. 8 9 For pre-training in Japanese-English and English-Korean, given that these two languages have different scripts and thus have few common words, the pretraining objectives for each language will work separately, even though they are performed jointly for two languages. However, for pre-training in Japanese and Chinese, they share more characters, which indicates that the monolingual pre-training tasks will be run in a pseudo-cross-lingual manner. Thus, we also expect to see whether such pre-training will benefit from more fine-tuning. Parallel Data: We use scientific abstracts domain ASPEC parallel corpus for training Japanese-English and Japanese-Chinese models. For Japanese-Chinese fine-tuning, we also utilize the Wikipedia parallel corpus, which is a real low-resource scenario. We use News parallel corpus for English-Korean, which is a low-resource dataset. For ASPEC, we used the official training, development, and test splits provided by WAT 2019. 10 11 For Wikipedia, we used the dataset released by Kyoto University. 12 For News, we use dataset provided by Park et al. [32] . 13 We tokenize the monolingual data by using the Moses tokenizer for English and Korean, 14 Jumanpp for Japanese, 15 and jieba for Chinese. 16 We obtain the bunsetsu information by using KNP 17 and obtain the HPSG parsing results using enju. 18 Sentences with more than 175 tokens were removed. For each language pair, we constructed a joint vocabulary with 60 000 sub-word units through bytepair encoding (BPE) [43] on the concatenated monolingual corpora involved during pre-training. 19 , whereas 40 000 BPE merge operations is set for Japanese-English. In the multi-task pre-training, each sentence is prepended with a task between different pre-training objectives and languages. This token can be used when monolingual pre-training is conducted jointly by multiple languages and multiple tasks. In our experiments, we used the open-source OpenNMT [17] implementation of the transformer [50] NMT model. 21 The hyperparameters are set to the transformer-big setting in OpenNMT. Particularly, our model has a 6-layer encoder and decoder, a hidden size of 1024, feed-forward hidden layer size of 4096, batch size of 4096, dropout rate of 0.3, and 16 attention heads. An ADAM optimizer with a learning rate of 10 −4 was used for both pre-training and fine-tuning. All the pre-training tasks are run until convergence on four TITAN V100 GPU cards occurs, and fine-tuning uses only one GPU. It took approximately two days for each pre-training run. Mixed precision training [25] was used for both pre-training and fine-tuning. For multi-task pre-training, data are randomly shuffled such that even in each mini-batch, different pre-training objectives appear, corresponding to a real joint pre-training. Our proposed pre-training methods converge within the similar training time as compared to that of MASS. Pre-training tasks are evaluated using perplexity, and the checkpoint with the lowest pre-training perplexity was selected for fine-tuning. We used BLEU [31] for automatic evaluation, adequacy, and fluency for human evaluation. We performed early stopping using 1-gram accuracy and perplexity on the development set. We evaluated the statistical significance of our BLEU scores through bootstrap resampling [18] . In addition to MASS, we employ the "text infilling" in BART as another main baseline. 22 We also define two pre-training baselines for comparison with our proposed methods. They are named multi-span-based MASS (MultiMASS) and deshuffling. Moreover, the joint training with MASS and deshuffling was set as the multi-task pre-training baseline. All of the baselines are as follows: Baselines without pre-training. First, we employ the vanilla transformer big as the baseline without pre-training because all of the pre-training methods are based on this model structure. Moreover, following Araabi and Monz [1] , we also present the best performance for low-resource NMT by using Transformer model. Hyperparameter details are shown in Appendix B. MASS. Using the same settings as in Song et al. [47] . BART (text infilling). Different from MASS, BART (text infilling) masks several token spans within a sentence by a single [ ] where span lengths are samples from Poisson distribution and the model is also required to predict the lengths of the masked spans. We use the same settings as in Lewis et al. [21] . 23 MultiMASS. MultiMASS is a baseline method added to help demonstrate the effectiveness of masking specific syntactic units such as bunsetsu or phrase spans in a sentence that we propose as BMASS and PMASS. 21 https://github.com/OpenNMT/OpenNMT-py 22 Text infilling has been demonstrated as the most effective pre-training objective for NMT among several objectives in BART [21] . 23 In order to conduct fair comparisons for our proposed methods, we only present the most effective sub-task, text infilling, within BART. The combination of text-infilling and sentence permutation is proven to be the best practice of BART. With regard to sentence permutation, we do not consider it in this paper because it is mainly designed for document NMT. When it comes to multi-sentence pre-training, sentence permutation and other possible patterns of multi-sentence linguistically-driven pre-training tasks should be explored and compared in future work. Fig. 8 . Example of source and target for deshuffling with the meaning "LoveLive is made of three projects. " As shown in Figure 7 , MultiMASS predicts several randomly masked tokens in a sentence, which differs from the single masked span in MASS, masked busetsu spans in BMASS, several phrase spans in PMASS.P, and a single phrase span in PMASS.S. Deshuffling. Deshuffling denotes the pre-training task of random shuffling-based sentence reconstruction, which is also a crucial pre-training task. We perform this pre-training task as another baseline to confirm the effectiveness of reordering syntactic units in BRSS and the reordering driven by head finalization of HFSS. A pre-training example is presented in Figure 8 . Multi-task Baseline. The multi-task baseline is the combination of the respective best baseline methods from the masked language model and reordering pre-training. Thus, the multi-task baseline consists of MASS, 24 and deshuffling. The baseline is formulated as follows: where X represents the monolingual corpora. We pre-trained our NMT models by leveraging the monolingual data of the source and target languages. For Japanese, we can use MASS, BMASS, or BRSS, whereas for English, we can use MASS, PMASS, or HFSS. For Chinese and Korean, we use only the MASS. Particularly, we pretrained different types of models in Table 2 . Note that we use MASS for ENSS because PMASS underperforms MASS by a significant margin (see 6.1). We fine-tuned to improve Japanese-English, English-Japanese, Japanese-Chinese, Chinese-Japanese, English-Korean and Korean-English translations. We trained the following NMT models: (1) Ja-En and En-Ja: Japanese to English and English to Japanese models using from 3k to 50k parallel sentences randomly sampled from ASPEC for fine-tuning. (2) Ja-Zh and Zh-Ja: Japanese to Chinese and Chinese to Japanese models using from 3k to 50k parallel sentences randomly sampled from ASPEC and Wikipedia, respectively, for fine-tuning. (3) En-Ko and Ko-En: English to Korean and Korean to English models using 20k (randomly sampled) and 94k (full dataset) parallel sentences from News for fine-tuning. We compared these models with pre-trained model baselines and vanilla baselines, which are fully-supervised NMT models on the same data settings, but without pre-training. In addition, 24 Other baselines for Japanese 8 MultiMASS (Ja) Based on MASS pre-training, several random tokens are masked rather than one consecutive span. 9 Deshuffling (Ja) Random shuffling-based original sentence reconstruction. 10 MASS+Deshuffling (Ja) Multi-task pre-training baseline for Japanese. Similar to MASS, we mask an entire phrase span based on the head-driven phrase structure grammar. We performed the experiments for PMASS.P and PMASS.S, respectively. fine-tuning results under the high-resource scenarios (with more than 50k parallel sentences) are provided and discussed in 6.6. Tables 3, 4, 5, and 6 contain the NMT BLEU results of our proposed methods for Japanese-English, Japanese-Chinese and English-Korean translation on various translation domains, respectively. Subsequently, we provide in-depth analysis for translation quality in terms of adequacy by using LASER [2] , human evaluation scores, specific cases for the real low-resource scenario of Wikipedia Ja-Zh. Finally, we conduct an investigation on the pre-training accuracy to analyze the difference between the pre-trained models and their complementation of each other, and present the results in middle/high-resource scenarios. Table 3 . BLEU scores for simulated low/high-resource settings for Japanese-English ASPEC translation using from 3k to 50k parallel sentences for fine-tuning. Pre-trained models used for fine-tuning are numbered according to their description in Section 5.5. Results better than MASS with statistical significance < 0.05 are marked in †. Bold denotes the three top scores. In Tables 3 and 4 , where we simulate several low-resource settings for Japanese-English and Japanese-Chinese translations on ASPEC with different pre-training datasets; in Table 5 and 6, where we use realistic low-resource settings for Wikipedia Japanese-Chinese translation and News English-Korean translation, we observe that all settings using pre-training outperform those without pre-training (#0 & #0*), which indicates the importance of pre-training. The results also indicate that JASS (#4) and ENSS (#13) are generally better than MASS (#1). With regard to two main baselines with pre-training, MASS and BART (text infilling), we observe that MASS outperforms BART (text infilling) in most cases as shown in Table 3 (text infilling) with ours in Appendix C. 25 Without pre-training, we observe that using optimized transformer (#0*) benefits the low-resource setting, which has been proven by Araabi and Monz [1] , Sennrich and Zhang [44] . However, pre-training can further improve the optimized baselines without pre-training. Particularly, for the Japanese-English translation, BMASS (#2) is comparable to MASS; BRSS (#3 & #3(R)) and their combination, along with JASS (#5) are significantly better than MASS. However, as summarized in Tables 4 and 5 , the results for two parallel corpora on different domains for Japanese-Chinese yield significantly better results when using our proposed BMASS and BRSS. We observe that only a few settings on Japanese-to-Chinese BRSS yield lower BLEU results than MASS, whereas other settings using the proposed methods yield better results than MASS by significant margins. Although MASS is better than BMASS for Japanese-English translation, the reverse can be observed for the Japanese-Chinese translation. This indicates that the effects of the proposed span-masking techniques might correlate with specific translation directions and domains. We suppose it is worth exploring the span-masking tricks that are non-sensitive to language pairs and domains in the future. As summarized in Table 3 and 6, our proposed methods of leveraging linguistic knowledge for English yield significantly higher BLEU results when we perform the reordering pre-training task, HFSS (#12). However, the proposed linguistically-driven masked language model PMASS.P (#11) and PMASS.S (#11*) yielded comparable results to several other baseline methods such as MultiMASS (#14) and deshuffling (#15). This demonstrates that the syntactical span-based masked language model may merely work on head-final languages such as Japanese. 26 Considering the weak performance of the PMASS, we combined HFSS with MASS for ENSS. The multi-task pre-trained ENSS yielded the highest results on almost all the low-resource settings. We will explore proper chunking techniques for linguistically-driven span-masking pre-training for languages like English in the future. However, in Table 3 , when performing a universal linguistically-driven pre-training simultaneously for Japanese and English (#17), we did not achieve further significant BLEU improvements. This can be attributed to the increased dependence of NMT on specific linguistic information on a single language side, and the joint pre-training does not allow linguistic knowledge transfer across languages and between dissimilar languages. In addition to the main baseline MASS, we also performed several other sequence-to-sequence pre-training baselines: MultiMASS (#8 & #14) and deshuffling (#9 & #15) along with their multi-task combinations (#10, #16 & #18) for Japanese and English. As summarized in Tables 3, 4 , 5, and 6, we observe that the proposed masked style pre-training task, BMASS, and reordering pre-training tasks, BRSS & HFSS, outperform these baselines by significant margins, thereby indicating that linguistically-driven methods should be superior to self-supervised pre-training without leveraging linguistic features. Moreover, we investigated the percentages of the words of which the position changed. For Japanese pre-training, the percentages are 79.58% for BRSS and 94.72% for deshuffling. For English pre-training, the percentages are 91.97% for HFSS and 95.22% for deshuffling. Although there exists a gap for the percentages between BRSS and deshuffling, we can see that the percentages of deshuffling and HFSS are similar, which demonstrates that the quality of the linguistically generated reordered sentence is much more important than the percentage. As summarized in Table 3 , BRSS-F (English-order to Japanese-order) yielded slightly better results than BRSS-R (vice-versa); thus, we only experimented with BRSS-F for the remaining experiments. We suppose that the reason is that training the decoder with the original sentence is more important than training the encoder with it, which is also the reason why BART pre-training [21] treats the original sentence as the target sentence to be predicted from the decoder. 27 In other words, forcing the decoder to generate a natural sentence leads to a better initialized decoder for NMT. Meanwhile, HFSS pre-training is performed in an analogous manner for the same reason. As mentioned above, JASS yields the best results when we consider only linguistically driven methods for Japanese. After combining the proposed methods for Japanese with MASS (#5∼#7 in Table 3 ), we observe comparable results as compared to JASS by combing MASS and BRSS. This indicates the effects of combining masked style methods and reordering style methods. In Table 4 and 5, we believe that BMASS is better than MASS for combining with BRSS because of the significant improvements yielded by BMASS. Moreover, as summarized in Tables 4 and 5 , we observe that on the ASPEC domain, JASS improves up to 2.2 BLEU scores, whereas on the Wikipedia domain, JASS achieves up to 7.0 BLEU improvements. This demonstrates the promising performance of the proposed methods. Meanwhile, this indicates that the overlapping of pre-training domain with the fine-tuning domain is directly proportional to the realization of improvements by linguistically-driven pre-training methods. Finally, by comparing the BLEU results in Table 3 with those reported by Mao et al. [24] , we find that the BLEU scores of models pre-trained with News Crawl are better than those pre-trained with the Common Crawl monolingual corpus, which shows that pre-training with a high-quality monolingual dataset leads to superior fine-tuning results. Reference-free MT evaluation evaluates the translation system without using the target reference. Such an evaluation can help circumvent the noise existing in the references of translation targets. After the emergence of multilingual sentence encoders [2] , Yankovskaya et al. [55] proposed the # Model ASPEC ASPEC Wikipedia News Ja-En En-Ja Ja-Zh Zh-Ja Ja-Zh Zh-Ja En-Ko Ko-En Table 8 . Adequacy and fluency of Wikipedia Japanese-Chinese translations using 10k sentences for finetuning. use of multilingual sentence embeddings encoded by LASER to implement the reference-free MT evaluation. More precisely, we first apply LASER to encode the source sentence and the translated sentence, respectively. Thereafter, the cosine value of those two embeddings is used to evaluate the similarity between the source and translation. This cosine value is thus the metric used to evaluate translation adequacy. This approach has two advantages. The first advantage is that target references are not required, as mentioned above. The other advantage is that every two translation directions can be compared with each other because language-agnostic embedding is used for evaluation. We report the adequacies in Table 7 . First, we observe that methods with pre-training can yield more semantically correct translations than those without pre-training. Second, our proposed methods can significantly obtain higher LASER similarity scores than the MASS baseline, particularly the results on ASPEC Japanese-English, Wikipedia Chinese-Japanese and News English-Korean translations. Moreover, we can observe that the adequacy results obtained from the LASER embedding-based cosine similarity scores are consistent with the BLEU results. Model BLEU Adequacy Fluency Ja-En En-Ja Ja-En En-Ja Ja-En En-Ja Table 9 . Adequacy and fluency of ASPEC Japanese-English translations using 10k sentences for fine-tuning. Following Nakazawa et al. [29] , we performed adequacy and fluency evaluations for the Japanese-Chinese and Japanese-English translations when 10k Wikipedia parallel sentences and 10k ASPEC parallel sentences were used for fine-tuning the pre-trained models. We randomly sampled 100 test-set English sentences and blindly evaluated their translations across various models. Each sentence was scored on a scale of 1 to 5, with 1 representing the worst score. The higher the score, the more adequate (meaningful) or fluent (well-formed) the sentence is. The final score was the average of the scores of 100 sentences. We did not consider the references, but only considered the sources for our evaluation. In Table 8 and 9, we can observe that NMT models, even without pre-training, are capable of generating rather fluent sentences, and the lack of parallel sentences (low-resource scenario) will mainly influence the translation adequacy (refer to the extremely low adequacy of models without pre-training). Meanwhile, we can observe that our proposed BMASS, BRSS, JASS, HFSS, and ENSS result in large improvements in adequacy and moderate improvements in fluency, for both translation directions, whereas PMASS yielded marginal improvements. The improved performance of adequacy compared with that of MASS demonstrates the effectiveness of linguistically-driven pre-training methods. Moreover, we can observe that the results of human evaluation are almost consistent with those of BLEU. We conducted case studies on Japanese-to-English translation fine-tuned using 10k ASPEC parallel sentences and Chinese-to-Japanese translation fine-tuned using 10k Wikipedia parallel sentences to make improvements shown by BLEU score evaluations visible. As summarized in Tables 10 and 11, we find that the vanilla NMT system trained using 10k parallel sentences without pre-training can hardly implement the translation. With regard to models with pre-training, we observed that MASS and other baseline models generated several incorrect tokens in terms of semantics, whereas the entire sentence seemed fluent. However, our proposed methods can generate sentences with superior adequacy and fluency, where fewer missing keywords are observed. Pre-training accuracy is the accuracy of the monolingual pre-training tasks, and it can be an indicator of task complexity and pre-training objective performance. Tables 12 and 13 summarize the component-wise and overall pre-training accuracies for various models, respectively, on the ASPEC Japanese and English development set sentences. Regarding individual component methods, it can be observed that MASS and PMASS are the harder tasks, given their low accuracy, whereas # Reference-Ja 水の性質の多様性について,まず,水分子同士の間に働く力である水素結合と,そのネットワーク構造 について解説した。 Various properties of water were explained on hydrogen bonds in which the force works among the water molecules and the network structure. 0 w/o pre-training This paper introduces the outline of the development of the system, and it is described. The network structure of the water, hydrogen combination as the power of the water, and the network structure are explained. On the basis of the water properties, hydrogen coupling and the network structure are explained in the first stage of water. On the formation of the water, this paper explains hydrogen bond and hydrogen bond, which is connected between the water vapor man fellows. This paper explains the development of the properties of water and hydrogen combination, which is the power between the moisture man fellows and the network structure. 8 MultiMASS (Ja) This paper explains the rich characteristics of the water and also explains the network structure of the hydrogen joining with the network structure. 9 Deshuffling (Ja) This paper explains the potential of the water in the water, and the network structure that is connected between the water and hydrogen joining. 10 Multi-task baseline (Ja) The active properties of water are explained, and hydrogen combination that is connected to the network structure and the power of the water are explained. The importance of the properties of water and the network structure, which is the active component of the water, are explained. The formation of the properties of water is first explained, then hydrogen combination and the network structure between the moisture man. The importance of the property of the water is first explained: hydrogen combination and the network structure, which is the power for the entire body of the water. 14 MultiMASS (En) The growth of the water is explained, and the network structure and structure are explained through the hydrogen combination and network structure. Deshuffling (En) The network structure of the water properties is explained, and the network structure with hydrogen in the water is described. 16 Multi-task baseline (En) This paper explains the growth of the water properties, and it also explains hydrogen bonding and its network structure with the ability to develop between the water molecules. Table 13 . Component-wise and overall pre-training accuracies on ASPEC English development sentences. Column names "MASS, " "PMASS, " and "HFSS" denote the pre-training components in the respective model. Note the boost of the HFSS accuracy in multitask settings, although the opposite could have been expected. improves when coupled with BMASS, whereas for English, the accuracy of HFSS and MASS improves when they are combined with each other. Cross-referencing these accuracies with the BLEU scores in Table 3 , we observe that an increase in BLEU scores has no significant relationship with the pre-training accuracy. However, masked language model-based pre-training methods (MASS & BMASS) seem to act as an accuracy improving catalyst for BRSS and HFSS, and this in turn has a positive impact on the translation quality. One possible reason for this is that multi-task training of different pre-training methods helps boost the performance of individual methods. This is in accordance with several previous studies on multi-task training for NMT [9, 21, 23, 39] . Therefore, we recommend that such an analysis of multi-objective pre-training methods can help isolate the importance of individual pre-training objectives. Nevertheless, our analyses reveal that the components of JASS, BMASS, and BRSS, and the components of ENSS, MASS, and HFSS are certainly responsible for improving translation quality for Japanese-involved or English-language pairs. As summarized in Table 14 , we report that BLEU leads to middle/high-resource scenarios. The fine-tuning is performed by more than 200k parallel sentences on the respective language pair and domain. By comparing with models without pre-training, we find that pre-training can still lead to some improvements, but much less than those in low-resource scenarios. Second, we observe that most pre-training methods obtained comparable BLEU results regardless of whether they were linguistically-driven methods or not. This indicates that in middle/high-resource scenarios, our Table 14 . BLEU scores in middle/high-resource scenarios. "ASP" and "Wiki" denote ASPEC and Wikipedia parallel corpus, respectively. proposed methods might be limited, which also shows that linguistically-driven supervision can be utilized to compensate for the lack of parallel sentences. In this study, we proposed JASS and ENSS pre-training methods that leverage information from syntactic structures of sentences on the basis of language-agnostic pre-training schemes such as MASS for NMT. Our work leveraged abundant monolingual data and syntactic analysis such that the pre-training phase became aware of specific language structures. Our experiments on ASPEC Japanese-English, Japanese-Chinese, Wikipedia Japanese-Chinese, and News English-Korean translations demonstrated that JASS and ENSS outperform MASS and other language-agnostic pre-training methods in most low-resource settings. This demonstrates the importance of injecting language-specific information into the pre-training objective, as well as the benefit of multi-task pre-training with masked style and reordering objectives. Our adequacy evaluation through LASER, human evaluation, and case study also demonstrated that our methods resulted in a significant improvement in terms of the adequacy and fluency of translations. The analyses of pre-training accuracy reveal the complementary nature of individual tasks within JASS and ENSS. Our future work will focus on implementing linguistic-aware multilingual pre-training using more languages for more robust pre-trained models. We also note that Raffel et al. [39] demonstrated that several NLP tasks such as text understanding can be reformulated as text-to-text tasks. This broadens the domain of usefulness of sequence-to-sequence pre-training tasks including ours, and we will be interested in evaluating our approach on various NLP tasks. In this section, we introduce Algorithms 1 and 2 for PMASS.S and PMASS.P respectively. We utilize the HPSG parsing result ( Figure 5 (left)) to detect phrase spans to be masked. For PMASS.S, we can rapidly detect an entire phrase span to be masked. For PMASS.P, we start from the root of the HPSG parsing tree and stochastically mask the left child or the right child; then shift to the unmasked child node to find the next masking candidate. We implement this in a recursive manner. Following Araabi and Monz [1] , we use the hyperparameter settings shown in Table 15 for training optimized transformer on different parallel data settings. Although optimized hyperparameter settings can significantly improve low-resource NMT, they require laborious grid search for the optimal setting while fine-tuning NMT based on pre-trained models do not. In Table 16 , 17 and 18, we report the results of combining BART and our proposed methods for Japanese-English and Japanese-Chinese translations. We observe that BART (text infilling) can not further improve our proposed methods, which indicates that BART (text infilling) does not have a complement nature with our linguistically-driven multi-task pre-training methods. Table 18 . BLEU scores compared with BART for simulated low-resource settings for Japanese-Chinese Wikipedia translation using from 3k to 50k parallel sentences for fine-tuning. Results better than MASS with statistical significance < 0.05 are marked in †. Optimizing Transformer for Low-Resource Neural Machine Translation Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond Neural Machine Translation by Jointly Learning to Align and Translate Constructing a Chinese-Japanese Parallel Corpus from Wikipedia Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: A Case Study on Chinese-Japanese Wikipedia Cross-lingual Language Model Pretraining Exploiting Multilingualism through Multistage Fine-Tuning for Low-Resource Neural Machine Translation BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Multi-Task Learning for Multiple Language Translation Understanding Back-Translation at Scale Head Finalization Reordering for Chinese-to-Japanese Machine Translation Iterative Back-Translation for Neural Machine Translation Two-Stage Pre-ordering for Japanese-to-English Statistical Machine Translation Head Finalization: A Simple Reordering Rule for SOV Languages Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation SpanBERT: Improving Pre-training by Representing and Predicting Spans OpenNMT: Open-Source Toolkit for Neural Machine Translation Statistical Significance Tests for Machine Translation Evaluation Phrase reordering for statistical machine translation based on predicate-argument structure Improvements of Japanese morphological analyzer JUMAN BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information Multilingual Denoising Pre-training for Neural Machine Translation JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation Mixed Precision Training Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model Addressing word-order Divergence in Multilingual Neural Machine Translation for extremely Low Resource Languages Overview of the 2nd Workshop on Asian Translation Overview of the 5th Workshop on Asian Translation ASPEC: Asian Scientific Paper Excerpt Corpus Bleu: a Method for Automatic Evaluation of Machine Translation Korean Language Resources for Everyone Deep Contextualized Word Representations Fundamentals. Center for the Study of Language and Information Head-Driven Phrase Structure Grammar ProphetNet: Predicting Future N-gram for Sequence-to-SequencePre-training When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation Improving Language Understanding by Generative Pre-Training Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Explicit Cross-lingual Pre-training for Unsupervised Machine Translation Linguistic Input Features Improve Neural Machine Translation Improving Neural Machine Translation Models with Monolingual Data Neural Machine Translation of Rare Words with Subword Units Revisiting Low-Resource Neural Machine Translation: A Case Study Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation Pre-training via Leveraging Assisting Languages for Neural Machine Translation MASS: Masked Sequence to Sequence Pre-training for Language Generation ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding Sequence to Sequence Learning with Neural Networks Attention is All you Need Denoising based Sequence-to-Sequence Pre-training for Text Generation Multi-task Learning for Multilingual Neural Machine Translation XLNet: Generalized Autoregressive Pretraining for Language Understanding CSP:Code-Switching Pre-training for Neural Machine Translation Quality Estimation and Translation Metrics via Pre-trained Word and Sentence Embeddings Exploiting Source-side Monolingual Data in Neural Machine Translation Semantics-Aware BERT for Language Understanding Handling Syntactic Divergence in Low-resource Machine Translation LIMIT-BERT : Linguistics Informed Multi-Task BERT Transfer Learning for Low-Resource Neural Machine Translation We sincerely thank Dr. Raj Dabre, Dr. Fabien Cromieres, and Mr. Haiyue Song for their support and insightful comments on the JASS part of this work. This work was supported by Grant-in-Aid for Young Scientists #19K20343, JSPS and Information/AI/Data Science Doctoral Fellowship of Kyoto University.