key: cord-0207912-an5zfm0l authors: Zhou, Zhong; Waibel, Alex title: Active Learning for Massively Parallel Translation of Constrained Text into Low Resource Languages date: 2021-08-16 journal: nan DOI: nan sha: a24506d8998166a4e858f022840f21170b9599a4 doc_id: 207912 cord_uid: an5zfm0l We translate a closed text that is known in advance and available in many languages into a new and severely low resource language. Most human translation efforts adopt a portion-based approach to translate consecutive pages/chapters in order, which may not suit machine translation. We compare the portion-based approach that optimizes coherence of the text locally with the random sampling approach that increases coverage of the text globally. Our results show that the random sampling approach performs better. When training on a seed corpus of ~1,000 lines from the Bible and testing on the rest of the Bible (~30,000 lines), random sampling gives a performance gain of +11.0 BLEU using English as a simulated low resource language, and +4.9 BLEU using Eastern Pokomchi, a Mayan language. Furthermore, we compare three ways of updating machine translation models with increasing amount of human post-edited data through iterations. We find that adding newly post-edited data to training after vocabulary update without self-supervision performs the best. We propose an algorithm for human and machine to work together seamlessly to translate a closed text into a severely low resource language. Machine translation has flourished ever since the first computer was made (Hirschberg and Manning, 2015; Popel et al., 2020) . Over the years, human translation is assisted by machine translation to remove human bias and translation capacity limitations (Koehn and Haddow, 2009; Li et al., 2014; Savoldi et al., 2021; Bowker, 2002; Bowker and Fisher, 2010; Koehn, 2009) . By learning human translation taxonomy and post-editing styles, machine translation borrows many ideas from human translation to improve performance through active learning (Settles, 2012; Carl et al., 2011; Denkowski, 2015) . We propose a workflow to bring human translation and machine translation to work together seamlessly in translation of a closed text into a severely low resource language as shown in Figure 1 and Algorithm 1. Given a closed text that has many existing translations in different languages, we are interested in translating it into a severely low resource language well. Researchers recently have shown achievements in translation using very small seed parallel corpora in low resource languages (Lin et al., 2020; Qi et al., 2018; Zhou et al., 2018a) . Construction methods of such seed corpora are therefore pivotal in translation performance. Historically, this is mostly determined by field linguists' experiential and intuitive discretion. Many human translators employ a portion-based strategy when translating large texts. For example, translation of the book "The Little Prince" may be divided into smaller tasks of translating 27 chapters, or even smaller translation units like a few consecutive pages. Each translation unit contains consecutive sentences. Consequently, machine translation often uses seed corpora that are chosen based on human translators' preferences, but may not be optimal for machine translation. We propose to use a random sampling approach to build seed corpora when resources are extremely limited. In other words, when field linguists have limited time and resources, which lines would be given priority? Given a closed text, we propose that it would beneficial if field linguists translate randomly sampled ∼1,000 lines first, getting the first machine translated draft of the whole text, and then post-edit to obtain final translation of each portion iteratively as shown in Algorithm 1. We recognize that the portion-based translation is very helpful in producing quality translation with formality, cohesion and contextual relevance. Thus, our proposed way is not to replace the portion-based approach, but instead, to get the best of both worlds and to expedite the translation process as shown in Figure 1 . The main difference of the two approaches is that the portion-based approach focuses on preserving coherence of the text locally, while the random-sampling approach focuses on increasing coverage of the text globally. Our results show that the random sampling approach performs better. When training on a seed corpus of ∼1,000 lines from the Bible and testing on the rest of the Bible (∼30,000 lines), random sampling beats the portion-based approach by +8.5 BLEU using English as a simulated low resource language training on a family of languages built on the distortion measure, and by +1.9 using a Mayan language, Eastern Pokomchi, training on a family of languages based on the linguistic definition. Using random sampling, machine translation is able to produce a high-quality first draft of the whole text that expedites the subsequent iterations of translation efforts. Moreover, we compare three different ways of incorporating incremental post-edited data during the translation process. We find that self-supervision using the whole translation draft affects performance adversely, and is best to be avoided. We also show that adding the newly post-edited text to training with vocabulary update performs the best. Algorithm 1: Proposed joint human machine translation sequence for a given closed text. Input: A text of N lines consisting multiple books/portions, parallel in L source languages Output: A full translation in the target low resource language, l 0. Initialize translation size, n = 0, vocabulary size, v = 0, vocabulary update size, v = 0 ; 1. Randomly sample S (∼1,000) sentences with vocabulary size vS for human translators to produce the seed corpus, update n = S, v = vS ; 2. Rank and pick a family of close-by languages by linguistic, distortion or performance metric ; while n < N do if v > 0 then 3. Pretrain on the full texts of neighboring languages ; 4. Train on the n sentences of all languages in multi-source multi-target configuration ; 5. Train on the n sentences of all languages in multi-source single-target configuration ; 6. Combine translations from all source languages using the centeredness measure ; 7. Review all books/portions of the translation draft ; 8. Pick a book/portion with n lines and v more vocabulary ; 9. Complete human post-editing of the portion chosen return full translation co-produced by human (Step 1, 7-9) and machine (Step 0, 2-6) translation ; 2 Related Works Machine translation began about the same time as the first computer (Hirschberg and Manning, 2015; Popel et al., 2020) . Over the years, human translators have different reactions to machine translation advances, mixed with doubt or fear (Hutchins, 2001) . Some researchers study human translation taxonomy for machine to better assist human translation and post-editing efforts (Carl et al., 2011; Denkowski, 2015) . Human translators benefit from machine assistance as human individual bias and translation capacity limitations are compensated for by large-scale machine translation (Koehn and Haddow, 2009; Li et al., 2014; Savoldi et al., 2021; Bowker, 2002; Bowker and Fisher, 2010; Koehn, 2009 ). On the other hand, machine translation benefits from professional human translators' context-relevant and culturally-appropriate translation and post-editing efforts (Hutchins, 2001) . Severely low resource translation is a fitting ground for close human machine collaboration (Zong, 2018; Carl et al., 2011; Martínez, 2003) . Many use multiple rich-resource languages to translate to a low resource language using multilingual methods (Johnson et al., 2017; Ha et al., 2016; Firat et al., 2016; Adams et al., 2017; Gillick et al., 2016; Zhou et al., 2018a,b) . Some use data selection for active learning (Eck et al., 2005) . Some use as few as ∼4,000 lines (Lin et al., 2020; Qi et al., 2018) and ∼1,000 lines (Zhou and Waibel, 2021) of data. Some do not use low resource data (Neubig and Hu, 2018; Karakanta et al., 2018) . Active learning has long been used in machine translation (Settles, 2012; Ambati, 2012; Eck et al., 2005; Haffari and Sarkar, 2009; González-Rubio et al., 2012; Miura et al., 2016; Gangadharaiah et al., 2009) . Random sampling and data selection has been successful (Kendall and Smith, 1938; Knuth, 1991; Clarkson and Shor, 1989; Sennrich et al., 2015; Hoang et al., 2018; He et al., 2016; Gu et al., 2018) . The mathematician Donald Knuth uses the population of Menlo Park to illustrate the value of random sampling (Knuth, 1991 We train our models using a state-of-the-art multilingual transformer by adding language labels to each source sentence (Johnson et al., 2017; Ha et al., 2016; Zhou et al., 2018a,b) . We borrow the order-preserving named entity translation method by replacing each named entity with __NEs (Zhou et al., 2018b) using a multilingual lexicon table that covers 124 source languages and 2,939 named entities (Zhou and Waibel, 2021) . For example, the sentence "Somchai calls Juan" is transformed to "__opt_src_en __opt_tgt_ca __NE0 calls __NE1" to translate to Chuj. We use families of close-by languages constructed by ranking 124 source languages by distortion measure (FAMD), performance measure (FAMP) and linguistic family (FAMO + ); the distortion measure ranks languages by decreasing probability of zero distortion, while the performance measure incorporates an additional probability of fertility equalling one (Zhou and Waibel, 2021) . Using families constructed, we pretrain our model first on the whole text of nearby languages, then we train on the ∼1,000 lines of low resource data and the corresponding lines in other languages in a multi-source multi-target fashion. We finally train on the ∼1,000 lines in a multi-source single-target fashion (Zhou and Waibel, 2021) . We combine translations of all source languages into one. Let all N translations be t i , i = 1, . . . , N and let similarity between translations t i and t j be S ij . We rank all translations according to how centered it is with respect to other sentences by summing all its similarities to the rest through j S ij for i = 1, . . . , N . We take the most centered translation for every sentence, max i j S ij , to build the combined translation output. The expectation of the combined score is higher than that of any of the source languages (Zhou and Waibel, 2021) . Our work differs from the past research in that we put low resource translation into the broad collaborative scheme of human machine translation. We compare the portion-based approach with the random sampling approach in building seed corpora. We also compare three methods of updating models with increasing amount of human post-edited data. We add the newly post-edited data to training in three ways: with vocabulary update, without vocabulary update, or incorporating the whole translation draft in a self-supervised fashion additionally. For best performance, we build the seed corpus by random sampling, update vocabulary iteratively, and add newly post-edited data to training without self-supervision. We also have a larger test set, we test on ∼30,000 lines rather than ∼678 lines from existing research 1 . We propose a joint human machine translation workflow in Algorithm 1. After pretraining on neighboring languages in Step 3, we iteratively train on the randomly sampled seed corpus of low resource data in Step 4 and 5. The reason we include both Step 4 and 5 in our algorithm is because training both steps iteratively performs better than training either one (Zhou and Waibel, 2021) . Our model produces a translation draft of the whole text. Since the portion-based approach has the advantage with formality, cohesion and contextual relevance, human translators may pick and post-edit portion-by-portion iteratively. The newly post-edited data with updated vocabulary is added to the machine translation models without self-supervision. In this way, machine translation systems rely on quality parallel corpora that are incrementally produced by human translators. Human translators lean on machine translation for quality translation draft to expedite translation. This creates a synergistic collaboration between human and machine. We work on the Bible in 124 source languages (Mayer and Cysouw, 2014) , and have experiments for English, a simulated language, and Eastern Pokomchi, a Mayan language. We train on ∼1,000 lines of low resource data and on full texts for all the other languages. We aim to translate the rest of the text (∼30,000 lines) into the low resource language. In pretraining, we use 80%, 10%, 10% split for training, validation and testing. In training, we use 3.3%, 0.2%, 96.5% split for training, validation and testing. Our test size is >29 times of the training size 1 . We use the book "Luke" for the portion-based approach as suggested by many human translators. Training on ∼100 million parameters with Geforce RTX 2080 Ti, we employ a 6-layer encoder and a 6-layer decoder with 512 hidden states, 8 attention heads, 512 word vector size, 2,048 hidden units, 6,000 batch size, 0.1 label smoothing, 2.5 learning rate, 0.1 dropout and attention dropout, an early stopping patience of 5 after 190,000 steps, "BLEU" validation metric, "adam" optimizer and "noam" decay method (Klein et al., 2017; Papineni et al., 2002) . We increase patience to 25 for larger data in the second stage of training in Figure 2a and Table 3 : Performance training on 1,086 lines of Eastern Pokomchi data on FAMO + , FAMD and FAMP. We train using the portion-based approach in Luke, and using random sampling in Rand. During testing, Best is the book with highest BLEU score, and All is the performance on ∼29,000 lines of test data 1 . We observe that random sampling performs better than the portion-based approach. In Table 2 and 3, random sampling gives a performance gain of +8.5 for English on FAMD and +1.9 for Eastern Pokomchi on FAMO + 1 . The performance gain for Eastern Pokomchi may be lower because Mayan languages are morphologically rich, complex, isolated and opaque (Aissen et al., 2017; Clemens et al., 2015; England, 2011) . English is closely related to many languages due to colonization and globalization even though it is artificially constrained in size (Bird, 2020) . This may explain why Eastern Pokomchi benefits less. To simulate human translation efforts in Step 7 and 8 in Algorithm 1, we rank 66 books of the Bible by BLEU scores on English's FAMD and Eastern Pokomchi's FAMO + . We assume that BLEU ranking is available to us to simulate human judgment. In reality, this step is realized by human translators skimming through the translation draft and comparing performances of different books by intuition and experience. In Section 6, we will discuss the limitation of this assumption. Performance ranking of the simulated low resource language may differ from that of the actual low resource language. But the top few may coincide because of the nature of the text, independent of the language. In our results, we observe that narrative books performs better than philosophical or poetic books. The book of 1 Chronicles performs best for both English and Eastern Pokomchi with random sampling. A possible explanation is that the book of 1 Chronicles is mainly narrative, and contains many named entities that are translated well by the order-preserving lexiconized model. We included the BLEU scores of the best-performing book in Table 2 and 3. Note that only scores of "All" are comparable across experiments trained on the book of Luke with those trained by random sampling as they evaluate on the same set 1 . For the best-performing book, it is the book of 1 Chronicles for random sampling, and the Source book of Mark or the book of Matthew for experiments trained on the book of Luke. Thus, we cannot compare BLEU scores for the best-performing books across experiments. We include them in the tables to show the quality of the translation draft human translators will work on if they proceed to translate the best-performing book. In Table 4 , we compare three different ways of updating the machine translation models by adding a newly post-edited book that human translators produced. We call the baseline without addition of the new book Seed. Updated-Vocab adds the new book to training with updated vocabulary while Old-Vocab skips the vocabulary update. Self-Supervised adds the whole translation draft of ∼30,000 lines to pretraining in addition to the new book. Self-supervision refers to using the small seed corpus to translate the rest of the text which is subsequently used to train the model. We observe that the Self-Supervised performs the worst among the three. Indeed, Self-Supervised performs even worse than the baseline Seed. This shows that quality is much more important than quantity in severely low resource translation. It is better for us not to add the whole translation draft to the pretraining as it affects performance adversely. On the other hand, we see that both Updated-Vocab and Old-Vocab performs better than Seed and Self-Supervised. Updated-Vocab's performance is better than Old-Vocab. An explanation could be that Updated-Vocab has more expressive power with updated vocabulary. Therefore, in our proposed algorithm, we prefers vocabulary update in each iteration. If the vocabulary has not increased, we may skip pretraining to expedite the process. We show how the algorithm is put into practice for English and Eastern Pokomchi in Figure 2a and 2b. We take the worst-performing 11 books as the held-out test set, and divide the other 55 books of the Bible into 5 portions. Each portion contains 11 books. We translate the text by using the randomly sampled ∼1,000 lines of seed corpus first, and then proceed with human machine translation in Algorithm 1 in 5 iterations with increasing number of post-edited portions. For English, we observe that philosophical books like "Proverbs" and poetry books like "Song of Solomon" perform very badly in the beginning, but begin to achieve above 20 BLEU scores after adding 11 books of training data. This reinforces our earlier result that ∼20% of the text is sufficient for achieving high-quality translation (Zhou et al., 2018a) . However, some books like "Titus" remains difficult to translate even after adding 33 books of training data. This shows that adding data may benefit some books more than the others. A possible explanation is that there are multiple authors of the Bible, and books differ from each other in style and content. Some books are closely related to each other, and may benefit from translations of other books. But some may be very different and benefit much less. For Eastern Pokomchi, though the performance of the most difficult 11 books never reach BLEU score of 20s like that of English experiments, all books have BLEU scores that are steadily increasing. Challenges remain for Eastern Pokomchi, a Resource 0 language (Joshi et al., 2020) . We hope to work with native Mayan speakers to see ways we may improve the results. We propose to use random sampling to build seed parallel corpora instead of using the portionbased approach in severely low resource settings. Training on ∼1,000 lines, the random sampling approach outperforms the portion-based approach by +8.5 for English's FAMD, and by +1.9 for Eastern Pokomchi's FAMO + . We also compare three different ways of updating the machine translation models by adding newly post-edited data iteratively. We find that vocabulary update is necessary, but self-supervision by pretraining with whole translation draft is best to be avoided. One limitation of our work is that in real life scenarios, we do not have the reference text in low resource languages to produce the BLEU scores to decide the post-editing order. Consequently, field linguists need to skim through and decide the post-editing order based on intuition. However, computational models can still help. One potential way to tackle it is that we can train on ∼1,000 lines from another language with available text and test on the 66 books. Since our results show that the literary genre plays important role in the performance ranking, it would be reasonable to determine the order using a "held-out language" and then using that to determine order in the target low resource language. In the future, we would like to work with human translators who understand and speak low resource languages. Another concern human translators may have is the creation of randomly sampled seed corpora. To gauge the amount of interest or inertia, we have interviewed some human translators and many are interested. However, it is unclear whether human translation quality of randomly sampled data differs from that of the traditional portion-based approach. We hope to work with human translators closely to determine whether the translation quality difference is manageable. We are also curious how our model will perform with large literary works like "Lord of the Rings" and "Les Misérables". We would like to see whether it will translate well with philosophical depth and literary complexity. However, these books often have copyright issues and are not as easily available as the Bible data. We are interested in collaboration with teams who have multilingual data for large texts, especially multilingual COVID-19 data. Cross-lingual word embeddings for low-resource language modeling The Mayan languages Active learning and crowdsourcing for machine translation in low resource scenarios Decolonising speech and language technology Computer-aided translation technology: A practical introduction Computer-aided translation. Handbook of translation studies A taxonomy of human translation styles Applications of random sampling in computational geometry, ii. Discrete & Computational Geometry Ergativity and the complexity of extraction: A view from mayan The alchemist. HarperOne El Principito: The Little Prince Machine translation for human translators Low cost portability for statistical machine translation based on n-gram frequency and tf-idf A grammar of Mam, a Mayan language Multi-way, multilingual neural machine translation with a shared attention mechanism Active learning in examplebased machine translation Multilingual language processing from bytes Active learning for interactive machine translation Meta-learning for low-resource neural machine translation Toward multilingual neural machine translation with universal encoder and decoder Active learning for multilingual statistical machine translation Dual learning for machine translation Advances in natural language processing Iterative back-translation for neural machine translation Les Miserables. C. Lassalle Machine translation and human translation: in competition or in complementation Google's multilingual neural machine translation system: Enabling zero-shot translation The state and fate of linguistic diversity and inclusion in the nlp world Neural machine translation for lowresource languages without parallel corpora Randomness and random sampling numbers Opennmt: Open-source toolkit for neural machine translation 3: 16 Bible texts illuminated A process study of computer-aided translation Interactive assistance to human translators using statistical machine translation methods Dao de jing Comparison of google translation with human translation Pre-training multilingual neural machine translation by leveraging alignment information Human Translation Versuv Machine Translation and Full Post-editing of Raw Machine Translation Output Creating a massively parallel bible corpus Selecting syntactic, non-redundant segments in active learning for machine translation Rapid adaptation of neural machine translation to new languages Bleu: a method for automatic evaluation of machine translation Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals When and why are pre-trained word embeddings useful for neural machine translation Harry potter. The 100 Greatest Literary Characters Gender bias in machine translation Improving neural machine translation models with monolingual data Active learning. Synthesis lectures on artificial intelligence and machine learning It's in our hands: a rapid, international initiative to translate a hand hygiene song during the covid-19 pandemic The Lord of the Rings: One Volume World bibliography of translation Dream of the Red Chamber Massively parallel cross-lingual learning in low-resource target language translation Paraphrases as foreign languages in multilingual neural machine translation Family of origin and family of choice: Massively parallel lexiconized iterative pretraining for severely low resource text-based translation Research on the relations between machine translation and human translation Multi-source neural translation Transfer learning for low-resource neural machine translation