key: cord-0474497-0j74kdgf authors: Ojha, Atul Kr.; Liu, Chao-Hong; Kann, Katharina; Ortega, John; Shatam, Sheetal; Fransen, Theodorus title: Findings of the LoResMT 2021 Shared Task on COVID and Sign Language for Low-resource Languages date: 2021-08-14 journal: nan DOI: nan sha: ccdedf5f4fd06b7b247a07ed75c456bd12450c91 doc_id: 474497 cord_uid: 0j74kdgf We present the findings of the LoResMT 2021 shared task which focuses on machine translation (MT) of COVID-19 data for both low-resource spoken and sign languages. The organization of this task was conducted as part of the fourth workshop on technologies for machine translation of low resource languages (LoResMT). Parallel corpora is presented and publicly available which includes the following directions: English$leftrightarrow$Irish, English$leftrightarrow$Marathi, and Taiwanese Sign language$leftrightarrow$Traditional Chinese. Training data consists of 8112, 20933 and 128608 segments, respectively. There are additional monolingual data sets for Marathi and English that consist of 21901 segments. The results presented here are based on entries from a total of eight teams. Three teams submitted systems for English$leftrightarrow$Irish while five teams submitted systems for English$leftrightarrow$Marathi. Unfortunately, there were no systems submissions for the Taiwanese Sign language$leftrightarrow$Traditional Chinese task. Maximum system performance was computed using BLEU and follow as 36.0 for English--Irish, 34.6 for Irish--English, 24.2 for English--Marathi, and 31.3 for Marathi--English. The workshop on technologies for machine translation of low resource languages (LoResMT) 1 is a yearly workshop which focuses on scientific research topics and technological resources for machine translation (MT) using low-resource languages. Based on the success of its three predecessors (Liu, 2018; Karakanta et al., 2019 Karakanta et al., , 2020 , the fourth LoResMT workshop intoduces a shared task section based on COVID-19 and sign language data as part of its research objectives. The hope is to provide assistance with translation for low-resource languages where it could be needed most during the COVID-19 pandemic. To provide a trajectory of the LoResMT shared task success, a summary of the previous tasks follows. The first LoResMT shared task (Karakanta et al., 2019) took place in 2019. There, monolingual and parallel corpora for Bhojpuri, Magahi, Sindhi, and Latvian were provided as training data for two types of machine translation systems: neural and statistical. As an extension to the first shared task, a second shared task was presented in 2020 which focused on zero-shot approaches for MT systems. This year, the shared task introduces a new objective focused on MT systems for COVIDrelated texts and sign language. Participants for this shared task were asked to submit novel MT systems for the following language pairs: • English↔Irish • English↔Marathi The low-resource languages presented in this shared task were found to be sufficient data for baseline systems to perform translation on the latest COVID-related texts and sign language. Irish, Marathi, and Taiwanese Sign Language can be considered low-resource languages and are translated to either English or traditional Chinese -their high-resource counterpart. The rest of our work is organized as follows. Section 2 presents the setup and schedule of the shared task. Section 3 presents the data set used for the competition. Section 4 describes the approaches used by participants in the competition and Section 5 presents and analyzes the results obtained by the competitors. Lastly, in Section 6 a conclusion is presented along with potential future work. This section describes how the shared task was organized along with the systems. Registered participants were sent links to the training, development, and/or monolingual data (refer to Section 3 for more details). They were allowed to use additional data to train their system with the condition that any additional data used should be made publicly available. Participants were moreover allowed to use pre-trained word embeddings and linguistic models that are publicly available. As a manner of detecting which data sets were used during training, participants were given the following markers for denotation: • "-a" -Only provided development, training and monolingual corpora. • "-b"-Any provided corpora, plus publicly available language's corpora and pretrained/linguistic model (e.g. systems used pre-trained word2vec, UDPipe, etc. model). • "-c" -Any provided corpora, plus any publicly external monolingual corpora. Each team was allowed to submit any number of systems for evaluation and their best 3 systems were included in the final ranking presented in this report. Each submitted system was evaluated on standard automatic MT evaluation metrics; BLEU (Papineni et al., 2002) , CHRF (Popović, 2015) and TER (Post, 2018) . The schedule for deliver of training data and release of test data along with notification and submission can be found in Table 1 . May 10, 2021 Release of training data July 01, 2021 Release of test data July 13, 2021 Submission of the systems July 20, 2021 Notification of results July 27, 2021 Submission of shared task papers August 01, 2021 Camera-ready In this section, we present background information about the languages and data sets featured in the shared task along with a itemized view of the linguistic families and number of segments in Table 2 . • English↔Irish Irish (also known as Gaeilge) has around 170,000 L1 speakers and "1.85 million (37%) people across the island (of Ireland) claim to be at least somewhat proficient with the language". In the Republic of Ireland, it is the national and first official language. It is also one of the official languages of the European Union and a recognized minority language in Northern Ireland with the ISO ga code. 2 English-Irish bilingual COVID sentences/documents were extracted and aligned from the following sources: (a) Gov.ie 3 -Search for services or information , (b) Ireland's Health Services 4 -HSE.ie , (c) Revenue Irish Taxes and Customs 5 and (d) Europe Union 6 . In addition, the Irish bilingual training data was built from monolingual data using back translation (Sennrich et al., 2016) . English and Irish monolingual data was compiled from Wikipedia pages and newspapers such as The Irish Times 7 , RTE 8 and COVID-19 pandemic in the Republic of Ireland 9 . Back-translated and crawled data were cross-validated for accuracy by language experts leaving approximately 8,112 Irish parallel sentences for the training data set. • English↔Marathi Marathi, which has the ISO code mr, is dominantly spoken in India's Maharashtra state. It has around 83,026,680 speakers. 10 It belongs to the Indo-Aryan language family. English-Marathi parallel COVID sentences were extracted from the Government of India website and online newspapers such as PMIndia 11 , myGOV 12 , Lokasatta 13 , BBC Marathi and English 14 . After pre-processing and manual validation, approximately 20,993 parallel training sentences were left. Additionally, English and Marathi monolingual sentences were crawled from the online newspapers and Wikipedia (see Table 2 ). • Taiwanese Sign Language ↔ Traditional Chinese According to UN, there are "72 million deaf people worldwide... they use more than 300 different sign languages." 15 In Taiwan, Taiwanese Sign Language is a recognized national language, with a population of less than thirty thousand "speakers". Taiwanese Sign Language (and Korean Sign Language) evolved from Japanese Sign Language and share about 60% of "words" between them. The sign language data set is prepared from press conferences for COVID-19 response, which were held daily or weekly depending on the pandemic situation in Taiwan. Fig. 1 shows a sample video of sign language and its translations in Traditional Chinese (excerpted from the corpus) and English. Similar to the training data, English-Irish and English-Marathi language pair's dev and test data sets were crawled from bilingual and/or monolingual websites. Additionally, some parallel segments and terminology were taken from the Translation Initiative for COVID-19 (Anastasopoulos et al., 2020), a manually translated and validated data set created by professional translators and native speakers of the target languages. The participants of the shared task were provided with the manual translations of which 502 Irish and 500 Marathi development segments were used while 250 (Irish-English), 500 (English-Irish), 500 (English-Marathi) and 500 ( Marathi-English) manually translated segments were used for testing. Taiwanese Sign Language ↔ Traditional Chinese language pair's participants were provided with 3071 segments and videos for development and 7,053 videos for sign language testing. The detailed statistics of the data set in each language is provided in Table2. The complete shared task data sets are available publicly 16 . A total of 12 teams registered for the shared task: 5 teams registered to participate for all language pairs, 5 teams registered to participate only for English↔Marathi, one team registered for Taiwanese↔Mandarin (Traditional Chinese) sign language and one team registered for English↔Irish. Out of these, a total of 6 teams submitted their systems on COVID while none of them submitted a system for sign language. Out of the submitted systems, two teams participated for the English↔Irish and English↔Marathi tasks, one team participated for English-Irish and three teams participated for English↔Marathi (see Table 3 ). All the teams who submitted their systems were invited to submit system description papers describing their experiments. Next, we give a short description of the approaches used by each team to build their systems. More details about the approaches can be found in the papers by respective teams in the accompanying proceeding. • IIITT (Puranik et al., 2021 ) used a fairseq pre-trained model Indictrans for English-Marathi. It consists of two models that can translate from Indic to English and vice-versa. The model can perform 11 languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu pre-trained on the Samanantar data set, the largest data set for Indic languages during the time of submission. The model is fine-tuned on the training data set provided by the organizers and a parallel bible corpus for Marathi. The team used the parallel bible parallel corpus from a previous task (Mul-tiIndicMT task in WAT 2020). After conducting various experiments, the best checkpoint was recorded and predicted upon. For Irish, the team fine-tuned an Opus MT model from Helsinki NLP on the training data set, and then predicted results after recording. After careful experimentation, the team observed that the Opus MT model outperformed the other models giving it the highest scoring model award. • oneNLP-IIITH (Mujadia and Sharma, 2021) used a sequence to sequence neural model with a transformer network (4 to 8 layers) with label smoothing and dropouts to reduce overfitting with English-Marathi and Marathi-English. The team explored the use of different linguistic features like part-of-speech and morphology on sub-word units for both directions. In addition, the team explored forward and backward translation using webcrawled monolingual data. • A3108 (Yadav and Shrivastava, 2021) built a statistical machine translation (smt) system in both directions for English↔Marathi language pair. Its initial baseline experiments used various tokenization schemes to train models. By using optimal tokenization schemes, the team was able to create synthetic data and train an augmented dat aset to create more statistical models. Also, the team reordered English syntax to match Marathi syntax and further trained another set of baseline and data augmented models using various tokenization schemes. • CFILT-IITBombay (Jain et al., 2021) buildt three different neural machine translation systems; a baseline English-Marathi system, a Baseline Marathi-English system, and a English-Marathi system that was based on back translation. The team explored the performance of the NMT systems between English and Marathi languages. Also, they explored the performance of back-translation using data obtained from NMT systems trained on a very small amount of data. From their experiments, the team observed that back-translation helped improve the MT quality over the baseline for English-Marathi. • UCF (Chen and Fazio, 2021) used transfer learning, uni-gram and sub-word segmentation methods for English-Irish, Irish-English, English-Marathi and Marathi-English. The team conducted their experiment using an OpenNMT LSTM system. Efforts were constrained by using transfer learning and sub-word segmentation on small amounts of training data. Their models achieved the following BLEU scores when constraining on tracks of English-Irish, Irish-English, and Marathi-English: 13.5, 21.3, and 17.9, respectively. • adapt dcu (Lankford et al., 2021) used a transformer training approach carried out using OpenNMT-py and sub-word models for English-Irish. The team also explored domain adaptation techniques while using a Covid-adapted generic 55k corpus, fine-tuning, mixed fine-tuning and combined data set approaches were compared with models trained on an extended in-domain data set. As discussed, participants were allowed to use data sets other than those provided. The best three results for English-Irish, Irish-English, English-Marathi and Marathi-English language pairs are presented in Tables 4 and 5 . The complete submitted systems results are available publicly 17 . We have reported the findings of the LoResMT 2021 Shared Task on COVID and sign language translation for low-resource languages as part of the fourth LoResMT workshop. All submissions used neural machine translation except for the one from A3108. We conclude that in our shared tasks the use of transfer learning, domain adaptation, and back translation achieve optimal results when the data sets are domain specific as well as small-sized. Our findings show that uni-gram segmentation transfer learning methods provide comparatively low results for the following metrics: BLEU, CHRF and TER. The highest BLEU scores achieved are 36.0 for English-to-Irish, 34.6 for Irish-to-English, 24.2 for English-to-Marathi, and 31.3 for Marathito-English. In future iterations of the LoResMT shared tasks, extended corpora of the three language pairs will be provided for training and evaluation. Human evaluation on system results will also be conducted. For sign language MT, the tasks will be fine-grained and evaluated separately. TICO-19: the translation initiative for COvid-19 The UCF Systems for the LoResMT 2021 Machine Translation Shared Task Evaluating the Performance of Back-translation for Low Resource English-Marathi Language Pair: CFILT-IITBombay @ LoResMT 2021 Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages Proceedings of the 2nd workshop on technologies for mt of low resource languages Machine Translation in the Covid domain: an English-Irish case study for LoResMT 2021 Proceedings of the AMTA 2018 Workshop on Technologies for MT of Low Resource Languages English-Marathi Neural Machine Translation for LoResMT 2021 Findings of the LoResMT 2020 shared task on zero-shot for low-resource languages BLEU: a method for automatic evaluation of machine translation chrF: character n-gram F-score for automatic MT evaluation A call for clarity in reporting BLEU scores Attentive fine-tuning of Transformers for Translation of lowresourced languages @LoResMT 2021 Improving neural machine translation models with monolingual data A3-108 Machine Translation System for LoResMT Shared Task @MT Summit 2021 Conference This publication has emanated from research in part supported by Cardamom-Comparative Deep Models of Language for Minority and Historical Languages (funded by the Irish Research Council under the Consolidator Laureate Award scheme (grant number IRCLA/2017/129)) and we are grateful to them for providing English↔Irish parallel and monolingual COVID-related texts. We would like to thank Panlingua Language Processing LLP and Potamu Research Ltd for providing English↔Marathi parallel and monolingual COVID data and Taiwanese Sign Language↔Traditional Chinese linguistic data, respectively.