key: cord-0513376-x5cnnm4v authors: Nekoto, Wilhelmina; Marivate, Vukosi; Matsila, Tshinondiwa; Fasubaa, Timi; Kolawole, Tajudeen; Fagbohungbe, Taiwo; Akinola, Solomon Oluwole; Muhammad, Shamsuddee Hassan; Kabongo, Salomon; Osei, Salomey; Freshia, Sackey; Niyongabo, Rubungo Andre; Macharm, Ricky; Ogayo, Perez; Ahia, Orevaoghene; Meressa, Musie; Adeyemi, Mofe; Mokgesi-Selinga, Masabata; Okegbemi, Lawrence; Martinus, Laura Jane; Tajudeen, Kolawole; Degila, Kevin; Ogueji, Kelechi; Siminyu, Kathleen; Kreutzer, Julia; Webster, Jason; Ali, Jamiil Toure; Abbott, Jade; Orife, Iroro; Ezeani, Ignatius; Dangana, Idris Abdulkabir; Kamper, Herman; Elsahar, Hady; Duru, Goodness; Kioko, Ghollah; Murhabazi, Espoir; Biljon, Elan van; Whitenack, Daniel; Onyefuluchi, Christopher; Emezue, Chris; Dossou, Bonaventure; Sibanda, Blessing; Bassey, Blessing Itoro; Olabiyi, Ayodele; Ramkilowan, Arshath; Oktem, Alp; Akinfaderin, Adewale; Bashir, Abdallah title: Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages date: 2020-10-05 journal: nan DOI: nan sha: ea5fbda6d13a380ff62fde19bcb54bda732d6c1b doc_id: 513376 cord_uid: x5cnnm4v Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved."Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under https://github.com/masakhane-io/masakhane-mt. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released at https://github. com/masakhane-io/masakhane-mt. Language prevalence in societies is directly bound to the people and places that speak this language. Consequently, resource-scarce languages in an NLP context reflect the resource scarcity in the society from which the speakers originate (McCarthy, 2017) . Through the lens of a machine learning researcher, "lowresourced" identifies languages for which few digital or computational data resources exist, often classified in comparison to another language (Gu et al., 2018; Zoph et al., 2016) . However, to the sociolinguist, "low-resourced" can be broken down into many categories: low density, less commonly taught, or endangered, each carrying slightly different meanings (Cieri et al., 2016) . In this complex definition, the "low-resourced"-ness of a language is a symptom of a range of societal problems, e.g. authors oppressed by colonial governments have been imprisoned for writing novels in their languages impacting the publications in those languages (Wa Thiong'o, 1992) , or that fewer PhD candidates come from oppressed societies due to low access to tertiary education (Jowi et al., 2018) . This results in fewer linguistic resources and researchers from those regions to work on NLP for their language. Therefore, the problem of "lowresourced"-ness relates not only to the available resources for a language, but also to the lack of geographic and language diversity of NLP researchers themselves. The NLP community has awakened to the fact that it has a diversity crisis in terms of limited geographies and languages (Caines, 2019; Joshi et al., 2020) : Research groups are extending NLP research to low-resourced languages (Guzmán et al., 2019; Hu et al., 2020; Wu and Dredze, 2020) , and workshops have been established (Haffari et al., 2018; Axelrod et al., 2019; . We scope the rest of this study to machine translation (MT) using parallel corpora only, and refer the reader to Joshi et al. (2019) for an assessment of low-resourced NLP in general. We diagnose the problems of MT systems for low-resourced languages by reflecting on what agents and interactions are necessary for a sustainable MT research process. We identify which agents and interactions are commonly omitted from existing low-resourced MT research, and assess the impact that their exclusion has on the research. To involve the necessary agents and facilitate required interactions, we propose participatory research to build sustainable MT research communities for low-resourced languages. The feasibility and scalability of this method is demonstrated with a case study on MT for African languages, where we present its implementation and outcomes, including novel translation datasets, benchmarks for over 30 target languages contributed and evaluated by language speakers, and publications authored by participants without formal training as scientists. Cross-lingual Transfer. With the success of deep learning in NLP, language-specific feature design has become rare, and cross-lingual transfer methods have come into bloom (Upadhyay et al., 2016; Ruder et al., 2019) to transfer progress from high-resourced to low-resourced languages (Adams et al., 2017; Wang et al., 2019; Kim et al., 2019) . The most diverse benchmark for multilingual transfer by Hu et al. (2020) allows measurement of the success of such transfer approaches across 40 languages from 12 language families. However, the inclusion of languages in the set of benchmarks is dependent on the availability of monolingual data for representation learning with previously annotated resources. The content of the benchmark tasks is English-sourced, and human performance estimates are taken from English. Most cross-lingual representation learning techniques are Anglo-centric in their design (Anastasopoulos and Neubig, 2019) . Approaches. Multilingual MT (Dong et al., 2015; Firat et al., 2016a,b; Wang et al., 2020) addresses the transfer of MT from high-resourced to low-resourced languages by training multilingual models for all languages at once. (Aharoni et al., 2019; Arivazhagan et al., 2019) train models to translate between English and 102 languages, for the 10 most high-resourced African languages on private data, and otherwise on public TED talks (Qi et al., 2018) . Multilingual training often outperforms bilingual training, especially for low-resourced languages. However, with multilingual parallel data being also Anglo-centric, the capabilities to translate from English versus into English vastly diverge (Zhang et al., 2020) . Another recent approach, mBART (Liu et al., 2020) , leverages both monolingual and parallel data and also yields improvements in translation quality for lower-resource languages such as Nepali, Sinhala and Gujarati. 3 While this provides a solution for small quantities of training data or monolingual resources, the extent to which standard BLEU evaluations reflect translation quality is not clear yet, since human evaluation studies are missing. Targeted Resource Creation. Guzmán et al. (2019) develop evaluation datasets for lowresourced MT between English and Nepali, Sinhala, Khmer and Pashtolow. They highlight many problems with low-resourced translation: tokenization, content selection, and translation verification, illustrating increased difficulty translating from English into lowresourced languages, and highlight the ineffectiveness of accepted state-of-the-art techniques on morphologically-rich languages. Despite involving all agents of the MT process (Section 3), the study does not involve data curators or evaluators that understood the languages involved, and resorts to standard MT evaluation metrics. Additionally, how this effort-intensive approach would scale to more than a handful of languages remains an open question. We reflect on the process enabling a sustainable process for MT research on parallel corpora in terms of the required agents and interactions, visualized in Figure 1 . Content creators, translators, and curators form the dataset creation process, while the language technologists and evaluators are part of the model creation process. Stakeholders (not displayed) create demand for both processes. Stakeholders are people impacted by the artifacts generated by each agent in the MT process, and can typically speak and read the source or the target languages. To benefit from MT systems, the stakeholders need access to technology and electricity. Content Creators produce content in a language, where content is any digital or nondigital representation of language. For digi-tal content, content creators require keyboards, and access to technology. Translators translate the original content, including crowd-workers, researchers, or translation professionals. They must understand the language of the content creator and the target language. A translator needs content to translate, provided by content creators. For digital content, the translator requires keyboards and technology access. Curators are defined as individuals involved in the content selection for a dataset (Bender and Friedman, 2018), requiring access to content and translations. They should understand the languages in question for quality control and encoding information. Language Technologists are defined as individuals using datasets and computational linguistic techniques to produce MT models between language pairs. Language technologists require language preprocessors, MT toolkits, and access to compute resources. Evaluators are individuals who measure and analyse the performance of a MT model, and therefore need knowledge of both source and target languages. To report on the performance on models, evaluators require quality metrics, as well as evaluation datasets. Evaluators provide feedback to the Language Technologists for improvement. If we place a high-resource MT pair such as English-to-French into the process defined above, we observe that each agent nowadays has the necessary resources and historical stakeholder demand to perform their role effectively. A "virtuous cycle" emerged where available content enabled the development of MT systems that in turn drove more translations, more tools, more evaluation and more content, which cycled back to improving MT systems. By contrast, parts of the process for existing low-resourced MT are constrained. Historically, many low-resourced languages had low demand from stakeholders for content creation and translation (Wa Thiong'o, 1992) . Due to missing keyboards or limited access to technology, content creators were not empowered to write digital content (Adam, 1997; van Esch et al., 2019) . This is a chicken-oregg problem, where existing digital content in a language would attract more stakeholders, which would incentivize content creators (Kaffee et al., 2018). As a result, primary data sources for NLP research, such as Wikipedia, often have a few hundred articles only for lowresourced languages despite large speaker populations, see Table 1 . Due to limited demand, existing translations are often domain-specific and small in size, such as the JW300 corpus (Agić and Vulić, 2019) whose content was created for missionary purposes. When data curators are not part of the societies from where these languages originate, they are are often unable to identify data sources or translators for languages, prohibiting them from checking the validity of the created resource. This creates problems in en-coding, orthography or alignment, resulting in noisy or incorrect translation pairs (Taylor et al., 2015) . This is aggravated by the fact that many low-resourced languages do not have a long written history to draw from and therefore might be less standardized and using multiple scripts. In collaboration with content creators, data curators can contribute to standardization or at least recognize potential issues for data processing further down the line. As discussed in Section 1, language technologists are fewer in low-resourced societies. Furthermore, the techniques developed in highresourced societies might be inapplicable due to compute, infrastructure or time constraints. Aside from the problem of education and complexity, existing techniques may not apply due to linguistic and morphological differences in the languages, or the scale, domain, or quality of the data (Hu et al., 2020; Pires et al., 2019) . Evaluators usually resort to potentially unsuitable automatic metrics due to time constraints or missing connections to stakeholders (Guzmán et al., 2019) . The main evaluators of low-resourced NLP that is developed today typically cannot use human metrics due to the inability to speak the languages, or the lack of reliable crowdsourcing infrastructure, identified as one of the core weaknesses of previous approaches (in Section 2). In summary, many agents in the MT process for low-resourced languages are either missing invaluable language and societal knowledge, or the necessary technical resources, knowledge, connections, and incentives to form interactions with other agents in the process. We propose one way to overcome the limitations in Section 3.1: ensuring that the agents in the MT process originate from the countries where the low-resourced languages are spoken or can speak the low-resourced lan-guages. Where this condition cannot be satisfied, at least a knowledge transfer between agents should be enabled. We hypothesize that using a participatory approach will allow researchers to improve the MT process by iterating faster and more effectively. Participatory research, unlike conventional research, emphasizes the value of research partners in the knowledge-production process where the research process itself is defined collaboratively and iteratively. The "participants" are individuals involved in conducting research without formal training as researchers. Participatory research describes a broad set of methodologies, organised in terms of the level of participation. At the lowest level is crowd-sourcing, where participants are involved solely in data collection. The highest level-extreme citizen science-involves participation in the problem definition, data collection, analysis and interpretation (English et al., 2018) . Crowd-sourcing has been applied to lowresourced language data collection Guevara-Rukoz et al., 2020; Millour and Fort, 2018) , but existing studies highlight how the disconnect between the data creation process and model creation process causes challenges. In seeking to create crossdisciplinary teams that emphasize the values in a societal context, a participatory approach which involves participants in every part of the scientific process appears pertinent to solving the problems for low-resourced languages highlighted in Section 3.1. To show how more involved participatory research can benefit low-resource language translation, we present a case study in MT for African languages. African languages account for a small fraction of available language resources, and NLP research rarely considers African languages. In the taxonomy of Joshi et al. (2020) , African languages are assigned categories ranging from "The Left Behinds" to "The Rising Stars", with most languages not having any annotated data. Even monolingual resources are sparse, as shown in Table 1 . In addition to a lack of NLP datasets, the African continent lacks NLP researchers. In 2018, only five out of the 2695 affiliations of the five major NLP conferences were from African institutions (Caines, 2019) . ∀ et al. (2020) attribute this to a culmination of circumstances, in particular their societal embedding (Alexander, 2009) and socio-economic factors, hindering participation in research activities and events, leaving researchers disconnected and distributed across the continent. Consequently, existing data resources are harder to discover, especially since these are often published in closed journals or are not digitized (Mesthrie, 1995) . For African languages, the implementation of a standard crowd-sourcing pipeline as for example used for collecting task annotations for English, is at the current stage infeasible, due to the challenges outlined in Section 3 and above. Additionally, no standard MT evaluation set for all of the languages in focus exists, nor are there prior published systems that we could compare all models against for a more insightful human evaluation. We therefore resort to intrinsic evaluation, and rely on this work becoming the first benchmark for future evaluations. We invite the reader to adopt a metaperspective of this case study as an empirical experiment: Where the hypothesis is that participatory research can facilitate low-resourced MT development; the experimental methodology is the strategies and tools employed to bring together distributed participants, enabling each language speaker to train, contribute, and evaluate their models. The experiment is evaluated in terms of the quantity and diversity of participants and languages, and the variety of research artifacts, in terms of benchmarks, human evaluations, publications, and the overall health of the community. While a set of novel human evaluation results are presented, they serve as demonstration of the value of a participatory approach, rather than the empirical focus of the paper. To overcome the challenge of recruiting participants, a number of strategies were employed. Starting from local demand at a machine learning school (Deep Learning Indaba (Engelbrecht, 2018)), meetups and universities, distant connections were made through Twitter, conference workshops, 4 and eventually press coverage 5 and research publications. 6 To overcome the limited tertiary education enrollments in Sub-Saharan Africa (Jowi et al., 2018) , no prerequisites were placed on researchers joining the project. For the agents outlined in Section 3, no fixed roles are imposed onto participants. Instead, they join with a specific interest, background, or skill aligning them best to one or more of agents. To obtain crossdisciplinarity, we focus on the communication and interaction between participants to enable knowledge transfer between missing connections (identified in Section 3.1), allowing a fluidity of agent roles. For example, someone who initially joined with the interest of using machine translation for their local language (as a stakeholder) to translate education material, might turn into a junior language technologist when equipped with tools and introductory material and mentoring, and guide content creation more specifically for resources needed for MT. To bridge large geographical divides, the community lives online. Communication occurs on GitHub and Slack with weekly video conference meetings and reading groups. Meeting notes are shared openly so that continuous participation is not required and time commitment can be organized individually. Subinterest groups have emerged in Slack channels to allow focused discussions. Agendas for meetings and reading groups are public and democratically voted upon. In this way, the research questions evolve based on stakeholder demands, rather than being imposed upon by external forces. The lack of compute resources and prior exposure to NLP is overcome by providing tutorials for training a custom-size Transformer model with JoeyNMT (Kreutzer et al., 2019) on Google Colab 7 . International researchers were not prohibited from joining. As a result, mutual mentorship relations emerged, whereby international researchers with more language technology experience guided research efforts and enabled data curators or translators to become language technologists. In return, African researchers introduced the international language technologists to African stakeholders, languages and context. Participants. A growth to over 400 participants of diverse disciplines, from at least 20 countries, has been achieved within the past year, suggesting the participant recruitment process was effective. Appendix A contains detailed demographics of a subset of participants from a voluntary survey in February 2020. 86.5% of participants responded positively when asked if the community helped them find mentors or collaborators, indicating that the health of the community is positive. This is also reflected in joint research publications of new groups of collaborators. Research Artifacts. As a result of mentorship and knowledge exchange between agents of the translation process, our implementation of participatory research has produced artifacts for NLP research, namely datasets, benchmarks and models, which are publicly available online. 8 . Additionally, over 10 participants have gone on to publish works addressing language-specific challenges at conference workshops, such as (Dossou and Emezue, 2020; Orife, 2020; Orife et al., 2020; Öktem et al., 2020; Van Biljon et al., 2020; Martinus et al., 2020; Marivate et al., 2020) . Dataset Creation. The dataset creation process is ongoing, with new initiatives still emerging. We showcase a few initiatives below to demonstrate how bridging connections between agents facilitates the MT process. by the internal demand to ensure that accessible and representative data of their culture is used to train models, are translating their own writings including personal religious stories and undergraduate theses into Yoruba and Igbo 9 . 2. A Namibian participant, driven by a passion to preserve the culture of the Damara, is hosting collaborative sessions with Damara speakers, to collect and translate phrases that reflect Damara culture around traditional clothing, songs, and prayers. 10 3. Creating a connection between a translator in South-Africa's parliament and a language technologist has enabled the process of data curation, allowing access to data from the parliament in South-Africa's languages (which are public but obfuscated behind internal tools). 11 . These stories demonstrate the value of including curators, content creators, and translators as participants. Benchmarks. We publish 45 benchmarks for neural translation models from English into 32 distinct African languages, and from French into two additional languages, as well as from English into three different languages. 12 Most were trained on the JW300 corpus (Agić and Vulić, 2019 (Tiedemann, 2012) , and data translated or curated by participants were added. Language pairs were selected based on the individual demands of each of the 32 participants, who voluntarily contributed 10 https://github.com/masakhaneio/masakhane-khoekhoegowab 11 http://bit.ly/raw-parliamentarytranslations 12 Benchmark scores can be found in Appendix C. 13 https://tatoeba.org/ the benchmarks they valued most. 16 of the selected target languages are categorized as "Left-behind" and 11 are categorized as "Scraping by" in the taxonomy of (Joshi et al., 2020) . The benchmarks are hosted publicly, including model weights, configurations and preprocessing pipelines for full reproducibility. The benchmarks are submitted by individual or groups of participants in form of a GitHub Pull Request. By this, we ensure that the contact to the benchmark contributors can be made, and ownership is experienced. To our knowledge, there is no prior research on human evaluation specifically for machine translations of low-resourced languages. Until now, NLP practitioners were left with the hope that successful evaluation methodologies for high-resource languages would transfer well to low-resourced languages. This lack of study is due to the missing connections between the community of speakers (content creators and translators), and the language technologists. MT evaluations by humans are often done either within a group of researchers from the same lab or field (e.g. for WMT evaluations 14 ), or via crowdsourcing platforms Post et al., 2012) . Speakers of low-resource languages are traditionally underrepresented in these groups, which makes such studies even harder (Joshi et al., 2019; Guzmán et al., 2019) . One might argue that human evaluation should not be attempted before reaching a viable state of quality, but we found that early evaluation results in an improved understanding of the individual challenges of the target languages, strengthens the network of the community, and most importantly, improves the connection and knowledge transfer between language technologists, content creators and curators. The "low-resourced"-ness of the addressed languages pose challenges for evaluation beyond interface design or recruitment of evaluators proficient in the target language. For the example of Igbo, evaluators had to find solutions for typing diacritics without a suitable keyboard. In addition, Igbo has many dialects and variations which the MT model is uninformed of. Medical or technical terminology (e.g., "data") is difficult to translate and whether to use loan words required discussion. Target language news websites were found to be useful for resolving standardization or terminology questions. Solutions for each language were shared and often also applicable for other languages. Data. The models are trained on JW300 data. 15 To gain real-world quality estimates beyond religious context, we assess the models' out-of-domain generalization by translating a English COVID-19 survey with 39 questions and statements regarding 16 where the human-corrected and approved translations can directly serve the purpose of gathering responses. The domain is challenging as it contains medical terms and new vocabulary. Furthermore, we evaluate a subset of the Multitarget TED test data (Duh, 2018) 17 . The obtained translations enrich the TED datasets, adding new languages for which no prior translations exist. The size of the TED evaluations vary from 30 to 120 sentences. Details are given in Table 3 , Appendix B. Evaluators. 11 participants of the community volunteered to evaluate translations in their language(s), often involving family or friends to determine the most correct translations. The evaluator role is therefore taken by both stakeholders and language technologists. Within only 10 days, we gathered a total of 707 evaluated translations covering Igbo (ig), Nigerian Pidgin (pcm), Shona (sn), Luo (luo), Hausa (ha, twice by two different annotators), Kiswahili (sw), Yoruba (yo), Fon (fon) and Dendi (ddn). We did not impose prescriptions in terms of number of sentences to evaluate, or time to spend, since this was voluntary work, and guidelines or estimates for the evaluation of translations into these languages are non-existent. Evaluation Technique. Instead of a direct assessment (Graham et al., 2013) often used in benchmark MT evaluations (Barrault et al., 2019; Guzmán et al., 2019) , we opt for postediting. Post-edits are grounded in actions that can be analyzed in terms of e.g. error types for further investigations, while direct assessments require expensive calibration (Bentivogli et al., 2018) . Embedded in the community, these post-edit evaluations create an asset for the interaction of various agents: for the language technologists for domain adaptation, or for the content creators, curators, or translators for guidance in standardization or domain choice. Results. Table 2 reports evaluation results in terms of BLEU evaluated on the benchmark test set from JW300, and human-targeted TER (HTER) (Snover et al., 2006) , BLEU (Papineni et al., 2002) and ChrF (Popović, 2015) against human corrected model translations. For ha we find modest agreement between evaluators: Spearman's ρ = 0.56 for sentence-BLEU measurements of the post-edits compared to the original hypotheses. Generally, we observe that the JW300 score is misleading, overestimating model quality (except yo). Training data size appears to be a more reliable predictor of generalization abilities, illustrating the danger of chasing a single benchmark. However, ig and yo both have comparable amounts of training data, JW300 scores, and carry diacritics, but exhibit very different evaluation performances, in particular on COVID. This can be explained by the large variations of ig as discussed above: Training data and model output are not consistent with respect to one dialect, while the evaluator had to decide on one. We also find difference in performance across domains, with the TED domain appearing easier for pcm and ig, while the yo model performs better on COVID. We proposed a participatory approach as a solution to sustainably scaling NLP research to low-resourced languages. Having identified key agents and interactions in the MT development process, we implement a participatory approach to build a community for African MT. In the process, we discovered successful strategies for distributed growth and communication, knowledge sharing and model building. In addition to publishing benchmarks and datasets for previously understudied languages, we show how the participatory design of the community enables us to conduct a human evaluation study of model outputs, which has been one of the limitations of previous approaches to low-resourced NLP. The sheer volume and diversity of participants, languages and outcomes, and that for many for languages featured, this paper constitutes the first time that human evaluation of an MT system has been performed, is evidence of the value of participatory approaches for low-resourced MT. For future work, we will (1) continue to iterate, analyze and widen our benchmarks and evaluations, (2) build richer and more meaningful datasets that reflect priorities of the stakeholders, (3) expand the focus of the existing community for African languages to other NLP tasks, and (4) Figure 2 shows the demographics for a subset of participants from a voluntary survey conducted in February 2020. Between then and now (May 2020), the community has grown by 30%, so these figures have to be seen as a snapshot. Nevertheless we can see that the educational background and the occupation is fairly diverse, with a majority of undergraduate students (not necessarily Computer Science). Table 3 reports the number sentences that were post-edited in the human evaluation study reported in Section 4. data comes tokenized with Polyglot. 18 . The table also features the target categories according to (Joshi et al., 2020) as of 28 May 2020. Content and the web for african development Crosslingual word embeddings for low-resource language modeling JW300: A wide-coverage parallel corpus for low-resource languages Massively multilingual neural machine translation Evolving african approaches to the management of linguistic diversity: The acalan project Can crowds build parallel corpora for machine translation systems? Active learning-based elicitation for semi-supervised word alignment Should all cross-lingual embeddings speak english? Massively multilingual neural machine translation in the wild: Findings and challenges Proceedings of the 2019 Workshop on Widening NLP Mathias Müller, Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Findings of the 2019 conference on machine translation (wmt19) Data statements for natural language processing: Toward mitigating system bias and enabling better science Machine translation human evaluation: an investigation of evaluation based on post-editing and its relation with direct assessment The geographic diversity of nlp conferences Wit 3 : Web inventory of transcribed and translated talks Xiang Ren, and Swabha Swayamdipta Selection criteria for low resource language programs The SAWA corpus: A parallel corpus English -Swahili Multi-task learning for multiple language translation Ffr v1. 0: Fon-french neural machine translation The multitarget ted talks task Long Papers) Crowdsourcing Latin American Spanish for low-resource text-to-speech The flores evaluation datasets for low-resource machine translation: Nepalienglish and sinhala-english Proceedings of the Workshop on Deep Learning Approaches for Low-Resource NLP. Association for Computational Linguistics Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization Unsung challenges of building and deploying language technologies for low resource language communities The state and fate of linguistic diversity and inclusion in the nlp world Building phd capacity in subsaharan africa Mind the (language) gap: generation of multilingual wikipedia summaries from wikidata for articleplaceholders Effective cross-lingual transfer of neural machine translation models without shared vocabularies Joey NMT: A minimalist NMT toolkit for novices Multilingual denoising pre-training for neural machine translation Rethabile Mokoena, and Abiodun Modupe. 2020. Investigating an approach for low resource language dataset creation, curation and classification: Setswana and sepedi Joanne Moonsamy, Moses Shaba Jnr, Ridha Moosa, and Robert Fairon. 2020. Neural machine translation for south africa's official languages The new digital divide: Language is the impediment to information access An english to xitsonga statistical machine translation system for the government domain Language and social history: Studies in South African sociolinguistics Toward a lightweight solution for less-resourced languages: Creating a POS tagger for alsatian using voluntary crowdsourcing Tigrinya neural machine translation with transfer learning for humanitarian response Towards neural machine translation for edoid languages 2020. Improving yor \'a diacritic restoration Bleu: a method for automatic evaluation of machine translation How multilingual is multilingual BERT? chrF: character n-gram fscore for automatic MT evaluation A call for clarity in reporting BLEU scores Constructing parallel corpora for six Indian languages via crowdsourcing When and why are pre-trained word embeddings useful for neural machine translation? A survey of cross-lingual word embedding models A study of translation edit rate with targeted human annotation Data-inplace: Thinking through the relations between data and community Parallel data, tools and interfaces in opus Cross-lingual models of word embeddings: An empirical comparison On optimal transformer depth for low-resource language translation Decolonising the mind: The politics of language in African literature Balancing training for multilingual neural machine translation Cross-lingual BERT transformation for zero-shot dependency parsing Are all languages created equal in multilingual bert? Improving massively multilingual neural machine translation and zero-shot translation Transfer learning for lowresource neural machine translation We would like to thank Benjamin Rosman and Richard Klein for their invaluable feedback, as well as the anonymous EMNLP reviewers, and George Foster and Daan van Esch. We would also like to thank Google Cloud for the grant that enabled us to build baselines for languages with larger datasets. (Joshi et al., 2020) as of 28 May 2020.