key: cord-0505647-sybtcv4q authors: Way, Andy; Haque, Rejwanul; Xie, Guodong; Gaspari, Federico; Popovic, Maja; Poncelas, Alberto title: Facilitating Access to Multilingual COVID-19 Information via Neural Machine Translation date: 2020-05-01 journal: nan DOI: nan sha: 069f775e406f39b9a3c75f31bae4cb520cbbaeb9 doc_id: 505647 cord_uid: sybtcv4q Every day, more people are becoming infected and dying from exposure to COVID-19. Some countries in Europe like Spain, France, the UK and Italy have suffered particularly badly from the virus. Others such as Germany appear to have coped extremely well. Both health professionals and the general public are keen to receive up-to-date information on the effects of the virus, as well as treatments that have proven to be effective. In cases where language is a barrier to access of pertinent information, machine translation (MT) may help people assimilate information published in different languages. Our MT systems trained on COVID-19 data are freely available for anyone to use to help translate information published in German, French, Italian, Spanish into English, as well as the reverse direction. The COVID-19 virus was first reported in China in late December 2019, and the World Health Organization declared its outbreak a public health emergency of international concern on 30 January 2020, and subsequently a pandemic on 11 March 2020. 1 Despite initial doubts as to whether it could be passed from human to human, very quickly the virus took hold with over three million now infected worldwide, and over 200,000 people dying from exposure to this virus. Some countries responded pretty quickly to the onset of COVID-19, imposing strict barriers on human movement in order to "flatten the curve" and avoid as much transmission of the virus as possible. Regrettably, others did not take advantage from the lessons learned initially in the Far East, with significant delays in implementing social distancing with concomitant rises in infection and death, particularly among the elderly. Being an airborne disease, different countries were exposed to the virus at different times; some countries, like China, Austria and Denmark, are starting to relax social distancing constraints and are allowing certain people back to work and school. Nonetheless, the virus is rampant in many countries, particularly the UK and the US. There is still time to absorb the lessons learned in other regions to try to keep infection and associated deaths at the lower end of projections. However, much salient information online appears in a range of languages, so that access to information is restricted by people's language competencies. It has long been our view that in the 21st century, language cannot be a barrier to access of information. Accordingly, we decided to build a range of MT systems to facilitate access to multilingual content related to . Given that Spain, France and Italy suffered particularly badly in Europe, it was important to include Spanish, French and German in our plans, so that lessons learned in those countries could be rolled out in other jurisdictions. Equally, Germany appears to have coped particularly well, so we wanted information written in German to be more widely available. The UK and the US are suffering particularly badly at the time of writing, so English obviously had to be included. Accordingly, this document describes our efforts to build 8 MT systems -FIGS 2 to/from Englishusing cutting-edge techniques aimed specifically at making available health and related information (e.g. promoting good practice for symptom identification, prevention, treatment, recommenda-tions from health authorities, etc.) concerning the COVID-19 virus both to medical experts and the general public. In the current situation, the volume of information to be translated is huge and is relevant (hopefully) only for a relatively short span of time; relying wholly on human professional translation is not an option, both in the interest of timeliness and due to the enormous costs that would be involved. By making the engines publicly available, we are empowering individuals to access multilingual information that otherwise might be denied them; by ensuring that the MT quality is comparable to that of well-known online systems, we are allowing users to avail of high-quality MT with none of the usual privacy concerns associated with using online MT platforms. Furthermore, we note interesting strengths and weaknesses of our engines following a detailed comparison with online systems. The remainder of the paper explains what data we sourced to train these engines, how we trained and tested them to ensure good performance, and our efforts to make them available to the general public. It is our hope that these engines will be helpful in the global fight against this pandemic, so that fewer people are exposed to this disease and its deadly effects. There are of course a number of challenges and potential dangers involved in rapid development of MT systems for use by naive and vulnerable users to access potentially sensitive and complex medical information. There have been alarmingly few attempts to provide automatic translation services for use in crisis scenarios. The best-known example is Microsoft's rapid response to the 2010 Haiti earthquake (Lewis, 2010) , which in turn led to a cookbook for MT in crisis scenarios (Lewis et al., 2011) . In that paper, Lewis et al. (2011) begin by stating that "MT is an important technology in crisis events, something that can and should be an integral part of a rapid-response infrastructure . . . If done right, MT can dramatically increase the speed by which relief can be provided". They go on to say the following: "We strongly believe that MT is an important technology to facilitate communication in crisis situations, crucially since it can make content in a language spoken or written by a local population accessible to those that do not know the language" [p.501] They also note [pp.503-504] that "While translation is not [a] widely discussed aspect of crisis response, it is a perennial hidden issue (Disaster 2.0, 2011): Go and look at any evaluation from the last ten or fifteen years. Recommendation: make effective information available to the government and the population in their own language. We didnt do it . . . It is a consistent thing across emergencies. Brendan McDonald, UN OCHA in Disaster 2.0 (2011). While it is of course regrettable that natural disasters continue to occur, these days we are somewhat better prepared to respond when humanitarian crises such as COVID-19 occur, thanks to work on translation in crisis situations such as INTERACT. 3 Indeed, Federici et al. (2019) issue a number of recommendations within that project which we have tried to follow in this work. While they apply mainly to human translation provision in crisis scenarios, they can easily be adapted to the use of MT, as in this paper. The main relevant recommendation 4 is that "Emergency management communication policies should include provision for translation, which we take as an endorsement of our approach here. The provision of MT systems has the potential to help: Federici et al. (2019) note that "the right to translated information in managing crises must be a part of living policy and planning documents that guide public agency actions" [2b] , and that people have a "right to translated information across all phases of crisis and disaster management [4a]. We do not believe that either of these have been afforded to the public at large during the current crisis, but our provision of MT engines has the potential to facilitate both of these requirements. [7a] notes that "Translating in one direction is insufficient. Two-way translated communication is essential for meeting the needs of crisis and disaster-affected communities. By allowing translation in both directions (FIGS-to-English as well as English-to-FIGS), we are facilitating two-way communication, which would of course be essential in a patient-carer situation, for example. By making translation available via MT rather than via human experts, we are helping avoid the need for "training for translators and interpreters . . . includ[ing] aspects of how to deal with traumatic situations [8a], as well as translators being exposed to traumatic situations altogether. Finally, as we describe below, we have taken all possible steps to ensure that the quality of our MT systems is as good as it can be at the time of writing. Using leading online MT systems as baselines, we demonstrate comparable performance -and in some cases improvements over some of the online systems -and document a number of strengths and weaknesses of our models as well as the online engines. We deliberately decided to release our systems as soon as good performance was indicated both by automatic and human measures of translation quality; aiming for fully optimized performance would have defeated the purpose of making the developed MT systems publicly available as soon as possible to try and mitigate the adverse effects of the ongoing international COVID-19 crisis. Neural MT (NMT: Bahdanau et al. (2014) ) is acknowledged nowadays as the leading paradigm in MT system-building. However, compared to its predecessor (Statistical MT: Koehn et al. (2007) ), NMT requires even larger amounts of suitable training data in order to ensure good translation quality. It is well-known in the field that optimal performance is most likely to be achieved by sourcing large amounts of training data which are as closely aligned with the intended domain of application as possible (Pecina et al., 2014; Zhang et al., 2016) . Accordingly, we set out to assemble large amounts of good quality data in the language pairs of focus dedicated to COVID-19 and the wider health area. There have been suggestions (e.g. on Twitter) that the data found in some of the above-mentioned corpora, especially those that were recently released to support rapid development initiatives such as the one reported in this paper, is of variable and inconsistent quality, especially for some language pairs. For example, a quick inspection of samples of the TAUS Corona Crisis Corpus for English-Italian revealed that there are identical sentence pairs repeated several times, and numerous segments that do not seem to be directly or explicitly related to the COVID-19 infection per se, but appear to be about medical or health-related topics in a very broad sense; in addition, occasional irrelevant sentences about computers being infected by software viruses come up, that may point to unsupervised text collection from the Web. Nonetheless, we are of course extremely grateful for the release of the valuable parallel corpora that we were able to use effectively for the rapid development of the domain-specific MT systems that we make available to the public as part of this work; we note these observations here as they may be relevant to other researchers and developers who may be planning to make use of such resources for other corpus-based applications. This section describes how we built a range of NMT systems using the data described in the previous section. We evaluate the quality of the systems using BLEU (Papineni et al., 2002) and chrF (Popović, 2015) . Both are match-based metrics (so the higher the score, the better the quality of the system) where the MT output is compared against a human reference. The way this is typically done is to 'hold out' a small part of the training data as test material, 9 where the human-translated target sentence is used as the reference against which the MT hypothesis is compared. BLEU does this by computing n-gram overlap, i.e. how many words, 2-, 3and 4-word sequences are contained in both the reference and hypothesis. 10 There is some evidence (Shterionov et al., 2018 ) that word-based metrics such as BLEU are unable to sufficiently demonstrate the difference in performance between MT systems, 11 so in addition, we use chrF (Popović, 2015) , a character-based metric which is more discriminative. Instead of matching word n-grams, it matches character n-grams (up to 6). 9 Including test data in the training material will unduly influence the quality of the MT system, so it is essential that the test set is disjoint from the data used to train the MT engines. 10 Longer n-grams carry more weight, and a penalty is applied if the MT system outputs translations which are 'too short '. 11 For some of the disadvantages of using string-based metrics, we refer the reader to Way (2018). In order to build our NMT systems, we used the MarianNMT (Junczys-Dowmunt et al., 2018) toolkit. The NMT systems are Transformer models (Vaswani et al., 2017) . In our experiments we followed the recommended best set-up from Vaswani et al. (2017) . The tokens of the training, evaluation and validation sets are segmented into sub-word units using the byte-pair encoding technique of Sennrich et al. (2016b) . We performed 32,000 join operations. Our training set-up is as follows. We consider the size of the encoder and decoder layers to be 6. As in Vaswani et al. (2017) , we employ residual connection around layers (He et al., 2015) , followed by layer normalisation (Ba et al., 2016) . The weight matrix between embedding layers is shared, similar to Press and Wolf (2016) . Dropout (Gal and Ghahramani, 2016) between layers is set to 0.1. We use mini-batches of size 64 for update. The models are trained with the Adam optimizer (Kingma and Ba, 2014) , with the learning-rate set to 0.0003 and reshuffling the training corpora for each epoch. As in Vaswani et al. (2017) , we also use the learning rate warm-up strategy for Adam. The validation on the development set is performed using three cost functions: cross-entropy, perplexity and BLEU. The early stopping criterion is based on cross-entropy; however, the final NMT system is selected as per the highest BLEU score on the validation set. The beam size for search is set to 12. The performance of our engines is described in Table 2 . Testing on a set of 1,000 held-out sentence pairs, we obtain a BLEU score of 50.28 for IT-to-EN. For the other language pairs, we also see good performance, with all engines bar EN-to-DE -a notoriously difficult language pair -obtaining BLEU scores in the range of 44-50, already indicating that these rapidly-developed MT engines could be deployed 'as is' with some benefit to the community. On separate smaller test sets specifically on COVID-19 recommendations (cf. Table 3 , where the 'Reco' test data ranges from 73-105 sentences), performance is still reasonable. For IT-to-EN and EN-to-IT, the performance even increases a little according to both BLEU and chrF, but in general translation quality drops off by 10 BLEU points or more. Notwithstanding the fact that this test set is much shorter than the TAUS set, it is more Note that as expected, scores for translation into English are consistently higher compared to translation from English. This is because English is relatively morphologically poorer than the other four languages, and it is widely recognised that translating into morphologically-rich languages is a (more) difficult task. Leading Online MT Systems? In order to examine the competitiveness of our engines against leading online systems, we also translated the Reco test sets using Google Translate, 12 Bing Translator, 13 and Amazon Translate. 14 The results for the 'into-English' use-cases appear in Table 4 . As can be seen, for DE-to-EN, in terms of BLEU score, Bing is better than Google, and 2.1 points (6% relatively) better than Amazon. In terms of chrF, Amazon is still clearly in 3rd place, but this time Google's translations are better than those of Bing. For both IT-to-EN and FR-to-EN, Google outperforms both Bing and Amazon according to both BLEU and chrF, with Bing in 2nd place. However, for ES-to-EN, Amazon's systems obtain the best performance according to both BLEU and chrF, followed by Google and then Bing. If we compare the performance of our engines against these online systems, in general, for all four language pairs, our performance is worse; for DEto-EN, the quality of our MT system in terms of BLEU is better than Amazon's, but for the other three language pairs we are anything from 0.5 to 10 BLEU points worse. This demonstrates clearly how strong the online baseline engines are on this test set. However, in the next section, we run a set of experiments which show improved performance of our systems, in some cases comparable to those of these online engines. The previous sections demonstrate clearly that with some exceptions, on the whole, our baseline engines significantly underperform compared to the major online systems. However, in an attempt to improve the quality of our engines, we extracted parallel sentences from large bitexts that are similar to the styles of texts we aim to translate, and use them to fine-tune our baseline MT systems. For this, we followed the state-of-the-art sentence selection approach of Axelrod et al. (2011) that extracts 'pseudo in-domain' sentences from large corpora using bilingual crossentropy difference over each side of the corpus (source and target). The bilingual cross-entropy difference is computed by querying in-and outof-domain (source and target) language models. Since our objective is to facilitate translation ser-vices for recommendations for the general public, non-experts, and medical practitioners in relation to COVID-19, we wanted our in-domain language models to be built on such data. Accordingly, we crawled sentences from a variety of government websites (e.g. HSE, 15 NHS, 16 Italy's Ministry for Health and National Institute of Health, 17 Ministry of Health of Spain, 18 and the French National Public Health Agency) 19 that offer recommendations and information on COVID-19. In Table 5 , we report the statistics of the crawled corpora used for in-domain language model training. The so-called now on we refer to as the "ParaWiki Corpus". First, we chose the Italian-to-English translation direction to test our data selection strategy on EMEA and ParaWiki corpora. Once the best set-up had been identified, we would use the same approach to prepare improved MT systems for the other translation pairs. We report the BLEU and chrF scores obtained on a range of different Italian-to-English NMT systems on both the TAUS and Reco test sets in Table 6. The second row in Table 6 Italian-to-English baseline NMT system fine-tuned on the training data from the TAUS Corona Crisis Corpus along with 300K sentence-pairs from the EMEA Corpus using the aforementioned sentenceselection strategy. We see that this strategy brings about moderate gains on both test sets over the baseline; however, these gains are not statistically significant (Koehn, 2004) . Currey et al. (2017) generated synthetic parallel sentences by copying target sentences to the source. This method was found to be useful where scripts are identical across the source and target languages. In our case, since the Sketch Engine Corpus is prepared from the COVID-19 Open Research Dataset (CORD-19), 23 it includes keywords and terms related to COVID-19, which are often used verbatim across the languages of study in this paper, so we contend that this strategy can help translate terms correctly. Accordingly, we carried out an experiment by adding sentences of the Sketch Engine Corpus (English monolingual corpus) to the TAUS Corona Crisis Corpus following the method of Currey et al. (2017) , and fine-tune the model on this training set. The scores for the resulting NMT system are shown in the third row of Table 6 . While this method also brings about moderate improvements on the Reco test set, it is not statistically significant. Interestingly, this approach also significantly lowers the system's performance on the TAUS test set, according to both metrics. Nonetheless, given the urgency of the situation in which we found ourselves, where the MT systems needed to be built as quickly as possible, the approach of Currey et al. (2017) can be viewed as a worthwhile alternative to the back-translation strategy of Sennrich et al. (2016a) which needs a high-quality back-translation model to be built, and the target corpus to be translated into the source language, both of which are time-demanding tasks. In our next experiment, we took five million low-scoring (i.e. similar to the in-domain corpora) sentence-pairs from ParaWiki, added them to the training data, and fine-tuned the baseline model on it. As can be seen from row 4 in Table 6 (i.e. 2 + ParaWiki), the use of ParaWiki sentences for fine-tuning has a positive impact on the system's performance, as we obtain a statistically significant 3.95 BLEU point absolute gain (corresponding to a 7.74% relative improvement) on the Reco test set over the baseline. This is corroborated by the chrF score. We then built a training set from all data sources (EMEA, Sketch Engine, and ParaWiki Corpora), and fine-tuned the baseline model on the combined training data. This brings about a further statistically significant gain on the Reco test set (6.75 BLEU points absolute, corresponding to a 13.2% relative improvement (cf. fifth row of Our next experiment involved adding a further three million sentences from ParaWiki. This time, the model trained on augmented training data performs similarly to the model without the additional data (cf. sixth row of Table 6 , "3 + 4*"), which is disappointing. Our final model is built with ensembles of all eight models that are sampled from the training run (cf. fifth row of Table 6 , "3 + 4"), and one of them is selected based on the highest BLEU score on the validation set. This brings about a further slight improvement in terms of BLEU on the Reco test set. Altogether, the improvement of our best model ('Ensemble' in Table 6 , row 7) over our Baseline model is 7.14 BLEU points absolute, a 14% relative improvement. More importantly, while we cannot beat the online systems in terms of BLEU score, we are now in the same ballpark. More encouragingly still, in terms of chrF, our score is higher than both Amazon and Bing, although still a little way off compared to Google Translate. Given these encouraging findings, we used the same set-up to build improved engines for the other translation pairs. The results in Table 7 show improvements over the Baseline engines in Table 2 for almost all language pairs on the Reco test sets: for DE-to-EN, the score dips a little, but for FR-to-EN, we see an improvement of 1.42 BLEU points (4% relative improvement), and for ES-to-EN by 2.09 BLEU points (6.7% relative improvement). While we still largely underperform compared to the scores for the online MT engines in Table 4 , we are now not too far behind; indeed, for FR-to-EN, our performance is now actually better than Amazon's, by 0.9 BLEU (2.5% relative improvement) and 1.29 chrF points (2.4% relative improvement). For the reverse direction, EN-to-DE also drops a little, but EN-to-IT improves by 1.79 BLEU points (3.8% relatively better), EN-to-ES by 1.02 BLEU points (3.2% relatively better), and EN-to-FR by 2.22 BLEU points (6.5% relatively better). These findings are corroborated by the chrF scores. Using automatic evaluation metrics such as BLEU and chrF allows MT system developers to rapidly test the performance of their engines, experimenting with the data sets available and fine-tuning the parameters to achieve the best-performing set-up. However, it is well-known (Way, 2018) that where possible, human evaluation ought to be conducted in order to verify that the level of quality globally indicated by the automatic metrics is broadly reliable. Accordingly, for all four 'into-English' language pairs, we performed a human evaluation on 100 de-en it-en es-en fr-en better than both 19% 5% 5% 4% same 50% 69% 50% 68% worse than either 15% 26% 18% 17% not fully clear 16% 15% 27% 11% Table 8 : Percentage of sentences translated by our system which are adjudged to be better, worse or of the same quality compared to Google Translate and Bing Translator sentences from the test set, comparing our system against Google Translate and Bing Translator; we also inspected Amazon's output, but its quality was generally slightly lower, so in the interest of clarity it was not included in the comparisons discussed here. Translation examples for German-to-English appear in Table 9 . Despite the discrepancies in terms of automatic evaluation scores, our system often results in better lexical choice than the online ones (examples 1, 2 and 3). In many cases, our system outperforms one of the online systems, but not another (example 4). The main advantage of the online systems is fluency, as shown in example 5, where our system treated the German verb "werden" as a future tense auxiliary verb rather than a passive voice auxiliary. Example 6 represents a very rare case where our engine omitted one part of the source sentence altogether, thus not conveying all of the original meaning in the target translation. Table 10 shows translation examples for Frenchto-English. Again, lexical choice is often better in our MT hypotheses (example 1). Sometimes, our system outperforms one of the online systems, but not both (examples 2 and 3). Online systems outperform our system mainly in terms of fluency (examples 4, 5 and 6)). In example 4, the online systems select a better preposition, in 5 our system failed to generate the imperative form and generated a gerund instead, and the output of our system for example 6 is overly literal. In Table 11 we present examples of Spanish sentences translated into English by different systems. In the first sentence we observe that "gravedad" (which in this context should be translated as "serious" when referred to symptoms or illness) is incor-rectly translated as "gravity" by Google and Bing, whereas our system proposes the more accurate translation "feel serious for any other symptoms". In the second example, the generated translations for "entra en contacto" are either "come in contact" or "come into contact". However, this structure requires a prepositional phrase (i.e. "with" and a noun), which the systems do not produce. Note also that in Google's translation, there is a mismatch in the pronoun agreement (i.e. "hands" is plural and "it" is singular). Our system also produces an additional mistake as it translates the word "usado" as "used", whereas in this context (i.e. "wear gloves") the term "wear" would be more accurate. The last two rows of Table 11 present translations where both our system and Google produces translations that are similar to the references. In the third sentence, our system produces the term "guidance" as a translation of "orientaciones" as in the reference, and Google's system generates a similar term, "guidelines". In contrast, Bing produces the word "directions" which can be more ambiguous. In the last example, Bing's system generates the phrase "Please note that" which is not present in the source sentence. Overall, based on the manual inspection of 100 sentences from the test set, our Italian-to-English MT system shows generally accurate and mostly fluent output with only a few exceptions, e.g. due to the style that is occasionally a bit dry. The overall meaning of the sentences is typically preserved and clearly understandable in the English output. In general, the output of our system compares favourably with the online systems used for comparison, and does particularly well as far as correct translation of the specialized terminology is concerned, even though the style of the online MT systems tends to be better overall. The examples for Italian-English are shown in Tables 12 and 13, which include the Italian input sentence, the English human reference translation, and then the output in English provided by Google Translate, Bing and our final system. In example 1, even though the style of our MT system's English output is somewhat cumbersome (e.g. with the repetition of "cells"), the clarity of the message is preserved, and the translation of all the technical terminology such as "epithelial cells" and "respiratory and gastrointestinal tracts" is correct; interestingly, the MT output of our system 1) source Unter den bermittelten COVID-19-Fllen 1) reference Of notified cases with a COVID-19 infection google Among the transmitted COVID-19 cases bing Among the COVID-19 cases submitted dcu (best) Among the reported COVID-19 cases 2) source in einer unbersehbaren Anzahl von Regionen weltweit. 2) reference (shifted) in many other, not always well-defined regions google in an unmistakable number of regions worldwide. bing in an unmistakable number of regions worldwide. dcu (best) in an innumerable number of regions worldwide. Bei einem Teil der Flle sind die Krankheitsverlufe schwer, auch tdliche Krankheitsverlufe kommen vor. Severe and fatal courses occur in some cases google In some of the cases, the course of the disease is difficult, and fatal course of the disease also occurs. bing In some cases, the disease progressions are difficult, and fatal disease histories also occur. dcu (best) In some cases, the course of the disease is severe, including fatal cases. 4) source COVID-19 ist inzwischen weltweit verbreitet. Due to pandemic spread, there is a global risk of acquiring COVID-19. google (best) COVID-19 is now widespread worldwide. bing (worse) COVID-19 is now widely used worldwide. dcu COVID-19 is now widely distributed worldwide. 5) source nderungen werden im Text in Blau dargestellt 5) reference Changes are marked blue in the text google Changes are shown in blue in the text bing Changes are shown in blue in the text dcu (worst) Changes will appear in blue text 6) source Bei 46.095 Fllen ist der Erkrankungsbeginn nicht bekannt bzw. diese Flle sind nicht symptomatisch erkrankt 6) reference (missing part) In 46,095 cases, onset of symptoms is unknown {MISSING} google In 46,095 cases, the onset of the disease is unknown or these cases are not symptomatically ill bing (best) In 46,095 cases, the onset of the disease is not known or these cases are not symptomatic dcu (worst) In 46,095 cases, the onset or absence of symptomatic disease is not known {MISSING} What to do about the first signs? dcu (worst) what to do with the first signs ? Tousser ou ternuer dans son coude ou dans un mouchoir 5) reference Cough or sneeze into your sleeve or a tissue google Cough or sneeze into your elbow or into a tissue bing Cough or sneeze in your elbow or in a handkerchief dcu (worst) Coughing or sneezing in your elbow or in a tissue 6) source Cest le mdecin qui dcide de faire le test ou non. It is up to a physician whether or not to perform the test. google Its the doctor who decides whether or not to take the test. bing It is the doctor who decides whether or not to take the test. dcu (literal) It is the doctor who decides to take the test or not. follow the guidance presented above. 4) source Tenga en la habitacin productos de higiene de manos. Keep hand hygiene products in your room google (best) Keep hand hygiene products in the room. bing Please note that hand hygiene products are in the room. dcu (best) keep hand hygiene products in the room. Table 11 : Translation examples for Spanish-to-English pluralizes the noun "tracts", which is singular in the other outputs as well as in the reference human translation, but this is barely relevant to accuracy and naturalness of style. Similarly, the style of the MT output of our system is not particularly natural in example 2, where the Italian source has a marked cleft construction that fronts the main action verb but omits part of its subsequent elements, as is frequent in newspaper articles and press releases; this is why the English MT output of our system wrongly has a seemingly final clause that gives rise to an incomplete sentence, which is a calque of the Italian syntactic structure, even though the technical terminology is translated correctly (i.e. "strain" for "ceppo"), and the global meaning can still be grasped with a small amount of effort. By comparison, the meaning of Bing's output is very obscure and potentially misleading, and the verb tense used in Google Translate's output is also problematic and potentially confusing. In the sample of 100 sentences that were manually inspected for Italian-English, only very minor nuances of meaning were occasionally lost in the output of our system, as shown by example 3. Leaving aside the minor stylistic issue in the English output of the missing definite article before the noun "infection" (which is common with Bing's output, the rest being identical across the three MT systems), the translation with "severe acute respiratory syndrome" in the MT output (the full form of the infamous acronym SARS) for the Italian "sindrome respiratoria acuta grave" seems preferable, and more relevant, than the underspecified rendition given in the reference with "acute respiratory distress syndrome", which is in fact a slightly different condition. Finally in example 3, a seemingly minor nuance of meaning is lost in the MT output "in severe cases", as the beginning of the Italian input can be literally glossed with "in the more/most serious cases"; the two forms of the comparative and superlative adjective are formally indistinguishable in Italian, so it is unclear on what basis the reference human translation opts for the comparative form, as opposed to the superlative. However, even in such a case the semantic difference seems minor, and the overall message is clearly preserved in the MT output. As far as example 4 is concerned, the MT out-1) source Le cellule bersaglio primarie sono quelle epiteliali del tratto respiratorio e gastrointestinale. 1) reference Epithelial cells in the respiratory and gastrointestinal tract are the primary target cells. google The primary target cells are the epithelial cells of the respiratory and gastrointestinal tract. bing The primary target cells are epithelial cells of the respiratory and gastrointestinal tract. dcu primary target cells are epithelial cells of the respiratory and gastrointestinal tracts. 2) source A indicare il nome un gruppo di esperti incaricati di studiare il nuovo ceppo di coronavirus. 2) reference The name was given by a group of experts specially appointed to study the novel coronavirus. google The name is indicated by a group of experts in charge of studying the new coronavirus strain. bing The name is a group of experts tasked with studying the new strain of coronavirus. dcu to indicate the name a group of experts in charge of studying the new strain of coronavirus. Nei casi pi gravi, l'infezione pu causare polmonite, sindrome respiratoria acuta grave, insufficienza renale e persino la morte. 3) reference In more serious cases, the infection can cause pneumonia, acute respiratory distress syndrome, kidney failure and even death. google In severe cases, the infection can cause pneumonia, severe acute respiratory syndrome, kidney failure and even death. bing In severe cases, infection can cause pneumonia, severe acute respiratory syndrome, kidney failure and even death. dcu in severe cases, infection can cause pneumonia, severe acute respiratory syndrome, kidney failure, and even death. Alcuni Coronavirus possono essere trasmessi da persona a persona, di solito dopo un contatto stretto con un paziente infetto, ad esempio tra familiari o in ambiente sanitario. 4) reference Some Coronaviruses can be transmitted from person to person, usually after close contact with an infected patient, for example, between family members or in a healthcare centre. google Some Coronaviruses can be transmitted from person to person, usually after close contact with an infected patient, for example between family members or in a healthcare setting. bing Some Coronaviruses can be transmitted from person to person, usually after close contact with an infected patient, for example among family members or in the healthcare environment. dcu some coronavirus can be transmitted from person to person, usually after close contact with an infected patient, for example family members or in a healthcare environment. Il periodo di incubazione rappresenta il periodo di tempo che intercorre fra il contagio e lo sviluppo dei sintomi clinici. 5) reference The incubation period is the time between infection and the onset of clinical symptoms of disease. google The incubation period represents the period of time that passes between the infection and the development of clinical symptoms. bing The incubation period represents the period of time between contagion and the development of clinical symptoms. dcu the incubation period represents the period of time between the infection and the development of clinical symptoms. 6) source Qualora la madre sia paucisintomatica e si senta in grado di gestire autonomamente il neonato, madre e neonato possono essere gestiti insieme. 6) reference Should the mother be asymptomatic and feel able to manage her newborn independently, mother and newborn can be managed together. google If the mother is symptomatic and feels able to manage the infant autonomously, mother and infant can be managed together. bing If the mother is paucysy and feels able to manage the newborn independently, the mother and newborn can be managed together. dcu if the mother is paucisymptomatic and feels able to manage the newborn independently, mother and newborn can be managed together. puts are very similar and correspond very closely to the meaning of the input, as well as to the human reference translation. Interestingly, our system's output misses the plural form of the first noun ("some coronavirus" instead of "Some Coronaviruses" as given by the other two MT systems, which is more precise), which gives rise to a slight inaccuracy, even though the overall meaning is still perfectly clear. Another interesting point is the preposition corresponding to the Italian "tra familiari", which is omitted by our system, but translated by Google as "between family members", compared to Bing's inclusion of "among". Overall, omitting the preposition does not alter the meaning, and these differences seem irrelevant to the clarity of the message. Finally, the translation of "ambiente sanitario" (a very vague, underspecified phrase, literally "healthcare environment") is interesting, with both our system and Bing's giving "environment", and Google's choosing "setting". Note that all three MT outputs seem better than the human reference "healthcare centre", which appears to be unnecessarily specific. With regard to example 5, the three outputs are very similar and equally correct. The minor differences concern the equivalent of "contagio", which alternates between "infection" (our system and Google's) and the more literal "contagion" (Bing, which seems altogether equally valid, but omits the definite article, so the style suffers a bit). Interestingly, Google presents a more direct translation from the original with "time that passes", while our system and Bing's omit the relative clause, which does not add any meaning. All things considered, the three outputs are very similar, and on balance of equivalent quality. Finally, example 6 shows an instance where the performance of our system is better than the other two systems, which is occasionally the case especially with regard to very specialized terminology concerning the COVID-19 disease. The crucial word for the meaning of the sentence in example 5 is "paucisintomatica" in Italian, which refers to a mother who has recently given birth; this is a highly technical and relatively rare term, which literally translates as "with/showing few symptoms". Our system translates this correctly with the English equivalent "if the mother is paucisymptomatic", while Google gives the opposite meaning by missing the negative prefix, which would cause a serious misunderstanding, i.e. "If the mother is symptomatic", and Bing invents a word with "if the mother is paucysy", which is clearly incomprehensible. Interestingly, the human reference English translation for example 6 gives an overspecified (and potentially incorrect) rendition with "If the mother is asymptomatic", which is not quite an accurate translation of the Italian original, which refers to mothers showing few symptoms. The remaining translations in example 6 are broadly interchangeable, e.g. with regard to rendering "neonato" (literally "newborn") with "infant" (Google Translate) or "newborn" (our system and Bing's); the target sentences are correct and perfectly clear in all cases. We expose our MT services as webpages which can be visited by users on a worldwide basis. As there are 8 language pairs, in order to ensure provision of sufficiently responsive speed, we deploy our engines in a distributed manner. The high-level architecture of our online MT systems is shown in Figure 1 . The system is composed of a webserver and separate GPU MT servers for each language pair. The webserver provides the user interface and distributes translation tasks. Users visit the online website and submit their translation requests to the webserver. When translation requests come in from a user, the webserver distributes the translation tasks to each MT server. The appropriate MT server carries out the translation and return the translation result to the webserver, which is then displayed to the user. Figure 2 shows the current interface to the 8 MT systems. The source and target languages can be selected from drop-down menus. Once the text is pasted into the source panel, and the Translate button clicked, the translation is instantaneously retrieved and appears in the target-language pane. We exemplify the system performance with a sentence -at the onset of the outbreak of the virus, a sentence dear to all our hearts -taken from Die Welt on 20th March. The translation process is portrayed in more detail in Figure 3 . The DNT/TAG/URL module first replaces DNT ('do not translate') items and URLs with placeholders. It also takes care of tags that may appear in the text. The sentence-splitter splits longer sentences into smaller chunks. The tokeniser is an important module in the pipeline, as it is responsible for tokensing words, and separating tokens from punctuation symbols. The segmenter module segments words for specific languages. The compound splitter splits compound words for specific languages like German which fuses together constituent parts to form longer distinct words. The character normaliser is responsible for normalising characters. The lowercaser and truecaser module is responsible for lowercasing or truecasing words. The spellchecker checks for typographical errors and corrects them if necessary. Corresponding to these modules, we need a set of tools which perform the opposite operations: a detruecaser/recaser for detruecasing or recasing words after translation; a character denormaliser for certain languages; a compound rejoiner to produce compound words from subwords when translating into German; a desegmenter for producing a sequence from a set of segments; a detokeniser, which reattaches punctuation marks to words; and the DNT/TAG/URL module reinstates 'do not translate' entities, tags and URLS into the output translations. This paper has described the efforts of the ADAPT Centre MT team at DCU to build a suite of 8 NMT systems for use both by medical practitioners and the general public to efficiently access multilingual information related to the COVID-19 outbreak in a timely manner. Using freely available data only, we built MT systems for French, Italian, German and Spanish to/from English using a range of stateof-the-art techniques. In testing these systems, we demonstrated similar -and sometimes better -performance compared to popular online MT engines. In a complementary human evaluation, we demonstrated the strengths and weaknesses of our engines compared to Google Translate and Bing Translator. Finally, we described how these systems can be accessed online, with the intention that people can quickly access important multilingual information that otherwise might have remained hidden to them because of language barriers. Disaster Relief 2.0: The Future of Information Sharing in Humanitarian Emergencies Domain adaptation via pseudo in-domain data selection Layer normalization. CoRR Neural machine translation by jointly learning to align and translate Copied monolingual data improves low-resource neural machine translation International network in crisis translation -recommendations on policies A theoretically grounded application of dropout in recurrent neural networks Deep residual learning for image recognition Marian: Fast neural machine translation in C++ Adam: A method for stochastic optimization Statistical significance tests for machine translation evaluation Moses: Open source toolkit for statistical machine translation Crisis MT: Developing a cookbook for MT in crisis situations Haitian Creole: how to build and ship an MT engine from scratch in 4 days, 17 hours, & 30 minutes Bleu: a method for automatic evaluation of machine translation Domain adaptation of statistical machine translation using web-crawled resources and model parameter tuning. Language Resources and Evaluation chrF: character n-gram f-score for automatic MT evaluation Using the output embedding to improve language models Improving neural machine translation models with monolingual data Neural machine translation of rare words with subword units Human vs. automatic quality evaluation of NMT and PBSMT Parallel data, tools and interfaces in OPUS Quality expectations of machine translation Fast gated neural domain adaptation: Language model as a case study The ADAPT Centre for Digital Content Technology is funded under the Science Foundation Ireland (SFI) Research Centres Programme (Grant No. 13/RC/2106) and is co-funded under the European Regional Development Fund.