key: cord-0617443-ulfjo3t7
authors: Mondal, Ishani; Ahuja, Kabir; Jain, Mohit; Neil, Jacki O; Bali, Kalika; Choudhury, Monojit
title: Global Readiness of Language Technology for Healthcare: What would it Take to Combat the Next Pandemic?
date: 2022-04-06
journal: nan
DOI: nan
sha: 5099f9d053eb53f55bd73db5b46b8b1836456242
doc_id: 617443
cord_uid: ulfjo3t7

The COVID-19 pandemic has brought out both the best and worst of language technology (LT). On one hand, conversational agents for information dissemination and basic diagnosis have seen widespread use, and arguably, had an important role in combating the pandemic. On the other hand, it has also become clear that such technologies are readily available for a handful of languages, and the vast majority of the global south is completely bereft of these benefits. What is the state of LT, especially conversational agents, for healthcare across the world's languages? And, what would it take to ensure global readiness of LT before the next pandemic? In this paper, we try to answer these questions through survey of existing literature and resources, as well as through a rapid chatbot building exercise for 15 Asian and African languages with varying amount of resource-availability. The study confirms the pitiful state of LT even for languages with large speaker bases, such as Sinhala and Hausa, and identifies the gaps that could help us prioritize research and investment strategies in LT for healthcare.

The world witnessed one of the worst pandemics in early 2020, COVID-19, infecting over 250 million people globally. Scientists and technologists from various fields joined hands, lending support to deal with this global crisis. Language Technology (LT) played a crucial role in combating the pandemic through the development of healthcare chatbots that facilitate information dissemination (Li et al., 2020; Maniou and Veglis, 2020) and early disease screening (Judson et al., 2020a; Martin et al., 2020b) . Nevertheless, today practically useful chatbots and other benefits of LT are available only in a handful of languages (Joshi et al., 2020) . * Equal contribution Despite impressive gains made by the Massively Multilingual Transformer based Language Models (MMLM) (Devlin et al., 2019; Lample and Conneau, 2019; Aharoni et al., 2019; Conneau et al., 2020; Xue et al., 2021) on standard NLP benchmark tasks (Pan et al., 2017; Conneau et al., 2018; Yang et al., 2019; Hu et al., 2020) , the real-world implications of such advancements remain largely unexplored. Joshi et al. (2020) has highlighted such a disparity and classified the languages of world into six classes based on their resource-availability, where Class 5 represents the most resource-rich languages for whom benefits of LT are readily available; and class 0 denotes the most under-resourced languages. In this paper, we ask the following two questions: (1) Today, in which languages can we build practically useful LT systems, especially chatbots, that could help us combat a pandemic? (2) How should we prioritize research and resource building investments so that LT is globally ready before the next pandemic?

In order to answer these questions, we review the existing literature and resources on COVID-19 chatbots, and classify them based on the languages they support and the solutions they provide. Quite unsurprisingly, the survey reveals a strong disparity in LT solutions between resource-rich and resource-poor languages. In order to quantify this gap and measure the pandemic-readiness of various languages today, we select 15 Asian and African languages (except English) with various degrees of resource-availability, and attempt to build COVID-19 FAQ bots for them. Since building an end-to-end chatbot is a substantial engineering effort, we scope the problem down to building an intent classifier for these languages, which forms the core of the Natural Language Understanding (NLU) unit. We also experiment with entity recognition for a subset of these languages.

Our study shows that despite using the best avail-able commercial multilingual chatbot frameworks (e.g., Google Dialogflow 1 , Microsoft Bot Framework (MS Bot) 2 ), advanced Machine Translation (MT) systems 3 , and state-of-the-art massively multilingual language models (mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) ), there is a 20-30% drop in performance for class 0 -2 languages as compared to English. The drop is large for all the African languages (e.g., Hausa and Somali) and some of the Asian languages (e.g., Marathi and Sinhala). Note that our experiments were limited to languages which are supported by at least one of the chatbot frameworks, MT systems or MMLMs. There are thousands of other languages which are supported by none. We extrapolate our findings at global scale and construct a global LT readiness map for pandemicresponse and healthcare. Based on this map as well as error analysis of the chatbot experiments, we identify a set of research problems as well as resource-prioritization strategies which we believe are key to ensure global LT readiness before the next pandemic.

The rest of the paper is organized as follows: Sec 2 presents the literature survey on LT response to COVID-19, specifically focusing on chatbots built for the pandemic. Sec 3 describes the chatbot building experiments, where in 3.1 we motivate the choice of languages for the experiment, in 3.2 and 3.3 we discuss the intent and entity detection experiments respectively. In Sec 4 we present the global LT readiness map and in Sec 5, we conclude with our recommendations.

In the recent years, NLP for Healthcare has witnessed a major uptake and an impressive volume of work has significantly pushed the research forward by developing sophisticated domain-specific language models (Alsentzer et al., 2019; Lee et al., 2020; Ji et al., 2021) . These models have been adopted to serve different axes of healthcare such as patient provider communication (Min et al., 2020; Si et al., 2020) , information dissemination (Maniou and Veglis, 2020), and self-care management and therapy (Morris et al., 2018; Kadariya et al., 2019; Park et al., 2019; Kamita et al., 2019) . The role of ,https://www.bing.com/translator healthcare chatbots becomes crucial along all these axes because of the recent adoption of telehealth technology services (Bhat et al., 2021) .

Chatbots have received a considerable interest during the recent COVID-19 pandemic. Due to the worldwide spread and severity of the virus and subsequent global response, we believe that the study of COVID-19 chatbots can provide us an accurate picture of the global-readiness of LT. We, therefore, surveyed COVID-19 chatbots that are mentioned in the literature and/or deployed in the real-world. 4

From the survey, two primary use-cases of COVID-19 chatbots emerge -(1) information dissemination: answering pandemic-related questions asked by the users (Li et al., 2020; Desai, 2021; Prasannan et al., 2020; Mehfooz et al., 2020; Trang and Shcherbakov, 2021) , and (2) symptom-screening: assessing risk factors associated with the symptoms provided by the user for quick diagnosis (Ferreira et al., 2020; Martin et al., 2020a; Judson et al., 2020b; Quy Tran et al., 2021) . Existing commercial frameworks such as DialogFlow, Watson Assistant and MS Bot have been used primarily for building a majority of these chatbots (Li et al., 2020; Sophia and Jacob, 2021) . However, open-source bot frameworks like Rasa (Quy Tran et al., 2021; Nguyen and Shcherbakov, 2021) have also been gaining traction in the community. The inbuilt NLU engines supported by these frameworks makes chatbot development easy, hence there is a significant uptake in utilizing these to develop new chatbots. Pre-trained LMs were also leveraged for COVID symptom identification (Oniani and Wang, 2020) and question answering (Park et al., 2020) .

Which languages are supported by these COVID bots? Of the 20 COVID-related bots mentioned in the existing literature and 34 others deployed by different countries to combat the pandemic, 26 (≈ 50%) are exclusively for English, followed by German having 10 deployed bots. In Figure 1 we show the distribution of the chatbots by the language classes defined in Joshi et al. (2020) . As expected, for all the cases, we observe that chatbots were available primarily and almost exclusively for languages in class 5. We observe a slightly higher presence of class 4 and 3 languages in research papers on COVID chatbots (Fig. 1a) . For instance, there are three research papers each for Hindi and Vietnamese, both class 4 languages. To the best of our knowledge, we could not find any publication or deployed bot for class 0 languages. This skew is more prominent when we consider the coverage of languages of different classes, i.e., the fraction of languages in each class for which at least one COVID-19 conversation system was developed (Figure 1b and 1d ). This lack of attention to a large number of languages has also been highlighted by Anastasopoulos et al. (2020) who strongly advocated for the development of language resources for improving access to COVID-19 related information in 26 lesser-resourced languages, particularly from Africa and South and South-East Asia.

How quickly can one build a pandemic response chatbot in a language based on the best publicly available systems? In order to answer this question we have to understand the pandemic-readiness of various languages. To do this, we made an at-tempt to build chatbots for answering frequently asked questions about COVID-19 using Google Dialogflow, Microsoft Bot Framework (MS Bot), as well as two of the most popular Massively Multilingual Language Models (MMLM) -mBERT and XLM-R. Since building an end-to-end chatbot is complex, we chose to conduct rapid prototyping experiments for intent recognition in 16 languages, and entity recognition in 3 languages.

For our experiments, we chose a few languages from each language class (defined by Joshi et al. (2020) ) such that at least one language per class is supported by either of the two commercial chatbot frameworks, leading to the following set: English, Chinese from Class 5, Hindi, Korean from Class 4, Bengali, Malay from Class 3, Swahili, Hausa, Marathi, Amharic, Zulu from Class 2, Assamese, Gujarati, Kikuyu, Somali from Class 1, and Sinhala from Class 0.

Intent Recognition is an essential component of conversational systems. Given a user query, the task is to classify it into one of the pre-defined intent categories (Braun et al., 2017) .

For training and evaluation, we curate a set of 147 queries categorized into one of the 14 intents: 1) Airborne (how COVID spreads by air), 2) ClarifyCovid (difference between COVID and other diseases), 3) Country (country-wise infection statistics), 4) CovidTwice (possibility of reinfection), 5) ExplainSymptom (COVID symptoms) 6) Incubation (how many days of incubation required), 7) Length (longevity of infection), 8) Mask (ways of wearing mask), 9) Protection (ways to protect against infection), 10) Quarantine (quarantine requirement of US), 11) Spread (how COVID spreads), 12) Testing (available COVID tests), 13) Medication (about drugs to protect from COVID), and 14) Treatment (about treatments or therapies related to COVID). Examples and definitions of each intent are present in 6.1

We refer to the FAQs provided by the UN (Department of Operational Support, 2020) and user queries in the dataset released by Anastasopoulos et al. (2020) , to identify the 14 types of questions that a user may ask. We manually paraphrase the questions to generate queries (Mean = 10.5, S.D. = 4.36 queries per intent) for each intent in English. Two annotators with native English proficiency independently classified these queries; the inter-annotator agreement (κ) was 0.89. We asked a few native speakers of each of the selected languages to translate these 147 queries manually. The dataset is split into train and test sets, using a stratified split over the intents, giving a total of 76 and 71 queries in train and test set respectively.

We consider three training and inference strategies, emulating the possible scenarios for developing such chatbots in practical settings (details in 6.2). Train on English Data: In this strategy, we develop our bots by training them on the English queries, and evaluate the intent detection performance in different languages by automatically translating the test queries into English (e.g., similar to Gupta et al. (2021)). Train on MT Translations: Here we build target language intent classifier models from training data in different languages, which is obtained by automatically translating the English training data. The classifier is then tested on the manually translated test data in the corresponding target language. A similar method was adopted by Balahur and Turchi (2012) for sentiment analysis. Pre-trained MMLMs: We evaluate two popular MMLMs, namely mBERT (bert-base-multilingualcased) and XLMR (xlm-roberta-base), for our intent detection experiments. XLM-R supports all but Kikuyu, Somali and Zulu, while mBERT supports all but Amharic, Assamese, Hausa, Kikuyu and Zulu. For these models, we only evaluate the Train on Manual Translations setup. We experiment with two different approaches for building intent classifiers with these models: i) Fine-tune the pre-trained MMLM as a classifier on the training set, and ii) Training an end-to-end classifier. Since our dataset is small, training an end-to-end Table 1 : δ l for each language for the Intent Recognition task using the three different strategies. × indicates that the framework does not support end-to-end chatbot development for that language. Drops that lead to accuracy below 67% are marked by †, indicating the case where the bot mis-recognizes 1 out of every 3 queries. *Owing to non-availability of standard MT for Kikuyu, we used Safarini 5 app from Android playstore for translation. Note:

The values mentioned in the parantheses indicate that we observe relative gain instead of drop.

classifier might be prone to overfitting, hence we use pre-trained embeddings to fit a k-nearest neighbors model as done in Caron et al. (2021) . We report the best scores out of these two setups for both MMLMs (details in 6.3). Evaluation: We report the relative accuracy drop δ l for each target language l from English (en), defined as (A en − A l )/(A en ) × 100, where A l is the accuracy of intent classification for l on the held-out test set. Thus, lower the value of δ l , better is the state-of-the-art of LT for the language l.

Table 1 presents the intent classification results. While the relative drop δ l is reported, we also mark the values with a † wherein the absolute accuracy, A l falls below 67% (A U X ). Such a classifier is not useful for real-world deployment as it will misclassify every third query. As expected, we observe high δ l for languages belonging to class 3 or lower, with most of the accuracies below A U X .

Comparison across the three setups: We observe that for classes 4 and 5, Translate on English Data performs at par or even better than the most expensive Train on Manual Translations setup. This may be because the MT translations from these languages to English is highly accurate. On the other hand, for languages belonging to class 3 or lower, Train on Manual Translations led to better performance, arguably due to poorer performance of the MT system. Unfortunately, the Train on Manual Translations method is the most expensive in terms of data curation cost, hence may be the hardest to implement in the midst of a pandemic. The problem become worse because a majority of class 3 and lower languages are not supported by current chatbot frameworks. Even when supported, their performance is below the acceptable limit (e.g., Marathi, Gujarati). One of the reasons is the difficulty in correctly identifying technical intents like Airborne and Incubation in such low-resource languages (Figure 2) .

Since a few of these low-resource languages are present in the pre-training dataset of mBERT and XLM-R, we can evaluate them for Train on Manual Translations. There is a similar pattern in accuracy drop for MMLMs, however the accuracy begin to fall below the acceptable limit (67%) from class 4 languages onward. There is a remarkable drop in mBERT's accuracy for Sinhala (class 0). In general, we find mBERT to outperform XLM-R, except for Swahili and Sinhala. This may be due to the higher representation of these languages in the pre-training corpus of XLM-R (CommonCrawl Corpus). This strongly indicates the importance of the pre-training dataset size for developing LT, both in terms of absolute size as well as relative size to other languages (Wu and Dredze, 2020) . As expected, the performance in Train on MT Translations setup is the worst among the three; except for Korean in LUIS, all values lie below A U X , which could be a compounded effect of poor translation quality and inferior NLU solutions.

To conclude, all languages in class 3-5 had at least one solution yielding an acceptable accuracy, while all languages in class 0-2, except Gujarati, Sinhala and Zulu, had no acceptable solution. Table 2 shows the intent misclassification errors due to the errors in MT translations. The manual translation in the target language correspond to the 'Actual Example' in English, and the phrase translated back to English for the Train on English Data setup is reported under the 'Misclassified Translated Example.' We categorize the translation errors as Terminology Mismatch, Fluency and Relevance (Li et al., 2020) . We find that domain-specific terms often get translated incorrectly into English (2). In a few cases, the translations result in unnatural queries resulting in loss of fluency, such as Does SARS-CoV-2 sit in the air?. All these factors lead to poor performance of Train on English Data setup for low-resource languages. We find that Terminology Mismatch is the most common issue affecting the performance. Interestingly, technical terms like incubation does not exist in a few of our target languages, hence the manually written test queries in these languages just had the English term written in that language's script. In such cases, we found lesser performance drop compared to languages when equivalent vocabulary exists in the target language. E.g., high drops in F1-score for intents like "Quarantine" and "Incubation" in Hausa (76%, 100% respectively) and Amharic (56%, 100%) justify this, whereas for Zulu, where the human translator used English terms in their queries resulted in much lower drop in F1 scores (20%, 0%).

See appendix for the intent-wise F1 scores for different languages. Implications: Based on our experimental results, we wish to explore how to prioritize the resourceinvestment strategies to push the state of current LT forward. Resource-poor languages mostly underperform across all the three set-ups, so then should we invest more towards developing better translation systems or focus more on improving the current NLU solutions for different languages? We observe that a good quality translation system can support building bots from scratch in a new language, and often performs on-par with the Train on Manual Translations setup for high-resource languages (e.g., Korean, Hindi) and sometimes even for low-resource languages (e.g., Gujarati). Building bots from scratch in a new language is resourceintensive, requiring rapid prototyping, which may be infeasible during a crisis. Therefore, a generic way to ensure pandemic-readiness in a language is by ensuring reasonably accurate MT systems similar to that for class 4 languages. Improving representation of low-resource languages in the pretraining datasets of existing multilingual models (specifically on domain-specific corpora as done by (Gu et al., 2021; ) is yet another way to ensure preparedness.

We also evaluate the developed chatbots on another core task of NLU, i.e. entity recognition (Ali, 2020) on English, Hindi and Bengali. To train and evaluate different COVID bots on this task, we use a set of 200 user related queries (obtained by augmenting existing dataset of 147 queries). Entity types were identified from a subset of labels from the CORD-19 NER dataset (Lu , and the queries were accordingly tagged by two native speakers of Bengali and Hindi. Overall, our dataset had a mix of medical and non-medical entities. The Extremely ill-prepared Ill-prepared Moderately prepared Well prepared Fully prepared final set of medical entity types consists of: Covid (COVID-related entities), PhysicalScience (technical terms related to bio-molecular mechanism of the disease), and Disease (any form of illness or symptoms). The non-medical entity types are: BodyPart (name of the body part), Country (country name), Duration (length in days), Protection (ways to protect against COVID, such as 'mask', 'gloves'), and InfoSource (source of information). Country (country name). For generating the equivalent translations, we manually aligned the entity tags in two languages: Hindi (supported by DialogFlow and LUIS) and Bengali (supported by DialogFlow). In majority of the cases, we observe that domainspecific entities such as incubation, ACE-2 Cells, biochemical assays are hard to predict by these models on languages other than English. For instance, for Covid entity, we observe significant F1-score drop of 24.6% for Hindi and 42.9% for Bengali. However, for non-medical entities, these models were found to perform comparatively better, e.g., drop in F1-score on Country tag was 5.2% for Hindi and 8.9% for Bengali.

Although our current work focuses on analyzing pandemic-preparedness of only 16 languages, here we try to generalize our findings to other languages by introducing a Readiness Score for every language which empirically measures the preparedness of current LT to serve its speakers in a pandemic-like emergency situation. The definition of readiness is based on the assumption that one has access to the best available LT by considering the highest intent detection accuracy of A * l for a language across different frameworks and training setups. We then define the readiness of a language l as its relative accuracy with respect to English as r l =

denotes the best case accuracy on English, and A random is the accuracy of a random classifier: A random = 100/numberOf Intents. We would like to interpolate r l for all the languages of the world, and hence would need more training examples than the 16 languages that we currently have. We select a set of 11 proxy languages (details in Appendix 6.5), ensuring the coverage of major language families in the world 7 . For these languages, we compute proxy accuraciesÃ * l by building and evaluating chatbots on MT translated data. We then train a Gaussian Process Regression model for predicting readiness scores with the r l values for the 27 languages as our training set. We use geographical and genetic features from the URIEL database (Littell et al., 2017) to represent the languages. The predictive model, which has an average absolute prediction error of 5%, is then used to estimate the readiness scores of 116 new languages supported by major MMLMs (mBERT and XLM-R) and/or translators (Google and Microsoft). For all other languages, we set r l = 0, as one can expect near random performance without any LT, as we did see for Kikuyu (Table 1) . The estimated final r l scores for all the languages was used to extrapolate the pandemic-readiness of each country c, as follows. We use the country-wise language and speaker demographic data 8 to calculate the country-wise readiness (similar to Blasi et al. (2021)), r c = ∑ l∈L c s c,l r l , where L c is the set of languages spoken in country c, and s c,l is the fraction of c's population forming native speakers of the language l. The r c values were clustered to generate five classes (Extremely ill-prepared: 0-0.43, Illprepared: 0.43-0.76, Moderately prepared: 0.76-0.86, Well prepared: 0.86-0.92, Fully prepared: 0.92-1) using Jenks' natural breaks optimization (Jenks, 1967) . These classes were used to generate a readiness heatmap of the world (Figure 3) . Observations: From Figure 3 one can observe that South and East African, and South Asian and East European countries fall into Extremely illprepared category, due to the high dominance of low-resource languages in these geographical regions. For instance, people in Zambia's primarily speak Bemba, Chewa and Luzi, all of which are severely under-resourced. As pointed out in Anastasopoulos et al. (2020), these regions might also be worse-hit in a pandemic situation, and therefore, require immediate attention. For Ill-prepared regions such as Bolivia and Paraguay in South America, and Guatemala in Latin America, r l values are slightly better due to the abundance of Spanish speakers, however there is a sizeable population speaking under-served languages such as Q'eqchi and Guarani.

Countries that fall within fully to moderately prepared categories typically have large native speaker population of one or more of the class 5 languages (English, French, Chinese, Arabic) and/or well-supported languages (e.g., Korean, Bengali, Malay). It is important to note that while approximating readiness of a language, we assumed same value for all its diverse linguistic variants and dialects, which in certain cases results in overestimation of r c . For instance, high r c for north and central African countries (e.g., Libya, Egypt and Sudan) might be due to the sizeable population of a resource-rich language Arabic. However, Arabic has several dialects, which vary from the Modern Standard Arabic at various linguistic levels, and consequently the performance of LT systems for such dialects also vary considerably (Zbib et al., 2012; Alsharhan and Ramsay, 2020 

From our chatbot development experiences, we uncover a set of interesting insights to arrive at the following recommendations which can improve the state of preparedness of languages to combat the next pandemic.

-Our experiments showed that low-resource Indian languages (such as Marathi, Bengali) were benefited due to the presence of a geographically and/or linguistically closely related well-resourced language (Hindi). This notion of such "bridge" languages has been explored before in the context of MT (Paul et al., 2013) and zero/few-shot transfer in MMLMs (Lauscher et al., 2020) . We recommend the community to target bridge languages for the regions that are currently poorly prepared from an LT perspective.

-Drawing insights from the brittleness of MT for domain-specific terms (airborne, incubation) or newly-coined terms (COVID), we believe that commercial and open-source bot frameworks can benefit from domain adaptation techniques (Chu and Wang, 2018) , or techniques to inject new terms to existing solutions.

Our study confirms that except English, only a few European and Asian languages push forward the state-of-the-art research in LT for healthcare. Our preliminary investigation suggests that instead of demographic demand, it is the economic prowess of the users of a language that drives the investment towards developing sophisticated LT solutions for a given language. For instance, Swahili, even though considered as the lingua franca of Africa, is still under-served by commercial chatbot frameworks. Similar trends were observed for Hausa which has a considerably large speaker base compared to Dutch (resource-rich) 9 .

We believe that these findings will play a crucial role in making the community aware of the disparity that needs to be addressed before the next pandemic hits. For our experiments with Multilingual Pre-trained Transformers we consider mBERT (bert-basemultilingual-cased) and XLMR (xlm-roberta-base) for training intent classifiers. As mentioned in the main text we explore two methodologies to train and evaluate these MMLMs, a detailed description with hyperparameters is given below:

1. KNN using Pre-trained Embeddings:

Since the scale of our data is on the lower side, training an end-to-end classifier might be prone to over-fitting. We fit a k-Nearest Neighbors (KNN) classifier on the sentence embeddings obtained using the pre-trained model for the queries in training data. At test time, we similarly obtain the representation of the user query and find its nearest neighbors among the training queries to predict its intent. The optimal value for k was empirically found to be 1 and for sentence embeddings, we take the average of the representation of each token of the sentence in the last layer of MMLM. We also tried fine-tuning the pre-trained model with the training queries using a Masked Language Modelling (MLM) objective. Additionally, we also fine-tuning on a much larger COVID-19 queries dataset in english : COQB (Li et al., 2020) along with our training queries, as has been pointed by Lauscher et al. (2020) can be an effective strategy for few shot transfer. We use 3 epochs to fine-tune the models with a learning rate of 5e-5 and Adam-W optimizer (Loshchilov and Hutter, 2019) . A masking probability of 15% was used during the MLM training and maximum sequence length was taken to be 32.

We also try fine-tuning the MMLMs end-to-end by adding a classification head on top of the pretrained network to classify the input query into one of the 14 intents. We adapt the sequence classification scripts for GLUE benchmark (Wang et al., 2018) provided by hugging face 10 on our dataset.

We fine-tune the classifier for 20 epochs, with the same learning rate and optimizer as the MLM finetuning in the first point with a batch size of 8. For every language we use the best accuracies 10 https://huggingface.co/transformers/ obtain from either of these two strategies 11 . All the experiments were run on 4 NVIDIA V-100 GPUs with 32 GB memory.

Results and Analysis Initially, we have plotted the readiness measures of each language used in our training data on the scatter plot in Figure 5 with the language class on X-axis and readiness measure in Y-Axis. It clearly shows that the African languages such as Somali, Amharic, Hausa, Zulu are below the trendline in terms of readiness. In fact, some of the European languages such as Icelandic, Hungarian, Estonic, Finnish also require some attention. Primarily, we observe that the readiness measure is not a direct function of the 11 technically 4, as in the KNN case we consider no finetuning, fine-tuning on Train Queries, and fine-tuning on Train and COQB queries language class from this plot. As we can see that even though majority of the class 4 and 5 are near the trend line, the observation is similar for Class 1 as well.

Therefore, we also resort to understanding how much does the trend hold true for the language families of these corresponding languages? So, we approximate each of the language family by taking the average scores of each language falling into that class and plot those in Figure 6 . It was interesting to observe that the English-major language families such as Austroasiatic, Koreanic and Sino-Tibetan are wellserved, and consequently lie above the trend line. Overall, Indo-European language families are well near the trend line and then the resourcepoor language families are Afroasiatic, Niger-Congo and Uralic, the worst being the Afroasiatic language family. Table 5 : Relative drop in entity-type wise F1-score in Entity Recognition task using DialogFlow (DF) and LUIS.

In section 4, we discussed the estimation of readiness values of different languages. We first extended our 16 languages that we considered for intent recognition experiments with proxy scores for an additional 11 languages, namely, French (fr) , Arabic (ar), German (de), Spanish (es), Portuguese (pr), Vietnamese (vi), Hungarian (hu), Finnish (fi), Czech (cs), Estonian (et), and Icelandic (is). Finally, it covers six primary language families in the world, such as: 1) Indo-European, 2) Sino-Tibetan, 3) Afroasiatic, 4) Niger-Congo, 5) Koreanic and 6) Austroasiatic.

To estimate the readiness values of the remaining 116 languages supported by the Translators (Google and Bing) and MMLMs (mBERT and XLMR), we used the available readiness data for the 27 to build a regression model. We used Gaussian Processes to model the readiness prediction problem, due its efficiency on the small sized datasets. Radial Basis Function (RBF) with added noise level for each instance (White Kernel), was used, and the length scale of RBF and noise level were tuned using L-BFGS algorithm with 5 restarts for the optimizer. The model selection was done using a Leave One Out strategy, where we move one language to validation set and train on the remaining, repeating this for all the languages and measuring average accuracy. Besides Gaussian Process Regression (GPR), we also experimented with Linear Regression, Lasso Regression and XGBoost (Chen and Guestrin, 2016), but observed inferior validation accuracies.

In section 4 of the paper, we have talked about how to actually take the speaker-base values into account while calculating the readiness scores for each country in the world and the final r l scores obtained on all the languages are used to extrapolate the readiness of each country c in the world. We had also experimented in a way such that all the languages spoken in the country are weighted equally while calculating the readiness of a country. This is similar to the linguistic utility defined by Blasi et al. (2021) in their work for a country c we calculate linguistic readiness r ling c as:

The r

ling c values have been plotted in Figure 5 . Based on our observations on these values we make the following observations highlighting the difference between demographic and linguistic readiness of different countries. Observations: The map shown in 7 provides us an idea of how each country in the world would be able to effectively combat the pandemic by leveraging LT solutions. However, this is when we are actually considering uniform speaker-base for each language in a country. Overall, it can be observed that some of the Asian countries like India falls in the ill-prepared zone now which was initially treated as moderately prepared. This is due to the presence of class 4 language Hindi (having a readiness score of 0.9536) with a considerably high speaker-base (46.19%). Also, similar trend is observed in Canada (home to the speakers of various languages like English, French, Punjabi, Italian, Spanish, German, Cantonese, Arabic, Tagalog).

Extremely ill-prepared ill-prepared Moderately prepared Well prepared Fully prepared 

Massively multilingual neural machine translation

Medical chatbot for novel covid-19

Upstage: Unsupervised context augmentation for utterance classification in patient-provider communication

Towards an artificially empathic conversational agent for mental health applications: System design and user perceptions

Enhancing rasa nlu model for vietnamese chatbot

A qualitative evaluation of language models on automatic question-answering for covid-19

UN Department of Operational Support. 2020. Covid-19 frequently asked questions

Crosslingual name tagging and linking for 282 languages

Classification of covid-19 symptom for chatbot using bert

Designing a chatbot for a brief motivational interview on stress management: Qualitative case study

How to choose the best pivot language for automatic translation of low-resource languages

A chatbot in Malayalam using hybrid approach

Fu covid-19 ai agent built on attention algorithm using a combination of transformer, albert model, and rasa framework

Extracting situational information from microblogs during disaster events: A classification-summarization approach

Students need more attention: Bert-based attention model for small data with application to automatic patient message triage

Edubot-a chatbot for education in covid-19 pandemic and vqabot comparison

LORELEI language packs: Data, tools, and resources for technology development in low resource languages

Enhancing rasa nlu model for vietnamese chatbot

Ask diana: A keyword-based chatbot system for water-related disaster management

GLUE: A multi-task benchmark and analysis platform for natural language understanding

Are all languages created equal in multilingual BERT?

Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer

PAWS-X: A cross-lingual adversarial dataset for paraphrase identification

Machine translation of Arabic dialects

Multi-stage pretraining for low-resource domain adaptation