key: cord-0587735-o2gi7noi authors: Mabrouk, Aymen Ben Elhaj; Hmida, Moez Ben Haj; Fourati, Chayma; Haddad, Hatem; Messaoudi, Abir title: A Multilingual African Embedding for FAQ Chatbots date: 2021-03-16 journal: nan DOI: nan sha: 8304a37c211f5504e64ddd931915e7bb778cc171 doc_id: 587735 cord_uid: o2gi7noi Searching for an available, reliable, official, and understandable information is not a trivial task due to scattered information across the internet, and the availability lack of governmental communication channels communicating with African dialects and languages. In this paper, we introduce an Artificial Intelligence Powered chatbot for crisis communication that would be omnichannel, multilingual and multi dialectal. We present our work on modified StarSpace embedding tailored for African dialects for the question-answering task along with the architecture of the proposed chatbot system and a description of the different layers. English, French, Arabic, Tunisian, Igbo,Yor`ub'a, and Hausa are used as languages and dialects. Quantitative and qualitative evaluation results are obtained for our real deployed Covid-19 chatbot. Results show that users are satisfied and the conversation with the chatbot is meeting customer needs. Nowadays, African people have easier access to the internet and particularly to digital information through various platforms. However, African public and private institutions interact with citizens through old and non updated websites using the official language to communicate, mainly French or English. Nevertheless, most African citizens talk and express themselves in their own local dialects which makes the information not accessible and not easy to find. On the other hand, they are struggling with the geographic distance from the institutions. In addition to that, African institutions phone numbers to be contacted are bombarded and unavailable especially during a crisis case such as the Covid-19 pandemic where everyone is asking for a piece of information about the pandemic or how to prevent it. Faced with the Covid-19 crisis, African institutions faced the need to improve the communication across digital channels and struggled to make the information as reliable as possible, in the official language and in local dialects. In daily life, Tunisian people, for instance, communicate and type using the Tunisian dialect. It consists of "Tunisian Arabic" also described as "Tounsi" or "Derja" which uses arabic alphabets but is different from the Modern Standard Arabic (MSA) and "TUNIZI" which was introduced in (Fourati et al., 2020) as "the Romanized alphabet used to transcribe informal Arabic for communication by the Tunisian Social Media community". In fact, Tunisian dialect features Arabic vocabulary spiced with words and phrases from Tamazight, French, Turkish, Italian and other languages (Stevens, 1983) . In (Younes et al., 2015) , a work on Tunisian Dialect on Social Media mentions that 81% of the Tunisian comments on Facebook used Romanized alphabet. A study was conducted in (Abidi, 2019) on social media Tunisian comments presents statistics as follows: Out of 1,2M comments with 16M words and 1M unique words, 53% of the comments use Romanized alphabet, 34% use Arabic alphabet, and 13% use script-switching. 87% of the Romanized alphabet are Tunizi, while the rest are French and English. A work 1 presents statistics about Nigerian languages and dialects and indicates that "Nigeria is one of the most linguistically diverse countries in the world, with over 500 languages spoken". Nigeria's official language is English, however is spoken less frequently in rural areas and amongst people with lower educational levels. Major languages spoken in Nigeria include Hausa, Yorùbá, Igbo, and Fulfulde. Statistics 2 indicates that a total of about 54.8 million people worldwide speak Hausa as their mother tongue. Statistics 3 are presented as follows: The "Igbo people number between 20 and 25 million, and they live all over Nigeria". An other work 4 indicates that the most spoken African languages by number of native speakers are Arabic, Berber (Amazigh), Hausa, Yorùbá, Oromo, Fulani, Amharic, and Igbo. To answer properly a huge number of Covid-19 related questions, asked at the same time, we decided to develop a chatbot. However, two main problems were encountered. First, the non availability of covid-19 related lexical field in any of previously deployed chatbots. Also, no pretrained language models for dialects, particularly, African ones exist. Hence, the creation of our own specific domain embedding model is crucial. In this paper, we introduce a chatbot for African institutions that is omnichannel, multilingual and multi-dialectal. This solution would solve the digital communication with citizens using official governmental language and local dialects, avoiding fake news, understanding and replying using the relevant language or dialect to the user. Services would be digitized and hence, illiterate, and vulnerable citizens such as rural women and people that need social inclusion and resilience would have easier access to reliable information. Open-domain Question-Answering (QA) task has been intensively studied to evaluate the performances Language Understanding models. This task takes as input a textual question to look for correspondent answers within a large textual dataset. In (Mozannar et al., 2019) , two MSA QA datasets have been proposed. In 2016, the first Arabic dialect chatbot,called BOTTA, was developed that uses the Egyptian Arabic dialect in the conversation (Abu Ali and Habash, 2016). It represents a female character that converses with users for entertainment. BOTTA was developed by using AIML and was launched by using Pandorabots platform. In (Al-Ghadhban and Al-Twairesh, 2020), authors developed a social chatbot that can support conversation with the students of the information technology (IT) department at King Saud University (KSU) using the Saudi Arabic dialect. They collected 248 inputs/outputs from the KSU IT students, then preprocessed and classified the collected data into several text files to build the dialogue dataset. To the best of our knowledge, no such work was previously done for African dialects. In order to create our chatbot, three steps were conducted. • First, we collected data from official sources. For instance, Tunisian information was provided from Ministry of health and Nigerian information from Nigerian local Non-Governmental Organizations (NGOs). • Second, already collected data needs to get further augmented and spiced with chitchat like greetings and jokes. • Finally, data collected gets divided into two main categories: Frequently Asked Questions (FAQ) and chitchat. In order to collect Tunisian and Nigerian reliable and official data, covid-19 related questions and answers were provided by the Tunisian ministry of health, and Nigerian NGOs respectively. However, such information are available only in official languages. For instance, Tunisian resources are written in French and Arabic, and Nigerian ones in English. Hence, the need to include local dialects spoken by citizens on a daily basis. In order to further augment our data and include local dialects, each question written in French was translated into Tunisian dialect using the Tunizi way of writing. Questions collected in MSA are translated into Tunisian dialect expressed in Arabic letters. Every question was augmented in Tunisian dialect by five Tunisian native speakers. Table 1 presents the translation of the question "How to protect myself from the covid-19?" into French, MSA, Tunizi, Tunisian Arabic, and English. Answers for the Tunisian asked questions written in Tunizi, or French are answered in French, whereas questions written in Tunisian Arabic or MSA are answered in MSA. Since Nigerian collected questions were expressed only in English, all questions and answers were translated into Igbo, Yorùbá, and Hausa respectively. This work was done by seven Nigerian native speakers. Hence, a question asked in one of the three dialects would be answered in the same chosen dialect. Table 2 presents the question "What is the Current Status of COVID-19 Testing in Nigeria?" expressed and answered in English and all available Nigerian dialects. After collecting and augmenting the needed data in all languages and dialects, we grouped it into two main categories presented as follows: • FAQs containing all official and reliable covid-19 related questions and answers. • Chitchat containing informal conversations like salutations, jokes, greetings, etc. Chitchat answers are presented in the local dialect of the asked question. Table 3 presents examples for Tunisian chitchat in MSA, French, Tunizi, and Tunisian Arabic. Table 4 presents examples for Nigerian chitchat in English, Yorùbá, Hausa, and Igbo. The number of answers of each category in each language and each dialect are presented in Table 5 . The proposed chatbot system is based on a multilayered architecture as shown in Figure 1 . The architecture is composed of four layers: Presentation Layer, NLP Layer, NLU Layer, and Data Layer. • The Presentation Layer allows users to interact with the chatbot by posing questions and receiving responses. A simple web browser is enough to load the UI components and maintain a discussion with the chatbot. Alternatively, the interaction is also possible via messaging platforms. The messaging Platforms Connector provides built-in connectors to connect to common messaging platforms, like Facebook Messenger. • The NLP Layer is composed of a data processing module and a data formatting module. The data processing module is responsible for data cleaning. The data formatting module receives HTML encoded questions from the presentation layer, then transforms them into the JSON (JavaScript Object Notation) format required by the NLU layer. Inversely, JSON responses received from the NLU layer are formatted in adequate HTML to be presented to users. • The NLU Layer is considered to be the core of the chatbot system. This layer is composed of two modules: NLU services and External Services Connector. First, the NLU services module embeds the text question in a numerical vector. Second, the embedded question is mapped into an intent with a confidence score. If the score is higher than a predefined threshold, an answer related to the matched intent is determined. In specific cases, the answer requires an external service, like weather or news. In such a case, External Services Connector consumes an external API to retrieve the answer. When the confidence score is lower than the threshold, the question is considered ambiguous so it will be unanswered and is logged out by the lower layer. A default message is sent to the user stating that the chatbot has no specific answer for the question or asks the user to reformulate his/her question. The threshold is defined as : Where V is the validation dataset, T is the You must wash your hands well, wear a bib and respect physical distancing Since no previous work on Tunisian and Nigerian dialect Language models was previously done and specifically chatbots answering covid-19 questions, we trained a specific-domain language model dedicated for this task. We modeled our data as a document and a tag. The documents represent the questions and the tags represent the intents. An intent defines the purpose of a user's question. Dialects, and particularly, African ones have no writing standards. People express themselves using words written in any format. For example the word "good" in Tunizi can be represented as "behi", "bahi", "behii" or "behy" etc. Therefore, we used bag of character n-grams representation for documents with n between two and four, and one hot encoding for tags. We trained a multilingual and multidialectal classifier. We used StarSpace neural model (Wu et al., 2017) . It embeds inputs and intent's labels into the same space and it is trained by maximizing similarity between them. We used two hidden layers and an embedding layer that returns a twenty dimensional vector. We used the default hyperparameters and the cosine similarity function. The classifier representation can be seen in Figure 2 . The chatbot deployment was done on two different channels: websites (official government website and NGO website) and Facebook Messenger (a Facebook page was created for the chatbot where users can interact with it using the Messenger). In the period from March 23, 2020 to June 18, 2020, 34800 users, excluding bot traffic, interacted with the chatbot as shown in Table 6 . Statistic Number of unique users 34 800 Number of received questions 300 000 Min #questions/conversation 1 Max #questions/conversation 32 Average of questions/conversation 7.4 For the Messenger channel, 8153 users interacted with our chatbot, with a minimum of 89 users/day and a maximum of 903 users/day. The stickiness rate calculated as the ratio of Daily Active Users to Monthly Active Users achieved 16.85%. Nevertheless, results show that, in a crisis context, users prefer to retrieve information from official websites rather than non-official ones like Messenger. In Table 7 , We present the distribution of Messenger's users over age and gender from April 1, 2020 to June 18, 2020. Female These statistics demonstrate the high interaction rate with our chatbot. We can conclude that people had real conversations where the average of messages during one conversation is at 7.4. To measure the chatbot's quality, we used the Sensibleness and Specificity Average (SSA) metric. SSA is a human evaluation metric proposed by Google Brain Team (Adiwardana et al., 2020) . Sensibleness and Specificity Average measures humanlikeness of the chatbot by answering the following question: Does the response have sense (Sensibleness) and was it specific (Specificity)?. For example, if a user asks "How to protect my-self from covid-19 ?" and the chatbot responds with "Be safe", then the answer should be marked as "not specific". Such a reply could be used in various contexts. However, if the chatbot responds "You must wash your hands well, wear a bib and respect physical distancing", then the answer should be marked as "specific", since it relates closely to what is being discussed. However, the sensibleness metric is not sufficient when considered alone. Responses can be sensible but vague. Vague responses are not able to attract users to ask more questions. As an example, if the question is not Covid-19 related, the chatbot replies with a generic apology. Hence, the second dimension of the SSA metric will decide the specificity of the response. We asked Human judges to label answers as sensible, when it has a meaning, and as specific when it is close to the discussed topic. Answers labeled not sensible are considered not specific. Sensibleness and specificity rates are calculated as the percentage of sensible-labeled and specific-labeled responses respectively. We calculate the average of both metrics, which results in the SSA metric. SSA is a representative for human likeness, that also ignores chatbots with generic responses. We randomly selected 100 discussions with 20 user interactions (questions). We submitted the discussions to 3 human judges and asked whether each response, given the context, is sensible and specific. We consider this evaluation as static since we have a fixed context: Covid-19. The obtained results are presented in Table 8 . Average Sensibleness 68% Specificity 60% In this work, we present the first work for African chatbots that is covid-19 pandemic related. The particularity of this work is that this chatbot answers questions using English, French, Arabic, Tunisian, Igbo, Yorùbá, and Hausa without any predefined scenarios. Our solution digitalizes African institutions Services and hence, illiterate, and vulnerable citizens such as rural women and people that need social inclusion and resilience would have easier access to reliable, official, and understandable information 24/7. La construction automatique de ressources multilinguesà partir des réseaux sociaux : application aux données dialectales du Maghreb. (Automatic building of multilingual resources from social networks : application to Maghrebi dialects) Botta: An Arabic dialect chatbot Nabiha: An arabic dialect chatbot Tunizi: a tunisian arabizi sentiment analysis dataset Neural Arabic question answering Ambivalence, modernisation and language attitudes: French and arabic in tunisia Starspace: Embed all the things! arXiv preprint Constructing linguistic resources for the tunisian dialect using textual user-generated contents on the social web The Artificial Intelligence Powered chatbot for crisis communication is omnichannel, multilingual and multidialectal. We modified StarSpace embedding to be tailored for African dialects along with its architecture. Quantitative and qualitative evaluations were performed. Results show that users are satisfied and that conversations with the chatbot are meeting customer needs. Indeed, statistics demonstrate the high interaction rate with the chatbot. SSA metric and its individual factors (specificity and sensibleness) showed reasonable performances. The architecture used can be tailored to any tokenized African dialect. Hence, the ability to make other chatbots for other underrepresented African languages and dialects. As a future work, we plan to build a voicebot that understands vocal requests and replies using voice messages. Such a problematic will be addressed using Speech To Text (STT) and Text To Speech (TTS) technologies for African dialects.