key: cord-0147607-34sspi61 authors: Pandey, Rohan; Gautam, Vaibhav; Bhagat, Kanav; Sethi, Tavpritesh title: A Machine Learning Application for Raising WASH Awareness in the Times of Covid-19 Pandemic date: 2020-03-16 journal: nan DOI: nan sha: c8b45c6052cbc388de67682ac02a185fc62ba815 doc_id: 147607 cord_uid: 34sspi61 A proactive approach to raise awareness while preventing misinformation is a modern-day challenge in all domains including healthcare. Such awareness and sensitization approaches to prevention and containment are important components of a strong healthcare system, especially in the times of outbreaks such as the ongoing Covid-19 pandemic. However, there is a fine balance between continuous awareness-raising by providing new information and the risk of misinformation. In this work, we address this gap by creating a life-long learning application that delivers authentic information to users in Hindi, the most widely used local language in India. It does this by matching sources of verified and authentic information such as the WHO reports against daily news by using machine learning and natural language processing. It delivers the narrated content in Hindi by using state-of-the-art text to speech engines. Finally, the approach allows user input for continuous improvement of news feed relevance on a daily basis. We demonstrate a focused application of this approach for Water, Sanitation, Hygiene as it is critical in the containment of the currently raging Covid-19 pandemic through the WashKaro android application. Thirteen combinations of pre-processing strategies, word-embeddings, and similarity metrics were evaluated by eight human users via calculation of agreement statistics. The best performing combination achieved a Cohen's Kappa of 0.54 and was deployed in the WashKaro application back-end. Interventional studies for evaluating the effectiveness of the WashKaro application for preventing WASH-related diseases are planned to be carried out in the Mohalla clinics that provided 3.5 Million consults in 2019 in Delhi, India. Additionally, the application also features human-curated and vetted information to reach out to the community as audio-visual content in local languages. Raising healthcare awareness for primary prevention of diseases is a challenge all across the globe. Hygiene promotion is the most cost-effective health intervention if accurate content is delivered effectively. A majority of preventable diseases result from unhygienic practices. Water, Sanitation and Hygiene (WASH) measures such as hand-washing are also important in limiting the spread of pandemics such as the currently raging Covid-19. Further, the awareness raising content is often not available to those who need it the most and in a format that they easily understand leading to profoundly wide socio-economic impacts of this lack. In 2017, around 55% of the global population did not make use of a safely managed sanitation service effected in part due to lack of awareness in addition to the lack of facilities at home [2] . Around 827,000 people in low and middle-income countries die as a result of inadequate water, sanitation, and hygiene each year. A significant proportion of these deaths can be averted through dissemination of information about WASH practices and their critical role in preventing diseases by delivering authentic information content in local languages. This cuts across the Sustainable Development Goal 3 (Good Health and Well Being for All) and 6 (Adequate Sanitation and Hygiene for All). India is the second-most populous country in the world, with more than 1 billion citizens where a staggering 344 million lack hygienic defecation facilities [1] . The World Health Organisation states that more than 500 children under the age of five die each day from diarrhoea in India alone [1] and estimates that 21 per cent of communicable diseases in India are linked to unsafe water and the lack of hygiene practices [1] . Ironically, India is also one of the largest and fastestgrowing markets for digital consumers, with 560 million internet subscribers in 2018 [4] and about 60% of Indian users anticipate that the m-Health technologies will improve healthcare within the next three years [5] . This offers a unique opportunity to bridge the gap in information availability through m-Health technologies to reach out to those who need it the most, and in a medium that they understand the most, e.g. audios delivered in local languages, thus narrowing the divide between these resources and the masses. The recent pandemic outbreak of Coronavirus (Covid-19) has demonstrated the need for proactive containment and prevention measures including repeated hand-washing. Every single day lost of proactive interventions has an exponential impact and countries that acted early were able to contain the disease effectively, thus saving thousands of lives and dollars [9] . Therefore, there exists a dire need for proactive information in addition to proactive testing while preventing the spread of misinformation. In this work, we demonstrate an awareness raising solution WashKaro that uses NLP approaches, machine learning and m-Health to combine authentic sources of information with daily news and delivers these in Hindi, the most widely understood local language across India. The application also hosts human-curated and vetted information to reach out to the community as audio-visual content in local languages. We have validated our approach using the following datasets: This dataset comprises of WHO guidelines obtained from publically available WHO reports with special emphasis of Water, Sanitation and Hygiene (WASH) from various WHO reports published. The dataset comprises of more than 400 WHO articles manually scraped from individual reports owing to the varied format of each report. The dataset comprises of the title of the guideline, the guideline, a category in which it belongs, and the URL of the WHO published report. Some of the broad categories of these reports are This dataset comprises of news articles scraped from publically available news articles. The dataset comprises of the article headline, article text, URL of the article and the date of publishing. We have maintained the following news article datasets: The English News Article dataset consists of news articles extracted from 'The Hindu'. The news articles are filtered using the following keywords: 'Handwash', 'Hygiene', 'sanitation', and 'health'. The Hindi News Article dataset consists of news articles extracted from 'jagran'. The news articles are filtered using the following keywords: 'svachta' , 'safai', 'haath dhona', ' saaf', 'haath ragad' which are Hindi translations of 'Cleanliness', 'handwash', 'hygiene' and 'sanitation'. In this section, we present our proposed public healthcare intervention workflow, designed and centred around imparting healthcare information effectively. Our methodology is represented in figure2 and further explained in the following sections. The dataset as described in section 2 needs to be preprocessed in order to transfer text from human language to machinereadable format for further processing. The following preprocessing is done: The entire text is converted into lowercase. Lowercasing significantly helps with consistency of expected output. Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens. Stop words are the most common words in a language like "the", "a", "on", "is", "all". These words do not carry important meaning and are usually removed from texts. Stemming is a process of reducing words to their word stem, base or root form (for example, books -book, lookedlook). The main two algorithms are the Porter stemming algorithm (removes common morphological and inflexional endings from words [14] ) and Lancaster stemming algorithm (a more aggressive stemming algorithm). We have deployed the Porter Stemming algorithm. The pipeline takes in news articles and WHO reports and constructs two-level sentence similarity between titles and the full-text to construct a relevance score. The relevance score thresholds are continuously tuned as more user data are collected. Finally the relevant texts are subject to text to speech translation for consumption in local language (Hindi). Embedding is a technique to transform text and convert them into a form, such that a machine can process it. It is one of the most popular representations of document vocabulary. The transformation is done in such a way that machine level analysis can be carried out on them. It is capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other words, etc. They are basically a form of word representation that bridges the human understanding of language to that of a machine. An embedding is a learned representation for text where words that have the same meaning have a similar representation. Our methodology employs the following embeddings: Word2Vec [7] : is a statistical method for efficiently learning a standalone word embedding from a text corpus.Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space. Two different learning models were introduced that can be used as part of the word2vec approach to learning the word embedding: Continous Bag of Words (CBOW) and skipgram model. The CBOW model learns the embedding by predicting the current word based on its context. The continuous skip-gram model learns by predicting the surrounding words given a current word. The key benefit of the approach is that high-quality word embeddings can be learned efficiently (low space and time complexity), allowing larger embeddings to be learned (more dimensions) from much larger corpora of text (billions of words). GloVe (Global Vectors for Word Representation) [8] : is an extension to the word2vec method for efficiently learning word vectors. Classical vector space model representations of words were developed using matrix factorization techniques such as Latent Semantic Analysis (LSA) that do a good job of using global text statistics but are not as good as the learned methods like word2vec at capturing meaning and demonstrating it on tasks like calculating analogies. GloVe is an approach to marry both the global statistics of matrix factorization techniques like LSA with the local context-based learning in word2vec. Rather than using a window to define local context, GloVe constructs an explicit word-context or word co-occurrence matrix using statistics across the whole text corpus. The result is a learning model that may result in generally better word embeddings. Google Sentence Encoder [3] : encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.The model is trained and optimized for greater-thanword length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. The universal-sentence-encoder-large model is trained with a Transformer encoder. Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. This is composed of 2 parts: TF, which measures how frequently a term occurs in a document, and IDF, which measures how important a term is by giving higher weight to words occurring only in a few documents. Where N i is the number of times i appears in a document and N is the total number of terms in the document. Where N is the total number of documents and df i is the number of documents in which word i occurs. In order to generate a similarity score, between the news article and the WHo guideline, the following similarity metrics have been used: Cosine Similarity: is a metric used to measure the similarity between the two documents.It is independent of the size of the documents. Cosine similarity calculates similarity by measuring the cosine of angle between two vectors. This is calculated as: Mathematically speaking, Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0 • is 1, and it is less than 1 for any angle in the interval (0, π] radians. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors oriented at 90 • relative to each other have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity. Word Mover Distance: suggests that distances and between embedded word vectors are to some degree semantically meaningful. It utilizes the property of word vector embeddings and treats text documents as a weighted point cloud of embedded words. The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to "travel" to reach the embedded words of another document. WMD shows that this distance metric can be cast as an instance of the Earth Mover's Distance (a wellstudied transportation problem for which several highly efficient solvers have been developed). WMD enables us to assess the "distance" between two documents in a meaningful way, even when they have no words in common. (dis)similarity between the two sentences. The method also uses the bag-of-words representation of the documents (simply put, the word's frequencies in the documents). The intuition behind the method is that we find the minimum "traveling distance" between documents, in other words the most efficient way to "move" the distribution of document 1 to the distribution of document 2. A baseline cut-off similarity score is maintained. All news articles are mapped against the WHO guideliness which results in the generation of a similarity score. All the pairs above this cut-off score are accepted and published to the user ensuring than only relevant pairs are delivered. User feedback is obtained at the end of each pairing, the current cut-off score increases by a small margin when a user marks a pair irrelevant and decreases when a user marks a pair relevant. The news article and matched WHO guideline with relevance score greater than the cut-off, are converted into the local language(Hindi) using Google Could Platform. Google's pretrained neural machine translation delivers fast and dynamic translation results and Google Cloud's Text-to-Speech converts text into human-like speech. As our entire methodology is aimed at efficiently providing healthcare information to the masses, our success needs to be a measure of the acceptability by the masses. Inter-rater reliability is the extent to which two or more raters (or observers, coders, examiners) agree. It addresses the issue of consistency of the implementation of a rating system. Inter-rater reliability can be evaluated by using a number of different statistics. High inter-rater reliability values refer to a high degree of agreement between two examiners. Low inter-rater reliability values refer to a low degree of agreement between two examiners. We have evaluated our performance using the following commonly accepted statistics: Percentage Agreement amongst raters is a statistic calculated as the number of agreement scores divided by the total number of scores. In the case of multiple users, this technique gets complex. If total raters are n then it has to check for n(n-1)/2 combinations. Also, it does not take the chance of agreement into account and overestimate the level of agreement. Hence it is not reliable alone. Cohen's kappa coefficient [6] is a statistic that is used to measure inter-rater reliability for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, since k takes into account the agreement occurring by chance. Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. To find the coefficient, we use the following formula: where p o is the relative observed agreement among raters (identical to accuracy), and p e is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly seeing each category. The first step of our experimentation involved the creation of the datasets mentioned in section 2. Manual scraping of the WHO Guidelines dataset was done and a total of nearly 400 WHO articles were stored in a local CSV file. English News Articles and Hindi News Articles of the current day were also scraped and stored in their respective datasets. In the first set of experimentations, WHO Guidelines and English News articles dataset was used. These datasets were preprocessed by the steps removing unwanted characters as mentioned in the section 3.1. After the preprocessing was done, pairs were generated where the current days news articles are mapped against the entire WHO guidelines database. Pair generation is followed by evaluating the sentence similarity amongst the News Article Text vs WHO article Text. After the generation of pairs, the set of models mentioned in Table 1 are employed on each of these sets of pairs. Table 1 specifies the following technical details: • Embedding used for each model • If preprocessing was done • If TF-IDF weighting was done After running these models on each of the four sets of pairs generated, the relevance scores generated are obtained. The text similarity models work as shown in figure 3 . These scores differ from one model to another. The obtained relevance scores are sorted and the top 5 pairs obtained from each similarity model on a particular set of pair are filtered out. These top pairs were presented to reviewers to classify as relevant and irrelevant (binary classification). The reviewer will classify the pair relevant if the pair of news article and WHO guideline has some relation between them and can be clubbed while if they don't have anything in common, a reviewer will mark that pair as irrelevant. Since the idea of relevance and irrelevance is subjective, therefore a total of 8 reviewers were used. The reviewers were given a set of total 65 stories(top 5 stories of each model) and were asked to press 1 if they think the news article and the WHO report are relevant and 0 otherwise. To find the best model, percentage agreement and Kappa score are calculated for each of the Models and the particular model which has the highest Kappa score among all these models is selected. The best model selected on the basis of these scores is used for future tasks also. After the model is selected, we translate both the WHO dataset and the News articles into Hindi. The translated text is converted into speech for ease of access. The methodology incorporates a feedback system wherein for each Newsarticle and WHO report presented to the user, they can classify the matching as relevant or irrelevant. After learning from the feedback, only the stories that have a similarity score above the learned baseline(cut-off score) are provided to the end-user. This decision line improves with each feedback and ensures delivery of effective content. With the feedback given by the user as relevant/ irrelevant, the value of threshold changes thus pushing only the relevant stories. The same workflow is followed by scraping Hindi news articles, which are converted into English for running sentence similarity models. As done above, the resultant pairs are provided to the users in the form of Hindi text and speech. To evaluate the performance of our healthcare intervention methodology. We relied on on various inter-rater reliability evaluation metrics as mentioned in section 4. After calculating the similarity scores for a set of News articles and WHO guidelines using models defined in table 1, 8 different users were asked to classify the matched pairs as relevant or irrelevant. At this time, the cut-off was set to 0.5, and all pairs with relevance score greater than the cut-off score were provided to these user's for rating. We calculated percentage agreement and Kappa Score for these 8 raters, on the pairs of News articles and WHO Guidelines and following results are shown in Table2 and Table3 for the Hindi and the English News article dataset as mentioned in section 2. Preprocessing TF-IDF Similarity Metric 1 Word2Vec Word2Vec Glove X Cosine 10 Glove Cosine 11 Glove X X Word Mover Distance 12 Glove X Word Mover Distance 13 Google Sentence Encoder X X Cosine Table 1 : Combinations of approaches tested. These included pre-processing word-embedding models and similarity metrics for evaluation by eight human users. Table 1 . It is seen that Cohen's Kappa provided a better discrimination among models and was further used for model selection. In both the cases, we can see that the model 2 ( Preprocessing + Word2Vec Embedding + Cosine Similarity) gave the best results with 0.54912 Kappa Score and 79.2420 Percentage Agreement. As seen in Fig.5 and Fig.4 , Hindi News Article dataset provided better results in terms of both metrics.Hence for the purpose of this app, model 2 was chosen. Our android application WashKaro is available for free download on Google Play Store at https://play.google.com/store/apps/details?id=inspire2connect.inspire2connect. The application feed displays various news articles related to sanitation and hygiene. Every news article has a corresponding WHO guideline matched with it. The entire news article is provided in the form of text and an audio file in the local language. User can switch between the news article and the matched WHO guideline. After going through the article, user can review the corresponding matching as relevant or irrelevant. The feedback from the user is used to improve the cut-off score. As this application is targeted towards the lesser educated Table 1 . The model yielding highest agreement among humans was selected for deployment. section of the society, and onboarding section is added to help the time users. This is followed by a an optional questionnaire comprising of a few basic questions on patient demographics to help us understand the prevalence and impact of interventions planned. To the best of our knowledge, this is the first application that demonstrates the use of state-of-the-art machine learning and m-Health technologies to specifically address the issue of ongoing WASH awareness in a local language in India. This is a daily-learning platform that allows user feedback on the relevance of content. The results of the technical approach taken in this work were evaluated by a panel of eight human users to choose the most appropriate model. However, this study has several limitations. The models have been trained on a relatively small corpus and we have only implemented the approach in Hindi, which is the most widely understood language in India. We do plan to incorporate more languages and local context to the application. All the humans evaluating the models were from a similar educational background. We hope to overcome this limitation through the feedback obtained from users of the WashKaro app. We also plan to devise a ranking score for the feedback providers based upon their reputation score for Public Health published via an accompanying website. Finally, the most important limitation is the lack of assessment of the interventional impact of this application. We plan to address this through a phased roll out with the primary health clinics in Delhi and appropriate partnerships delivering digital health interventions onground. Regardless, our current work highlights the potential of machine learning, m-Health and natural language processing in addressing primary health challenges and provides a framework for replicating such studies in a variety of public health challenges including the Covid-19 pandemic. India's water and sanitation crisis Progress on household drinking water, sanitation and hygiene 2000-2017. special focus on inequalities. new york: United nations children's fund (unicef) and world health organization Universal sentence encoder Digital india: Technology to transform a connected nation Emerging mhealth: paths for growth Interrater reliability: the kappa statistic Distributed representations of words and phrases and their compositionality Glove: Global vectors for word representation Response to covid-19 in taiwan: Big data analytics, new technology, and proactive testing