key: cord-0437348-pwntqqtd authors: Oniani, David; Wang, Yanshan title: A Qualitative Evaluation of Language Models on Automatic Question-Answering for COVID-19 date: 2020-06-19 journal: nan DOI: nan sha: e703f7ae04755c5cd33b6360ca20ce10287ab3e8 doc_id: 437348 cord_uid: pwntqqtd COVID-19 has resulted in an ongoing pandemic and as of 12 June 2020, has caused more than 7.4 million cases and over 418,000 deaths. The highly dynamic and rapidly evolving situation with COVID-19 has made it difficult to access accurate, on-demand information regarding the disease. Online communities, forums, and social media provide potential venues to search for relevant questions and answers, or post questions and seek answers from other members. However, due to the nature of such sites, there are always a limited number of relevant questions and responses to search from, and posted questions are rarely answered immediately. With the advancements in the field of natural language processing, particularly in the domain of language models, it has become possible to design chatbots that can automatically answer consumer questions. However, such models are rarely applied and evaluated in the healthcare domain, to meet the information needs with accurate and up-to-date healthcare data. In this paper, we propose to apply a language model for automatically answering questions related to COVID-19 and qualitatively evaluate the generated responses. We utilized the GPT-2 language model and applied transfer learning to retrain it on the COVID-19 Open Research Dataset (CORD-19) corpus. In order to improve the quality of the generated responses, we applied 4 different approaches, namely tf-idf, BERT, BioBERT, and USE to filter and retain relevant sentences in the responses. In the performance evaluation step, we asked two medical experts to rate the responses. We found that BERT and BioBERT, on average, outperform both tf-idf and USE in relevance-based sentence filtering tasks. Additionally, based on the chatbot, we created a user-friendly interactive web application to be hosted online. Coronavirus disease 2019 (COVID- 19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [22] . As of 12 June 2020, more than 7.4 million cases have been recorded, resulting in over 418,000 deaths [44] . The sudden global outbreak of COVID-19 made millions of people quarantined, due to the social distancing measures. Additionally, the COVID-19 pandemic caused a historic rise in mental health problems, such as depression, post-traumatic stress disorder, and suicide, due to the state-wise quarantine. People are isolated and stressed, and may develop long-term psychological consequences, beyond the quarantine period [36] [16] [21] . Therefore, most of the time, people rely on online and web-based resources for getting news and updates concerning COVID-19. Given that currently many web sources do not hold the accurate information about the pandemic and the misinformation campaigns are running rampant [39] , it is critically important that people and patients receive accurate, up-to-date, and useful information regarding COVID-19. Online communities, forums, and social media provide potential venues to search for relevant questions and answers, or post questions and seek answers from other members. However, due to the nature of such sites, there are always a limited number of relevant questions and responses to search from, and posted questions are rarely answered immediately. To address these issues, we propose to develop a chatbot enhanced by neural language models that is able to automatically answer questions related to COVID-19 through conversational interactions. A conversational chatbot is a software which is able to conduct a conversation via text and/or other means. There are different taxonomies for the type of conversational chatbot. Based on how the natural language conversations are generated, there are two main categories: script chatbot and intelligent chatbot. The entire interaction in a script chatbot is based on a pre-determined model that determines what the chatbot can and cannot do. The "script" is usually a decision tree that is manually crafted by domain experts to determine which specific path to take given a response to one question task. It is usually very labor-expensive and nongeneralizable to develop conversation decision trees. The intelligent chatbot is built using Artificial Intelligence (AI) and Natural Language Processing (NLP) techniques that automatically generate natural language on the back end. With the advancements in AI and NLP, the functionality and the performance of modern chatbots have been dramatically improved. However, these techniques are rarely applied and evaluated in the healthcare domain to meet the information needs with accurate, up-to-date, and interactive healthcare information. The outbreak of COVID-19 has motivated us to develop a chatbot with advanced NLP techniques and evaluate the approach in automatically answering questions related to COVID-19. To the best of our knowledge, this is the first study of such kind. Our contributions are: • We applied and compared the performance of four embedding generation approaches, namely tf-idf (Term Frequency -Inverse Document Frequency) [18] , Bidirectional Encoder Representations from Transformers (BERT) [42] , BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) [28] , and Universal Sentence Encoder (USE) [14] for refining the automatically generated answers. • We proposed a qualitative evaluation guideline for automatic question-answering for COVID-19. • We assessed the performance of the proposed "hybrid" approach for automatic question-answering for COVID-19. • We built a web-based chatbot using the language models that facilitate question-answering for users. This paper is organized as follows. We will proceed by discussing the related work and the efforts in Section 2. Section 3 will be dedicated to materials and Section 4 to the proposed approach. We will report the chatbot evaluation strategy and the experimental results in Sections 7 and 6, respectively. Finally, we will also discuss a webbased chatbot with the proposed model and future work in Section 7, and conclude the work in Section 8. Recent neural language models of dialogue generation offer great promise for generating responses for intelligent chatbots. The LSTM (Long Short-Term Memory) sequence-to-sequence (seq2seq) model is one type of neural generation model that maximizes the probability of generating a response given the previous dialogue turn [37] [19] [41] . XLNet uses a context of the word for predicting the next word where the context word is constrained to two directions (backward or forward) [45] . SAM is a technique (Self-Attentive Associative Memory) where two memories are wired into a single sequential model capable of both memorization and relational reasoning [26] . In the GPT-2 domain, Lee and Hsiang [29] have fine-tuned GPT-2 for generating patent claims. Klein and Nabi [25] have applied GPT-2 in conjunction with BERT for automatic question generation purposes. Zhang, Sun, et al. developed a large and tunable neural conversational model DialoGPT using GPT-2 [48] . Lee, Shu et al. developed RecipeGPT for automatic generation of cooking recipes by fine-tuning GPT-2 on a large cooking recipe dataset [27] . We are unaware of the work which applied GPT-2 model for transfer learning purposes on CORD-19. In regard to the work related to comparing pretrained AI models, Jin et al. made some efforts conducting probing experiments and comparing BERT, ELMo [31] , and BioBERT. Sharma and Daniel [40] compared the performance of BERT networks to that of FLAIR [12] . In the general AI-based chatbot domain, Serbal et al. [38] have applied deep reinforcement learning for building a conversational AI chatbot. Adiwardana et al. [11] have developed a multi-turn open-domain chatbot trained end-to-end on data mined social media conversations. Yin et al. [47] have developed a deep learning based chatbot for psychological therapy purposes. Semantic similarity of texts, on the other hand, has been studied for a long time and recent breakthroughs allowed for development of new models such as BERT, BioBERT, and Universal Sentence Encoder (USE). Today, one of the state-of-the art conversational AI models is GPT-2. GPT-2 is a pretrained model, so we have applied transfer learning utilizing CORD-19 for retraining purposes. The resulted chatbot gave irregularly long responses that would not be typical of a human. We have therefore decided to further filter the responses via applying embedding generation algorithms and models such as tf-idf, BERT, BioBERT, and USE and then using semantic similarity approaches such as cosine similarity and inner product. In other words, we first let a human ask a question and make GPT-2 come up with an answer. We the further processed the response with additional filters and ultimately, applied an embedding generation model for finding the sentences that are most relevant to the question. Cosine similarity is one of the most commonly used approaches in calculating semantic similarity of texts. Therefore, it is naturally employed in NLP tasks. Many NLP applications need to compute the semantic similarity between two short texts. Its flexibility allows one to apply it under virtually any settings, as long as documents can be represented as vectors. Besides, finding cosine similarity is usually not a time-consuming task and can be done really quickly. Therefore, it is also commonly used for benchmarking purposes [49] . Our study has produced a chatbot that is both performant and extensible. Additional layer of filters have shown success in classifying sentences. The chatbot is also able to be retrained and readjusted to the new data, in case there are new discoveries or scientific achievements related to COVID-19. Furthermore, chatbot responses have been annotated by medical experts and the results were consistent across the annotators. The White House Office of Science and Technology Policy alongside with the coalition of leading research groups has released a COVID-19 machine readable dataset -COVID-19 Open Research Dataset (CORD- 19) [2]. It consisted of over 128,000 scholarly articles regarding COVID-19, SARS-CoV-2, and related coronaviruses, including over 59,000 with full text, and called researchers globally to develop text and data mining tools for finding answers to the questions within this content in support of the ongoing COVID-19 response efforts worldwide [30] . We used CORD-19 to train a language model that would automatically answer questions related to COVID-19. The chatbot would not only help improve information acquisition, but also serve as a knowledge base for COVID-19. We harvested the data from the initial commercial use subset of CORD-19, containing 9000 scholarly articles in the form of JSON files. We extracted the abstract and the main body of the article from every JSON file, combined them together, and used as a corpus for retraining the language model. We applied a hybrid approach for generating responses: GPT-2 was used to generate the answer to the question, then an additional filtering step was applied for pruning the irrelevant sentences from the answer, and subsequently, semantic similarity methods were employed to retain the sentences that are most semantically similar to the question. Such hybrid approach to the response generation produced high quality answers to COVID-19-related questions. Figure 1 illustrates the pipeline of the proposed approach. GPT-2 has a Transformer-based [43] architecture which, in many ways, is similar to Open AI GPT model [34] [33] . There are a total of 4 different GPT-2 models that were released by OpenAI: 124 million (124M), 355 million (355M), 774 million (774M), and 1.5 billion (1.5B) parameters [4] models. While the model with 1.5 billion parameters showed the best results in the original paper [34] , in our experiments, we found that it was difficult to fine-tune and use for the transfer learning purposes. Besides, the training was unbearably slow, even if run on TPUs (Tensor Processing Unit) provided by Google Colaboratory [20] which we used as our training ground. We therefore utilized 774M model and ran transfer learning for 2500 iterations with the batch size of 8. After 2000 iterations, the loss was not decreasing so we let the language model train for the additional 500 iterations and stopped the training. The batch size of 8 was chosen due to the memory limitations of Google Colaboratory. As for the optimizer, we used Adam [23] and set the learning rate of 0.0001 (1e-4). Adam is an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments [23] . It is highly memory-efficient and has shown good results in retraining our chatbot. We have also tried SGD [24] , yet Adam has shown the better performance and hence, we have released the Adam-based retrained model. The original GPT-2 was written in tensorflow [10] and this is the version we used. That said, for retraining purposes, we applied the TPU-trainable version of the GPT-2 [32] . As for the hardware, Google Colaboratory provided us with cloud TPUs and training capabilities. It came 25 GB RAM and since we connected the Colab to Google Drive [20], we had enough storage to do transfer learning. The link for downloading the model is available on our GitHub page [17] . The GPT-2 responses are usually very lengthy and for the most part, the answer is not relevant to the question. To prune the responses generated from GPT-2, we first chunked the answer into the list of sentences using Python's built-in module for dealing with regular expressions (re [6]) and then for each answer in the list of answers, performed the following regex/string operations: (1) Eliminated redundant spaces (2) Eliminated extra punctuation marks (specifically, ". ", "!", and "?") (3) Removed redundant parentheses and square brackets (4) Further split the sentence into separate sentences if it contained a period (". ") Steps 2 and 4, once again, employed re module while for steps 1 and 4, just the built-in string operations were sufficient (hence, no built-in or external module was used). These operations have significantly improved the quality of the answer and allowed us directly passing them to the pretrained models for generating embeddings. Semantic similarity is a metric that quantifies the degree to which two texts or text documents are similar to each other. The two approaches we have used include cosine similarity and inner product. The difference between the two is that cosine similarity pays attention to only the angle between the vectors, while the inner product cares about both the angle and the magnitude. That said, if one has the normalized data, both approaches are nearly equivalent. To put each sentence in a vector representation, we tested and applied four different approaches for generating embeddings: • tf-idf [7] : a simple, tf-idf based embedding-generation method. In all cases, the similar strategy was applied for filtering sentences. The following equation defines the embedding generation process: where S = (s 1 , s 2 , . . . s n , q) and denotes the list of sentences obtained by performing the split based on regex/string manipulation plus the question, emb denotes one of the embedding generation approaches (i.e., tf-idf, BERT, BioBERT, or USE), and q denotes the question being asked. E denotes the embeddings generated for the list of sentences. In other words, the list of sentences is passed to the embedding generation function and the feature matrix is returned. Once the embeddings were generated, we have applied cosine similarity and inner product for ranking the sentences. Inner product was used only with USE since its embeddings were approximately normalized. The following equation shows the ranking process: where e 1 , e 2 , . . . e n , e q ∈ E and sim is a function that finds the cosine similarity or the inner product between two embeddings. For every sentence embedding e 1 , e 2 , . . . e n , the approach calculates the cosine similarity to the question embedding e q . M is a similarity matrix where every column denotes the similarity vector for a particular sentence. The similarity vector contains the similarity values of the particular sentence with respect to all the sentences (including itself). The last column/similarity vector in the matrix has the similarity measures with respect to the question embedding. Finally, we ranked the sentences based on their similarity scores, selected top five most similar sentences using Python's numpy [5] package, and concatenated them into a single, unified answer. We show an example responses in the form of a Human-Bot conversation generated by the GPT-2 language model, as follows. : What do we know about vaccines and therapeutics of COVID-19? : Because of their critical role in viral pathogenesis and host defense against infections, several drugs have been developed against COVID-19 With this in mind, we present evidence that drugs that inhibit pro-inflammatory responses (e.g., IL-6) have the potential to inhibit COVID-19 infection and viral replication in nonhuman primates after administration up to 12 days following infection Because of their critical role in viral pathogenesis and host defense against infections, several drugs have been developed against COVID-19 : What do we know about COVID-19 risk factors? : However, COVID-19 is now a threat to global health security According to the latest World Health Organization report, a large number of travel is seen as one of the major causes of COVID-19 spread : What is known about transmission, incubation, and environmental stability of COVID-19? : For example, a number of reports have suggested that animal and human hosts may have provided the necessary environment for the transmission of COVID-19 in people and poultry In fact, all known animal hosts that have been associated with animal-to-human transmission of COVID-19 are wild birds, including chickens, geese, mule deer, herons, and turkey In fact, all known animal hosts that have been associated with animal-to-human transmission of COVID-19 are wild birds, including chickens, geese, mule deer, herons, and turkey The dialogue shows a consistent pattern in generating responses. Response to the first question regarding the vaccines and therapeutics states that "several drugs have been developed against COVID-19. " Furthermore, it says that "drugs that inhibit pro-inflammatory responses (e.g., IL-6) have the potential to inhibit COVID-19 infection and viral replication in nonhuman primates. " Regarding the response to the second question, concerning the risk factors, it addressed the question directly by stating that "a large number of travel is seen as one of the major causes of COVID-19 spread. " As for the third question, about transmission, incubation, and environmental stability of COVID-19, it has mentioned that "large number of travel is seen as one of the major causes of COVID-19 spread" and additionally, talks about "animal-to-human transmission. " In all cases, sentences were highly readable and understandable. That said, in some cases, the same sentences were repeated due to how the hybrid approach was implemented. This can be avoided, which we discuss in the section 7. In order to evaluate the performance of the proposed approaches as well as the overall performance of the chatbot, it is crucial to have a question dataset that both are frequently asked and related to COVID-19. For this purpose, we decided to use 12 questions from the Kaggle's COVID-19 Open Research Dataset Challenge (CORD-19) [15] . Most of the questions included the term "COVID-19" but others did not, in which case we appended the term to the end of the question. Table 1 presents all 12 questions. For every one of the 12 questions, we generated five different answers by applying four different embedding generation techniques, resulting in a total of 240 answers. Therefore, the response for every question was generated exactly 5 times using the same technique. This ensured a fair and consistent distribution of both the questions and the approaches across the dataset. We made all of the answers publicly available on GitHub [1]. We then asked two experienced medical experts to evaluate the quality of these responses by assigning different relevance scores according to the categories in Table 2 . Having 5 categories allowed for a flexibility and diversity of opinions/judgements as well as a broad range of scores that ultimately gave us a better way to evaluate our approaches. The evaluation was done primarily by averaging the scores for a particular approach. What has been published about information sharing and inter-sectoral collaboration of COVID-19? What do we know about vaccines and therapeutics of COVID-19? What do we know about non-pharmaceutical interventions of COVID-19? Our annotation process had two phases. In the first phase, we let the annotators evaluate the test subset of the responses generated by the language model. The test subset was comprised of 20 questions. We then computed the IAA (Inner Annotator Agreement) which was approximately equal to 0.389. Due to having 5 categories, we used Pearson correlation coefficient for computing the IAA (as opposed to Cohen's Kappa, etc). Low correlation value led us to having a meeting with both annotators where we discussed why the they had different scores on the particular responses to questions. Finally, both annotators reached the agreement and gave same scores for every question in the test subset of 20. Once the agreement was reached, we then let the annotators evaluate the remaining 220 questions. Note that we evaluated our model based on the 240 responses and included initial subset, where both annotators agreed on the judgement. This was done for the sake of fairness and consistency. Description Point(s) The answer partially or fully answers the question and/or makes clear attempts to do so and is related to the question 5 Well-formed the answer makes a logical sense and is somewhat related to both the question and COVID-19, yet it does not (partially or fully) answer the question 4 Informative The answer is not related to the question, but provides some information about COVID-19 and makes a logical sense 3 Acceptable The answer makes some logical sense and is weakly related to the question or COVID-19, but is mostly difficult to understand 2 Poor the answer is totally unrelated to the question or COVID-19 and/or does not make a logical sense 2 6 EMPIRICAL RESULTS 6.1 Performance by Approach. Table 3 lists the evaluation results of different approaches. It shows the approach, the average scores based on the approach for each annotator, and the overall average across the annotators. The first annotator rated BERT as the best approach with the average score of 4.167. BioBERT shows slightly worse performance with a score of 4.133 than BERT. The tf-idf approach performs well with a score of 3.967, yet it could not outperform either BERT or BioBERT. USE has the worst performance out of all embedding generation techniques with the score of 3.683 out of 5. The second annotator, similarly, gave the highest average score to BERT (4.283). USE was the second best with the score of 4.083 followed by BioBERT with approximately the same score of 4.067. The tf-idf approach has yielded the worst results, rated 3.8. In general, the results are consistent between two annotators with an inner annotator agreement score of 0.521, which was calculated using the Pearson correlation. Models from the BERT family showed the best performance in automatically answering COVID-19 questions, with BERT slightly outperforming BioBERT (4.225 vs. 4.100 -average scores) being the best. The tf-idf approach and USE show roughly similar performance (3.884 vs. 3883) ., yet inferior to BERT and BioBERT. All four approaches, on average, can be considered to be in the "well-formed" category with BERT and BioBERT being close to the "Relevant" category. The overall average was 4.023 (Well-formed). It should be noted that all 3 of these questions seem to be rather short in length. The responses to question # 7, on the other hand, has the worst average score and interestingly, both annotators have given the same score of 2.667. That said, the question is also one of the shortest questions in length. Therefore, the length does not seem to always have a correlation with the score (hence, the performance). To further analyze why responses to the question # 7 had the lowest average score, we determined whether the terms of the question are present in the dataset. Terms "inter-sectoral collaboration" and "information sharing" were both present in CORD-19. Therefore, it is likely that the issue stems from the model itself and not the dataset. According to the scores, we also find that the terms in the question has some correlation with the score. For example, the questions that featured words strongly linked to COVID-19, such as virus in question # 3, vaccine in question # 11, and risk in question # 9, had higher average response scores than those that did not (e.g., question # 10). The project had several limitations. First, due to hardware constraints and the difficulty of fine-tuining, we have not used the larger 1.5B GPT-2 model that could potentially yield better results in generating responses. Second, the question pool was also limited and comprised of 12 questions. Additionally, we have tried only 4 specific embedding generation approaches, which might not be a fair representation of all such techniques in the domains of AI and NLP. In order to make the language model more accessible to the general audience for automating the response generation, we built a web-based chatbot using the trained GPT-2 with options of tf-idf, BERT, BioBERT, and USE approaches. Please find the released code on our GitHub: . The application is powered by Python's Flask [9] package and gives a simple and user-friendly interface for the interactive communication with the chatbot. Please note that the health information generated by the chatbot is for general research purposes only. It is not a diagnostic tool, nor is it a substitute for medical advice or treatment for specific conditions. Although our work has demonstrated the feasiblity of using language models for automatically answering COVID-19 questions, much can be done in the further research. At first, we would like to explore why certain questions had the higher scores than others. Secondly, other approaches for generating embeddings for sentences, such as BioWordVec [46] , could potentially improve the performance of the chatbot and can be another avenue for exploration. From the dialogue presented in Section 4, it is clear that GPT-2 could generate duplicate sentences that may be irrelevant to the question. In that case, the same sentence might be repeated in the final answer. One could incorporate an additional simple step of eliminating duplicate sentences which could potentially improve the quality of the answers. Adding an additional, third layer of filtering can also be tested and checked whether it improves the quality of the responses. Additionally, the GPT-2 model can always be further retrained on a new corpus, which could potentially improve the result. The 1.5B GPT-2 model could also be applied for retraining purposes. Finally, given that a larger GPT language model (GPT-3) was recently released [13] , we believe that it is feasible for the chatbot with the capable hardware to explore the realm of possibilities with this model which could also evolve into an interesting future work. In this paper, we applied the GPT-2 language model to automatically answer questions related to COVID-19, and quantitatively evaluate the proposed approach. To refine the responses generated by GPT-2, we compared four different embedding generation techniques, namely tf-idf, BERT, BioBERT, and USE. We utilized the collected corpus from the CORD-19 task to pretrain the GPT-2 model, and evaluated the automatically generated answers on twelve questions from the CORD-19. The results were evaluated by two medical experts. In general, the results are consistent between two annotators. The empirical results show that BERT achieved the best performance in automatically answering COVID-19 questions. We also built a web-based chatbot using the trained GPT-2 model and opensoured the code. This work was supported by NIH grant R01LM11934, the Mayo Clinic Center for Health Equity and Community Engagement Research Award, and the Mayo Clinic Office of Patient Education. The funders had no role in the design of the study, or collection, analysis, and interpretation of data and in preparation of the manuscript. The views presented in this report are not necessarily representative of the funder's views and belong solely to the authors. Retrieved 2020-15-05 from Retrieved 21-06-2020 from Retrieved 21-06-2020 from https://numpy.org/ [6] 2020. re âĂŤ Regular expression operations Tfidf Vectorizer.html [8] 2020. Tensorflow Hub: Universal Sentence Encoder (Version 3, Large) Welcome to Flask -Flask Documentation TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems Towards a Human-like Open-Domain Chatbot Contextual String Embeddings for Sequence Labeling Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil 2020. A Novel Approach of Consultation on 2019 COVID-19)-Related Psychological and Mental Problems: Structured Letter Therapy GitHub: COVID-19 Chatbot A Vector Space Model for Automatic Indexing Intent Classification in Question-Answering Using LSTM Architectures Multidisciplinary research priorities for the COVID-19 pandemic: a call for action for mental health science Adam: A Method for Stochastic Optimization Stochastic Estimation of the Maximum of a Regression Function Learning to Answer by Learning to Ask: Getting the Best of GPT-2 and BERT Worlds Self-Attentive Associative Memory RecipeGPT: Generative Pre-training Based Cooking Recipe Generation and Evaluation System BioBERT: a pre-trained biomedical language representation model for biomedical text mining Patent Claim Generation by Fine-Tuning OpenAI GPT-2 The White House Office of Science and Technology Policy. 2020. Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset Deep contextualized word representations Improving Language Understanding by Language Models are Unsupervised Multitask Learners COVID-19 and mental health: a review of the existing literature LONG SHORT-TERM MEMORY Sarath Chandar An Exploratory Study of COVID-19 Misinformation on Twitter BioFLAIR: Pretrained Pooled Contextualized Embeddings for Biomedical Sequence Labeling Tasks Sequence to Sequence Learning with Neural Networks Well-Read Students Learn Better: On the Importance of Pre-training Compact Models World Health Organization (WHO). 2020. Coronavirus disease 2019 (COVID-19) XLNet: Generalized Autoregressive Pretraining for Language Understanding BioWordVec, improving biomedical word embeddings with subword information and MeSH A Deep Learning Based Chatbot for Campus Psychological Therapy DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation Correlation Coefficients and Semantic Textual Similarity