key: cord-0265038-ezujwgx8
authors: Luu, Son T.; Bui, Mao Nguyen; Nguyen, Loi Duc; Tran, Khiem Vinh; Nguyen, Kiet Van; Nguyen, Ngan Luu-Thuy
title: Conversational Machine Reading Comprehension for Vietnamese Healthcare Texts
date: 2021-05-04
journal: nan
DOI: 10.1007/978-3-030-88113-9_44
sha: a8485619efcb1236afc9f974368a3f78531df159
doc_id: 265038
cord_uid: ezujwgx8

Machine reading comprehension (MRC) is a sub-field in natural language processing that aims to assist computers understand unstructured texts and then answer questions related to them. In practice, the conversation is an essential way to communicate and transfer information. To help machines understand conversation texts, we present UIT-ViCoQA, a new corpus for conversational machine reading comprehension in the Vietnamese language. This corpus consists of 10,000 questions with answers over 2,000 conversations about health news articles. Then, we evaluate several baseline approaches for conversational machine comprehension on the UIT-ViCoQA corpus. The best model obtains an F1 score of 45.27%, which is 30.91 points behind human performance (76.18%), indicating that there is ample room for improvement. Our dataset is available at our website: http://nlp.uit.edu.vn/datasets/ for research purposes.

Conversation is a standard method to communicate between people, and it plays an important role in human daily life. The process of asking a question and responding to an answer brings helpful information about a specific domain.

Healthcare is one of the most concerning problems for many people. Many audiences often read the healthcare news, and people tend to discuss frequently about health and medicine. Thus, based on the conversations about healthcare, we constructed a corpus named UIT-ViCoQA for conversational question answering on healthcare texts in Vietnamese. The UIT-ViCoQA contains 2,000 conversations and 10,000 questions from articles about health news in Vietnamese. This corpus is used to train the computer for understanding the conversation and giving the right answers based on the conversation context from questions of users. Besides, we implement neural-based models for conversational question answering including: DrQA [1] , GraphFlow [2] , FlowQA [8] , and SDNet [22] on the UIT-ViCoQA corpus. Then, we evaluate the performance of those models on the UIT-ViCoQA dataset.

The main contribution in this paper includes providing a corpus for conversational machine comprehension about healthcare texts in Vietnamese and evaluating the performance of baseline MRC models on the dataset. Our paper is structured as described. Section 2 takes a literature review about the conversation machine comprehension corpora and models. Section 3 provides overview information about the UIT-ViCoQA dataset. Section 4 introduces available state-ofthe-art approaches for the conversational machine comprehension task. Section 5 shows our empirical results and error analysis of question-answering models on the UIT-ViCoQA corpus. Finally, Section 6 concludes our works.

Machine reading comprehension (MRC) is a challenging task of natural language processing (NLP) which enables machines to understand the reading text and answer the questions [16] . Many of MRC corpora are constructed on specific domains, and open domains in English such as SQuAD [16] (extractive MRC) on Wikipedia articles, RACE [11] (multiple choices MRC) on High school students English Exams domain, and NarrativeQA [9] (abstractive MRC) on books and stories domain. For the Vietnamese language, the UIT-ViQuAD [14] (Wikipedia domain), and UIT-ViNewsQA [21] (Health news domain) are two extractive MRC corpora for machine reading comprehension. Besides, the ViMMRC [13] is the multiple-choice reading comprehension corpus on the Vietnamese students' textbook for primary schools domain.

Machine reading comprehension applied in question-answering (QA) systems is another challenge that the MRC models have to understand both given texts and conversational context and then answer relevant questions. These questions are often paraphrased, contain co-reference queries, and their answers can be spans texts or free-form. This type of MRC is called Conversational Machine Comprehension (CMC) [7] . CoQA [17] and QuAC [3] are two CMC corpora in English. Based on the CoQA works, we constructed the UIT-ViCoQA for automated reading comprehension on the health news articles in the Vietnamese language.

Attention-based reasoning with sequence models and FLOW mechanism are two approaches for CMC models, according to Gupta et al. [7] . DrQA [1] and PGNet [19] are two neural attention-based models implemented in the CoQA corpus. Next, SDNet [22] is another attention-based model that combines interattention and self-attention to comprehend the conversation context. Finally, FlowQA [8] and GraphFlow [2] are two flow-based models that used to yield the contextual information through sequences.

Our data creation process consisting of three phases is described in Figure 1 . In the first phase, we collect news articles about health from VnExpress 3 -the most read online newspapers in Vietnam by using scrapy 4 -a web crawler tool for collecting articles from the online newspaper. In the next phase, we construct an annotation tool for creating conversational data. Our annotation tool allows two annotators to create the conversation based on the given articles. Finally, in the third phase, we hire a team of annotators who create data on our annotation tool. The detailed steps from the annotation process are described below. For each conversation (C), we hire two different annotators, which are questioners and answerers, respectively. The questioner goes first by asking a question (Q). The question is sent to the answerer then. After receiving the question, the answerer gives the answer by selecting a span of text from the article (S) and then submits the natural answer (A). Next, the annotation system compares the answer given by the answerer with the asked question of the questioner by character level. If the given answer matches about 70% with the asked question, it is a valid answer, and two annotators can move to the next turn. In contrast, the answerer must give another answer. There is a total of five turns for asking and answer per article.

In the data creation process, we have some requirements for questioners and answerers as: (1) The answers must be extracted from the article. Questions that cannot be answered according to the article are not allowed, (2) Questioners are encouraged to give questions with synonyms, opposite words, and coreference, and (3) The answers should be short and limited to use new words from the article content. Moreover, the selected answerers need to give full answers with complete texts, correct syntax, and punctuations. Table 1 : An example of conversation in the UIT-ViCoQA corpus Trạng thái "ngủ" là cách các tế bào ngay lập tức thay đổi để kháng lại phương pháp điều trị. Các phương pháp điều trị ung thư vú thường thành công, tuy nhiên một số trường hợp ung thư tái phát và tiên lượng xấu hơn. Ông Luca Magnani, Khoa Dược, Đại học Hoàng Gia London, Anh, cho biết phương pháp điều trị bằng hormone hiện được sử dụng cho phần lớn bệnh nhân ung thư vú ... (The status of "sleep" is the way when the cell changes immediately to resist treatment. The treatment methods of breast cancer are often successful. However, some cases of cancer recur, and the prognosis worsens. Mr. Luca Magnani, Faculty of Medicine, Imperial College London, says that the treatment method by using hormones is used for a huge amount of breast cancer patients ... ) Q1 Phương pháp thường được sử dụng để chữa trị ung thư vú là gì ? (What is the treatment method usually use for breast cancer treatment?) S1 Ông Luca Magnani, Khoa Dược, Đại học Hoàng Gia London, Anh, cho biết phương pháp điều trị bằng hormone hiện được sử dụng cho phần lớn bệnh nhân ung thư vú . (Mr. Luca Magnani, Faculty of Medicine, Imperial College London, says that treatment method by using hormone is used for a huge amount of breast cancer patients.) A1 điều trị bằng hormone (using hormone) Q2 Các bác sĩ có lo ngại gì về phương pháp này? (What are doctors concerned about for this treatment?) S2 Từ lâu, các nhà khoa học đã đặt câu hỏi, liệu pháp này thực chất có tiêu diệt được các tế bào ung thư vú không, hay chỉ là chuyển các tế bào sang trạng thái "ngủ yên". (Scientists have long questioned whether this therapy actually kills breast cancer cells, or just puts the cells in an "inactive" state.) A2 nó đưa các tế bào ung thư sang trạng thái "ngủ yên" (This treatment puts the cells in an "inactive" state) Q3 Vậy những nghiên cứu này có ý nghĩa như thế nào? (What profits from these studies?) S3 cũng giải thích rằng những phát hiện hiện tại sẽ mở ra lộ trình mới cho việc nghiên cứu chữa trị ung thư. (explaining that current works can open new future researchs about cancer treatments) A3 mở ra lộ trình mới cho việc nghiên cứu chữa trị ung thư (Opening new research for cancer treatments)

The UIT-ViCoQA corpus contains 2,000 conversations. Each conversation consists of a reading article and five question-answer pairs. We follow the structure of the CoQA [17] for our dataset. According to Table 1 , to answer question Q2, the answerer needs to read the passage and looks back to question Q1 and answer A1 to retrieve the relevant information. Similar to question Q2, the answerer needs to read the reading passage and two previous question-answer pairs (Q1, A1) and (Q2, A2) to extract the answer A3. The chain of question-answer pairs Q1-A1, Q2-A2 is the history of the conversation. Table 2 provides the overview of the UIT-ViCoQA corpus and compares it with the CoQA corpus. The result illustrates that although the number of questions and answers in the UIT-ViCoQA corpus is lower than the CoQA corpus, the average number of words in the UIT-ViCoQA dataset is larger than the CoQA dataset. This is because the interrogative words in English contain a single word (e.g., who?, when?, and why?) while they may have two words in Vietnamese. For example, the words "who" means "ai", "when" means "khi nào" and "why" means "tại sao". Besides, the UIT-ViCoQA is constructed on a specific domain. Hence it is not as diverse as the CoQA corpus. In Vietnamese, the process of interaction contains statements between two people. Each statement contains two functional elements, including the negotiatory for carrying the argument in statements that go through the conversation and the remainder to keep the rest information of statements [20] . The negotiatory is an essential part of the statement in the conversation. The negotiatory element comprises interrogatives particles, element interrogatives items, and imperative particles. The interrogatives are the characteristic of questions. In Table 3 , we show all kinds of questions in Vietnamese that are usually used in daily life. The interrogative words are marked bold in the sentence. According to Table 3 , the "What" type accounts for the highest ratio in the UIT-ViCoQA corpus (32.6%). Next, we randomly divide our corpus into training, development, and test sets with proportions 70%, 15%, and 15%, respectively. Then, we take 100 articles by random from the development set to analyze and evaluate the corpus, which is called analysis set [17] . We segment texts in the corpus by the Underthesea framework 5 .

According to Gupta et al. [7] , the Conversational Machine Comprehension (CMC) model answers the question by extracting information not only from the reading texts but also from conversational history. Therefore, the main linguistic phenomena in the UIT-ViCoQA are based on the relationship between questions and the reading passage and the relationship between questions and the conversation history. Table 4 displays the linguistic phenomena in the UIT-ViCoQA corpus.

For the relationship between questions and the reading texts, there are three types of phenomena: lexical match, paraphrasing, and pragmatic. The lexical match indicates that the questions contain the same words as the reading texts. In contrast, paraphrasing is the question in which their words use synonyms from the reading texts, and pragmatic means the question uses words that do not relate to the reading texts. The proportions of lexical match, paraphrasing, and pragmatic phenomenon in the UIT-ViCoQA corpus are 47.6%, 48.0%, and 4.4%, respectively, as shown in Table 4 .

In addition, for the relationship between questions and the conversation history, there are three types of relational phenomena: no coreference, explicit coreference, and implicit coreference. The percentages of no coreference, explicit coreference, and implicit coreference in the UIT-ViCoQA corpus are 73.6%, 20.6%, and 5.8%, respectively, according to Table 4 .

According to Gupta et al. [7] , a typical conversation reading comprehension task consists of reading passage as context (C), the conversation history (H) includes multiple question-answer pairs, and the generated answers (A). Therefore, this task combines two models: the machine reading comprehension model for encoding the questions and context into neural space vectors and the questionanswering model to generate and decode answers from questions to natural language.

For the machine reading comprehension model, the Document Reader (DrQA) introduced by Chen et al. [1] is a powerful model on various of machine reading comprehension corpora such as: SQuAD [16] , TextWorldsQA [10] , and UIT-ViQuAD [14] . The DrQA model consists of two modules: Document Retriever and Document Reader. We use the Document Reader of the DrQA to extract the answer spans for the questions.

Besides, for the conversational comprehension task, the generated answers are not only from the reading passage but also the conversation history. The model extracts the history of conversations as a special context to generate new answers. SDNeT model [22] is a contextual attention-based model based on the idea of DrQA with a special mechanism to extract the context of the conversation.

Furthermore, The FLOW mechanism enables the MRC models to encode the history of the conversation comprehensively. Hence, this mechanism integrates well the latent semantic of the conversation history. FlowQA [8] and GraphFlow [2] are two flow-based neural models that grasping the conversational history context to generate answers.

We pre-process the data before fitting to the model by these following steps: (1) Removing special characters and stop words, (2) Segmenting sentences into words by using the Underthesea tool, and (3) Transforming the texts into vectors by using fastText word embedding in the Vietnamese language provided by Grave et al. [6] . The dimension of fastText word embedding is 300.

We evaluate the performance of the models by comparing the generated answers with the accurate answers on F1-score and Exact match (EM) score. The F1score measures the right predicted answers comparing with the correct answers. The EM score measures the exact matching of prediction answers with original answers [16] .

The FLOW models give optimistic results on the UIT-ViCoQA corpus. According to Table 5 , FlowQA obtains the highest result by F1-score on both development and test sets. For the EM score, the SDNet model gives the highest results. However, there is a large gap between the F1 and the EM scores as well as the performance of CMC models and human performance. Table 6 shows the predicted answers given by four different models, including DrQA, SDNet, FlowQA, and GraphFlow, respectively. In general, FlowQA and GraphFlow give the most relevant answer as the original answer. For example, in the question Q3 -"What the enterprise think about?", the reader needs to look back to the previous question-answer Q1-A1 and Q2-A2 to inference the context about the "affected cases of COVID-19" (Q1) and the "detailed of affected cases" (Q2). GraphFlow and FlowQA offer the most relevant answer than DrQA for the question Q3. For question Q5, GraphFlow provides the most relevant answer about the person mentioned in the reading passage, while other models give the answer with redundant information in comparison with the original answer. For the question Q4, both four models cannot give the exact answer. This is due to the ambiguity of Vietnamese interrogative words in questions where it is written in the genuine and non-genuine form. For example, the question Q2: "Cụ thể?" can be understood as "What is the detail?" or "How it happened?". Besides, the question Q4: "Nguy cơ là gì?" can be understood as "What is the risk?" or "How bad is the risk?". This is known as the MOOD in the Vietnamese. The interrogative clause in Vietnamese consists of two main elements: the negotiatory and the remainders. The negotiatory carries the centroid of the interaction. This aspect of Vietnamese interrogative is described carefully by Thai [20] . In addition, we study the ability of the models for retrieving correct answers based on the type of questions on the development set. Figure 2 shows the ratio of correct answers by different kinds of questions in the UIT-ViCoQA corpus. A question gives the right answers if the F1-score is greater than 70%. According to Figure 2 , the question type "What" has the highest ratio, which is 35.12%.

Besides, the question type "What" accounts for 32.6% as described in Table 3 . Therefore, the models mostly give the correct answers to this kind of question. Furthermore, the question types "How many" and "Who" also have a high ratio. Finally, we analyze the predicted answers on the development set. According to Table 7 , there are three types of the answer given by the models, and most of the predicted answers are concentrated on the free-form type, which accounts for 59.93%. This is why the F1 and EM scores have a considerable difference, as described in Table 5 .

In general, most error predictions are due to the number of questions and the variety of answers, as well as the linguistic phenomena. Therefore, it is necessary to increase the number of questions and the question types as well as enriching answers to make the corpus more diverse.

In this paper, we propose the dataset about machine reading comprehension for healthcare texts in Vietnamese. This dataset includes 2,000 health articles with 10,000 questions. We also conduct experiments on several baseline models, and the best result in the F1-score is 45.27%. Nevertheless, the difference between F1 and EM scores is large. This is due to the linguistic phenomena about the Vietnamese interrogative particles and the limited answers. Therefore, it is necessary to increase the number of questions and answers as well as make questions and answers more diverse in further research. Besides, enabling the CMC models to capture and understand the contextual meaning of the conversation history is also a challenging task in the conversational machine reading comprehension model researching.

In future, we plan to increase the quantity and quality of the UIT-ViCoQA corpus as well as to conduct further experiments on deep learning and transfer learning using pre-trained language models [4, 5, 12, 18] to enhance the performance of CMC models on the UIT-ViCoQA corpus. Inspired by the conversational question answering system [15] , we suggest using this model and UIT-ViCoQA for building Vietnamese conversational question answering systems.

Reading Wikipedia to answer opendomain questions

Graphflow: Exploiting conversation flow with graph neural networks for conversational machine comprehension

QuAC: Question answering in context

Unsupervised cross-lingual representation learning at scale

BERT: Pre-training of deep bidirectional transformers for language understanding

Learning word vectors for 157 languages

Conversational machine comprehension: a literature review

Flowqa: Grasping flow in history for conversational machine comprehension

The NarrativeQA reading comprehension challenge

Multi-relational question answering from narratives: Machine reading and reasoning in simulated worlds

RACE: Large-scale ReAding comprehension dataset from examinations

PhoBERT: Pre-trained language models for Vietnamese

Enhancing lexical-based approach with external knowledge for vietnamese multiple-choice machine reading comprehension

A Vietnamese dataset for evaluating machine reading comprehension

Open-retrieval conversational question answering

SQuAD: 100,000+ questions for machine comprehension of text

CoQA: A conversational question answering challenge

A primer in bertology: What we know about how bert works

Get to the point: Summarization with pointergenerator networks

Metafunctional profile of the grammar of vietnamese. Language typology: A functional perspective 253

New vietnamese corpus for machine reading comprehension of health news articles

Sdnet: Contextualized attention-based deep network for conversational question answering

Acknowledgements We would like to express our thanks to reviewers for their valuable comments to help improve our work. Besides, we would like to thank our annotators for their cooperation.