key: cord-0557673-uk873efc
authors: He, Zhenfeng; Han, Yuqiang; Ouyang, Zhenqiu; Gao, Wei; Chen, Hongxu; Xu, Guandong; Wu, Jian
title: DialMed: A Dataset for Dialogue-based Medication Recommendation
date: 2022-02-22
journal: nan
DOI: nan
sha: 24a8abc80626273b587486d14ba6f7a968ec14cc
doc_id: 557673
cord_uid: uk873efc

Medication recommendation is a crucial task for intelligent healthcare systems. Previous studies mainly recommend medications with electronic health records(EHRs). However, some details of interactions between doctors and patients may be ignored in EHRs, which are essential for automatic medication recommendation. Therefore, we make the first attempt to recommend medications with the conversations between doctors and patients. In this work, we construct DialMed, the first high-quality dataset for medical dialogue-based medication recommendation task. It contains 11,996 medical dialogues related to 16 common diseases from 3 departments and 70 corresponding common medications. Furthermore, we propose a Dialogue structure and Disease knowledge aware Network(DDN), where a graph attention network is utilized to model the dialogue structure and the knowledge graph is used to introduce external disease knowledge. The extensive experimental results demonstrate that the proposed method is a promising solution to recommend medications with medical dialogues. The dataset and code are available at https://github.com/Hhhhhhhzf/DialMed.

The outbreak of COVID-19 has challenged the healthcare systems and led to millions of patients facing delays in diagnosis and treatment. As an essential complement to the traditional face-to-face medicine, telemedicine relieved the therapeutic stress caused by the diversion of medical resources. According to the report of WeDoctor 1 , an online health consultation platform in China, about 1.2 million patients conducted online medical consultations during the COVID-19

Pandemic. Telemedicine can increase the availability of medical treatment, reduce healthcare costs, and improve the quality of care. Consequently, it has attracted increasing attention due to its vast application potential.

On the basis of telemedicine data, many researchers focus on Medical Dialogue Systems(MDSs) which aim to communicate with patients and give diagnoses. The existing work addresses critical sub-tasks in MDSs, including automatic diagnosis [Wei et al., 2018; Xu et al., 2019] , dialogue generation [Lin et al., 2021; , and information extraction . Medication recommendation based on dialogues has not yet received much attention, even though it is also an important task to be solved. Our study found that around 31% of online consultations are about what medications the patients should take based on their current conditions 2 . Figure 1 demonstrates a typical medication consultation dialogue. The patient reported the health issues initially, with some personal information, such as gender and age. Then the doctor asked for further information (e.g., symptoms and disease history) about the patient. Finally, the doctor provided medication advice based on the gathered information and clinical experience.

Existing studies on medication recommendation are primarily based on EHRs [Zhang et al., 2017; Shang et al., 2019b; , accumulatively collected according to a diagnostic procedure in clinics. However, the doctors will omit some details of interactions with patients in EHRs, which are essential for the automatic medication recommendation. Compared to EHRs, medical dialogues have complex interactions between doctors and patients, containing more rich but noisy information. To this end, medical dialoguebased medication recommendation is a promising and challenging task.

Therefore, in this work, we study the new task, namely dialogue-based medication recommendation. Duo to the lack of available datasets, we firstly construct a high-quality online medical dialogues dataset(DIALMED) for this task. It contains 11, 996 consultation dialogues, 16 diseases from 3 different departments and 70 related common medications. More detailed statistics of the dataset can be found in Section 3.

Then, to further advance the research of this task, we propose a dialogue structure and disease knowledge aware network(DDN). In DDN, for the input dialogue, a pre-trained language model is first utilized to extract the semantic information of each utterance, and then a dialogue graph is constructed to model the structure feature. Then a graph attention network is used to get the dialogue embedding. For the input disease, its identity is used to query the entity in a knowledge graph CMeKG 3 , and then dialogue embedding is imported to a graph attention network to get contextual disease embedding. The two embeddings are fused to make the medication prediction. Finally, we conduct extensive experiments to show that the proposed method can effectively recommend medications with medical dialogues.

Our contributions can be summarized as follows: 1) We construct the first high-quality human-annotated dialogue dataset for dialogue-based medication recommendation task. 2) We propose a novel medication recommendation framework utilizing both dialogue structure and external disease knowledge. 3) We conduct extensive experiments to demonstrate DDN can effectively extract the essential information to make medication recommendation accurately.

Medication Recommendation. Existing medication recommendations are mainly based on EHRs. It could be categorized into instance-based and longitudinal-based recommendation methods [Shang et al., 2019b] . Instance-based methods are based on the current health conditions extracted from recent visit [Zhang et al., 2017; Wang et al., 2019a] . For example, [Zhang et al., 2017] proposed a multi-instance multi-label learning framework to predict medication combination based on patient's current diagnoses. Longitudinalbased methods leverage the temporal dependencies among clinical events [Choi et al., 2016; Le et al., 2018; Shang et 3 http://cmekg.pcl.ac.cn/ al., 2019b; Shang et al., 2019a; He et al., 2020; Wang et al., 2021; . Among them, [Shang et al., 2019a] combined the power of graph neural networks and BERT for medication recommendation. proposed a drug-drug interactions (DDI)-controllable drug recommendation model to leverage drugs' molecule structures and model DDIs explicitly.

Unlike the work mentioned above, dialogue-based medication recommendation task is more challenging in practice due to the noisy and sparse data. Because of the privacy issue, it is difficult to get historical dialogues of a patient on online consultation platforms. So we perform the medication recommendation solely based on the current medical dialogues.

Graph Neural Networks. Graph neural networks have attracted a lot of attention for processing data with graph structures in various domains [Zhou et al., 2020] . For example, [Kipf and Welling, 2017] proposed the graph convolutional networks (GCN) . With integration of attention mechanisms, graph attention networks(GAT) [Veličković et al., 2018] has become one of the most popular methods in graph neural networks.

Recently, some works have applied GAT to the dialogue modeling. For instance, [Chen et al., 2020] used Graph attention and recurrent GAT to fully encode dialogue utterances, schema graphs, and previous dialogue states for dialogue state tracking. [Qin et al., 2020] proposed a co-interactive GAT layer to simultaneously solve both dialog act recognition and sentiment classification task. In this work, we utilize GAT to model the intra-and inter-speaker correlations to propagate semantic on the dialogue graph and extend the GAT on knowledge graph to introduce external knowledge.

In this section, we introduce the construction details and statistics of DIALMED.

Our dataset is collected from a popular Chinese medical consultation website, Chunyu-Doctor 4 . The conversations between doctors and patients contain rich but complex information, mainly related to the patients' current conditions. The diagnosed diseases and symptoms both are indispensable for accurate medication recommendation. Considering the complexity of the symptoms, we decide to utilize information from explicit disease and implicit symptoms in this paper. So we annotate the diagnosed diseases and replace the recommended medications with mask to keep the original dialogue structure. For the example in Figure 1 , we annotate the disease Upper Respiratory Tract Infection, and replace the medications Shuanghuanglian Oral Liquid and Pudilan Oral Liquid with special token [MASK] .

The process of annotation is as follows. First, we select 16 common diseases and the corresponding common medications from 3 departments(i.e., respiratory, gastroenterology and dermatology) with the guidance of a doctor. These diseases can be consulted online and have abundant medication consultations. Then three annotators with relevant medical backgrounds are involved in the annotation. Each dialogue is annotated by at least two annotators and will be further judged by another one if there is any inconsistency. The annotation consistency, i.e., the Cohen's kappa coefficient [Fleiss and Cohen, 1973] of the labelled dialogues is 0.88, which indicates a strong agreement between annotators. This ensures the feasibility of our annotation approach.

After the annotation, we normalize the diseases and medications to improve the quality of the dataset. We keep the original names of the compound medicines and normalize the non-compound ones. Specifically, we group different brands of drugs that are suitable for the same disease into one cluster and rename them as the common name from DXY Drugs Database 5 . It can reduce the bias caused by the doctors' preferences to the brands of the medicines, which is more practical in the medication consultation scenario. Refer to A.2 for more normalization results.

Top of Table 1 summarizes the statistics of DIALMED. Medical dialogues and EHRs have significant differences, since the former scenario is similar to outpatient procedure while the data for the latter comes from intensive care units. For example, in the EHR data from MIMIC-III [Johnson et al., 2016] common used in EHRs-based medication recommendation, the number of medications is 145, the average number of medications in each visit is 8.80, and the average number of diagnosis in each visit is 10.51. By contrast, the data in medical dialogues are more sparse and noisy.

The frequency of medications is shown in Figure 2 (a). We can see that the frequency of medication follows a long-tail distribution, although they are all common medications for related diseases. It is because some medications are effective for several diseases, while others are only for one or two diseases. We also present the frequency of diseases in Figure 2(b) , where a similar long-tail distribution is observed. It shows that some diseases are more common and others are relatively less among patients. Refer to Appendix A.1 for corpus comparison details.

All the statistics demonstrate that DIALMED corresponds to the actual case and is appropriate for dialogue-based medication recommendation. And it also shows that dialoguebased medication recommendation is a challenging task. 

In this section, we first introduce the dialogue-based medication recommendation task, and then describe the proposed DDN in detail.

In the online medical dialogue setting, each dialogue consists of a sequence of utterances from the patient and the doctor. Formally, each dialogue can be represented as D n = {u 1 , u 2 , ..., u |Dn| }, where n ∈ {1, 2, ..., N }, N denotes the total number of dialogues in the dataset, and |D n | represents the number of turns in a dialogue D n . Each utterance can be represented as

where w j i is a word and |u i | denotes the number of words in u i . We collect all the diseases and medications mentioned in the dataset to construct a disease corpus S and medication corpus M. To avoid notation clutter, we hereinafter remove the subscript n as we only consider a single dialogue instance. Formally, given the consultation dialogue D and the diagnosed disease d, dialogue-based medication recommendation aims to recommend potential treatment medications y in M, where y ∈ {0, 1} |M| .

The proposed end-to-end framework is presented in Figure 3 , consisting of two main parts: (1) Dialogue encoder, which encodes the medical dialogues between patient and doctor by comprehensively capturing the semantic information and dialogue structure.

(2) Disease encoder, which incorporates external medical knowledge based on the disease information from the dialogue.

Dialogues contain two types of important information, i.e. the rich semantic information in the utterances and strong struc- tural correlations between utterances. Utterance Encoding Pre-trained language models (e.g., RoBERTa) are utilized to capture the semantic information in utterances. First, special tokens [CLS] (capturing utterance representation) and [SEP] (separating different utterances) are inserted at the beginning and end of each utterance token sequence. Then the position embedding of each token in a utterance can be calculated. In addition, two types of speaker embeddings (i.e., Doctor and Patient) are proposed to make model aware of speaker role of the utterance. The model takes the sum of three embeddings as input and outputs the representation of [CLS] as the utterance embedding h. So a dialogue D can be represented as h D = {h 1 , h 2 , ..., h |D| }. Dialogue Structure Representation In medical conversations, the interactions between doctors and patients are more complicated. For example, in Figure 3 , the doctor asked two questions in u 2 and u 3 , and the patient gave the answers in u 4 , where the questions and answers are not adjacent. It makes the structure is essential for the understanding of dialogues. Simply combining utterances may result in information loss or misunderstanding. So we propose to model each dialogue as an undirected graph G D , where each utterance is represented as a vertex. We define consecutive utterances of the same speaker as an block, for example, u 2 and u 3 constitute a block, and u 4 is another single block. Then the edges can be defined as follows: 1) For a block, each utterance connects with the others. This represents the intra-speaker correlation and ensures the information flow from the same speaker within a local context. 2) For two adjacent blocks, each utterance connects with all in the other block. This represents the inter-speaker correlation and ensures the information flow between two speakers in a local context. Dialogue Encoding GAT is employed to automatically aggregate semantic and structure features on dialogue graph. In particular, the l-th layer representation of a vertex can be computed as:

where N i is the first-order neighbors of vertex i, W h ∈ R d l ×d l−1 is a trainable weight matrix, and σ is a nonlinear activation function. The weight α ij which determines the relatedness between two vertices can be calculated following [Veličković et al., 2018] :

where a ∈ R 2d l is a trainable weight matrix, and σ is the LeakyReLU activation function. Considering the dialogue graph is relatively small (about 11 nodes from the data statistics), we do not use the multi-head attention as in GAT, which may make the final node embedding too smooth. Finally, we apply the mean pooling on nodes embedding to obtain the dialogue representation h D .

Disease knowledge is crucial for delivering accurate medication recommendation. In this paper, we incorporate knowledge from CMeKG, a high-quality Chinese medical knowledge graph. TransR [Wang et al., 2019b] is utilized to get the initial entities embedding. Given a disease d, we first identify the corresponding entity in CMeKG, and then use GAT to get the embedding under the dialogue context. Here, we fuse the entity, relation and dialogue information to get the attention scores:

where σ is the LeakyReLU function, i , j and ϕ are the embeddings of node i, j and their relation separately. And W , W r , and W D are learnable weights to transform node, relation and dialogue embeddings, respectively. Then the l-th layer of disease embedding can be obtained as follows:

The final contextual disease embedding can be represented as s d .

The dialogue h D and disease s d are fused by the fusion function to make prediction. In this work, we concatenate them and then fed it into decoder to make the medication prediction as follows:

where W o ∈ R |M|×2d and b o ∈ R |M| are trainable weight matrices for the decoder, σ is the sigmoid activation function. Here, we reserve all the candidates whose probability is higher than the threshold of 0.5 as the recommended treatment medication combination.

Since medication combination recommendation is treated as a multi-label classification task [Shang et al., 2019b; , we utilize the binary cross-entropy loss as the objective function, which can be formulated as:

where |D| is the number of dialogues in the training set, |M| is the number of medications. y (i) j is the ground truth label which equals 1 if medication j is prescribed by the doctor in dialogue i, and 0 otherwise. y (i) j is the predicted probability of recommending medication j for dialogue i by our model.

Dataset In our experiments, we divide the data into train/development/test dialogue sets as shown in Table 1 . The average number of medications in each dialogue is approximately the same, as well as the the average length of utterances and dialogues, meaning the distribution of the data is relatively consistent among three sets.

Implementation Details The pretained model we use is Chinese RoBERTa-base model. The learning rate and the batch size are set as 2 × 10 −5 and 8, respectively. Adam optimizer is utilized to optimize the model. All methods are implemented and trained using Pytorch on GeForce RTX 3090 GPUs. The results are the mean of five trainings.

Baselines Since there is no standard baselines for this task, we implement several methods, including statistics-based et al., 2020] ). The RETAIN, HiTANet, and LSAN are strong baselines for EHR-based medication or risk prediction. Among them, LSTM-hier takes the dialogue structure into consideration, and LSAN is modified to incorporate disease knowledge. Refer to Appendix B.1 for more details.

Evaluation Metrics We adopt two commonly used metrics, namely Jaccard Similarity Score and Average F1, to evaluate the medication recommendation performance. For both of the metrics, larger values indicate better performance. Table 2 shows performances of all methods under the metric of Jaccard and F1 on four datasets. The results clearly indicate that DDN has achieved the best performances among all baselines. Particularly, DDN improves 24.25%, 23.55%, 14.21%, and 13.32% compared with the second best method(i.e.,LSAN) at Jaccard, respectively. Further, RE-TAIN and LSTM-hier outperform LSTM-flat, demonstrating the dialogue structure is important for the dialogue understanding. And LSAN outperforms HiTANet, indicating that disease knowledge is also essential for the dialogue modeling.

Our well-designed model DDN considers both of the above and achieves the best performance. In addition, it's worth noting that the performance varies over three departments, which may attribute to the considerable difference of medication and disease frequencies between different departments. Figure 4 summarizes the contributions of dialogue graph and disease knowledge of our model. We notice that by removing the Dialogue Graph, the variant DDN w/o DG shows considerable performance decrease at both Jaccard and F1 compared with DDN, especially on three departments datasets. It demonstrates that dialogue graph structure is critical for the medical information extraction in dialogue-based medication recommendation task. Similarly, by removing the Knowledge Graph module, DDN w/o KG also shows similar performance decrease trends, indicating that disease knowledge can improve the medication recommendation performance. This is reasonable and accords with the actual medication consultation situations. 

#1 P ⊆ ∅ 65(7.20%) #2 P ⊂ T & P ⊆ ∅ 58(6.42%) #3 T ⊂ P 182(20.16%) #4 T ⊆ P & P ⊆ T & P ∩ T ⊆ ∅ 299(33.11%) #5 T ⊆ P & P ⊆ T & P ∩ T ⊆ ∅ 299(33.11%) Total - 903

To prove the feasibility of dialogue-based medication recommendation, we provide incomplete discourses to DDN during the inference process to explore whether the dialogue can provide necessary medical information. Figure 5 shows the model performances under different portions of discourses.

We can see that with the increasing of dialogue discourse percentage, the performance gets better, especially within the first 20% and the last 20%. This may be because that the first and last parts of dialogue contain much patient complaints and symptoms that are closely related to the medications. The results demonstrate that recommending medication based on medical dialogues is feasible.

Although we have elaborately designed a model for the task, the results are not so well satisfactory. So we make detailed analysis of the error cases in the test set. Table 3 summarizes the statistics of our defined five type of errors. We can see that (1) 86.38% of the cases(#3, #4, #5) predict wrong medications, which is mainly caused by DDN failing to distinguish the medications with similar effect.

(2) 7.20% of the cases predict none labels, which can be attributed to that these dialogues provide a little disease-related information.

We further provide a case study to illustrate the superiority of DDN. Figure 6 shows the medical dialogue and the medications recommended by all baselines and our method. The baselines either miss some medications, e.g., LSTM-flat, RE-TAIN, HiTANet, LSAN, or give the wrong drugs, e.g., TF-IDF, LSTM-hier. DDN takes full account of Duodenitisrelated information from the dialogue (e.g., the symptoms in chief complaint and past medical history) and the external knowledge graph. It recommends Omeprazole(inhibiting Figure 6 : The sample is extracted from the DIALMED test set. The "Missed" means the medication is in golden labels but not be predicted, and the underlined drugs in red represent the predicted medications that are not in ground truth. gastric acid secretion) and Mosapride(promoting gastric dynamics), as well as Glutamine which is omitted by all baselines.

In this paper, we studied a new task, namely dialogue-based medication recommendation. First, we presented the first high-quality medical dialogue dataset DIALMED for this task. And then we implemented several baselines, as well as designed a dialogue structure and external disease knowledge aware model. Experimental results show that medication recommendation quality can be enhanced with the help of dialogue structure and external disease knowledge.

Data in DIALMED is publicly collected from Chunyuyisheng, and personal information (e.g. usernames) is preprocessed. The annotating process is as described in Section 3. Furthermore, to ensure the quality of dataset, we paid the annotators 1 yuan ($0.16 USD) per label. The applications of machine learning in medical treatment would inevitably raise ethical problems. But the research on AI medicine should not be stopped by this, since the purpose of such research is how to make machines better serve human beings. We have seen many advanced achievements [Lin et al., 2021; Xu et al., 2019; Wei et al., 2018] in this field. For this study, the ethical problem is that there may be some cases with error in practical application. However, individual errors could be reduced by making doctors responsible for decisions while machines are used as assistants. 

First of all, diseases and related medications were identified in a dialogue. Secondly, we selected and annotated those dialogues containing drugs in our medication list. To speed up tagging process, we built an annotation tool based on this task. For each raw medical dialogue, the annotators need to annotate the disease of patients and medications recommended by doctors. We believe that the context after the doctor recommending the drug is not meaningful for drug inference. So as shown in the Figure 7 , the context after recommendation was removed from DIALMED. Due to the emergence of new medications in the labeling process and existence of ambiguity on recommendation, two additional annotation processes were carried out. Next we will focus on the processing of diseases and medications.

Disease Processing. With the guidance of a doctor, we select 16 diseases from 3 departments (i.e., respiratory, gastroenterology and dermatology) with following reasons: (1) they are common diseases and research on them have more practical value.

(2) they could be consulted online and there are abundant medication consultations. As described by Section Corpus Description, we normalize the diseases to improve the quality of DIALMED, e.g., chronic gastritis and acute gastritis are mapped to gastritis. The dialogues without explicit disease information or diseases in our scope were marked as None or Others. We mark one disease according to the chief complaint of patients who have more than one disease, because patients have only one complaint in most diagnostic scenarios. hydrochloride, dextromethorphan hydrobromide and chlorpheniramine maleate. Due to space constraints, more normalization of diseases and medications could be found in our repository 6 .

• TF-IDF. This is a traditional bag-of-word model for text classification. We view each dialogue as text and the corresponding medication as label, and train a classification model based on TF-IDF features of words. • LSTM-flat. This is a LSTM-based method. It concatenates all the sentences in a dialogue as a long sentence and feeds the long sentence into the BiLSTM to get the dialogue embedding for medication prediction. • LSTM-hier. This is also a LSTM-based method. Different from LSTM-flat, it uses a hierarchical BiLSTM where each word in an utterance are fed into BiLSTM to get the utterance embedding and then the utterances are fed into another BiLSTM to get the final dialogue embedding. It captures both word-level and utterance-level dependencies. • RETAIN. This is a RNN-based EHR medication recommendation method using on a two-level neural attention network that detects influential past visits. In the current scenario, it is used to model the dialogues. • HiTANet. This is a Transformer-based risk prediction approach on EHR, which model time information in local and global stages. We transform this method to model the hidden temporal information in medical dialogues. • LSAN. This is also a Transformer-based risk prediction approach, to model the hierarchical structure of EHR data. We modified this method to model the hierarchical structure in medical dialogues and add disease module of DDN to encoder the external knowledge.

• DDN. This is our proposed model. It utilizes the dialogue structure and external disease knowledge to enhance the dialogue-based medication recommendation performance.

In essence, drug recommendation is a sub-task of the automatic diagnosis. At present, most automatic diagnosis work are based on reinforcement learning, which gives the optimal response according to the question of patients and current state. Even if the response contains drug information, it is not intentional by the model. In addition, drug recommendations could be only treated as a secondary task in many medical dialogue datasets, although they may contain drug descriptions. As described in Appendix A.2, it would lead to a large number of meaningless labels when colloquial or trade names are remained in dialogues. Above all, in the future, the intelligent healthcare systems should not only give a disease diagnosis, but also make a good treatment plan for patients. DIALMED makes a step forward.

Medical treatment includes a number of steps: registration, examination, image reading, report interpretation, diagnosis, prescription and so on. AI medicine could help optimize resource allocation and improve efficiency in all aspects of health care. To this end, there are two kinds of computer aided diagnosis system, image diagnosis and text diagnosis. Due to the higher threshold of diagnosis, current researches are more inclined to image analysis, and there is still a lot of room for development in text diagnosis. Conversations in outpatient clinics are almost not reserved and involved more severe data privacy implications, leading to dialogue-based drug recommendation mainly oriented to telemedicine. The medical dialogue system, as the assistant of doctors, could give auxiliary suggestions on drugs based on the conversations between doctors and patients when both are communicating with each other.

The ratio of the patients to consult for medications is calculated with regular expressions. In the first place, 10,0000 different medical conversations from our dialogue corpus based on random sampling are fetched. For every dialogue, we apply the regular expression (e.g., "[Ww]hat (medication|drug|medicine) should I (take|eat)") on the utterances spoken by the patient and assume that it is a case of consulting for drugs if the regular expression matches. The regular expressions are collected based on our observation and understanding of data. More regular expressions could be found in our repository.

The frequency of all diseases and medications is shown in Figure 8 & 9. 

Schema-guided multi-domain dialogue state tracking with graph attention neural networks

The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and psychological measurement

Attention and memory-augmented networks for dual-view sequential learning

Semisupervised classification with graph convolutional networks

Zhumin Chen, Miao Fan, Jun Ma, and Maarten de Rijke. Semisupervised variational reasoning for medical dialogue generation

Enhancing dialogue symptom diagnosis with global attention and symptom graph

Graph-evolving meta-learning for low-resource medical dialogue generation

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks

Seqmed: Recommending medication combination with sequence generative adversarial nets

Lsan: Modeling long-term dependencies and short-term correlations with hierarchical attention for risk prediction

Leap: learning to prescribe effective and safe treatment combinations for multimorbidity

Mie: A medical information extractor towards medical dialogues

Graph neural networks: A review of methods and applications

We also compare the collected dataset with other tasks related medical dialogue datasets in Table 4 . We can see that the DIALMED has more dialogues and diseases than other four human-labeled medical dialogue datasets. DIALMED is much closer to the realistic online medication consultation scenario and is more suitable for model training in medication recommendation.