key: cord-0109520-83ky1rdw authors: Yan, Guojun; Pei, Jiahuan; Ren, Pengjie; Ren, Zhaochun; Xin, Xin; Liang, Huasheng; Rijke, Maarten de; Chen, Zhumin title: ReMeDi: Resources for Multi-domain, Multi-service, Medical Dialogues date: 2021-09-01 journal: nan DOI: nan sha: 03229d15414f82417d1862f69a835e560611977e doc_id: 109520 cord_uid: 83ky1rdw Medical dialogue systems (MDSs) aim to assist doctors and patients with a range of professional medical services, i.e., diagnosis, treatment and consultation. The development of MDSs is hindered because of a lack of resources. In particular. (1) there is no dataset with large-scale medical dialogues that covers multiple medical services and contains fine-grained medical labels (i.e., intents, actions, slots, values), and (2) there is no set of established benchmarks for MDSs for multi-domain, multi-service medical dialogues. In this paper, we present ReMeDi, a set of resource for medical dialogues. ReMeDi consists of two parts, the ReMeDi dataset and the ReMeDi benchmarks. The ReMeDi dataset contains 96,965 conversations between doctors and patients, including 1,557 conversations with fine-gained labels. It covers 843 types of diseases, 5,228 medical entities, and 3 specialties of medical services across 40 domains. To the best of our knowledge, the ReMeDi dataset is the only medical dialogue dataset that covers multiple domains and services, and has fine-grained medical labels. The second part of the ReMeDi resources consists of a set of state-of-the-art models for (medical) dialogue generation. The ReMeDi benchmark has the following methods: (1) pretrained models (i.e., BERT-WWM, BERT-MED, GPT2, and MT5) trained, validated, and tested on the ReMeDi dataset, and (2) a self-supervised contrastive learning(SCL) method to expand the ReMeDi dataset and enhance the training of the state-of-the-art pretrained models. We describe the creation of the ReMeDi dataset, the ReMeDi benchmarking methods, and establish experimental results using the ReMeDi benchmarking methods on the ReMeDi dataset for future research to compare against. With this paper, we share the dataset, implementations of the benchmarks, and evaluation scripts. Medical research with AI-based techniques is growing rapidly [10, 23, 66, 79] . Medical dialogue systems (MDSs) promise to increase access to healthcare services and to reduce medical costs [32, 76, 78] . MDSs are more challenging than common task-oriented dialogue systems (TDSs) for, e.g., ticket or restaurant booking [34, 51, 70] in that they require a great deal of expertise. For instance, there are much more professional terms, which are often expressed in colloquial language [62] . Recently, extensive efforts have been made towards building data for MDS research [35, 62] . Despite these important advances, limitations persist: (1) In currently available datasets, there is a lack of a complete diagnosis and treatment procedure. A practical medical dialogue is usually a combination of consultation, diagnosis and treatment, as shown in Figure 1 . To the best of our knowledge, no previous study considers all three medical services simultaneously [40, 68, 74, 76] . (2) In currently available datasets, labels are not comprehensive enough. Most datasets only provide the slot-value pairs for each utterance. Intent labels and medical knowledge triples related to each utterance are rarely provided. For example, there is one utterance in [80] : "Patient: Doctor, could you please tell me is it premature beat?" It only has the slot-value label "Symptom: Cardiopalmus", without the intent label "Inquire" and the required knowledge triple "". (3) In currently available datasets, labels are not fine-grained enough. Composite utterances, which contain more than one intent/action, are common in practice. For example, for the third utterance in Figure 1 , the patient says "Ten days. Yes. What is the disease?", there are three kinds of intents: informing time, informing symptom status, and inquiring diseases. Previous studies usually provide a single coarse-grained label for the whole composite utterance, which might mislead the training of models and/or lead to inaccurate evaluation. Second, we find that the values defined in previous work can hardly accurately convey complex information. Instead, we provide main-subordinate values, each arXiv:2109.00430v4 [cs.CL] 1 Mar 2022 Figure 1 : A practical medical dialogue involving diagnosis, treatment and consultation. They are all dependent. Combined with the knowledge triple in the upper right corner, we can better infer the related diseases. The lower right part is our annotation example, including intent/action, slot and value. of which includes a main value and a subordinate value. For example, for the labeling "Value=duration, ten days" of the second user utterance in Figure 1 , the main value is "duration" and the subordinate value is "ten days". The main-subordinate values have a stronger capacity to convey complex information: (a) Negation status of an entity, e.g., without experiencing symptom sore throat. (b) The specific value of an entity, e.g., the specific number of blood pressure. (c) Relationship between entities, e.g., the side effect of a medicine. (4) Besides the limitations above, some datasets only involve limited medical entities. For example, MedDG [78] , a very recent medical dialogue dataset, only contains 12 diseases. To address the lack of a suitable dataset, our first contribution in this paper is the introduction of the resources for medical dialogues (ReMeDi) dataset. The ReMeDi dataset has the following features: (1) medical dialogues for consultation, diagnosis and treatment, as well as their mixture; (2) comprehensive and fine-grained labels, e.g., intent-slot-value triples for sub-utterances; and (3) more than 843 diseases, 20 slots and 5,228 medical entities are covered. Moreover, we ground the dialogues with medical knowledge triples by mapping utterances to medical entities. Our second contribution in this paper is a set of medical dialogue models for benchmarking against the ReMeDi dataset. Recent work considers MDSs as a kind of TDS [35, 68, 74] by decomposing a MDS system into well-known sub-tasks, e.g., natural language understanding (NLU) [62] , dialogue policy learning (DPL) [68] , and natural language generation (NLG). There is, however, no comprehensive analysis on the performance of all the above tasks when achieved and/or evaluated simultaneously. To establish a shared benchmark that addresses all three NLU, DPL, and NLG tasks in a MDS setting, we adopt causal language modeling, use several pre-trained language models (i.e., BER-WWM, BERT-MED, MT5 and GPT2) and fine-tune them with the ReMeDi dataset. In addition, we provide a pseudo labeling algorithm and a natural perturbation method to expand the proposed dataset, and enhance the training of state-of-the-art pretrained models based on self-supervised contrastive learning. In the remainder of the paper, we detail the construction of the ReMeDi dataset and the definition of the ReMeDi benchmarks, and evaluate the ReMeDi benchmarks against the ReMeDi dataset on the NLU, DPL, and NLG tasks, thereby establishing a rich set of resources for medical dialogue system to facilitate future research. Details on obtaining the resources are included in the appendix of this paper. We survey related work in terms of datasets, models and contrastive learning. Most medical dialogue datasets contain only one domain, e.g., Pediatrics [35, 68, 74] , COVID-19 [76] , Cardiology [80] , Gastroenterology [40] and/or one medical service, e.g., Diagnosis [37, 38] , Consultation [62] . However, context information from other services and/or domains is often overlooked in a complete medical aid procedure. For example, in Figure 1 , the symptom "sore throat" mentioned in the diagnosis service has the long-term effect on the suggestion "talk less" in the follow-up consultation service. To this end, we provide medical dialogues for consultation, diagnosis and treatment, as well as their mixture in the ReMeDidataset. Although a few datasets [32, 78] contain multiple medical services in multiple domains, they target the NLG only without considering the NLU and DPL. Differently, ReMeDicontains necessary labels for NLU, DPL and NLG. Another challenge of existing datasets is the medical label insufficiency problem. The majority of datasets only provide a spot of medical labels for slots or actions, e.g., one slot [14, 38, 62] , single-value [32, 37, 40] . Moreover, their labels are too coarse to distinguish multiple intents or actions in one utterance. Unlike all datasets above, our dataset provides comprehensive and fine-grained intent/action labels for constituents of an utterance. To sum up, ReMeDi is the first multiple-domain multiple-service medical dialogue dataset with fine-grained medical labels and largescale entities, which is more competitive compared with the datasets mentioned above in terms of 9 aspects (i.e., domain, service, task, intent, slot, action, entity, disease, dialogue). A summary can be found in Table 1 . Large language models have achieved state-of-the-art performance for TDSs [1, 81] . BERT [13] is widely used as a benchmark for TDSs [42, 82] and has been shown to be effective for understanding and tracking dialogue states [5, 27] . In terms of dialogue generation, BERT is usually used in a selective way (e.g., TOD-BERT [71] ). The GPT family of language models [53, 54] serves as a competitive and common benchmark in recent work on TDSs [3, 21, 77] . GPT is used as a promising backbone of recent research on generating [4, 43, 71] and actions [33] . MT5 [55] is the current benchmark for TDSs, because it inherits T5 [56] 's powerful capabilities of text generation and provides with multilingual settings [39, 41, 83] . Large neural language models are data hungry and data acquisition for TDSs is expensive [57] . An effective method for alleviating this issue is contrastive learning (CL). CL compares similarity and dissimilarity by positive/negative data sampling, and defining contrastive training objectives [24, 28] . Most studies work on re-optimizing the representation space based on contrastive word [20, 22, 44, 50] or sentence [16, 18, 26, 61, 67] pairs. Some also focus on sampling negative data pairs [29, 64, 65, 73] . Research has also explored different contrastive training objectives based on single [11, 19] or multiple [9, 60, 63] positive and negative pairs, along with their complex relations [17, 46, 59] . In this work, we share several pretrained language models based on BERT, GPT2, MT5 as benchmarks of ReMeDi. To alleviate the data hungry problem, we enhance the pretrained language models with contrastive learning. Similar to TDSs [7] , a MDS system can be divided into several sub-tasks, e.g., NLU, DPL, and NLG. NLU aims to understand user utterances by intent detection [68] and slots filling [8, 52, 69] . Du et al. [14, 15] formulate NLU as a sequence labeling task and use Bi-LSTM to capture contextual representation for filling entities and their relations into slots. Lin et al. [38] improve filling entities with global attention and symptom graph. Shi et al. [62] propose the label-embedding attentive multilabel classifier and improve the model by weak supervision from responses. dialogue state tracking (DST) tracks the change of user intent [45] . Zhang et al. [80] employ a deep matching network, which uses a matching-aggregate module to model turn-interaction among utterances encoded by Bi-LSTM. In this work, we integrate DST into vanilla NLU to generate intents and updated slot values simultaneously. DPL decides system actions given a set of slot-value dialogue states and/or a dialogue context [7] . Wei et al. [68] first adopt reinforcement learning (RL) to extract symptoms as actions for disease diagnosis. Xu et al. [74] apply deep Q-network based on a medical knowledge graph to track topic transitions. Xia et al. [72] improve RL based DPL using generative adversarial learning with regularized mutual information. Liao et al. [35] use a hierarchical RL model to alleviate the large action space problem. We generate system actions as general tokens to fully avoid action space exploration in these RL models. NLG generates system responses given the outputs from NLU and DPL [48] . Yang et al. [76] apply several pretrained language models (i.e., Transformer, GPT, and BERT-GPT) to generate doctors' responses for COVID-19 medical services. Liu et al. [40] provide several NLG baselines based on sequence-to-sequence models (i.e., Seq2Seq, HRED) and pretrained language models (i.e., GPT2 and DialoGPT). Li et al. [31] use pretrained language models to predict entities and generate responses. Recently, meta-learning [37] and semi-supervised variational Bayesian inference [32] are adopted for low-resource medical response generation. The ReMeDi dataset is built following the pipeline shown in Figure 2: (1) We collect raw medical dialogues and knowledge base from online websites; (2) We clean dialogues by a set of reasonable rules, and sample dialogues by considering the proportions of disease categories; (3) We define annotation guidelines and incrementally improve them by dry-run annotation feedbacks until standard annotation guidelines are agreed by annotators; (4) We conduct human annotation with standard annotation guidelines. Note that we provided two versions of the dataset: a labeled ReMeDi-base (1,557 dialogues) and an unlabeled ReMeDi-large (95,408 dialogues). The former is for evaluating the performance of the benchmark models and the latter is for improving the training of large models (see details in §4.4.3). We collect 95,408 natural multiple-turn conversations between doctors and patients from ChunYuYiSheng, 1 a Chinese online medical community. All information from the website is open to the public and has been processed with ethical considerations by the website, e.g., the sensitive information of patients, such as names, has been anonymized. To further ensure data privacy, we anonymize more potentially sensitive information, e.g., the name of doctors, the address of hospitals, etc. These raw dialogues constitute a large-scaled unlabeled dataset, called ReMeDi-large. It covers 40 domains (e.g., pediatrics), 3 services (i.e., diagnosis, consultation, and treatment), 843 diseases (e.g., upper respiratory tract infection), and 5,228 medical entities. We crawled 2.6M medical triplets from CMeKG2.0, 2 a Chinese medical knowledge base. For example, the triplet denotes paracetamol can relieve headache. The entities involve about 901 diseases, 920 drugs, 688 symptoms, and 200 diagnosis and treatment technologies. The number of relation types is 125. We conduct the following steps to obtain a set of dialogues for human annotation: (1) Filtering out noise dialogues. First, we filter out short-turn dialogues with less than 8 utterances, because we find these short dialogues usually do not contain much information. Next, we filter out inaccurate dialogues with images or audios and keep dialogues with literal utterances only. Finally, we filter out dialogues in which too few medical entities emerged in the crawled knowledge triplet set. (2) Anonymizing sensitive information. We use special tokens to replace sensitive information in raw dialogues, e.g., "[HOSPITAL]" is used to anonymize the specific name of a hospital. (3) Sampling dialogues by disease categories. In order to balance the distribution of diseases, we extract the same proportion of dialogues from each disease to form ReMeDi-base for annotation. We hire 15 annotators with the relevant medical background to work with the annotation process. We define 5 intents, 7 actions, and 20 slots and design a set of primer annotation guidelines. First, each annotator is asked to annotate 5 dialogues and then to report unreasonable, confusing, and ambiguous guidelines with corresponding utterances. Second, we summarize the confusing issues and improve the guidelines by a high agreement among annotators. We repeat the above two steps in three rounds and obtain a set of standard annotation guidelines. To make the annotation more convenient, we build a web-based labeling system similar to [58] , which is available online. 3 In the system, each annotator is assigned with 5 dialogues each round and is asked to label all utterances following the standard annotation guidelines. To assure annotation quality, we provide: (1) Detailed guidelines. For each data sample, we introduce the format of the data, the specific labeling task, the examples of various types of labels, and detailed system operations. (2) A real-time feedback paradigm. We maintain a shared document to track problems and solutions in real-time. All annotators can write questions on it; some dialogues with ambiguous labels will be managed: we discussed them with experts and gave the final decision. (3) A semi-automatic quality judgment paradigm. We adopt a rule-based quality judgment model to assist annotators in re-labeling the untrusted annotations. (4) An entity standardization paradigm. We use Levinstein distance ratio [30] to compute the similarity between an annotation and an entity in medical knowledge triplet. If a max similarity score is in [0.9,1], we ask the annotator to replace the annotation with a standard entity from the medical knowledge triplet. Table 2 shows the data statistics. ReMeDi contains 95,408 unlabeled dialogues and 1,557 dialogues with sub-utterance-level semantic labels in the format of intent-slot-value or action-slot-value. ReMeDi-large is used for training, and ReMeDi-base is randomly divided into 657/100/800 dialogues for fine-tuning, validation, testing, respectively. ReMeDi-large has 40 domains and 843 diseases. ReMeDi-base has 30 domains and 491 diseases. In ReMeDi, about 70% of the dialogues involve multiple services. Figure 3 shows the number of utterances in ReMeDi-base distributed over different types of intents/actions and slots. In the left chart, there are 5 patient intents (i.e., "Informing", "Inquiring", "Chitchat", "QA" and "Others") and 7 doctor actions (including 5 intent types plus "Recommendation" and "Diagnosis"). These cover 25,446 utterances in total, and an utterance might contain multiple intents/actions. "Informing" account for the largest proportion (63%), while "Diagnosis" takes up the minimal proportion (2%). It shows that patients have a huge demand for online medical consultations, while doctors are very cautious to make online diagnosis. The right chart contains 20 types of slots covering 4,825 entities in total. "Symptom" (19%) has the largest proportion of entities, followed by "Medicine" (19%), "Treatment" (10%) and "Disease" (10%). In addition, 16% of label have subordinate value. In this section, we unify all tasks as a context-to-text generation task ( §4.1). Then we introduce two types of benchmarks, i.e., causal language model ( §4.2) and conditional causal language model ( §4.3). Last, we introduce how to enhance models with CL to build the state-of-the-art benchmarks ( §4.4). We view a MDS as a context-to-text generation problem [21, 49] and deploy a unified framework called SeqMDS. Formally, given a sequence of dialogue context , a MDS aims to generate a system response which maximizes the generation probability ( | ). Specifically, all sub-tasks are defined by the following formation. The NLU part of SeqMDS aims to generate a list of intent-slotvalue triplets : where dialogue history = [ 1 , 1 , . . . , ] consists of all previous utterances. And can be used to retrieve a set of related knowledge triplets from the knowledge base. The DPL part of SeqMDS generates the action-slot-value pairs given , , and as an input: The NLG part of SeqMDS generates a response based on all previous information: = SeqMDS([ ; ; ; ]). SeqMDS in the above equations can be implemented by either a causal language model ( §4.2) or a conditional causal language model ( §4.3). We consider the concatenation [ ; ; ; ; ] as a sequence of tokens 1: = ( 1 , 2 , . . . , ). The -th element can be an intent token (in intent-slot-value pairs), an action token (in action-slot-value pairs), or a general token (in utterances from patients or doctors). For the -th sequence 1: , the goal is to learn the joint probability ( 1: ) as: The cross-entropy loss is employed to learn parameters : where denotes batch size and denotes length of -th utterance. In this work, we implement the causal language model based on GPT2 [54] . We Similarly, the model can be learned by minimizing the cross-entropy loss as follows: In this work, we implement the conditional causal language model based on MT5 [75] . To extend upon the ReMeDi benchmark approaches introduced so far and enhance model training based on augmented data, we describe a self-supervised contrastive learning (SCL) approach. First, we generate data by two heuristic data augmentation approaches, i.e., pseudo labeling ( §4.4.1) followed by natural perturbation ( §4.4.2). Then, we adopt contrastive learning ( §4.4.3) to assure the models are aware that the augmented data is similar to the original data. We propose a pseudo labeling algorithm to extend the unlabeled dialogues. As shown in Algorithm 1, we decompose the labeled dialogues and unlabeled dialogues into utterance sets and , respectively. Each element of contains an utterance (from user or system) and its corresponding semantic label (in the format of intent-slot-value or action-slot-value). is a set of predefined rules, e.g., if "take orally" is mentioned in some utterance, then the action is "Recommendation" and the slot is "Medicine". The output is with pseudo labels . The main procedure is as follows. For each utterance in , we calculate the similarities between the current utterance and all labeled utterances in to get the maximum similarity and the corresponding label . If > ( = 0.8), is assigned as the pseudo label of . Otherwise, each rule in is applied to to update gradually. The similarity is deployed based on Levenshtein distance [30] , which considers both the overlap rate and the order of characters. We use three natural perturbation strategies to extend the labeled dialogues: (1) Alias substitution. If an utterance contains a drug with an alias, then the drug will be replaced with its alias to obtain a new data. For example, people from different regions may have different names for the same drug. (2) Back-translation. Chinese utterances are first translated into English and then back into Chinese to form a new data. Patients often use colloquial expressions, which motivates us to adopt back--translation to produce formal utterances from the informal ones. (3) Random modification. We randomly add, delete and replace a character of several medical entities in utterances. This simulates the common situation: typographical errors in online medical communities. We adopt an effective contrastive learning method to further increase the gains of natural perturbation. Following the CL framework [9] , we efficiently learn an enhanced representation by contrasting the positive pairs with the negative pairs within a mini-batch. Given an observed data ( , ), we randomly sample one natural perturbation strategy to get the augmented data ( , ). Let ∈ R denote the sentence representation with dimension. We construct the representation of observed and augmented data as a positive pair ( , ), and the representation of other data within the mini-batch as negative pairs {( , )} 2 =1, ≠ , ≠ . We compute the pairwise contrastive loss between the observed and augmented data: where 1 [ ≠ ] ∈ {0, 1} is an indicator function evaluating to 1 iff ≠ . denotes temperature parameter. , denote the encoder and decoder respectively and are the model parameters. The function ( , ) = ⊤ /∥ ∥∥ ∥ computes cosine similarity. For one batch, we minimize contrastive loss across positive pairs, for both ( , ) and ( , ): We jointly learn CL loss with task-specific cross-entropy loss, and the final loss function is defined as: where is the coefficient to balance the two training losses. In this section, we first list our evaluation settings, which includes 3 dialogue tasks, 5 benchmark models, 8 automatical metrics and 2 human evaluation metrics. Then we report on the results and detailed analysis of the ReMeDi benchmarks. The ReMeDi benchmarks address three tasks, NLU, DPL and NLG: NLU aims to generate a list of intent-slot-value triplets given a dialogue context. DPL aims to generate a list of action-slot-value triplets given a dialogue context and a list of intent-slot-value triplets. NLG aims to generate a response given a dialogue context, intentslot-value triplets and action-slot-value triplets. We employ several pretrained models as benchmarks: BERT-WWM is a BERT [13] pre-trained on a Chinese Wikipedia corpus [12] . BERT-MED is a BERT pre-trained on Chinese medical corpus. 4 GPT2 is used as a transformer decoder for causal language modeling; we use one pre-trained on Chinese chitchat dialogues [54] . 5 MT5 is used as a transformer encoder-decoder model for conditional causality modeling. We use the one pre-trained on multilingual C4 dataset [75] . 6 MT5+CL is an extension of MT5 with contrastive learning. We consider two types of evaluation: automatic (for the NLU and DPL tasks) and human (for the NLG task). For the automatic evaluation, we use 4 metrics to evaluate the NLU and DPL tasks: Micro-F1 is the intent/action/slot F1 regardless of categories. Macro-F1 denotes the weighted average of F1 scores of all categories. In this work, we use the proportion of data in each category as the weight. BLEU indicates how similar the generated values of intent/action slots are to the golden ones [6] . Combination is defined as 0.5 * Micro-F1 + 0.5 * BLEU. This measures the overall performance in terms of both intent/action/slot and the generated value. We use 4 metrics to evaluate the NLG task: BLEU1 and BLEU4 denote the uni-gram and 4-gram precision, indicating the fraction of the overlapping n-grams out of all n-grams for the responses [6] . ROUGE1 refers to the uni-grams recall, indicating the fraction of the overlapping uni-grams out of all uni-grams for the responses [2] . METEOR measures the overall performance, i.e., harmonic mean of the uni-gram precision and recall [36] . For the NLG task, we sample 300 context-response pairs to conduct the human evaluation. We ask annotators to evaluate each response by choosing a score from 0, 1, 2, which denotes bad, neutral, good, respectively. Each data sample is labeled by 3 annotators. We define 2 human evaluation metrics: In this section, we report the results of the ReMeDi benchmark models ( §5.2) on the NLU, DPL, NLG tasks, respectively. Please note that BERT treats NLU and DPL as a classification task, however, it is inapplicable to NLG task. the best Micro-F1 of 75.32%, followed by GPT2 of 73.32%. MT5 outperforms BERT-WWM/BERT-MED by 3.56%/3.85% and GPT2 wins by 1.56%/1.85%. So, MT5 and GPT2 can generate more accurate intent labels compared with BERT models. Second, for intent-slot label identification, BERT models outperform others by large margins in terms of both Micro-F1 and Macro-F1. BERT-MED achieves 2.01%/8.41% higher Micro-F1 and 5.65%/12.45% higher Macro-F1 than MT5 and GPT2. We believe one of the reasons is that BERT predicts over the label space rather than the whole vocabulary (like GPT2 and MT5), which makes the task easier. But BERT models are not able to predict the slot-values for the same reason. Another reason is that unlike intent identification, the training samples of intent-slot identification are inefficient and imbalanced (See Figure 3) , so the generation models (e.g., MT5 and GPT2) can hardly beat the classification models (e.g., BERT-WWM and BERT-MED). Third, for value generation, MT5 significantly outperforms GT2 by 10.04% in terms of BLEU and BERT models are unable to generate values. It shows that conditional casual language model is more conducive for value generation. Fourth, MT5 outperforms others in terms of overall performance, i.e., Combination. We conducted an ablation study, and find that pseudo labeling, natural perturbation, and historical utterances all have positive effect on the overall performance. Specifically, historical utterances have the largest influence (−1.02%), followed by natural perturbation (−0.62%) and pseudo labeling (−0.40%). All scores decrease except the BLEU score of MT5 without natural perturbation. This is because that the meaning of entities might be ambiguous after modification, e.g., "azithromycin" is replaced by its common name as "泰力特 (tylett)", which is hard to be distinguished from "力比泰 (alimta)" in Chinese. CL improves the performance of MT5 in terms of most metrics. Especially, for NLU, it increases 2.76% of Macro-F1, although it slightly decreases Micro-F1. CL performs better on types of slots that account for a larger proportion of the data (e.g. "Medicine" and "Symptom" in Figure 3 ). Table 4 shows the performance of all models, and the ablation study of MT5 (oracle), on the DPL task. First, MT5 (oracle) outperforms all the other models on all metrics. Specifically, it outperforms BERT-WWM by 0.59% and 1.35% on Micro-F1 for action and action-slot label identification, respectively. This reveals that MT5 can beat BERT models when more given more information in the input, especially the result from NLU. Besides, it achieves 2.86% higher BLEU and 7.11% higher Combination compared with GPT2 (oracle), which indicates that conditional casual language modeling is more effective in this case. Second, we explore the joint learning performance for MT5 and GPT2, where the prediction from NLU is used as an input of DPL. MT5 still outperforms GPT2 by 2.97% for the Combination performance, specifically 2.99% for the action label identification, 4.12% for the action-slot label identification, and 1.83% for the value generation. Third, we conducted an ablation study on MT5 and find that pseudo labeling, natural perturbation, historical utterances, and external knowledge are still helpful. Specifically, external knowledge has the largest influence (−4.13%), followed by historical utterances (−1.58%), pseudo labeling (−0.97%) and natural perturbation (−0.38%). All scores decrease generally. One exception is that BLEU increases by 0.17% without natural perturbation. Similar to the case in NLU, some modified entities may cause ambiguity. CL is beneficial in terms of all evaluation metrics. Specifically, CL increases 1.05% in Combination, while improving the generation of actions by 4.71%/3.59% and action-slots by 1.91%/1.98% in terms of Micro/Macro-F1. Thus CL helps the DPL task more than it helps the NLU task. Table 5 shows the automatic evaluation of GPT2 and MT5, and the ablation study of MT5 (oracle), on NLG. First, MT5 (oracle) outperforms GPT2 (oracle) on Table 6 shows the human evaluation on the NLG task. We did not consider the joint-learned GPT2 and MT5, as they contain the accumulated error from the upstream tasks, which will influence the evaluation of NLG. First, MT5 (oracle) performs better than GPT2 (oracle) on Fluency and Specialty. This indicates that MT5 can generate more fluent responses that provide more accurate medical knowledge compared with GPT2. This is consistent with the results of automatic evaluation. Second, the Fluency score is higher than Specialty for both GPT2 and MT5. This is because Specialty is more difficult, as generating responses with massive and accurate expertise is more challenging. Third, the average pairwise Cohen's kappa coefficient is larger than 0.6 for all metrics, which indicates a good annotation agreement. In this section, we analyze one of the strongest performing ReMeDi benchmarks, MT5, and reflect on the ReMeDi dataset creation process in terms of dataset size and data acquisition types. 5.5.1 Dataset size. The ReMeDi benchmarks achieve a solid performance for future research to compare against. Would an increase in the ReMeDi dataset size have helped to make the benchmarks even more challenging? To answer this question, we simulate the situation of feeding MT5 with more data by pseudo labeling. We investigate the performance on the NLU, DPL, NLG tasks with increasing volumes of training dialogues, as shown in Figure 4 . We see that feeding more simulated data has a positive effect on all three tasks, as the overall trends of the lines are upward. Specifically, NLG is increased by 0.95% of BLUE1, followed by DPL (+0.79% of Combination) and NLU(+0.69% of Combination). However, the improvement has an upper bound. For example, adding dialogues to 90K, NLU and DPL do not improve and even slightly decrease compared with the performance on 70K. It shows to what extent the current volume of dialogues suffices to approach the upper bound performance. This helps with the pains-gains trade-off of data acquisition. What types of data should we expand to enlarge the gains of data acquisition? As shown in Figure 5 , we compare the influence of different natural perturbation strategies on all three tasks. The overall influence of diverse data with lots of perturbation is positive. The mixture of "all" strategies significantly outperforms the "none" of strategies on all tasks. NLG is developed most by 3.16%, followed by NLU (+0.62%) and DPL (+0.38%). Besides, Figure 4 : Analysis of data hungry tolerance on the NLU, DPL, and NLG tasks w.r.t. different size of training dialogues. The x-axis is the number of training dialogues, and the 0 point denotes no pseudo labeled dialogues. The left y-axis is the Combination score on the NLU and DPL tasks, and the right y-axis is the BLEU1 on the NLG task. different strategies have different influences on different tasks. The alias strategy improves NLU most. This might be because adding alias entities to data samples helps with entity recognition. The random strategy has the largest effect on DPL, as it can improve the robustness with more input information. The trans strategy archives the best on NLG, as it can generate large-scale dialogues compared with the other two strategies. Therefore, adding data that can improve the diversity of ReMeDiis more than welcomed. In this paper, we have introduced key resources for medical dialogues (ReMeDi): a dataset and benchmarks. The ReMeDi dataset is a multiple-domain, multiple-service dataset with fine-grained labels for medical dialogue systems. We focus on providing the community with a new test set for evaluation and provide a small fine-tuning set to encourage low-resource generalization without large, monolithic, labeled training sets. We consider NLU, DPL and NLG in a unified SeqMDS framework, based on which, we deploy several state-of-the-art pretrained language models, with contrastive learning, as benchmarks, the ReMeDi benchmarks. We have also evaluated the ReMeDi benchmarks against the ReMeDi dataset. Both the ReMeDi datasets and benchmarks are available online; please see the Appendix for details. The resources released with this work have broader implications in that: (1) The fine-gained labels provided with ReMeDi can help research on the interpretability of medical dialogue systems. (2) The performance of the baseline models are far from satisfactory; therefore, we hope that the ReMeDi resources facilitate and encourage research in low-resource medical dialogue systems. One limitation of the ReMeDi dataset is that we do not provide explicit boundaries between dialogue sessions with different service types. This makes it challenging to explicitly model relationships among multiple services. As to future work, on the one hand, we will extend ReMeDi with service boundary labels to facilitate research on dialogue context modeling among multiple services. On the other hand, we will extend ReMeDi with more languages to help study multilingual MDSs. Last but not least, we call for studies to improve the benchmark performance, as well as conduct underexplored research, e.g., dialogue tasks for rare diseases under extremely low-resource settings. All resources presented in this paper, the dataset, code for the baselines, and evaluation scripts are shared at https://github.com/ yanguojun123/Medical-Dialogue. The shared resources are organized in multiple folders: (1) The folder "annotation_guide" contains the code and guidelines for the labeling system. (2) The folder "data" contains the ReMeDi-large and ReMeDi-base datasets. Each dialogue consists of multiple turns of utterances from doctors or patients, identified with a unique dialogue id. Each utterance in a dialogue turn is provided with: turn id, dialogue role, utterance text, and label list. Each label consists of sub-sentence text, the start/end position of sub-sentences, and the intent-slot-value or action-slot-value labels. (3) The folder "data_process" contains the code for processing the crawled raw data, the pseudo labeling and natural perturbation. (4) The folder "model" contains the code of the benchmark models based on BERT, GPT2, and MT5. (5) The folder "evaluate" contains the code for automatic evaluation in terms of all metrics. All resources are licensed under the MIT license. BERT-WWM and BERT-MED use 12 transformer blocks with 12 attention heads and the hidden size is 768. The maximum length of input tokens is 512 and the learning rate is 2e-5. GPT2 uses 10 transformer decoder blocks with 12 attention heads and the hidden size is 768. MT5 uses 8 transformer encoder blocks followed by 8 decoder blocks with 12 attention heads and the hidden size is 512. For GPT2 and MT5, the maximum length of input tokens is 800 and the learning rate is 1.5e-4. We set the coefficient as 0.8 and the temperature as 0.5 for the contrastive learning. We fine-tune the models on three training sets produced by pseudo labeling, natural perturbation, and human annotation, respectively. We use AdamW [25] as the optimization algorithm. The maximum training epochs are set to 30. We implemented benchmark models by PyTorch [47] . Our model is trained with 4 Nvidia TITAN RTX GPUs with 20 GB of memory. The results reported in this work can be reproduced with the random seed fixed. Recent neural methods on dialogue state tracking for task-oriented dialogue systems: A survey METEOR: An automatic metric for MT evaluation with improved correlation with human judgments Hello, It's GPT-2 -how can I help you? Towards the use of pretrained language models for task-oriented dialogue systems Jointly improving language understanding and generation with quality-weighted weak supervision of automatic labeling Bert-dst: Scalable end-to-end dialogue state tracking with bidirectional encoder representations from transformer A systematic comparison of smoothing techniques for sentence-level BLEU A survey on dialogue systems: recent advances and new frontiers WAIS: word attention for joint intent detection and slot filling A simple framework for contrastive learning of visual representations Medically aware GPT-3 as a data generator for medical dialogue summarization Learning a similarity metric discriminatively, with application to face verification Pre-training with whole word masking for Chinese bert Bert: Pre-training of deep bidirectional transformers for language understanding Extracting symptoms and their status from clinical conversations Learning to infer entities, properties and their relations from clinical conversations CERT: Contrastive self-supervised learning for language understanding Analyzing and improving representations with the soft nearest neighbor loss SimCSE: Simple contrastive learning of sentence embeddings Noise-contrastive estimation: A new estimation principle for unnormalized statistical models Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics A simple language model for task-oriented dialogue Incorporating prior knowledge into word embedding for Chinese word similarity measurement Context-aware symptom checking for disease diagnosis using hierarchical reinforcement learning Supervised contrastive learning Adam: A method for stochastic optimization Skip-thought vectors A simple but effective bert model for dialog state tracking on resource-limited systems Contrastive representation learning: A framework and review Contrastive learning with adversarial perturbations for conditional text generation Binary codes capable of correcting deletions, insertions, and reversals More but correct: Generating diversified and entityrevised medical response Semi-supervised variational reasoning for medical dialogue generation Language model pre-training improves generalization in policy learning End-to-end task-completion neural dialogue systems Weijian Sun, and Xuanjing Huang. 2020. Task-oriented dialogue system for automatic disease diagnosis via hierarchical reinforcement learning Rouge: A package for automatic evaluation of summaries Graph-evolving meta-learning for low-resource medical dialogue generation Enhancing dialogue symptom diagnosis with global attention and symptom graph Bitod: A bilingual multi-domain dataset for task-oriented dialogue modeling MedDG: A large-scale medical consultation dataset for building medical dialogue system Cross-lingual dialogue dataset creation via outline-based generation Dialoglue: A natural language understanding benchmark for task-oriented dialogue Generation-distillation for efficient natural language understanding in low-data settings Counter-fitting word vectors to linguistic constraints Neural belief tracker: Data-driven dialogue state tracking Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics Deep metric learning via lifted structured feature embedding PyTorch: An imperative style, high-performance deep learning library A modular task-oriented dialogue system using a neural mixture-of-experts Retrospective and prospective mixture-of-generators for task-oriented dialogue response generation Combining word embedding and semantic lexicon for Chinese word similarity computation Adversarial advantage actor-critic model for task-completion dialogue policy learning A stackpropagation framework with token-level intent detection for spoken language understanding Improving language understanding by generative pre-training Language models are unsupervised multitask learners Exploring the limits of transfer learning with a unified text-to-text transformer Exploring the limits of transfer learning with a unified text-to-text transformer Crossing the conversational chasm: A primer on multilingual task-oriented dialogue systems Wizard of search engine: Access to information through conversations with search engines Learning a nonlinear embedding by preserving class neighbourhood structure Facenet: A unified embedding for face recognition and clustering A simple but tough-to-beat data augmentation approach for natural language understanding and generation Understanding medical conversations with scattered keyword attention and weak supervision from responses Improved deep metric learning with multi-class n-pair loss objective CLINE: Contrastive learning with semantic negative examples for natural language understanding SNCSE: Contrastive learning for unsupervised sentence embedding with soft negative samples Coding electronic health records with adversarial reinforcement path generation EDA: Easy data augmentation techniques for boosting performance on text classification tasks Task-oriented dialogue system for automatic diagnosis A survey of joint intent detection and slot-filling models in natural language understanding A network-based end-to-end trainable task-oriented dialogue system TOD-BERT: Pre-trained natural language understanding for task-oriented dialogue Generative adversarial regularized mutual information policy gradient framework for automatic diagnosis Approximate nearest neighbor negative contrastive learning for dense text retrieval End-to-end knowledge-routed relational dialogue system for automatic diagnosis Aditya Barua, and Colin Raffel. 2021. MT5: A massively multilingual pre-trained text-to-text transformer On the generation of medical dialogues for COVID-19 Ubar: Towards fully end-to-end task-oriented dialog systems with gpt-2 Meddialog: A large-scale medical dialogue dataset SMedBERT: A knowledge-enhanced pre-trained language model with structured semantics for medical text mining MIE: A medical information extractor towards medical dialogues Recent advances and challenges in task-oriented dialog systems Crosswoz: A large-scale Chinese cross-domain task-oriented dialogue dataset AllWOZ: Towards multilingual task-oriented dialog systems for all