key: cord-0192559-rl47sox8
authors: Li, Bin; Chen, Encheng; Liu, Hongru; Weng, Yixuan; Sun, Bin; Li, Shutao; Bai, Yongping; Hu, Meiling
title: More but Correct: Generating Diversified and Entity-revised Medical Response
date: 2021-08-03
journal: nan
DOI: nan
sha: ca522c599f138824f926c2ddb72eed446ff3a20f
doc_id: 192559
cord_uid: rl47sox8

Medical Dialogue Generation (MDG) is intended to build a medical dialogue system for intelligent consultation, which can communicate with patients in real-time, thereby improving the efficiency of clinical diagnosis with broad application prospects. This paper presents our proposed framework for the Chinese MDG organized by the 2021 China conference on knowledge graph and semantic computing (CCKS) competition, which requires generating context-consistent and medically meaningful responses conditioned on the dialogue history. In our framework, we propose a pipeline system composed of entity prediction and entity-aware dialogue generation, by adding predicted entities to the dialogue model with a fusion mechanism, thereby utilizing information from different sources. At the decoding stage, we propose a new decoding mechanism named Entity-revised Diverse Beam Search (EDBS) to improve entity correctness and promote the length and quality of the final response. The proposed method wins both the CCKS and the International Conference on Learning Representations (ICLR) 2021 Workshop Machine Learning for Preventing and Combating Pandemics (MLPCP) Track 1 Entity-aware MED competitions, which demonstrate the practicality and effectiveness of our method.

During the COVID-19 epidemic, problems such as shortage of medical resources, hard burdens in doctors, and long waiting time for patients have existed in China. As a result, building an automatic response medical dialogue system is beneficial to improving the efficiency of clinical consultation and reducing the burden on doctors. To promote the research of Chinese medical dialogue generation, the 15th China Conference on Knowledge Graph and Semantic Computing (CCKS 2021) sets Task 11 for Entity-containing Medical Dialogue Generation, where participants are required to build the dialogue generation model based on the doctor-patient dialogue corpus in Gastroenterology.

In recent years, medical dialogue generation has attracted more and more attention, due to its wide range of applications [1] [2] [3] [4] [5] . However, to realize the real application landing, that is, to make the model imitate the real doctor, there are two problems that need to be solved urgently. One is that the model needs to be able to give a reasonable response, which often involves correct medical entity information [1, 2] . The other is that the model needs to imitate human thinking habits to generate responses, which are often long [3, 4] . To achieve this, we propose a pipeline system that contains two parts, including entity prediction and entity-aware dialogue response generation. Specifically, our contributions in this work can be summarized as follows -We build a framework for medical dialogue generation, adopting the pipeline structure with strong flexibility to get high-quality responses. -An Encoding fusion module is developed for adaptively controlling the encodings of different sources, making full use of the medical entity information in the dialogue. -We proposed Entity-revised Diversity Beam Search (EDBS), which can improve the diversity of final responses, while keeping the complete predicted entity information.

Medical dialogue generation has made great progress in recent years. Early research mainly focuses on task-oriented dialogue systems [1, 2] , which emphasize automatic disease diagnosis in high accuracy. However, the final response is often in the template, which requires vast human labor in designing. Subsequently, many studies begin to explore automatic response. Zeng et al. [3] develop a dialogue system for COVID-19. Liu et al. [4] construct the dialogue system in the field of Gastrointestinal. Nonetheless, it is not easy for the model to imitate the doctor. The simple Seq2Seq [6] structure cannot make good use of the knowledge information with reasoned entities. Besides, it is easy to generate shorter responses based on Beam Search [7] during the decoding, while the predefined conditions will be shifting with Diverse Beam Search (DBS) [8] . Different from the previous work, we improve the original Seq2Seq dialogue generation model with the encoding fusion mechanism. Specifically, we add contextual encodings and predicted entities encoding to the dialogue generation with encoding fusion, thereby making full use of information from different sources. At the same time, we propose a decoding mechanism Entity-revised Diverse Beam Search (EDBS) to improve the overall quality of the final response in terms of F1 and BELU scores. 

In this section, we will first illustrate the framework of our method, then we depict each component of the framework in detail. Finally, we present the fusion strategy for improvement.

The framework of our method is described in Figure 1 , where we adopt a pipeline method. In the upper stream, the best set of predicted entities is derived from optimal F1 threshold searching. In the downstream, input tokens with the predicted entities are sent to the entity-aware dialogue generation model. As a result, the final responses are obtained via the entity-revised diverse beam search.

Our medical entity prediction model is shown in Figure 2 , where we choose different pre-trained models, including BERT [9] , RoBERTa [10] , PCL-MedBERT 1 , RoBERTa-wwm-ext [11] etc. as the backbone. As for the pre-trained models for general domains, we use the Mac-Bert's pre-training method [12] with online medical data 2 as continuing pre-training [13] , in order to improve the generalization of the model in medical domain tasks. We extracted the features H 0 of the last three layers with the concatenation of the CLS vector. Then, the concatenated vector is passed through an attention layer to utilize the information between different layers. The final predicted entity distribution is obtained via Multi-sample dropout 3 , fully perception layer, and sigmoid function. Specifically, we concatenate multiple rounds of dialogue history with history entities, and introduce [SAP ] token as a separator to separate history dialogues and history entities, such as

. As a result, the multicategory classification task is designed with the loss function L p (X, T ), which is defined as follows:

where w k is the optimal weight performed with F1 threshold search on the validation set, t k is the target entity and x k is the input feature.

As the categories are not balanced and the result obtained by cross-entropy loss is not globally optimal, in order to access the best F1 index, the optimal weight w k is designed with the optimal F1 threshold search. We consider each category of the multi-category problem as a two-category problem. A reasonable threshold can be obtained by threshold search. More precisely, we can obtain the most optimal threshold by adjusting the threshold from 0.3 to 0.6 through grid search, with the step of 0.001.

The entity-aware dialogue generation model is presented in Figure 3 . We adopt the encoder-decoder architecture as the backbone, where the dialogue context and the predicted entities are fused through the Masked Multi-head Cross Attention Mechanism (MMCA) [6] , so that the predicted entity is used as the condition to generate the final response via auto-regression. The following part will introduce each component of the proposed model. Context embedding module The context embedding module is presented in Figure 4 , where the token embedding, the position embedding, and the entity embedding of the input are added together as the dialogue context embedding. Note that E avg is the average embedding of the entire sentence (divided by the sentence length), and the other part, such as Ent k , is the embedding of the entity contained in the corresponding sentence k, which is obtained through a two-layer linear perception projection.

Predicted Entities embedding The entities are concatenated together and separated by [SEP ] , which is mapped in the form of tokens by the tokenizer. As a result, the entity embedding is obtained by adding these tokens with the position embedding.

Encoding Fusion Mechanism We design the encoding fusion mechanism, where the encodings of dialogue context E C , the predicted entities E ent and the shifted right output E prev are sent together to the Masked Multi-head Cross Attention module. The equations are shown as follows

where the O ent and the O C represent the encoding of predicted entities and dialogue context. In order to pay attention to the information of decoded tokens, the previous decoded encoding O prev is obtained through equation 4

Ater obtaining the O ent , O C and O prev , the averaging is performed, which is shown as equation 5

Dialogue Generation The dialogue generation is processed via autoregressive decoding, the loss function is shown as follows

where i represents the i − th word generated by the decoder, and x 0 , ..., x i−1 , y i is a sequence of words from the generated response. Identically, the input of the decoder also can be represented as the mean fused encoding.

Auxiliary Tasks Noted that there are two main gaps running in our framework, one is the gap between the data utilized in the pre-training and fine-tuning stage, the other is the gap between the predicted entities and the real entities. As a result, we design two auxiliary tasks to fill the above-mentioned gaps 1. Language model task

where ϕ representes the parameters of the encoder, k is the size of the context window, and x i−k , . . . , x i−1 ,x i is the sequence of tokens sampled from the training corpus. 2. Hierarchical entity prediction task

where t i is ground truth, L T −5 (ϕ) represents the loss function of the 5 domain types, L T −160 (ϕ) represents the loss function of the 160 entity types.

As a result, the final loss funciton can be written as

where L(ϕ, θ) is the total loss, and all the tasks share the same parameters. To facilitate training, we set the weights µ, ν, λ to 1.

Diverse Beam Search is used to generate diversified responses. However, due to the lack of conditional information guidance, the results of the original DBS are often uncontrollable but diversified. Therefore, we design EDBS under predicted entity conditions during response generation, adopting the entity modification method as a guidance so that the results do not shift from the condition entities. Specifically, we considered the absence of true entities, the presence of incorrect entities, and the redundant predicted entities during the generation process. The algorithm design is as follows if Ω ≤ θ then 6: // utilize multinomial to increase true entity sampling diversity // ensure the final result can be converged 13: θ = θ * 0.9 14: end for 15: // perform the entity-revised method 16: for b in B do 17: Divide each b into list G at sentence granularity 18: Mapping predicted entities E into sentences list S

Normalized the entities in each sentence b to form the entity list R To sum up, we consider the three cases of conditional shifting in DBS, and use the edit distance with entity condition to correct it until the final result is obtained.

We adopt the curriculum 5-fold training strategy to fintune the Seq2Seq model, and use the bagging fusion mechanism to fuse the features of different models. As a result, two types of fusion mechanisms can improve the final performance. To a certain extent, These strategies can alleviate the problem of exposure bias, increasing quality of the final response.

In this section, we will introduce the dataset, evaluation, implementation and results of the experiments. The dataset statistics are shown in Table 1 . This task is based on a medical dialogue dataset with entity annotations. The Ref set is the test set sample announced online in advance. Note that the number of entities and sentence length in the test set is more than the training set, which requires the model to be able to generate correct and longer sentences.

There are two indicators in the evaluation 5 , including BLEU-avg [14] score and Entity-F1. The BLEU-avg score is to measure response generation quality, while the Entity-F1 score is to measure entity correctness. BERT-base-chinese [9] 31.23 RoBERTa-wwm-ext 7 [11] 31.68

RoBERTa-large [10] 33.23 Mac-BERT-large [12] 33.64 PCL-BERT-wwm 34.68 PCL-BERT-wwm-Post 35.71 

As for the entity prediction training method, we adopt the stratified learning rate with an attenuation strategy. Specifically, we set a larger value for the upper learning rate of the backbone, the internal learning rate of the pre-trained model is smaller, and the closer to the lower layer, the smaller the learning rate. We also adopt the FGM adversarial training [15] , mixed-precision training 6 , and moving average strategy to train the entity prediction model.

For the dialogue generation part, we design curriculum boosting learning method to train the Seq2Seq model, which can be divided into three steps:

1. The trained Seq2Seq model is utilized to initialize the parameters of the encoder and decoder, and fine-tune with the cleaning data. Finally, we use the boost method to train 4 epochs for a total of 5-fold; 2. We use the dialogues with entities of all doctors for training, so that the generated response will contain the common features of doctors. We use the boost method to train 4 epochs for a total of 5-fold; 3. We further sort out the dialogues with entities of doctors, whose length is greater than 11 (Counted on the Ref set) to train the model. As these dialogues have more entity characteristics. It is easier for the model to adapt to generating longer sentences. We train 2 epochs for a total of 5-fold.

In this Section, we will show our experimental results and online results.

As is shown in the Table 2 , the symbol -Post represents continue pretraining with Mac-BERT pretraining method in the collected data. The BERT-base-chinese and RoBERTa-wwm-ext have similar effects. After the backbone being replaced by RoBERTa-large, the improvement is about 1.62. We finally choose the PCL-BERT-wwm as our baseline backbone, with the improvement of 1.03 after continue pretraining. We also try different model structures with the highest backbone as is shown in Table 3 . The results show that it is competitive adopting the model structure with the concatenated features of the last three layers, attention mechanism and multi-dropout. The results of different dialogue generation models are shown in the Table  4 . The performance of the original Transformer [16] is relatively poor, while the length of encoding limits the GPT2's [17] ability (with small vocab). We carry out curriculum boost training for BertGPT [6] , whose average score is 2.62 higher than the original one. With the scale of the pre-trained model increasing, each test score shows an upward trend. As a result, the fine-tuned CPM2-prompt [19] reaches the highest score of 18.21 in average score among the single models. It can be found that the proposed context encoding module is beneficial to improve the BLEU score, as history entities are equally important for response generation.

The encoding fusion module also improves 0.62 in F1 and 0.2 in BLEU effectively. We also compare different decoding methods with the proposed EDBS. As shown in Table 5 , it can be found that the EDBS method has significant advantages in entity accuracy and response quality compared to other methods.

The detailed results are shown in Table 6 and Table 7 , our method wins both the CCKS and ICLR Workshop MLPCP Track 1 competitions. These results demonstrate that the proposed method is effective and solid.

In this paper, we propose a pipeline framework for Chinese medical dialogue generation, which consists of two parts: medical entity prediction, and entity-aware dialogue generation. In our framework, we first optimize the entity prediction model with F1 threshold search, then utilize the predicted entities with the proposed encoding fusion mechanism, which controls the information from different sources. The entity prediction model with F1 threshold search is used at the upstream. The predicted entities is ultilized with the proposed encoding fusion mechanism in the downstream, which controls the information from different sources. We improve the original DBS with the entity-revised method, which proves to be effective for bettering the quality of the final response. We win the best results in the CCKS and the ICLR Workshop MLPCP Track 1 competitions, which demonstrates the effectiveness and practicality of our proposed method. In the future, we will consider using the knowledge graph to infer the predicted entity, and try different fusion strategies when generating, to further improve the correctness and quality of the generated response.

Task-oriented dialogue system for automatic diagnosis

End-to-end knowledge-routed relational dialogue system for automatic diagnosis

Meddialog: A large-scale medical dialogue dataset

A Large-scale Medical Consultation Dataset for Building Medical Dialogue System

Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation

Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Empirical analysis of beam search performance degradation in neural sequence models

Diverse beam search: Decoding diverse solutions from neural sequence models

Pre-training of deep bidirectional transformers for language understanding

A robustly optimized bert pretraining approach

Pre-training with whole word masking for chinese bert

Revisiting Pre-Trained Models for Chinese Natural Language Processing

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

A systematic comparison of smoothing techniques for sentence-level bleu

Adversarial training methods for semi-supervised text classification

Advances in neural information processing systems

Language models are unsupervised multitask learners

Exploring the Limits of Transfer Learning with a Unified Textto-Text Transformer

Large-scale Cost-effective Pre-trained Language Models

The curious case of neural text degeneration