key: cord-0163913-u2rr6jns authors: Yuan, Hongyi; Yuan, Zheng; Gan, Ruyi; Zhang, Jiaxing; Xie, Yutao; Yu, Sheng title: BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model date: 2022-04-08 journal: nan DOI: nan sha: 0db5207510819b9956849eb84bfe8703f8f3688d doc_id: 163913 cord_uid: u2rr6jns Pretrained language models have served as important backbones for natural language processing. Recently, in-domain pretraining has been shown to benefit various domain-specific downstream tasks. In the biomedical domain, natural language generation (NLG) tasks are of critical importance, while understudied. Approaching natural language understanding (NLU) tasks as NLG achieves satisfying performance in the general domain through constrained language generation or language prompting. We emphasize the lack of in-domain generative language models and the unsystematic generative downstream benchmarks in the biomedical domain, hindering the development of the research community. In this work, we introduce the generative language model BioBART that adapts BART to the biomedical domain. We collate various biomedical language generation tasks including dialogue, summarization, entity linking, and named entity recognition. BioBART pretrained on PubMed abstracts has enhanced performance compared to BART and set strong baselines on several tasks. Furthermore, we conduct ablation studies on the pretraining tasks for BioBART and find that sentence permutation has negative effects on downstream tasks. Since the advent of ELMo and BERT (Devlin et al., 2019) , the new pretrain-thenfinetune paradigm has brought great performance improvement and dominated the methodology research of the natural language processing (NLP) field. Previous research has illustrated that pretraining language models on the domain-specific corpora can improve the model performance on domain-specific tasks further (Gururangan et al., 2020) . With the large-scale publicly accessible * Contributed equally. † Corresponded author. corpora from PubMed, researchers have already proposed biomedical domain pretrained language models such as BioBERT and PubMedBERT (Gu et al., 2022) to aid the later research. Natural language generation (NLG) tasks such as dialogue system (Chao et al., 2017) and question answering are of critical importance for the biomedical artificial intelligence research, and there is also a trend to approach natural language understanding as NLG tasks in the general domain (Sun et al., 2021; Yan et al., 2021) . For example, an entity retrieval task can be solved by constrained natural language generation (Cao et al., 2021) . However, there exist two gaps in the research of the biomedical NLG. On the one hand, the architectures of the biomedical pretrained language models are almost all encoder-only transformers. Such architecture is incapable of generating natural languages auto-regressively. A decoder is necessary for language generation (Liu and Lapata, 2019) . On the other hand, there are very few in-domain generative language models for biomedicine (Phan et al., 2021) . Models pretrained on biomedical corpora may further enhance the performance of current biomedical NLG methods. To bridge the gaps mentioned above, we propose a biomedical auto-regressive generative language model, BioBART, pretrained on the biomedical corpora. In our work, we adopt BART (Bidirectional and Auto-Regressive Transformers), a generative pretrained language model which achieves state-of-the-art results on different NLG tasks in the general domain (Lewis et al., 2020a) . We continuously pretrain BART on PubMed abstracts to achieve biomedical domain adaption only using the text-infilling task. We also collate and evaluate Bio-BART on the existing biomedical NLG tasks. The in-domain BioBART outperforms BART model and sets strong baselines for several NLG tasks. The main contributions of our work are summa-arXiv:2204.03905v2 [cs.CL] 22 Apr 2022 rized as follows 1 : 1. In aid of the research concerning the biomedical NLG tasks, we collate existing biomedical NLG tasks along with corresponding data and experimental settings. The archived biomedical tasks will be released. 2. We further analyze the influence of the pretraining task of sentence permutation in BART, and we find it brings degradation on the biomedical NLG tasks. 3. We evaluate our BioBART models on various NLG tasks and demonstrate the superb performance over BART. We will release the codes and weights to help reproduce our results. 2 Related Work Most of the prestigious language models such as BERT, RoBERTa are autoencoding transformers. The encoder-only architecture prevents the direct implementation of the seq2seq language generation. Several generative auto-regressive language models are proposed to mitigate the problem. The serial GPT models (Radford and Narasimhan, 2018; Radford et al., 2019; Brown et al., 2020) adopt the decoder-only transformer architecture which is a left-to-right language model. They pretrain the models by autoregressively predicting the upcoming word of sentences. UniLM1 (Dong et al., 2019) and UniLM2 (Bao et al., 2020) implement attention masks to the transformer encoder to achieve unidirectional language modeling. They pretrain their models with a mixture of masked language modeling and auto-regressive language generation. T5 (Raffel et al., 2020) and BART (Lewis et al., 2020a) apply the full transformer architecture, the encoder is used for input sequence encoding and the decoder is used for language generation. T5 and BART are both pretrained by denoising the corrupted corpora. Such models achieve many state-of-the-art results on various NLG tasks and some NLU tasks. Existing work has shown that pretraining the language models on the domain-specific corpora can bring better model transferability on the corresponding downstream tasks (Gururangan et al., 2020) . There are endeavors to adapt language models to the specific domain. BioBERT pretrained BERT model using biomedical corpora from PubMed abstracts and PubMed Central (PMC) full-text articles. BlueBERT (Peng et al., 2020) and clinicalBERT (Alsentzer et al., 2019) add electronic medical record (EMR) corpora from MIMIC-III to the pretraining data. Instead of continuous training from the general BERT checkpoint, SciBERT (Beltagy et al., 2019) and PubMedBERT (Gu et al., 2022) are trained from scratch using scientific papers from Semantic Scholar (Ammar et al., 2018) and PubMed articles respectively. (Shin et al., 2020) releases BioMegatron, a larger-size BERTstyle language model pretrained on PubMed abstracts, PMC and MIMIC-III. The aforementioned work all use the model architecture of BERT. Other researchers are exploring different language models. BioELMo (Jin et al., 2019) is pretrained on biomedical corpora based on stacked bidirectional LSTM language model ELMo . BioELECTRA (Kanakarajan et al., 2021) applies an adversarial training scheme consisting of a discriminator and a generator. They use PubMed abstracts and PMC articles as in-domain pretraining corpora. BioMed-RoBERTa (Gururangan et al., 2020) is initialized from RoBERTa , with additional training on the scientific papers from Semantic Scholar. Bio-lm (Lewis et al., 2020b) is pretrained on data from PubMed, PMC, and MIMIC-III based on the RoBERTa model. Ke-BioLM (Yuan et al., 2021) uses Entity as Experts (Févry et al., 2020) model to inject biomedical entity knowledge into the language model, starting from the weights of PubMedBERT. Coder (Yuan et al., 2022b) and SapBERT take advantage of the synonyms resource from biomedical knowledge base UMLS (Bodenreider, 2004) and enhance the model with entity knowledge by contrastive pretraining. Due to the nature of model architecture, encoderonly language models have limited performance on the NLG tasks, such as summarization and question answering. In recent research, SciFive (Phan et al., 2021) is proposed for biomedical NLP tasks. Sci-Five is pretrained on PubMed abstracts and PMC articles based on T5 architecture. While T5 is avail-able for NLG tasks, SciFive is focused on evaluating NLU tasks. Compared to SciFive, we choose to use BART as our model backbone and evaluate more on NLG tasks to leverage the power of decoders. In the biomedical domain, most of the NLP tasks are natural language understanding (NLU) tasks. There are well-archived benchmarks for the evaluation of biomedical NLU, such as BLUE (Gu et al., 2022) and CBLUE . NLG tasks are relatively less studied. collects the patients and doctors' dialogues and forms a benchmark for Covid-19 related dialogue system. (Ben Abacha et al., 2021) is an annual biomedical NLP competition containing NLG tasks such as medical question (or answer) summarization and figure captions. Moreover, with the success of GPT-3, there is a novel trend that unifies all the NLP tasks as NLG tasks (McCann et al., 2018; Brown et al., 2020) . The traditional NLU tasks can be approached by constrained language generation. Much attention is paid on the NLG methods recently. In the biomedical domain, entities are of primary concern. GENRE (Cao et al., 2021), Yuan et al. (2022a) and BARTNER (Yan et al., 2021) reach the new stateof-the-art by auto-regressive language model on entity linking and named entity recognition tasks. Such methods can be adapted to the biomedical domain. BART is a sequence-to-sequence model with a bi-directional encoder and a left-to-right autoregressive decoder. The model architecture is consistent with the Transformers (Vaswani et al., 2017) except for changing the ReLU activation functions to GeLUs (Hendrycks and Gimpel, 2016) . BART is pretrained by denoising the corrupted input documents. The work ablates five different types of corruption noise: text masking, text deletion, text infilling, sentence permutation, and document rotation. As a result, the pretraining documents are corrupted in two ways: 1) Text Infilling: For each document, a number of token spans are sampled, and each sample span is replaced with a single mask token. 2) Sentence Permutation: A document is split into sentences and sentences are shuf-fled in random orders. The pretraining objective is to minimize the negative log-likelihood of the original documents. Prior work has shown that continuous-pretrained models can get competitive results compared with those trained from scratch (Gu et al., 2022) . In our work, we continuously pretrain BART on the biomedical domain corpora. We revisit the methods to corrupt input texts. BART keeps the sentence permutation noise because of the significant performance gain on the summarization task, although this noise may lead to slight degradation on other tasks. We run further ablation studies on various biomedical NLG tasks. We show that the model pretrained without sentence permutation has better performance. Further details are listed in Section 5.5. Therefore we only implement the text infilling task to corrupt input texts for pretraining BioBART. In this section, we introduce the generative downstream tasks in the biomedical domain. We will conduct experiments on these tasks to illustrate the performance of the domain-specific BioBART. A medical dialogue system aims to imitate the human doctor to communicate with human patients in a natural way. Based on the BART-style model, the patients' primitive descriptions and dialogue histories are used as inputs to the model, then the model auto-regressively generates the replies as outputs. The task is trained and evaluated in a sequence-tosequence fashion. Summarization is a classical NLP task. It is important for healthcare to concisely summarize knowledge-rich biomedical documents. Technically, there are abstractive and extractive approaches to generate better summaries. With the help of large pretrained language models, abstractive summarization methods outperform extractive methods in summary diversity and conciseness (Zhang et al., 2020a; Dou et al., 2021) . The abstractive summarization is naturally an NLG task. We follow the BART (Lewis et al., 2020a) work and evaluate our BioBART on the biomedical summarization tasks in the same fashion. The input documents are encoded by the model encoder and the summaries are generated by the decoder autoregressively. Entity linking is a task that maps entity mentions in texts to its standard entity concepts. Traditional entity linking methods use language models to encode entity concepts from knowledge bases(e.g. UMLS) and mentions into the same dense space and disambiguate mentions by vector similarity. The large memory footprint requirements and difficult model training hinder the development of such methods. Cao et al. (2021) proposes GENRE which uses generative language models to disambiguate entity mentions by auto-regressively generating the standard concept names conditioned on the inputs. (Yuan et al., 2022a) achieves state-of-the-art entity linking performance on various biomedical entity linking datasets by generative methods. We include this leading-edge method to show the superior performance of BioBART. Named entity recognition (NER) is a critical task in the biomedical NLP community which extracts biomedical-related entities from texts. Nested and discontinuous entities widely exist in biomedical papers and EMR due to the multi-granularity semantic meanings and complex syntax structures (Yuan et al., 2020) . Well-used sequential labelling framework in NER (Lample et al., 2016) is not directly fitted for nested and discontinuous NER (Finkel and Manning, 2009 ). Yan et al. (2021) propose BARTNER to model nested and discontinuous NER into seq2seq task by inputting sentences and outputting entities with their entity types one by one. The generative approach of BARTNER achieves state-of-the-art performance on nested and discontinuous NER datasets, and we will use it to evaluate our proposed BioBART can further enhance the performance. Pretraining Corpora There are two main sources of biomedical corpora: PubMed abstracts, PMC articles. In the prior work (Gu et al., 2022) , training on both corpora surprisingly leads to a slight degradation in performance compared to solely training on PubMed abstracts. Therefore, we only use PubMed abstracts as the pretraining cor-pora. The corpora contain about 41 GB of biomedical research paper abstracts on PubMed. Pretraining Setup We continuously pretrain both large and base versions of BART for 120k steps with a batch size of 2560. We use the same vocabulary as BART to tokenize the texts. Although the input length limitation of BART is 1024, the tokenized PubMed abstracts rarely exceed 512. Therefore, for the sake of training efficiency, we truncate all the input texts to 512 maximum length. We mask 30% of the input tokens and the masked span length is determined by sampling from a Poisson distribution (λ = 3) as used in BART. We use a learning rate scheduler of 0.02 warm-up ratio and linear decay. The learning rate is set to 1e-4. We train the base version of BioBART on 2 DGX with 16 40GB A100 GPUs for about 100 hours and the large version of BioBART on the same devices for 168 hours with the help of the open-resource framework DeepSpeed (Rajbhandari et al., 2020) . CovidDialog Concerning the widespread Coronavirus disease 2019 (COVID-19) pandemic, the CovidDialog dataset is proposed to facilitate the development of dialogue system providing COVID-related consultations to people. The dataset is collected from online healthcare forums. It contains 603 consultations about COVID-19 and other related pneumonia, having 1232 utterances in total. Each consultation starts with a description related to patients' medical conditions, then followed the conversation between a doctor and a patient. iCliniq, HealthCareMagic Both datasets are extracted from MedDialog (Zeng et al., 2020) dataset, collected from the online healthcare platform. iCliniq contains 31,062 samples and Health-CareMagic contains 226,405 samples. Each sample is comprised of a summary and corresponding dialogues between a patient and a doctor. Health-CareMagic's summaries are more abstractive and are written in a formal style, unlike iCliniq's patient-written summaries. We follow the previous work (Mrini et al., 2021) for training, developing, and testing data separations of both datasets. MeQSum (Ben Abacha and Demner-Fushman, 2019) The dataset is created for better medical question summarization because the original patients' questions are verbose, causing difficulty for the question-answering system. The dataset contains 1000 patients' health questions selected from a collection distributed by the U.S. National Library of Medicine (Kilicoglu et al., 2018) . Each question is annotated with a question summarization by medical experts. MEDIQA-ANS (Savery et al., 2020) When feeling discomfort, people may turn to the internet for the answers to their medical questions. The raw searching result may be obscure for even medical experts. The dataset is proposed to emphasize the need for a medical answer summarization system in aid of better understanding biomedical materials. It consists of 156 health questions, corresponding answers to these questions, and expert-created summaries (both abstractive and extractive) of these answers. Following the paper, we use BioASQ (Tsatsaronis et al., 2015) MedMentions (Mohan and Li, 2019) MedMentions is a large-scale biomedical entity recognition dataset. The commonly used St21pv subset contains 4,392 PubMed abstracts, and over 350,000 mentions are linked to concepts of 21 selected semantic types in UMLS (Bodenreider, 2004 MEDIC is a medical dictionary that merges the diseases concepts, synonyms, and definitions in MeSH and OMIM and is composed of 9700 unique diseases. We follow BioSyn (Sung et al., 2020) to process data and construct dataset splits. COMETA (Basaldella et al., 2020) COMETA is derived from the online publicly available and anonymous health discussion on Reddit. It consists of 20k English biomedical entity mentions expertannotated with concepts from SNOMED CT. We use the "stratified (general)" split and follow the training and evaluation procedures of SapBert and ResCNN (Lai et al., 2021) . AskAPatient (Limsopatham and Collier, 2016) It contains 8,662 phrases from social media. Each phrase can be mapped to one of the 1,036 medical concepts from SNOMED-CT and AMT (the Australian Medicines Terminology). The samples in AskAPatient do not include contextual information. We follow Sung et al. (2020) and Limsopatham and Collier (2016) for data pre-processing and apply the 10-fold evaluation protocol. ShARe13, ShARe14, CADEC These three datasets annotate discontinuous adverse drug events entities. The main difference is the annotated data of ShARe tasks (Pradhan et al., 2013; Mowery et al., 2014) comes from MIMIC-II, and CADEC (Karimi et al., 2015) comes from social media. There is only one entity type for these datasets. We follow Yan et al. (2021) for dataset preprocess. GENIA (Kim et al., 2003) GENIA annotates 2000 MEDLINE abstracts with biological entities. Entities can be nested with others. We follow (Lin et al., 2019) to combine fine-grained entity types into 5 coarse-grained entity types and to construct dataset splits. All the aforementioned datasets are in English. The statistical overview of the aforementioned datasets is listed in Table 1 . Dialogue We use BioBART as the dialogue system model. The dialogue history is fed into the encoder and the decoder generates the response autoregressively. We apply the negative log-likelihood function as the training objective with respect to the reference dialogue response. We fine-tune the model with learning rate 5e-5 for the base version and 1e-5 for the large version for 20 epochs. We run evaluations on the validation set at the end of each epoch and use the checkpoint with the best validation performance for testing. During inference, we use beam search of size 5 to sample responses from the model's outputs. We use Rouge-1/2/L (Lin, 2004) , BLEU (Papineni et al., 2002) and BERTscore (Zhang et al., 2020b) as our evaluation metrics. RoBERTa-large is used as scorer in BERTscore. Summarization Similarly, for summarization, the encoder takes the documents as input, and the decoder generates the corresponding summarizations. We minimize the log-likelihood objective to fine-tune the model and apply beam search for inference. Across different summarization datasets, the beam size is set to 5 and we use no length penalty. We fine-tune the model with learning rate 5e-5 for the base version and 1e-5 for the large version for 6 epochs. We run evaluations on the validation set at the end of each epoch and use the checkpoint with the best validation performance for testing. We apply the commonly used Rouge-1/2/L and BERTscore for evaluation metrics. The large version of RoBERTa is used as the scorer in BERTscore. Entity Linking We follow the method and experimental settings in Yuan et al. (2022a) to implement the generative model for biomedical entity linking tasks. We do not apply knowledge-base guided pre-training proposed in Yuan et al. (2022a) . The documents with the positions of mentions marked are fed into the encoder and the decoder outputs the corresponding synonyms in the knowledge base directly. We use the top1 and top5 recall (Recall@1 and Recall@5) as the evaluation metrics. NER We use BARTNER (Yan et al., 2021) as our model. The target type for BARTNER is word (i.e. output first BPE of each word in entities). We use the parameters selected by Yan et al. (2021) for all pretrained models and fine-tune for 30 epochs. Entity-level F1 is used as the metric. In this section, we present the base and large version of BioBART on various generation tasks. We compare our in-domain BioBART with BART to illustrate the effectiveness of domain adaption. We also compare with the existing state-of-the-art results on each dataset to shed light on the superior performance of BioBART. The experimental results are shown in Table 2 -5. The best and the second-best scores are highlighted with bold numbers and underlines respectively. Dialogue We evaluate biomedical dialogue response generation on CovidDialog. For both base and large version, BioBART shows improvement on the automatic metric Rouge. The large Bio-BART outperforms BART by 1.71 on Rouge-2 and 0.03 on Rouge-L . Our evaluations surpasses the current state-of-the-art on BLEU score by 4.45. Summarization We present broad experimental results on biomedical summarization datasets. From Table 3 , BioBART has competitive or even superior performance on the task. Except for iCliniq and HealthCareMagic, we see consistent improvement on different datasets for both sizes of BioBART. For MeQSum, BioBART large exceeds BART large for 1.93/1.31/2.1 on Rouge-1/2/L and even outperforms the current state-of-the-art. The possible reason that biomedical in-domain pretraining fails on iCliniq and HealthCareMagic is that both datasets are built upon a clinical corpus. There still exists a domain-shifting problem for BioBART pretrained on biomedical scientific articles from PubMed. On dialogue and summarization tasks, there are minor changes in BERTscore for different models. This is possible because the metric is calculated by other pretranined language models. The implemented RoBERTa may suffer from biomedical domain-shifting and cannot quantify the model performance accurately. Entity Linking The results on biomedical entity linking tasks are shown in Table 4 There is no cure for tremor, but there are treatments to help manage the symptoms. There is no cure for tremor, but there are treatments to help manage the symptoms. The most common ... help relieve symptoms. Deep brain stimulation is sometimes done when drugs cannot control a severe, disabling tremor. Sometimes essential tremors or tremors due to Parkinson disease or another disorder require such treatment. with the current state-of-the-art NER method is still salient. In this section, we test on pretraining with or without the sentence permutation task. We pretrain BART base following the same pretraining settings except for reducing the training step to 40k for efficiency. We fine-tuned the pretrained models on the downstream tasks. The ablation results are shown in Table 6 . From the result, it is illustrated that the model pretrained on isolated text infilling task performs the best. The sentence permutation task downgrades the model's performance even for generative summarization and dialogue system tasks. Here we demonstrate BioBART's performance qualitatively. In Table 7 , we present three generative examples on CovidDialog, MeQSum, and MEDIQA-ANS respectively. In the first example, we can see that BART generates an erroneous instruction of the influence of diabetes. BioBART injected with domain knowledge can correctly give the response. In the second, BART misunderstands the document where sugar alcohol is not the cause of dry mouth. BioBART generates an accurate and concise summary. In the final example, the MEDIQA-ANS document is rather long and BART fails to extract complete information (colored in red). From the examples, we can conclude that BioBART has improvements on biomedical common sense and documents understanding. In this work, we pretrain the biomedical domain generative language model BioBART. We also collect various publicly available benchmarks for biomedical generative tasks to prompt future research. Our experimental results show that continuous pretraining on PubMed abstracts helps the model with domain adaption. BioBART shows great improvements on different benchmarks and achieves competitive or superior results over the current state-of-the-art methods. We also release our pretraining and fine-tuning codes to facilitate future research for reproducibility. We will explore pretraining generative language models 1) on in-domain vocabularies and from scratch, 2) and with clinical corpora such as EMRs in MIMIC-III (Johnson et al., 2016) or PMC-Patients (Zhao et al., 2022) in the future studies. Bridging the gap between consumers' medication questions and trusted answers Publicly available clinical bert embeddings Construction of the literature graph in semantic scholar Clusteringbased inference for biomedical entity linking Unilmv2: Pseudomasked language models for unified language model pre-training COMETA: A corpus for medical entity linking in the social media Scibert: A pretrained language model for scientific text On the summarization of consumer health questions Overview of the MEDIQA 2021 shared task on summarization in the medical domain The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32 Database issue Ncbi disease corpus: a resource for disease name recognition and concept normalization Unified language model pre-training for natural language understanding and generation GSum: A general framework for guided neural abstractive summarization Entities as experts: Sparse memory access with entity supervision Nested named entity recognition Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2022. Domain-specific language model pretraining for biomedical natural language processing Don't stop pretraining: Adapt language models to domains and tasks Gaussian error linear units (gelus) Probing biomedical embeddings from language models Biomedical question answering: A survey of approaches and challenges Mimic-iii, a freely accessible critical care database Coviddialog: Medical dialogue datasets about covid-19 Bioelectra:pretrained biomedical text encoder using discriminators Cadec: A corpus of adverse drug event annotations Semantic annotation of consumer health questions Genia corpus-a semantically annotated corpus for bio-textmining BERT might be overkill: A tiny but effective biomedical entity linker based on residual convolutional neural networks Neural architectures for named entity recognition Domain adaptation with pre-trained transformers for query focused abstractive text summarization Biobert: a pre-trained biomedical language representation model for biomedical text mining BART: Denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art Biocreative v cdr task corpus: a resource for chemical disease relation extraction Unified named entity recognition as word-word relation classification Normalising medical concepts in social media texts by learning semantic representation ROUGE: A package for automatic evaluation of summaries Sequence-to-nuggets: Nested entity mention detection via anchor-region networks Self-alignment pretraining for biomedical entity representations Text summarization with pretrained encoders Roberta: A robustly optimized bert pretraining approach The natural language decathlon: Multitask learning as question answering Medmentions: A large biomedical corpus annotated with {umls} concepts Task 2: Share/clef ehealth evaluation lab Emilia Farcas, and Ndapa Nakashole. 2021. A gradually soft multi-task and data-augmented approach to medical question understanding Bleu: A method for automatic evaluation of machine translation An empirical study of multi-task learning on bert for biomedical text mining Deep contextualized word representations Erol Bahadroglu, Alec Peltekian, and Grégoire Altan-Bonnet. 2021. Scifive: a text-to-text transformer model for biomedical literature Task 1: Share/clef ehealth evaluation lab Improving language understanding by generative pretraining Language models are unsupervised multitask learners Exploring the limits of transfer learning with a unified text-totext transformer Zero: Memory optimizations toward training trillion parameter models Question-driven summarization of answers to consumer health questions BioMegatron: Larger biomedical domain language model Xipeng Qiu, and Xuanjing Huang. 2021. Paradigm shift in natural language processing Biomedical entity representations with synonym marginalization Éric Gaussier, Liliana Barrio-Alvers, Michael Schroeder, Ion Androutsopoulos, and Georgios Paliouras. 2015. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition Cross-domain data integration for named entity disambiguation in biomedical text Attention is all you need A unified generative framework for various NER subtasks Generative biomedical entity disambiguation via knowledge base-guided pre-training and synonyms-aware fine-tuning Improving biomedical pretrained language models with knowledge Unsupervised multi-granular chinese word segmentation and term discovery via graph partition Coder: Knowledgeinfused cross-lingual medical term embedding for term normalization MedDialog: Large-scale medical dialogue datasets Pegasus: Pre-training with extracted gap-sentences for abstractive summarization Cblue: A chinese biomedical language understanding evaluation benchmark Bertscore: Evaluating text generation with bert Pmcpatients: A large-scale dataset of patient notes and relations extracted from case reports in pubmed central On the generation of medical dialogs for COVID-19 We appreciate three anonymous reviewers for helpful comments. This work was supported by the National Natural Science Foundation of China (Grant No. 12171270), and the Natural Science Foundation of Beijing Municipality (Grant No. Z190024).