key: cord-0556945-9qt2ni9j authors: Wang, Benyou; Xie, Qianqian; Pei, Jiahuan; Tiwari, Prayag; Li, Zhao; fu, Jie title: Pre-trained Language Models in Biomedical Domain: A Systematic Survey date: 2021-10-11 journal: nan DOI: nan sha: 0cea4db3dea404ed55b3a179b496357dc5c8ff80 doc_id: 556945 cord_uid: 9qt2ni9j Pre-trained language models (PLMs) have been the de facto paradigm for most natural language processing (NLP) tasks. This also benefits biomedical domain: researchers from informatics, medicine, and computer science (CS) communities propose various PLMs trained on biomedical datasets, e.g., biomedical text, electronic health records, protein, and DNA sequences for various biomedical tasks. However, the cross-discipline characteristics of biomedical PLMs hinder their spreading among communities; some existing works are isolated from each other without comprehensive comparison and discussions. It expects a survey that not only systematically reviews recent advances of biomedical PLMs and their applications but also standardizes terminology and benchmarks. In this paper, we summarize the recent progress of pre-trained language models in the biomedical domain and their applications in biomedical downstream tasks. Particularly, we discuss the motivations and propose a taxonomy of existing biomedical PLMs. Their applications in biomedical downstream tasks are exhaustively discussed. At last, we illustrate various limitations and future trends, which we hope can provide inspiration for the future research of the research community. . Existing data that involves biomedical information. The level is based on to which extent humans abstract biomedical knowledge from data. As the principal method of communication, humans usually record information and knowledge in a format of token sequences, resulting in languages such as natural language, constructed language, and programming language. For biomedical information and knowledge, tokens in sequences could be of various types, including words, disease codes, amino acids, and DNA. Tremendous biomedical information and knowledge in nature and human history are implicitly encapsulated in these natural token sequences in nature (a.k.a., data). Based on the abstraction degree of biomedical knowledge, we list some of these data as a pyramid in Fig. 1 . The data in the top-level means it is high-level information that explicitly conveys the biomedical knowledge, which is usually small-scaled -see biomedical knowledge bases and EHR data (maybe in multi-modality). At the low level, data, such as protein and DNA sequences, may not directly convey the biomedical knowledge, e.g., one can hardly know how a short sequence really means for humans, which needs more effort for abstraction. Fortunately, data at low-level are usually tremendous. In the current stage, existing work paid more attention to data in the top and medium-level, which are relatively small but easily understood by humans. We argue that biomedical knowledge in various scales should be paid attention to. To capture and mine the biomedical information and knowledge from various data scales, there is recently growing attention in the biomedical NLP community to adopt pre-trained language models (PLMs); since PLMs could leverage these plain token sequences without human annotations (sometimes called 'self-supervision'). The biomedical NLP is a cross-discipline research direction from various communities such as bioinformatics, medicine, and computer science (especially a major frontier of artificial intelligence, i.e., natural language processing a.k.a. NLP). The computational biology community [110] and biomedical informatics community [38] have made a substantial effort to make use of NLP tools for information mining and extraction of widespread-adopted electronic health records, medical scientific publications, medical WIKI pages, etc. For many decades, NLP has been investigating various biomedical tasks [37, 39] such as classification, information extraction, question answering, and drug discovery. Meanwhile, the approaches in the NLP community are changing rapidly, as one can witness exponentially-increasing Manuscript submitted to ACM • We discuss the limitation and future trends. This will be beneficial for beginners from both the computer science and bioinformatics fields. Organization. The paper is organized as below: Sec.2 introduces the general pre-trained language models, and Sec.3 introduces the basic methodology to apply pre-trained language models in the biomedical domain and proposes a taxonomy. Sec.4 discusses the training process of language models in the pre-training phase, and their applications in the fine-tuning phase is illustrated in Sec. 5 . More discussions about limitations and future directions is in Sec. 6. We conclude in Sec.7. Pre-trained language models (PLMs) have been widely used in computer visions, natural language processing, etc., to effectively capture the linguistic information and knowledge inherited in natural languages. In this paper, we mainly discuss pre-trained language in NLP tokens 2 . One can read the review paper of PLMs in [190] for more details. Previously, there were many typical methods to build token representation (e.g., word vectors) from plain corpora. For example, [156, 180] build a one-to-one mapping between words and their vectors, which is called 'static word embedding' since it is static and not related to word context. However, it is well known that words often express different meanings in different contexts. Inspired by [183] , many pre-trained language models adopt 'contextualized word embedding' to model words in a specific context. For 'contextualized word embedding', the vector for a word depends on its specific usage in a context. For example, the meanings of 'bank' in 'river bank' and in 'money bank' are supposed to have some difference. The 'contextualized word embedding' largely improves the quality of word representation in various tasks [45] . In this section, we will introduce three basic ingredients of biomedical pre-training models: the training objective with self-supervised tasks in Sec. 2.1 and basic neural network models in Sec. 2.2, and training paradigm in Sec. 2.3 Language models could be considered as an instance of self-supervision. Compared to data-hungry supervised learning, which usually needs annotations from humans, language models could make use of massive amounts and cheap plain corpora from the internet, books, etc. In language models, the next word is a natural label for a context sentence as a next word prediction task, or one can artificially mask a known word and then predict it. The paradigm that uses the unstructured data itself to generate labels (for example, the next word or the masked word in language models) and train supervised models (language models) to predict labels thereof is called 'self-supervision learning'. Language models are therefore also referred as "auxiliary task" or "pre-training task", in which the learned representations in language models can be used as an initial model for various downstream supervised tasks. Before self-supervision was invented, neural language models [17, 156] had been proposed for decades, and their variants [156] later became the backbones of modern NLP to provide pre-trained word features. The overview of neural language models is shown in Table 1 . [156, 180] (including Skip-Gram, CBow and Glove) typically use linear architecture to conduct calculation between word vectors, resulting in efficient training. Recently, BERT [45] , and GPT [192] proposed to use multiple layers of transformers as the basic architecture, but that led to much more parameters. Thus, subwords (limited set of common sub-word units, also called 'wordpieces' [266] 3 ) are used to reduce the size of word vocabulary and therefore save parameters. It is noticed that this is also beneficial to handle rare words. The Table 1 . Typical ways for word vectors and language models. = { , , , , } is an example text sequence. ELMO, BERT, and GPT usually work on much longer sequences than neural language models (NLMs), Skip-gram and CBOW. GPT [192] contextualized Transformers ( , , , ) → predicting the next word − =1 ( | { 1 , ..., −1 }) training corpora for pre-trained language models (like BERT and GPT) including: 1) online text like Wikipedia 4 , 2) existing books that have been digitized like BooksCorpus [303] , 3) crawled online corpora 5 . Pre-trained language models trained by these corpora are usually able to capture the common sense knowledge in the general domain. Therefore, it needs some efforts to make them be tailored to a specific domain like the biomedical domain, to capture the domain knowledge. In this paper, we mainly discuss the pre-trained language models, e.g., BERT and GPT. The main difference between them is that BERT is a textual encoder to encode a given document, while GPT is a textual decoder to decode a new document. This can also be considered as the difference between the discriminative model and generative model in machine learning. BERT is mainly used for the discriminative prediction/inference for a given text, like information extraction, text classification, named entity recognition, relation extraction, and non-generative question answering, as shown in Sec. 5. The latter is to generate texts, for example, text summarization, text completions, generative question answering and translations. The success of pre-trained language models is also attributed to the development of their base backbone network, from LSTM [84] to Transformer. Before Transformer [242] was invented, LSTM was widely used and became the first base architecture of the pre-trained language model (ELMO). However, because of its recurrence structure, it is computationally expensive to scale up LSTM to being deeper in layers. To this end, Transformer is proposed and becomes the backbone of modern NLP. Transformers are better architecture can be attributed to: 1) efficiency: a recurrent-free architecture which could compute the individual token in parallel, 2) effectiveness: attention allows spatial interaction across tokens that dynamically depends on the input itself. In this section, we briefly introduce the two typical architecture in pre-trained language models, namely, LSTM and Transformers. LSTM. Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture for sequential modeling. Unlike standard feed-forward neural networks processing single data points (such as images), LSTM can deal with entire sequences of data (such as text, speech, or video). A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell learns hidden states over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. LSTM networks are well-suited for time series data and were developed to deal with the vanishing gradient problem that can be encountered when training traditional RNNs. [183] tried to adopt Long and Short term memory network (LSTM) in pre-trained language, which naturally process token sequentially. Transformer. The backbone of most pre-trained language models (e.g., BERT and GPT) is a neural network called 'Transformer ' building upon self-attention networks (SANs) and feed-forward networks (FFNs). SAN is used to facilitate interaction between tokens, while FNN is used to refine the token presentation using non-linear transformation. Since Transformer has been the de facto backbone to replace recurrent and convolutional units, BERT and GPT adopt the Transformer as the backbone network. The transformer is superior in terms of capacity and scalability thanks to, 1) discard recurrent units and process tokens more efficiently in parallel with the position embeddings [249, 250] , 2) relieve saturation issue of expressive power with large-scale data and very deep layers due to the well-designed architecture including residual connections, layer normalization, and etc. Interestingly, AlphaFold2 [97] also borrows some insights to design the so-called 'Evoformer' as the core component in its architecture. In Table 2 , we introduce some typical pre-trained language models in general NLP domains, based on these two backbone neural networks. Table 2 . Representative pre-trained language models in the general domain. NSP means the next sentence prediction task. Objective Backbone network Comments ELMO [183] bidirectional LM Bi-LSTM the first contextualized word representation BERT [45] masked LM, NSP Transformer (Encoder) the most commonly-used pre-trained language model Roberta [144] masked LM Transformer (Encoder) a longer-trained BERT variant using more data ALBERT [118] masked LM, NSP Transformer (Encoder) a BERT variant with shared weights and a factorized word embedding XLNET [282] generalized autoregressive pretraining Transformer (Encoder) a generalized autoregressive pretraining using bidirectional contexts Electra [35] replaced token prediction Transformer (Encoder) a pre-trained based LM trained by replaced token prediction GPT [192] autoregressive language model Transformer (Decoder) a pre-trained based LM for autoregressive generation T5 [193] Seq2Seq Transformer (En-Decoder) a pre-trained based LM for seq2seq generation Representative Pre-trained language models. Based on whether the input or output is a textual sequence or the label, pre-trained language models are mainly divided into three categories: Encoder-only, Decoder-only, and En-Decoder. Encoder-only pre-trained models are mainly for text classification and sequential labeling tasks, while pre-trained models equipped with the decoder could deal with generation-related tasks like translation, summarization, and language models. See Fig. 4 for the difference: an Encoder model predicts labels for each input tokens (the brownish yellow color), a Decoder model generates a sequence of tokens w.r.t. a probability distribution (the blue color), an En-Decoder model predicts a new sequence conditioned on a given sequence (a.k.a. Seq2Seq) (the grey color). One challenge to use PLMs in downstream tasks is that there are two gaps between PLMs and downstream tasks, the task gap and domain gap. The task gap means the meta-task in PLMs (usually masked language model in BERT or causal language model in GPT) usually can not directly be tailored to most downstream tasks (e.g. sentimental classification). The domain gap refers to the difference between the trained corpora in PLMs and the needed domain in a specific downstream task. The adaptation of both task gap and domain gap is crucial. Adaption. To use the pre-trained language model in a downstream task, it is suggested to adopt both the domain and task [67, 71, 204, 293] , see Table. 3 for the difference. The domain adaption suggests continuing training pre-trained models trained from a general domain, in the target domain, e.g., biomedical domain. The task adaption refers to fine-tuning on similar downstream tasks. In this paper, without specifying, we mainly discuss the domain-adapted pre-trained models in various downstream tasks. The task adaption is not the main concern in this review. Take BERT as an example, BERT is first trained using next sentence predictions (NSP) and masked language models in the pre-training phase. Such pre-trained BERT will be used as the initial feature extractor. BERT with an additional classifier layer is then be fine-tuned to optimize the objective of down-stream tasks (like MNLI [263] , NER [235] , and SQuAD [194] ). Recently, the pre-trained language models have been widely applied to various NLP tasks and achieved significant improvement in performance. The pre-trained language models are widely used because: 1) Pre-training on the huge text corpus can learn universal language representations and help with the downstream tasks. 2) Pre-training provides a better model initialization, which usually leads to a better generalization performance and speeds up convergence on the target task. 3) Pre-training can be regarded as a kind of regularization to avoid overfitting on small data [190] . Pre-trained language models are firstly trained in general plain corpora from the Internet, like Wikipedia or crawled webpages. Except for the general domain, [57] trains CodeBERT in the programming language and [16] trains SciBERT on scientific publications and biological sequence. This paper aims to discuss pre-trained language models in the biomedical domain. In the biomedical domain, the motivation of using pre-trained language models are manyfold. • Firstly, the biomedical domain involves many sequential tokens (like biomedical corpora and history of electronic health records) that usually lack annotations. However, these sequential data were previously thought of as difficult to model. Thanks to pre-trained language models, it has been empirically demonstrated to train these sequential data in a self-supervised manner effectively. This would open a new door for biomedical pre-trained language models. • Second, annotated data in the biomedical domain is usually limited at scale. Some extreme cases in machine learning are called 'zero-shot' or 'few-shot'. More recently, GPT3 shows that language models have the potential for few-shot learning and even zero-shot learning [27] . Therefore, a well-trained pre-trained language model is more crucial to provide a richer feature extractor, which may slightly reduce the dependence on annotated data. • Plus, the biomedical domain is more knowledge-intensive than the general domain, since some tasks may need some domain expert knowledge, while pre-trained language models could serve as an easily-used soft knowledge base [184] that captures implicit knowledge from large-scale plain documents without human annotations. More recently, GPT3 has been shown to have the potential to 'remember' many complicated common knowledge [27] . • Lastly, other than text, various types of biological sequential data exist in biomedical domains, like protein sequences and DNA sequences. Using these data to trained language models has shown a great success in traditional biological tasks like protein structure predictions. Therefore, it is expected that pre-trained language models to solve more challenging problems in biology. The pre-trained language model [45] is a new two-stage paradigm for natural language tasks. In the first phase, it trains the language model with a self-supervised meta-task in task-agnostic corpora (e.g., masked language model and casual language model), and then in the second phase, it fine-tunes the pre-trained language model to (usually small-scaled) specific downstream tasks. The trivial way to use a pre-trained language model on the biomedical domain is to fine-tune it with the domain data. However, additional adaption is usually adopted to transfer the learned domain knowledge and task characteristics to the target domain and task. In this paper, we group the usage of the pre-trained language model into the categories in Fig. 5 . The adaption is basically two-fold: transfer the domain or task characteristics. The former refers to how to transfer a general pre-trained language model to a specific biomedical domain. The main challenge in the biomedical domain is that the medical jargon and abbreviations consist of many terms that are composed of Latin and/or Greek parts, leads to a gap between the general and medical domains. Moreover, clinical notes have different syntax and grammar than books or encyclopedias. This needs to design different vocabulary, resulting in that existing pre-trained models with different vocabularies (like general BERT or GPT) probably cannot be used, and training scratch is necessary. Continue Training or from scratch. One may reuse a pre-trained language model from the general domain (e.g., general Wikipedia pages 6 or Google books) and then continue pre-training a few epochs in the new (target) domain (i.e., biomedical domain). In the case when the corpora in the target domain are large-scale enough, one can also directly train the model from scratch since there is no need to reuse the general knowledge. Based on well-trained models, one has to adapt them to downstream tasks. This is typically implemented to replace the mask language model prediction head and next sentence prediction head with a downstream prediction head, e.g., classification head, or sequence labeling heads. Since the downstream tasks usually have much less training data than those used in pre-training, fine-tuning is an unstable process. [228] investigate different fine-tuning methods of BERT on the natural text classification tasks. [163] argue that the fine-tuning instability is due to vanishing gradients. [152] observe that fine-tuning mainly modifies the top layers of BERT. Unfortunately, the solutions (e.g. hyper-parameters of which layer to fine-tune) proposed in those papers cannot be easily translated to other settings. To automate this process, automatic hyper-parameter tuning (e.g. Bayesian optimization [26, 238] ) can come into help. To better standardise the domain of biomedical pre-trained language models, we propose a systematical taxonomy from the multiple-scale perspective (e.g., data, task, and approaches.) to explain the usage of pre-trained language models in the biomedical domain. Data. The data resources of PLMs range from electronic health records, scientific literature, social media, online knowledge bases, and biomedical sequences. The detail of data resources is in Sec. 4.1. In these resources, token types could be text, proteins, DNAs, etc. For textual data, most of these data are in English. Models. The model architecture could be encoder architecture, decoder architecture, or a combination of the both. The base models Typically are Transformer, while some earlier works adopt LSTM. Finally, one could either reuse an existing model (e.g., BERT or BioBERT) and continue training, or train a new model from scratch; the choice of the two above strategies depends on the specific scenario. Applications. The downstream tasks of biomedical PLMs include typical NLP applications such as information extraction, text classification, sentence similarity, question answering, dialogue system, summarization, and natural language inference. See Sec. 5 for more details. We also discuss their future trends and limitations in Sec. 6.2. Pre-trained language models are effective partially because they are not data-hungry to labeled data (sometimes called annotated data), which is essential for supervised learning. Self-supervised learning, which pre-trained language models Manuscript submitted to ACM rely on, usually adopts plain unstructured corpora in a format of a sequence of tokens. It is believed that the pre-trained language model can always benefit from more training corpora. To achieve better performance in the domain-specific downstream tasks, it is also intuitive that the in-domain data pre-training is necessary. In the biomedical domain, the in-domain data can be text in the electronic health records, scientific literature, and online social media, or biological sequence (e.g., DNA pieces), which will be introduced in the Sec. 4.1. Next, in the Sec. 4.2, we will introduce existing pre-trained models in the biomedical domain, which are pre-trained from the in-domain data as introduced in the Sec. 4.1. We will give an overview of these models and explain some differences between them. We expect to help one from both the bioinformatics and computer science community to get knowledge of the biomedical domain specific pre-trained language model quickly. Unstructured plain data for pre-trained language models mainly include electronic health records, scientific publications, social media text, or other biological sequence like protein. An overview of EHR mining can be seen in [53, 273] , and [64] discussed both health records and social media text. One can also check [99] for some systematic overview of biomedical textual corpora. Electronic health record (EHR) is a collection of patient and population electronically stored health information in a digital format that may include demographics, medical history, medication and allergies, immunization status, laboratory test results, radiology images, vital signs, personal statistics like age and weight, and billing information. One can check [219, 261] for details about EHR with deep learning. Assessing such records may be restricted to limited organizations, which hinders its widespread to the public. The reason may involve some privacy issues. CPRD. Clinical Practice Research Datalink (CPRD) [83] is primary the care database of anonymized medical records from 674 general physicians (GP) practices in the UK, which involves over 11.3 million patients. It consists of data on demographics, symptoms, tests, diagnoses, therapies, and health-related behaviors. It is also linked to secondary care (i.e., hospital episode statistics, or HES) and other health and administrative databases (e.g., office for national statistics' death registration). With 4.4 million actives (alive, currently registered) patients meeting quality criteria, approximately 6.9% of the UK population are included, this shows that patients are broadly representative of the UK general population in terms of age, sex and ethnicity. As a result, CPRD has been widely used across countries and spawned a lot of scientific research output. Publications. Scientific publications are another source for biomedical pre-trained language models since we expect that biomedical knowledge may be encapsulated in scientific publications. Moreover, such knowledge may not be limited to traditional common knowledge, but also involves some state-of-art research output that may be discovered by recent literature. 7 https://mimic.mit.edu/ BREATHE. Biomedical Research Extensive Archive To Help Everyone (BREATHE) 8 , is a large and diverse dataset collection of biomedical research articles from leading medical archives. It contains titles, abstracts, and full-body texts. The dataset collection process was done with public APIs that were used when available. The primary advantage of the BREATHE dataset is its source diversity. BREATHE is from nine sources including BMJ, arXiv, medRxiv, bioRxiv, CORD-19, Springer Nature, NCBI, JAMA, and BioASQ [30] . BREATHE v1.0 contains more than 6M articles and about 4 billion words. BREATHE v2.0 is the most recent version. Reddit. Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site, such as links, text posts, images, and videos, then voted up or down by other members. Posts are organized by subject into user-created boards called "communities" or "subreddits", which cover a variety of topics such as news, politics, religion, science, movies, video games, music, books, sports, fitness, cooking, pets, and image-sharing. Submissions with more up-votes appear towards the top of their subreddit and, if they receive enough up-votes, ultimately on the site's front page. Despite strict rules prohibiting harassment, Reddit's administrators have to moderate the communities and, on occasion, close them. COMETA corpos [15] crawled health-themed forums on Reddit using Pushshift (Baumgartner et al., 2020) and Reddit's own APIs. Tweets. Twitter is an American micro-blogging and social networking service on which users post and interact with messages known as "tweets". Registered users can post, like, and retweet tweets. Tweets were originally restricted to 140 characters, but the limit was doubled to 280 for non-CJK languages in November 2017. Audio and video tweets remain limited to 140 seconds for most accounts. The COVID-twitter-BERT [165] is trained on a corpus of 160M tweets about the coronavirus collected through the Crowdbreaks platform [166] during the period from January 12 to April 16, 2020. Other than unstructured text, there are some online medical knowledge source that are well-organized. For example, UMLS provides biomedical concepts that may benefit biomedical pre-trained language models. UMLS. Unified Medical Language System (UMLS) [21] (http:// umlsks.nlm.nih.gov) is a repository of biomedical vocabularies developed by the US National Library of Medicine. The UMLS has over 2 million names for 900, 000 concepts from more than 60 families of biomedical vocabularies, as well as 12 million relations among these concepts. These vocabularies include the NCBI taxonomy, the Medical Subject Headings (MeSH), Gene Ontology, OMIM and the Digital Anatomist Symbolic Knowledge Base. The UMLS knowledge sources are updated every quarter. All vocabularies are freely available for research purposes within an institution, if a license agreement is signed. Other than text, there are various types of biomedical token sequence, for example, amino acids for proteins. Structure of each protein is fully-determined by a sequence of amino acids [12] . These amino acids are from a limited-size amino acid vocabulary, of which 20 are commonly observed. This is similar to that the text is composed of words from a lexicon vocabulary. In this subsection, we introduce a protein dataset called 'Pfam' and DNA sequence dataset from Human Genome Project. Pfam Protein Dataset. The Pfam database 10 In 2003, an accurate and complete human genome sequence was finished two years ahead of schedule and at a cost less than the original estimated budget. [90] uses the reference human genome GRCh38.p13 primary assembly from GENCODE Release 12 . The total sequence length is about 3 Billion. Based on the types of training corpora in the biomedical domain as introduced in the above section 4.1, we mainly introduce two groups of biomedical pre-trained language models: biomedical textual language models and protein language models. Since BERT was released, various biomedical pretrained language models have been proposed via continue training with in-domain corpora based on the BERT model or training from scratch. Tab. 4 presents existing pre-trained language models with used corpora, size, release date, and related web pages. We introduce some representative pre-trained language models including encoder-only pre-trained language models like BioBERT, ClinicalBERT, SciBERT, and COVID-twitter-BERT, decoder-only pre-trained language model like MedGPT, and encoder-decoder pre-trained language models like SCIFIVE. • BioBERT [121] is initialized with the general BERT model and pre-trained on PubMed abstracts and PMC full-text articles. It is further fine-tuned for biomedical text mining tasks such as named entity recognition (NER), question answering, and relation extraction. • ClinicalBERT [87] is trained on clinical text from approximately 2M notes in the MIMIC-III database [96] , a publicly available dataset of clinical notes. • SciBERT [16] is trained on the large scale of scientific papers from multi-domain based on the BERT. The training papers are the random sample of 1.14 M full-text papers from Semantic Scholar, in which 82% articles are from the biomedical domain. 10 http://pfam.xfam.org/ 11 Clans are the generated higher-level groupings of related entries in Pfam. A clan is a collection of entries which are related by sequence similarity, structure or profile-HMM. 12 https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39/ Table 4 . Existing textual biomedical pre-trained models. The base setting is with 0.1B parameters and the large setting is with 0.3B parameters. Data is based on the submission in arXiv or publish data of the journal or conference proceeding. [211] PubMed abstracts lagre 2019.08 https://github.com/zalandoresearch/flair RadBERT [150] RadCore radiology reports -2019.12 -EhrBERT [125] MADE corpus base 2019.12 https://github.com/umassbento/ehrbert Clinical XLNet [88] EHR (MIMIC-III) base 2019.12 https://github.com/lindvalllab/clinicalXLNet CT-BERT 14 [165] Tweets about the coronavirus large 2020.05 https://github.com/digitalepidemiologylab/covid-twitter-bert Med-BERT [199] Cerner Health Facts (general EHR) -2020.05 https://github.com/ZhiGroup/Med-BERT ouBioBERT [245] PubMed base 2020.05 https://github.com/sy-wada/blue_benchmark_with_transformers Bio-ELECTRA [173] PubMed base 2020.05 https://github.com/SciCrunch/bio_electra BERT-XML Anonymous Institution EHR system small and base 2020.06 PubMedBERT [66] PubMed base 2020.07 https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract MCBERT [292] Chinese social media, wiki and EHR base 2020.08 https://github.com/alibaba-research/ChineseBLUE BioALBERT [168] PubMed [287] UMLS Metathesaurus base 2020.11 https://github.com/GanjinZero/CODER bert-for-radiology [25] daily clinical reports -2020.11 https://github.com/rAIdiance/bert-for-radiology BioMedBERT [30] BREATHE large 2020.12 https://github.com/BioMedBERT/biomedbert LBERT [257] PubMed base 2020.12 https://github.com/warikoone/LBERT ELECTRAMED [158] PubMed abstracts base 2021.04 https://github.com/gmpoli/electramed SCIFIVE [185] PubMed Abstract and PMC 220/770M 2021.06 https://github.com/justinphan3110/SciFive MedGPT [114] King's College Hospital and MIMIC-III customized 2021.07 https://pypi.org/project/medgpt/ • COVID-twitter-BERT [165] . is a natural language model to analyze COVID-19 content on Twitter. The COVIDtwitter-BERT model is trained on a corpus of 160M tweets about the coronavirus collected through the Crowdbreaks platform during the period from January 12 to April 16, 2020. • MedGPT [114] is a GPT-like language model trained by patients' medical history in the format of electronic health records (EHRs). Given the sequence of past events, MedGPT aims to predict future events like a diagnosis of a new disorder or complication of an existing disorder. • SCIFIVE [185] is a domain-specific T5 model which is pre-trained on large biomedical corpora. Like T5, SCIFIVE is a typical Seq2seq paradigm to transform a input sequence to a output sequence. Here, we will discuss the listed models in various aspects as below: Training corpora: EHR, literature, social media, etc., or the hybrid? Most pre-trained language models are based on scientific publications e.g., PubMed and EHR notes. Note that EHR datasets are usually relatively smaller than scientific publications datasets or Wikipedia. Hence pre-trained language models with only EHR datasets are typically trained from the initialization of well-trained BERT [10, 87] , XLNET [88] , etc. Furthermore, some pre-trained language models (e.g., BioRoBERTa [123] ) adopt both scientific publications and EHRs. A few models such as CT-BERT and Bioreddit-BERT [15, 165] adopt social media including Twitter and Reddit. Extra features. Unlike typical text, EHR data usually have some extra features, for example, disease codes, personal information of patients like age, gender. Such extra features can be embedded as dense vectors used in some models such as Med-BERT and BEHRT [131, 199] like word embedding, position embedding, and segment embedding that are used in the input layer of the Transformer. Training from scratch or continue training. The standard approach to obtain a biomedical pre-trained model is to conduct continual pre-training from a general-domain pre-trained model like BERT [45] , such as the BioBERT [285] . Specifically, this approach would initialize the model with the standard BERT model including its word vocabulary, which is pre-trained by general Wikipedia and BookCorpus. Besides, some literature demonstrated training from scratch may fully make use of in-domain data and reduce the negative effect from out-of-domain corpora, which may be beneficial for downstream tasks such as PubMedBERT [66] . Reusing existing vocabulary or building a new one. To make use of well-trained general pre-trained language models like BERT [45] , one has to reuse its vocabulary. However, Biomedical NLP is more challenging compared to general NLP because it involves jargon and abbreviations prevail: clinical notes have different syntax and grammar than books or encyclopedias. A totally new vocabulary necessarily leads to training from scratch due to different vocabularies that may be more computationally expensive. Model size. Typically, big models usually have a bigger capacity that needs more data for training. However, the biomedical domain usually does have as many corpora as the general domain. Thus, biomedical pre-trained language models are relatively smaller than general pre-trained language models. Another reason is that most of them are based BERT or BERT-like encoder based models, while pre-trained models with decoder architecture (e.g., GPT, T5) could be bigger than encoder-based pre-trained models. To the best of our knowledge, the biggest model is Biomegatron [215] with 1.2B parameters. Note that bigger models take longer for inference and this is unfriendly for those researchers without enough research computing resources. Thanks to the open-sourced tradition of computer science, most models have web pages for downloading and documents for usages. Some of them standardized its model in huggingface (https://huggingface.co), which will largely be beneficial for its wide-spreading. However, some models are not available for the public due to privacy issues even data might have been anonymized [122] . Biomedical pre-trained language models in other languages. Most of the biomedical pre-trained language models are in English. However, there is an increasing need for biomedical pre-trained language models in other languages. There are typically two solutions: a multilingual solution or a purely second-language solution. The former may be beneficial for low-resource languages the latter are usually used in some rich-resource languages like Chinese [292] . There are various biological sequences like proteins and DNA which could be also treated like linguistic tokens in natural language. Therefore, many existing work explored training language models for these biological sequences. The most crucial difference between language models for biological sequences, and the counterparts for natural language is the tokenization (will be introduced in Sec. 4.3.1), which leads to different vocabularies like textual vocabularies. In Sec. 4.3.2, we summarize the existing language models for these biological sequences. 4.3.1 Tokenization for Proteins/DNAs. Like words in text, biological sequences such as proteins and DNA sequences could also be modeled by language models which typically aims to predict the next token in a sequence. In contrast to that words are in a relatively-big vocabulary (typically 10k-100k), the vocabularies for biological sequences are usually small. Tokenization in Proteins. Since the structure of protein is fully determined by its amino acid sequence [12] , one can represent a protein by its amino acid sequences. Roughly 500 amino acids have been identified in nature, however only 20 amino acids are found to make up the proteins in human body. The vocabulary of protein sequences consists of these 20 typical amino acids. usually adopts a so-called ' -mer' representation for DNA sequences 18 for richer contexutual informaiton for DNAs. By doing so, the vocabulary size will increase to the 4 + 5 which is exponential to and additionally pluses five special Protein language models. Since the commonly-found categories of amino acids are relatively-small, namely 20. Initially, some work applied character-level language models to protein to deal with limited-size amino acids. In the beginning, there are many efforts to training RNN-based language models [9, 18] for protein sequences. [79, 80] trains a deep bi-directional model ELMo for proteins 19 . Other than those protein sequences, protein language models usually adopt additional features for proteins, e.g., global structural similarity between proteins and pairwise residue contact maps for each protein [18] . Later, [197] introduces the Tasks Assessing Protein Embeddings (TAPE), a suite of biologically relevant semi-supervised learning tasks. The authors also train language models based on LSTM, Transformer, and ResNet on the protein sequences. Bepler et al [19] also proposed a novel framework based on the LSTM model to learning protein sequence embeddings. They make their embeddings publicly available at 20 . Rives et al. [202] trains a contextual transformer-based language model 21 on 250 million protein sequences. The representations learned by this LM encode multi-level information spanning from biochemical properties of amino acids to remote homology of proteins. Different from the above line of approaches, MSA Transformer [198] fits a model separately to each family of proteins. ProtTrans [54] trains a variety of LM models with thousands of GPUs, and also makes the trained models publicly available 22 . ProGen [146] is a generative LM trained on 280M protein sequences conditioned on taxonomic and keyword tags. ProteinLM [269] was recently proposed, which trained a large-scale pre-train model for evolutionary-scale protein sequences, and the trained model is available at 23 BioBERT [121] , BlueBERT [179] , SciBERT [16] , BioELMo [93] , PubMedBERT [66] , BioMegatron [215] ,Yuan et al [286] , Alsentze et al [10] , Singh et al [206] , Zhu et al [301] , Si et al [216] , Sheikhshab et al [213] , Khan et al [103] , Giorgi et al [63] , Naseem [168] , Gao et al [59] , Poerner et al [186] , Sun et al [229] , for Spanish [6, 75, 159] , for Chinese [43, 91, 129, 130, 252, 272] , for French [41] , for Korean [109] , for Russian [154] , for Arabic [23] , for Italian [28] . BioBERT [121] , BlueBERT [179] , SciBERT [16] , PubMedBERT [66] , BioMegatron [ [196] , Wang et al [253] . BioBERT [121] , BlueBERT [179] , SciBERT [16] , PubMedBERT [66] , BioMegatron [215] , Yuan et al [286] , Wei et al [258] , Gao et al [58] , Mascio et al [148] , Guo et al [69] , [73] , Sont et al [221] . BioBERT [121] , BlueBERT [179] , SciBERT [16] , PubMedBERT [66] , BioMegatron [215] , Yuan et al [286] , [258] , Chen et al [33] , [31] , [32] , Yang et al [281] , Li et al [132] . [220] . Abstractive summarization: Wallace et al [247] , Gharebagh et al [62] . Both: Deyoug et al [46] , Guo et al [70] , Kieuvongngam et al [105] . Natural language inference BioELMo [93] , BlueBERT [179] , Sharma et al [212] , He et al [78] , Zhu et al [302] . Protein and DNA sequence [9] , [80] , [197] , [202] , MSA Transformer [198] , ProtTrans [54] , ProGen [146] , DNABERT [90] , [275] feasible to make use of unlabelled protein data. In detail, Alphafold2 adopts an auxiliary BERT-like loss to predict pre-masked residues in multiple sequence alignments (MSAs). DNA language models. Proteins are translated from DNA through the genetic code. There are 20 natural amino acids that are used to build the proteins that DNA encodes. Therefore, amino acids cannot be one-to-one mapped by only four nucleotides. Some work also explored the potential to build language models on DNA sequences. DNABERT [90] is a bidirectional encoder pre-trained on genomic DNA sequences with up and downstream nucleotide contexts. Yamada and Hamada [275] pre-trains a BERT on RNA sequences and RNA-binding protein sequences. All the LMs remain largely the same as those used for human language data. Designing new architectures and pipelines tailored to protein/DNA sequences is a promising direction. Similar to general domain, pre-trained language models have been widely used in many biomedical downstream tasks as shown in 5. In this section, we will introduce recent efforts of applying PLMs in various biomedical downstream tasks and also introduce some competitions and challenges in related workshops and conferences, in which pretrained language models may help. We first introduce the downstream applications of PLMs in various classical tasks of the biomedical domain. Similar to the general domain, to evaluate the effectiveness and facilitate the research development of biomedical pre-trained language models, the Biomedical Language Understanding Evaluation (BLUE) benchmark has been proposed in [179] . BLUE includes five text mining tasks in biomedical natural language processing, including sentence similarity, named entity recognition, relation extraction, text classification, and inference task. However, BLUE has some limitations. It does not include some important biomedical application tasks such as question answering and it mixes the applications of clinical data and biomedical literature. To improve it, [66] proposed a novel benchmark, the Biomedical Language Understanding & Reasoning Benchmark (BLURB). It includes named entity recognition (NER), evidence-based medical information extraction (PICO), relation extraction, sentence similarity, document classification, and question answering task. Moreover, there are works proposed the benchmark in other languages, such as chinese [292] . In the following, we will introduce the recent progress of PLMs on these tasks and other critical tasks in the biomedical domain. Information extraction plays a key role on automatically extracting structure biomedical information (entities, concepts, relations and events) from unstructured biomedical text data ranging from biomedical literature, electronic health record (EHR) to biomedical related social media corpus, etc. One can check a review in [256] . It plays an important role in the applications of intelligent healthcare such as clinical decision support, biocuration assistance and health monitoring et al. In the biomedical community, it generally refers to several important sub-tasks including named entity or concepts recognition, which aims to identify the common biomedical concept mentions or entity names (such as genes, drug names, adverse effects, metabolites and diseases et al) of biomedical texts, relation extraction which determines the relationships among biomedical entities, concepts and attributes, and event extraction. Similar to the general domain, learning methods for biomedical information extraction have been rapidly advanced based on the pre-trained language models in natural language processing recently. We will next introduce the methods progress based on the pre-trained language models in each sub-tasks. [66] consisted that domain-specific pre-training from scratch is more effectiveness than domain-specific fine-tuning and mixed-domain pre-training like the aforementioned methods. They proposed the PubMedBERT model, which conducted the domain-specific pre-training from scratch with the pure biomedical domain dataset: PubMed texts. It has achieved better performance than aforementioned methods on several biomedical NLP tasks including the biomedical NER task. Yuan et al [286] proposed a biomedical pre-trained language model KeBioLM to explicitly incorporate the knowledge from the Unified Medical Language System (UMLS) knowledge bases, which outperforms other PLMs in the biomedical NER task. Similar to PubMedBERT, Shin et al [215] further presented a larger BioMegatron model trained on a larger biomedical domain corpus. It has achieved the state-of-the-art (SOTA) on standard biomedical NLP benchmarks, including the named entity recognition. For the BioNER task, these models generally add an LSTM or conditional random field (CRF) based token classification layer to predict the tag of each token, on top of the transformer structures of pre-trained language models. Besides the biomedical pre-trained language models applied in various biomedical NLP tasks, there are also many works fine-tuning pre-trained language models for only the biomedical NER task. Singh et al [206] firstly investigated the pre-trained language model on improving the performance of biomedical NER. They proposed to make use of unlabeled data: the PubMed abstract dataset to pre-train the weights of NER model and then fine-tuned the model with the supervised NER training data. Zhu et al [301] trained a domain-specific ELMo model in the mixture data of clinical reports and relevant Wikipedia pages and then utilized it for the clinical concept extraction. Si et al [216] presented an analysis of advanced word embedding methods (including ELMO and BERT) and investigate their effects on clinical concept extraction tasks. Similar to [216] , Sheikhshab et al [213] also investigated the context embeddings from the language model ELMo on improving the biomedical named entity recognition. Khan et al [103] presented a multi-task transformer-based neural architecture for slot tagging and applied it to the Biomedical domain (MT-BioNER). It utilized the BioBert as the transformer encoder layer and the multiple data sets in the task-specific layers. With the development of general domain language models, Naseem [168] proposed an effective domain-specific language model bioALBERT trained on the biomedical domain corpora (PubMed abstracts and PMC full-text articles) for biomedical named entity recognition. Giorgi et al [63] proposed the end-to-end model for jointly extracting named entities and their relations using the pre-trained language model BERT. It achieved better performance in the biomedical NER task than [216] and BlueBERT. Gao et al [59] explored the effectiveness of transfer learning based on the pre-trained language models and semi-supervised self-training to improve the performance of biomedical NER with very limited labeled data. To save the memory and time consuming of domain adaption, Poerner et al [186] proposed to train Word2Vec on biomedical domain text and incorporated it into the general domain BERT to improve the performance of the biomedical NER task. Instead of treating the BioNER as the sequence labeling problem, Sun et al [229] proposed to consider the BioNER as the machine reading comprehension (MRC) problem based on BERT. Besides English, there is much work exploring the pre-trained language models on the BioNER of other languages, including Chinese [43, 91, 129, 130, 252, 272] , Spanish [6, 75, 159] , French [41] , Korean [109] , Russian [154] , Arabic [23] , Italian [28] . In Table 6 , we summary the commonly used datasets in the BioNER task. extraction of biomedical texts. It aims to identify the relationship (semantic correlation) between biomedical entities mentioned (such as genes, proteins, and diseases) in texts, which has a variety of biomedical applications such as clinical outcomes prediction, protein structure prediction, and clinical diagnosis. It is generally be transformed into the classification problem, to predict the possible relation type of two identified entities in a given sentence. In the past [234] adapted the SciBERT to the BioRE via fine-tuning the representation of the classification token (CLS). However, it only utilized partial information of the last layer due to mainly fine-tuning the classification token. To further explore the potential of utilizing full information in the last layer to improve performance, Su et al [226] proposed to utilize the whole last layer when fine-tuning the BERT model on the BioRe task. To further learn more general representations of entities to improve the performance, Su et al. [225] propose to employ the constractive learning to improve the BERT model for biomedical relation extraction. Xue et al [272] proposed to fine-tune BERT for joint entity and relation extraction in Chinese medical text, which used the BERT as a shared encoder and the focused attention to fuse information of NER and RE task. Chen et al [34] combined BERT with the one-dimensional convolutional neural network (1d-CNN) to fine-tune the medical relation extraction. Lin et al [135, 136] investigated the BERT model on the clinical temporal relation extraction. Guan et al [68] investigated several pre-trained language models in predicting the relationships between clinical events and temporal expressions, including BERT, RoBERTa, ALBERT, XLNet, BioBERT, ClinicalBERT. They found that RoBERTa generally has the best performance. To make the biomedical relation extraction in the document with long-distance dependencies and complex semantics, Liu et al [143] proposed to use the pre-trained self-attention structure for biomedical relation extraction in the document level with the entity replacement method. In many real application scenarios, training the medical relation extraction model generally requires collecting and storing privacy-sensitive data, which may conflict with privacy protection. To solve the problem, Sui et al [227] proposed a privacy-preserving medical relation extraction method FedED based on BERT and federated learning. Moreover, similar to BioNER, several advanced biomedical domain pre-trained language models as mentioned in the last section, such as BioBERT [121] , BlueBERT [179] , SciBERT [16] , PubMedBERT [66] , BioMegatron [215] have achieved great performance on the BioRE task. They typically adapt to the BioRE task with the extra binary classification layer (linear layer or MLP). Event extraction is another important task for mining structured knowledge from biomedical data. It aims to extract interactions between biological components (such as protein, gene, metabolic, drug, disease) and the consequences or effects of these interactions [11] . It has wide applications in biomedical ranging from supporting information retrieval, knowledge base enrichment, and pathway curation. The event generally consists of event triggers and their arguments (event participants). The triggers are signal words (generally verbs or nouns) to indicate the appearance and type of events. The arguments are biomedical entities. The events will finally be formulated to the graph structures, in which the triggers are connected with the appropriate argument along with the path. Commonly used datasets are summarized in Table 8 . Many efforts have been proposed to explore the application of pre-trained language models on biomedical event extraction recently. Trieu et al [236] proposed the neural nested event extraction model called DeepEventMine with the BERT based encoder, in which the task-specific layers are added on the top of BERT encoder to detect the nested entities and triggers, roles and nested events. Wadden et al [246] explored combining the BERT model and graph propagation on the event extraction of a variety domain including the biomedical domain, to capture both context information in the sentence and cross-sentence dependencies. They proved that the long-range dependencies captured by the graph propagation can improve the performance of the model-based BERT alone. Zhang et al [298] investigated transfer learning with the BERT model for Chinese clinical event detection. Moreover, there are works considering model the biomedical event extraction task as other NLP tasks. Ramponi et al [196] modeled the biomedical event extraction as the sequence labeling problem. They proposed the neural event extraction model called BEESL, which converted the event structures into the format of sequence labeling and utilized the BERT model as the encoder. Wang et al [253] proposed to formulate the biomedical event extraction as the multi-turn question answering problem and utilized the question answering system based on the domain-specific language model SciBERT to achieve great performance. Text classification is an essential task in biomedical natural language processing. It aims to classify the biomedical texts into the pre-defined categories, which plays an important role in the statistical analysis, data management, retrieval of biomedical data. Compared with the general domain, text classification in the biomedical domain has more challenges such as data unbalancing, semantic ambiguity, and irregular data. Fine-tuning pre-trained language models on the biomedical text classification has attracted great attention recently, in which the commonly used biomedical text classification datasets are summarized in Table 9 . Gao et al [58] Capturing the semantic similarity of sentences plays an important role in the information extraction and text mining of biomedical data, which is beneficial for many biomedical applications and downstream tasks such as biomedical search, evidence sentence retrieval, classification and question answering. It is generally be formulated into the regression problem to predict the similarity score of each sentence pair. Recent works have focused on fine-tuning various pre-trained language models to this task. Existing pre-trained language models in the biomedical domain and general domain such as BERT, RoBERTa, BioBERT, SciBERT, ClinicalBERT, BlueBERT, PubMedBERT, and BioMegatron have achieved great performance in the task, in which the BioMegatron yielded the best performance due to domain-specific pre-training and task-specific fine-tuning. However, these models are pre-trained for yielding embeddings in the token level with the contextual information. To better capture the semantic information at the sentence level, Chen et al [33] proposed similarity and text classification tasks, due to better capture the sentence semantic. In [31] , they empirically compared the performance of traditional deep learning methods such as random forest, RNN, CNN with the pre-trained models BERT and BioBERT, which shown that pre-trained language models are more effective in capture the sentence semantic due to pre-trained in large scale of corpora. Moreover, in [32] , they further employed the pre-trained sentence embedding BioSentVec to improve the traditional deep learning models: random forest and the encoder network on finding similar sentences. Yang et al [281] explored three pre-trained models including BERT, XLNet, and RoBERTa for the clinical semantic textual similarity task, in which the XLNet achieves the best performance among three models. Li et al [132] proposed to integrate the BERT models and bidirectional recurrent neural network (Bi-RNN) to capture both contextual semantic and semantic textual similarity. The commonly used sentence similarity datasets are shown in Table 10 . Biomedical question answering (BioQA) is a very important task for the information retrieval and knowledge acquisition of the large amount of unstructured data, which can provide service for professionals in the biomedical domain such as doctors and also general users. It aims to extract or generate the natural language answers to the given questions. It is a challenging task due to the lack of large-scale annotated data. In the biomedical domain, annotating data generally requires domain expertise, this is time-consuming and expensive. To facilitate the development of the biomedical question answering, some competitions and datasets have been proposed such as BioASQ [237] and MEDIQA 2019 27 . Recently, motivated by the success of unsupervised pre-training models, the fine-tuning and transfer learning of 27 https://sites.google.com/view/ mediqa2019 pre-trained language models have been widely explored in the task. Most of them are formulated into the machine reading comprehension approach, which focuses on predicting the text span of answers with the given questions and passages containing the answers. Much effort has investigated the BERT and BioBERT on the question answering task of biomedical literature datasets. Yoon et al [284] applied the BioBERT to answer biomedical questions such as factoid, list, and yes-no type questions. To solve the problem of limited training data in the biomedical domain, they firstly fine-tune BioBERT on the general domain question answering datasets SQuAD and SQuAD 2.0, and then further fine-tune it on the task dataset: BioASQ. Instead of the general domain question answering dataset, Jeong et al [89] proposed to transfer the knowledge of natural language inference (NLI) with BioBERT to improve the performance. Inspired by the BioBERT, Chakraborty et al [30] proposed a biomedical domain pre-trained language model BioMedBERT for question answering (QA) and information To help the prevention of COVID-19, there are works [55, 120, 170, 201] build the question answering and information retrieval system based on BioBERT and BERT. Besides methods for biomedical literature corpora, other works have been devoted to proposing question answering models to acquire knowledge from unstructured electronic health records (EHR). Soni and Roberts [222] investigated the performance of various pre-trained language models including BERT, BioBERT, ClinicalBERT and XLNet on the clinical question answering. They explored the fine-tuning methods with different datasets including datasets in the general domain, biomedical and clinical corpora. Mairittha [147] explored fine-tune BERT to construct the personalized EHR question answering system. The commonly used datasets in BioQA are summarized in the Table 11 . Dialogue Systems (DSs) in the biomedical domain has attracted continuous attention due to the utility applications (e.g., virtual health consultants and therapists) [119] . The aim of the dialogue system is to produce a proper response in either Table 11 . Datasets used in the biomedical question answering task. Text type Data Size PubMedQA [94] PubMed abstracts 1,000 BioASQ [169] MEDLINE articles 885 CliCR [231] Clinical case reports 100,000 emrQA [176] Clinical notes 400,000 cMedQA [294] Question-answer pair from online community 61,343 COVID-19 Questions [232] Literature review 124 a selective [262, 296] or generative [142, 288, 297] way given a dialogue context for the biomedical goals of a user. The context includes historical utterances from users and systems, biomedical knowledge base, electronic health records of users, etc. The format of a response could be various, e.g., a set of structured user goal data [259] , a distribution of biomedical labels for diagnosis [138, 296] and natural language utterances [288] . For different types of contexts and responses, recent work focuses on end-to-end DS [270, 288] or parts of four typical DS modules, i.e., Natural Language Understanding (NLU) [50, 214] , Dialogue State Tracking (DST) [138, 259] , Dialogue Policy Learning (DPL) [259, 268] and Natural Language Generation (NLG) [288] . Recently, PLMs are well-known for natural language modeling, but it is nontrivial to pre-train on task datasets that are biased on a specific domain [262] . Pre-training DS models in biomedical can be seen as a task-specific pre-training problem [72] . It involves two essential aspects, i.e., biomedicine-domain adaptation and dialogue-task adaptation. To adapt to the medical domain, the dominant solution is to pretrain a language model on a large-scale general/medical corpus and then fine-tune the model with a medical dialogue dataset. BERT-WWM and BERT-MED [277] first pre-train a BERT on Chinese Wikipedia and medical corpus, correspondingly, and then fine-tune on 2 -MedDialog dataset for understanding the intents and slots of patients. BioBERT and MIMIC-BERT [259] are pre-trained using MIMIC III dataset [216] and PubMed articles, respectively, followed by fine-tuning on MZ dataset for predicting diagnosis actions. Zeng et al. [288] pre-train Transformer, BERT-GPT, and GPT on dialog datasets and other large-scale texts, and then fine-tune models on the Chinese MedDialog dataset for generating clinically correct and human-like medical responses. To adapt to DS tasks, the challenges are complex dialogue context modeling [178] and external knowledge (e.g., knowledge base, user profiles) enrichment [177] . Naturally, PLMs are talented for those challenges because: (i) PLMs are based on transformer architecture, which can capture longer-term dependency for learning complex dialogues effectively and efficiently [81, 242] . Yan et al. [277] unify NLU, DPL, NLG tasks into one context-to-response generation framework and use pretrained GPT2 and MT5 to model complex context for generating responses. The empirical study proves the positive impact of PLMs on 2 -MedDialog dataset. (ii) PLMs can incorporate external knowledge by pretraining on large-scared corpora in a general domain [98, 127] . Shi et al. [214] use a pretrained BERT as fixed features to get a well initialized word embedding from the aspect of transfer learning. DialoGPT [297] is pre-trained based on GPT-2 [192] with a large in-domain dialogue dataset, and is able to generate more relevant, informative and coherent responses. Li et al. [127] learn to pretrain language modeling objectives (e.g., MLM and NSP) on a large-scale general corpus and then fine-tune on medical datasets with task-specific training objectives. Although recent studies have deployed PLMs models in medical DSs tasks, medical DS is still under-explored. So we summarize all available biomedical dialogue datasets in Table 12 for future research. Automatic text summarizing is an efficient task to conduct information extraction and retrieval from an ever-growing amount of biomedical texts [160] . It aims to automatically summarize the key information of single or multiple documents with shorter and fluent texts, which greatly decreases the time-consuming of acquiring important information. Similar to the general domain, existing methods generally can be classified into two categories: extractive summarization methods and abstractive summarization methods. The former methods extract correlated sentences from given long documents and concatenate them into the final summary, while the latter methods generate new sentences based on the information of given long documents. Therefore, the extractive summarization generally is formulated into the binary classification, in which the model aims to predict the sentences whether be selected into the summary, while the abstractive summarization can be deemed as the conditional text generation problem. Pre-trained language models have been well-explored on text summarizing of the general domain. To explore the advanced pre-trained language models in the text summarizing of biomedical domain, the domain knowledge is incorporated by existing methods via domain fine-tuning. For biomedical extractive summarization, Du et al [51] proposed a novel model BioBERTSum, which used the domain-aware pre-trained language model as the encoder and then fine-tune it on the biomedical extractive summarization task. Gharebagh et al [62] utilized the domain knowledge-the salient medical ontological terms to help the content selection of the SciBERT based clinical abstractive summarization model. Moradi et al [162] proposed the unsupervised extractive summarization for biomedical extractive summarization. They proposed to use the hierarchical clustering algorithm to group the contextual embeddings of sentences based on the BERT encoder and select the most informative sentences from each group to generate the final summary. Padmakumar et al [174] also proposed an unsupervised extractive summarization model, which used the GPT-2 to encode the sentences and the pointwise mutual information (PMI) to calculate the semantic similarity between sentences and documents. The proposed method has better performance than other similarity-based models on the medical journal dataset. Kanwal et al [101] proposed to fined-tune the BERT model on the International Classification of Diseases (ICD-9) labeled MIMIC-III discharge notes for the extractive summarization of electronic health records. For abstractive summarization, Wallace et al [247] utilized the Bidirectional and Auto-Regressive Transformers (BART) as To facilitate the development of methods for generating plain summaries toward the general public, Guo et al [70] proposed a novel task of plain language summarization task on the biomedical scientific reviews, and construct a novel dataset that contains 7805 high-quality abstract pairs. They explored the BART model for both extractive and abstractive summarization of the dataset, which was further pre-trained on general domain dataset CNN/DM and in-domain dataset PubMed and achieved the best performance. For the information acquisition of COVID-19 related scientific literature, Kieuvongngam et al [105] proposed the BERT and GPT-2 based model for both extractive and abstractive summarization of COVID-19 research literature. There are also works to build the multi-document summarization system for the information retrieval of COVID-19 research literature with the Siamese-BERT [55] , BioBERT and XLNet [44] . Besides biomedical literature and EHR, there are works that proposed BERT-based summarization methods [220] toward the understanding of the medical conversation between patients and doctors. Natural language inference (NLI, also known as the text entailment) is a basic task for the natural language understanding of biomedical texts. It aims to infer the relation such as entailment, neutral and contradiction, between two sentences, named as the premise and hypothesis, which can further benefit biomedical downstream tasks such as commonsense comprehension, question answering and evidence inference. In the task, the common neural network model is based on sentence pair modeling, which encodes the premise and hypothesis sentences with various neural networks and then classifies the relation between them with the softmax classifier layer. Similar to other tasks, pre-trained language models in the biomedical domain including BioELMo [93] and Blue-BERT [179] , have shown their effectiveness in the task via task-guided fine-tuning. To facilitate the development of methods for text inference and entailment in the medical domain, the MEDIQA 2019 shared task [3] was organized, in which many participants investigated the SciBERT, BioBERT, and ClinicalBERT in the medical NLI task. Moreover, some works make efforts on incorporating the domain knowledge to improve PLMs on the biomedical NLI. Sharma et al [212] incorporated the embedding of knowledge graph (UMLS) in the biomedical domain into the BioELMo to improve its performance. Yadav et al [274] a novel framework Sem-KGN for the medical textual entailment task, which infused the medical entity information from the medical knowledge bases into the BERT model. He et al [78] proposed to infuse the domain knowledge of disease into a series of pre-trained language models including BERT, BioBERT, SciBERT, ClinicalBERT, BlueBERT, and ALBERT, to improve their performance in the question answering, medical inference, and disease name recognition task. Zhu et al [302] utilized the neural architecture search (NAS) to automatically find a better transformer structure of the Chinese BERT-wwm-ext model [42] for better medical query understanding. In this section, we only list some applications that have been well-investigated or potential, although there are much bigger spaces in biomedical domains to make use of PLMs. 5.8.1 Protein structure predictions. Proteins are essential to life, and knowing their structure can facilitate our understanding of their function. However, structure of only a small fraction of proteins is known [97] . Predicting 3D structure of a protein is based solely on its amino acid sequence, a.k.a, 'protein folding problem' [12] . To evaluate protein structure predictions, CASP (Critical Assessment of Structure Prediction) uses proteins with recently solved structures that have not been deposited in the PDB or publicly disclosed; it therefore is a blind test for the participators, which is the gold-standard assessment for protein structure predictions [117, 164] . In CASP14, AlphaFold 2 [97] , a model designed by DeepMind achieves much more better performance than other participating methods (e.g. template-based methods). The authors claims that AlphaFold 2 could provide precise estimates and it could be confidently used for protein structure predictions with high reliability. However, predictions of existing methods including the AlphaFold 2 are more family-specific than protein-specific, and rely on the evolutionary information captured in multiple sequence alignments (MSAs). To solve these issues, Weißenow [260] proposed to use the attention head from the pre-trained protein language model ProtT5 without MSAs. Recently, Sturmfels et al [224] presented a new biologically-informed pre-training task: predicting protein profiles derived from multiple sequence alignments, which can improve the downstream protein structure prediction task. There are few work in DNA pre-training, among which DNABERT [86] is the representative one. DNABERT claims that 'DNABERT could 1) effectively predicts proximal and core promoter regions; 2) accurately identifies transcription factor binding sites; 3) allows visualization of important regions, contexts and sequence motifs; 4) identifying functional genetic variants with DNABERT; 5) substantially enhances performance and generalizes to other organisms.' Hong et al [85] proposed to pre-train DNA vectors to encode enhancers and promoters, and then Incorporated the attention mechanism to predicting long-range enhancer-promoter interactions (EPIs). Yamada et al [276] proposed a novel method based on the BERT to predict the interactions between RNA sequences and RNA-binding proteins (RBPs), in which the BERT model is pre-trained on the human reference genome. Mock et al [161] presented the BERTax based on BERT, for the taxonomic classification of DNA sequences. To facilitate the technological developments on biomedical text mining, many shared tasks and competitions are organized since several years ago, which focus on various important tasks in the biomedical domain. • BioNLP workshop. The BioNLP workshop 28 has been organized for 20 years and continually promoted the development of the biomedical domain, in which the community proposed a series of shared tasks and benchmark datasets. In BioNLP 2019, the BioNLP Open Shared Tasks (BioNLP-OST) 2019 [95] and the MEDIQA 2019 Shared Task [3] were organized. The BioNLP-OST 2019 proposed six tasks, including the information extraction on the bacterial biotopes and phenotypes, event extraction of genetic and molecular mechanisms, pharmacological substances, compounds and proteins named entity recognition, integrated structure, semantics and coreference task, concept extraction for drug repurposing, and the information retrieval task for neuroscience. The MEDIQA 2019 aims to explore the method development on the natural language inference (NLI), recognizing question entailment (RQE), and question answering (QA) in the medical domain. In bioNLP 2021, the MEDIQA 2021 [2] shared tasks were organized to address three tasks related to the summarization of medical documents, including the question summarization task, the multi-answer summarization task, and the radiology report summarization task. The shared tasks in the #SMM4H '21 involve the information processing methods on Twitter related to COVID-19, self-report of breast cancer, adverse effect mentions, medication regimen, and adverse pregnancy outcomes. In this subsection, we will mainly discuss the limitations of biomedical pre-trained language models and raise some concerns about them. Misinformation. The training corpora consist of EHR and social media may include wrong information. Thus, pretrained language models pre-trained on them may convey some misinformation. Furthermore, the biomedical domain itself may have misclassified disease definitions during its development process. Misinformation has become much more serious in the biomedical domain than general domain, since this may lead to some fatal consequences of biomedical decision-making. However, researchers must be aware of the complexity of routinely collected electronic health records, including ways to manage variable completeness. We believe that the predictions from pre-trained language models should be artificially calibrated by biomedical experts before it is used by end-uses like the patience or the public. Interpretation issues. Along with the power of neural networks, there is a growing concern about the interpretability of deep neural networks (DNNs). While in the biomedical domain, the consequence of bad decisions/predictions may be deadly, thus a well-interpreted model is more crucial. The interpretation in the biomedical domain may come from two aspects: (1) biomedical models should be easily understood and the predictions could be simulated from the raw input, (2) a (textual) reason should be provided for each prediction. The basic example of the former (a.k.a, transparency [139] ) is decision trees that could clearly illustrate the decision path. However, such a transparency goal is hardly achieved in modern natural language processing especially with pre-trained language models. More efforts could be made for the latter, one has to find some textual explanation for each prediction/decision, based on what doctors and patients could make their own decisions. Identifying causalities from correlations. Similar to interpretability, causality may provide the underlying explanation of the model decisions. Causality is crucial in many tasks of biomedical knowledge, e.g., diagnosis, pathology, or systems biology. Causal associations between biological entities, events, and processes are central to most claims of interest, see an early review from [111] . With automatic causality recognition, it could suggest possible causal connections that may be beneficial for biomedical decisions, which hence greatly reduces the human workload [155] . Trade off between coverage or quality? There is not large-scaled and high-quality training corpora in the biomedical domain. This means one has to sacrifice its coverage to obtain a high-quality vertical application, or train a general model with large-scaled yet low-quality corpora. Pre-trained language models typically consist of many transformer layers that have many parameters, which usually requires a massive amount of plain text. This may lead to a general model with great coverage, but a smaller proportion of high-quality expert knowledge. Heterogeneous training data. For biomedical understandings, there is heterogeneous information including tables, figures, graphs (fMRI), etc. For example, tables and numbers are crucial in scientific literature. But most PLMs are unable to interpret tables and numbers well. To deeply capture the information in these heterogeneous data, both in-depth data prepossessing and model adaption may be needed. Especially, multi-domain pre-trained language models in biomedical should be paid much more attention. Ethics and bias. With the rapid development of AI systems and applications in industrial products, it should be aware that they should not introduce any bias for a special groups or populations [149] , and some of efforts were taken in the NLP field [20, 60, 230, 299] . This become more crucial in these sensitive environments in the biomedical domain that involves life-changing decisions, like surgery [205] . It should ensure that the decisions cannot reflect discriminatory or biased behavior toward specific groups or certain populations. In the domain of pre-trained language models, the ethics and bias issues has been quantified by a few works. [290] quantifies biases in clinical contextual word embeddings. The reason behind may due to the training itself is biased with respect to various attributes like gender, language, race, age, ethnicity, and marital status. For example, in the MIMIC-III dataset [96] , one can find: 1) gender bias: males have more heart disease than females, and 2) ethnicity bias: black patients have fewer clinical studies than other groups [98] . Considering the complexity of directly reducing biases in training corpora, existing works explore to identify bias by adversarial training [290] or data augmentation [157] . Privacy. Although most corpora used in biomedical pre-training like scientific publications and social medical are open-access. Some EHRs are private since some organizations do not want to expose their data. Clinical records may contain patient visits and medical history. Such sensitive data should not be exposed as it may harm the patients physically or mentally [167] . Although sensitive information in EHR records (like MIMIC III) is de-identified before sharing for research purposes. It is possible to recover sensitive patient information from the de-identified medical records. Recent works showed that there is data leakage from pre-trained models in the general domain i.e., it is possible to recover personal information present in the pre-training corpora [122] . Due to data leakage, the models pre-trained on proprietary corpora, cannot be released publicly. Recently, Nakamura et al [167] proposed KART framework which can conduct various attacks to assess the leakage of sensitive information from pre-trained biomedical language models. Also, the federated learning [128, 278] framework may help when different organizations and end users could collaboratively learn a shared prediction model while keeping all the training data on a private side. We further suggest some future trends in this subsection. Standardized benchmark. In general NLP fields, evaluation criteria and standard benchmarks are a driving force for the NLP community. For example, BERT [45] were widely accepted in benchmarks [194, 248] that makes it spread to various tasks in NLP. On the other hand, lacking effective evaluation criterion is one of the bottlenecks of text generations [29] . In the biomedical domain, various pre-trained models and their fine-tuning applications have been proposed (as introduced in Sec. 4 and Sec. 5). However, they are generally not well-compared. Although a few efforts have been done to standardize benchmarks for biomedical pre-trained models, which includes but not limits to [67, 291] . This becomes much more difficult in the cross-discipline domain like biomedical domain since papers are usually from different communities like informatics, medicine,and computer science. An open standardized and well-categorized benchmark (like in [123] ) should be proposed to make use of the advantages of each work and collaboratively push the development of biomedical NLP. This survey is the first step to introduce the biomedical pre-trained language models and their applications in downstream tasks. More efforts are expected to be done to design fine-grained taxonomy and define each SOTA approach in various applications, based on what incremental work could be better evaluated. Open culture. In general NLP fields, a lot of effort is done to make better-available resources, including open-source resources (released training data and models), fairly-implemented approaches. Open culture makes researchers could easily contribute to the community. For example, the NLP community has been largely developed thanks to the model collections [56, 264] . Most accepted papers in top conferences tend to release codes, models, and data. Biomedical NLP fields also benefit a lot from such open culture and standard systematic evaluations. For instance, pre-trained models in Huggingface 37 largely fascinated their applications in the biomedical domain. Efficiency on pre-trained language models. Compared to previous SOTA methods training from scratch based on neural networks such as LSTM or CNN, before Transformer, pre-trained language models are much bigger in terms of model scale and much slower due to the increasing of parameters. This is more expensive for deployment that requires more computing resources. One may have to refer to [233] for efficient transformers. For example, current work explores quantization [14, 295] , weights pruning [86] , and knowledge distillation [92, 207] for BERT. Therefore, in biomedical domain, pre-training language models with lower computation complexity is a direction needed to be pay more attention. Generation based PLMs is under-investigated. Most works focused on encoder-based models, a few work is encoderbased or encoder-decoder based. This may be due to that classification tasks may be widely used in downstream biomedical tasks. Very recently, [114] proposes GPT models using temporal electronic health records and [185] trained a T5 based biomedical pre-trained model. We believe that generation based PLMs have a great potential in biomedical domain but it is currently under-investigated. We expect that more work will be done in generation based pre-trained language models like GPT, T5, and BART. In non-English or low-resource language. Most works in biomedical pre-trained language models are with English corpora, and a few about Chinese [292] , German [25] , Japanese [102, 245] , Spanish [6, 7, 145, 159] , Korean [109] , Russian [239] , Italian [28] , Arabic [13, 23] , French [41] , Portuguese [208, 209] etc. For the non-English biomedical tasks, there are two mainstream solutions: a single non-English language paradigm and a multi-linguistic paradigm. The former uses a single language, while the latter uses multiple languages. The multi-linguistic paradigm could be more beneficial for low-resource, since biomedical knowledge itself is language-independent and information in a second language could be complementary. Multi-modal pre-training. Multi-modal pre-training [191, 195] has attracted much attention in image classification and generation tasks, because it only needs cheap but large-scale publicly-available online resources. This shows great potential in machine learning since less human annotation is needed. It is expected that various modalities could provide complementary information. Making use of biomedical codes, medical images, waveforms, and genomics in pre-training models would be beneficial but challenging due to its multi-modal nature. 37 https://huggingface.co/ Injecting biomedical knowledge in pre-trained language model. Before the pre-training age, some works [184] have explored injecting medical knowledge into embeddings that provides potentially better machine learning features. Recently, existing work claims that pre-trained language models could be a soft knowledge base that captures knowledge. Despite this, [40, 271] also tried to explicitly inject knowledge into pre-trained language models. In the biomedical domain which is knowledge-intensive, knowledge-injected models could have great potential in the future. For example, [153] integrates domain knowledge (i.e., Unified Medical Language System (UMLS) Metathesaurus) in pre-training via a knowledge augmentation strategy. Interpretability in biomedical PLMs. Neural networks were criticized to have limited interpretability. Pre-trained language models are typically huge neural network models, which is more challenging in terms of interpretability. One may expect to understand the working mechanism related to the medical characteristics in pre-trained language models. For example, probing pre-trained language models have been widely used to understand pre-trained language models, see [113, 134, 244, 267] . For biomedical pre-trained language models, [8] aim to evaluate pre-trained language models about the disease knowledge. [243] exhaustively analyzing attention in protein Transformer models, providing many interesting findings to better understand the working mechanisms. [93] conducts some probing experiments to determine what additional information is carried intrinsically by BioELMo and BioBERT. We believe that more efforts are expected for interpretability in biomedical PLMs. This paper systematically summarizes recent advances of pre-trained language models in biomedical domain, including background, why and how pre-trained language models are used in the biomedical domain, existing biomedical pretrained language models, data sources in the biomedical domain, application of pre-trained language models in various biomedical downstream tasks. Furthermore, we also discuss some limitations and future trends. Finally, we expect that the pre-trained language model in the general NLP domain could also help the specific biomedical domain. Recognizing question entailment for medical question answering Overview of the mediqa 2021 shared task on summarization in the medical domain Overview of the mediqa 2019 shared task on textual inference, question entailment and question answering Pharmaconer: pharmacological substances, compounds and proteins named entity recognition track Transfer learning for biomedical question answering Named entity recognition in spanish biomedical literature: short review and bert model Testing contextualized word embeddings to improve ner in spanish clinical case narratives Probing pre-trained language models for disease knowledge Unified rational protein engineering with sequence-only deep representation learning Publicly Available Clinical Bert embeddings Event extraction for systems biology by text mining the literature Principles that govern the folding of protein chains Arabert: transformer-based model for arabic language understanding Binarybert: pushing the limit of bert quantization Cometa: a corpus for medical entity linking in the social media SciBERT: A pretrained language model for scientific text A neural probabilistic language model Learning protein sequence embeddings using information from structure Learning protein sequence embeddings using information from structure Language (technology) is power: a critical survey of" bias The unified medical language system (umls): integrating biomedical terminology Abioner: a bert-based model for arabic biomedical named-entity recognition Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research Highly accurate classification of chest radiographic reports using a deep learning natural language model pre-trained on 3.8 million text reports A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning Language models are few-shot learners Crosslingual named entity recognition for clinical de-identification applied to a covid-19 italian data set Evaluation of text generation: a survey Biomedbert: a pre-trained biomedical language model for qa and ir Evaluation of five sentence similarity models on electronic medical records Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records Biosentvec: creating sentence embeddings for biomedical texts A general approach for improving deep learning-based medical relation extraction using a pre-trained model and fine-tuning Electra: pre-training text encoders as discriminators rather than generators A discourse-aware attention model for abstractive summarization of long documents A survey of current work in biomedical text mining Bioinformatics-an introduction for computer scientists Biomedical natural language processing Combining pre-trained language models and structured knowledge Contextualized french language models for biomedical named entity recognition Pre-training with whole word masking for chinese bert Named entity recognition using bert bilstm crf for chinese electronic health records Caire-covid: a question answering and query-focused multi-document summarization system for covid-19 scholarly information management Bert: pre-training of deep bidirectional transformers for language understanding Ms2: multi-document summarization of medical studies Biocreative vi precision medicine track: creating a training corpus for mining protein-protein interactions affected by mutations Ncbi disease corpus: a resource for disease name recognition and concept normalization Extracting symptoms and their status from clinical conversations Learning to infer entities, properties and their relations from clinical conversations Biomedical-domain pre-trained language model for extractive summarization Deep scaled dot-product attention based domain adaptation model for biomedical question answering A survey and analysis of electronic healthcare record standards Prottrans: towards cracking the language of life's code through self-supervised deep learning and high performance computing Co-search: covid-19 information retrieval with semantic search, question answering, and abstractive summarization Matchzoo: a toolkit for deep text matching Codebert: a pre-trained model for programming and natural languages Limitations of transformers on clinical text classification A pre-training and self-training approach for biomedical named entity recognition A survey on bias in deep nlp Linnaeus: a species name identification system for biomedical literature Attend to medical ontologies: content selection for clinical abstractive summarization End-to-end named entity recognition and relation extraction using pre-trained language models Capturing the patient's perspective: a review of advances in natural language processing of health-related text Cas: french corpus with clinical cases Jianfeng Gao, and Hoifung Poon. 2020. Domain-specific language model pretraining for biomedical natural language processing Jianfeng Gao, and Hoifung Poon. 2021. Domain-specific language model pretraining for biomedical natural language processing Robustly pre-trained neural model for direct temporal relation extraction Benchmarking of transformer-based pre-trained models on social media text classification datasets Automated lay language summarization of biomedical scientific reviews 2020. Don't stop pretraining: adapt language models to domains and tasks Don't stop pretraining: adapt language models to domains and tasks Document classification for covid-19 literature Yannick Toussaint, and Adrien Coulet. 2020. Experiments on transfer learning architectures for biomedical relation extraction Biomedical named entity recognition with multilingual bert Pre-trained models: past, present and future Infusing disease knowledge into bert for health question answering, medical inference and disease name recognition Modeling aspects of the language of life through transfer-learning protein sequences Modeling the language of life-deep learning protein sequences Convert: efficient and accurate conversational representations from transformers The ddi corpus: an annotated corpus with pharmacological substances and drug-drug interactions Data resource profile: clinical practice research datalink (cprd) Long short-term memory Identifying enhancer-promoter interactions with neural network based on pre-trained DNA vectors and attention mechanism Dynabert: dynamic bert with adaptive width and depth Clinicalbert: modeling clinical notes and predicting hospital readmission Clinical xlnet: modeling sequential clinical notes and predicting prolonged mechanical ventilation Transferability of natural language inference to biomedical question answering Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome A bert-bilstm-crf model for chinese electronic medical records named entity recognition Fang Wang, and Qun Liu. 2020. Tinybert: distilling bert for natural language understanding Probing biomedical embeddings from language models Pubmedqa: a dataset for biomedical research question answering Proceedings of the 5th workshop on bionlp open shared tasks Mimic-iii, a freely accessible critical care database Ammu-a survey of transformer-based biomedical pretrained language models Secnlp: a survey of embeddings in clinical natural language processing How to pre-train your model? comparison of different pre-training models for biomedical question answering Attention-based clinical note summarization Eiji Aramaki, and Kazuhiko Ohe. 2020. A clinical specific bert developed with huge size of japanese clinical narrative Morteza Ziyadi, and Mohamed AbdelHady. 2020. Mt-bioner: multi-task learning for biomedical named entity recognition using deep bidirectional transformers A survey of word embeddings for clinical text Automatic text summarization of covid-19 medical research articles using bert and gpt-2 Introduction to the bio-entity recognition task at jnlpba Overview of genia event task in bionlp shared task The genia event extraction shared task, 2013 edition-overview Korean clinical entity recognition from diagnosis text using bert Computational biology A review of causal inference for biomedical informatics Unsupervised pre-training for biomedical question answering Discourse probing of pretrained language models Medgpt: medical concept prediction from clinical narratives Chemdner: the drugs and chemical names extraction challenge Overview of the biocreative vi chemical-protein interaction track Critical assessment of methods of protein structure prediction (CASP)-Round XIII Albert: a lite bert for self-supervised learning of language representations Conversational agents in healthcare: a systematic review Answering questions on covid-19 in real-time Biobert: a pre-trained biomedical language representation model for biomedical text mining Does bert pretrained on clinical notes reveal sensitive data Pretrained language models for biomedical and clinical tasks: understanding and extending the state-of-the-art Semi-supervised variational reasoning for medical dialogue generation Fine-tuning bidirectional encoder representations from transformers (bert)-based models on large-scale electronic health record notes: an empirical study Biocreative v cdr task corpus: a resource for chemical disease relation extraction Task-specific objectives of pre-trained language models for dialogue adaptation Federated learning: challenges, methods, and future directions Chinese clinical named entity recognition with variant neural structures based on bert methods Towards chinese clinical named entity recognition by dynamic embedding using domain-specific knowledge Behrt: transformer for electronic health records Cross2self-attentive bidirectional recurrent neural network with bert for biomedical semantic text similarity Weijian Sun, and Xuanjing Huang. 2020. Task-oriented dialogue system for automatic disease diagnosis via hierarchical reinforcement learning Birds have four legs?! numersense: probing numerical commonsense knowledge of pre-trained language models A bert-based universal model for both within-and cross-sentence clinical temporal relation extraction A bert-based one-pass multi-task model for clinical temporal relation extraction Graph-evolving meta-learning for low-resource medical dialogue generation Enhancing dialogue symptom diagnosis with global attention and symptom graph The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing 2020. A survey on contextual embeddings Meddg: a large-scale medical consultation dataset for building medical dialogue system Document-level biomedical relation extraction leveraging pretrained self-attention structure and entity replacement: algorithm and pretreatment method validation study Roberta: a robustly optimized bert pretraining approach Pre-trained language models to extract information from radiological reports Progen: language modeling for protein generation Improving fine-tuned question answering models for electronic health records Comparative analysis of text classification approaches in electronic health records A survey on bias and fairness in machine learning Self-supervised contextual language representation of radiology reports to improve the identification of communication urgency Bidirectional representation learning from transformers using multimodal electronic health record data to predict depression What happens to bert embeddings during fine-tuning Umlsbert: clinical domain knowledge augmentation of contextual embeddings using the unified medical language system metathesaurus On biomedical named entity recognition: experiments in interlingual transfer for clinical and social media texts Biocause: annotating and analysing causality in the biomedical domain Efficient estimation of word representations in vector space Interpretable bias mitigation for textual data: reducing gender bias in patient notes while maintaining classification performance Electramed: a new pre-trained language representation model for biomedical nlp Named entity recognition, concept normalization and clinical coding: overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results Text summarization in the biomedical domain: a systematic review of recent research BERTax: taxonomic classification of DNA sequences with Deep Neural Networks Clustering of deep contextualized representations for summarization of biomedical texts On the stability of fine-tuning bert: misconceptions, explanations, and strong baselines A large-scale experiment to assess protein structure prediction methods COVID-Twitter-BERT: A Natural Language Processing Model to Analyse Covid-19 content on twitter Crowdbreaks: tracking health trends using public social media data and crowdsourcing Kart: privacy leakage framework of language models pre-trained with clinical records Bioalbert: a simple and effective pre-trained language model for biomedical named entity recognition Results of the seventh edition of the bioasq challenge Transformer-based models for question answering on covid19 A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature Overview of the epigenetics and post-translational modifications (epi) task of bionlp shared task On the effectiveness of small, discriminatively pre-trained language representation models for biomedical text mining Unsupervised extractive summarization using pointwise mutual information The species and organisms resources for fast and accurate identification of taxonomic names in text Emrqa: a large corpus for question answering on electronic medical records A cooperative memory network for personalized task-oriented dialogue systems with incomplete user profiles Sentnet: source-aware recurrent entity network for dialogue response selection Transfer learning in biomedical natural language processing: an evaluation of bert and elmo on ten benchmarking datasets Glove: global vectors for word representation Modern clinical text mining: a guide and review True few-shot learning with language models Deep contextualized word representations. In NAACL Language models as knowledge bases Erol Bahadroglu, Alec Peltekian, and Grégoire Altan-Bonnet. 2021. Scifive: a text-to-text transformer model for biomedical literature Inexpensive domain adaptation of pretrained language models: case studies on biomedical ner and covid-19 qa Event extraction across multiple levels of biological organization Overview of the cancer genetics and pathway curation tasks of bionlp shared task Overview of the infectious diseases (id) task of bionlp shared task Pre-trained models for natural language processing: a survey Learning transferable visual models from natural language supervision Improving language understanding by generative pre-training Exploring the limits of transfer learning with a unified text-to-text transformer Squad: 100,000+ questions for machine comprehension of text Zero-shot text-to-image generation Biomedical event extraction as sequence labeling Evaluating protein transfer learning with tape Msa transformer. bioRxiv Med-bert: pre-trained contextualized embeddings on large-scale structured electronic health records for disease prediction Entity-enriched neural models for clinical question answering Radu Florian, and Salim Roukos. 2020. End-to-end qa on covid-19: domain adaptation with synthetic training Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences Lessons from natural language inference in the clinical domain Continual domain-tuning for pretrained language models Ethics of artificial intelligence in surgery Effective use of bidirectional language modeling for transfer learning in biomedical named entity recognition Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter A gpt-2 language model for biomedical texts in portuguese BioBERTpt -A Portuguese Neural Language Model for Clinical Named Entity Recognition Pre-training of graph augmented transformers for medication recommendation Bioflair: pretrained pooled contextualized embeddings for biomedical sequence labeling tasks Incorporating domain knowledge into medical nli using knowledge graphs In-domain context-aware token embeddings improve biomedical named entity recognition Understanding medical conversations with scattered keyword attention and weak supervision from responses Biomegatron: larger biomedical domain language model Enhancing clinical concept extraction with contextual embeddings Overview of biocreative ii gene mention recognition Biosses: a semantic sentence similarity estimation system for the biomedical domain Deep learning for electronic health records: a comparative review of multiple deep neural architectures Summarizing medical conversations via identifying important utterances Classification of traditional chinese medicine cases based on character-level bert and deep learning Evaluation of dataset selection for pre-training and fine-tuning transformer language models for clinical question answering Natural language processing in medicine: an overview Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models Improving bert model using contrastive learning for biomedical relation extraction Investigation of bert model on biomedical relation extraction based on revised fine-tuning mechanism Feded: federated learning via ensemble distillation for medical relation extraction How to fine-tune bert for text classification Biomedical named entity recognition using bert in the machine reading comprehension framework Mitigating gender bias in natural language processing: literature review Clicr: a dataset of clinical case reports for machine reading comprehension Rapidly bootstrapping a question answering dataset for covid-19 Efficient transformers: a survey Biomedical relation extraction with pre-trained language representations and minimal task-specific architecture Introduction to the CoNll-2003 shared task: language-independent named entity recognition Deepeventmine: end-to-end neural nested event extraction from biomedical texts Bioasq: a challenge on large-scale biomedical semantic indexing and question answering Bayesian optimization is superior to random search for machine learning hyperparameter tuning: analysis of the black-box optimization challenge The russian drug reaction corpus and neural models for drug reactions and effectiveness detection in user reviews i2b2/VA challenge on concepts, assertions, and relations in clinical text The eu-adr corpus: annotated drugs, diseases, targets, and their relationships Attention is all you need Bertology meets biology: interpreting attention in protein language models Probing pretrained language models for lexical semantics Kamohara, and Yasushi Matsumura. 2020. Pre-training technique to localize medical bert and enhance biomedical bert Entity, relation, and event extraction with contextualized span representations Generating (factual?) narrative summaries of rcts: experiments with neural multi-document summarization Glue: a multi-task benchmark and analysis platform for natural language understanding On position embeddings in bert Encoding word order in complex embeddings Cord-19: the covid-19 open research dataset A bert-based named entity recognition in chinese electronic medical record Biomedical event extraction as multi-turn question answering Medsts: a resource for clinical semantic textual similarity The 2019 n2c2/ohnlp track on clinical semantic textual similarity: overview Clinical information extraction applications: a literature review Lbert: lexically aware transformer-based bidirectional encoder representation model for learning universal bio-entity relations Relation extraction from clinical narratives using pre-trained language models Task-oriented dialogue system for automatic diagnosis Protein language model embeddings for fast, accurate, alignment-free protein structure prediction Representation learning for electronic health records An effective domain adaptive post-training method for bert in response selection A broad-coverage challenge corpus for sentence understanding through inference Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Huggingface's transformers: state-of-the-art natural language processing Deep learning in clinical natural language processing: a methodical review Google's neural machine translation system: bridging the gap between human and machine translation Perturbed masking: parameter-free probing for analyzing and interpreting bert Generative adversarial regularized mutual information policy gradient framework for automatic diagnosis Modeling Protein Using Large-scale Pretrain Language Model End-to-end knowledge-routed relational dialogue system for automatic diagnosis 2021. K-plug: knowledge-injected pre-trained language model for natural language understanding and generation in e-commerce Huanhuan Zhang, and Ping He. 2019. Fine-tuning bert for joint entity and relation extraction in chinese medical text Mining electronic health records (ehrs) a survey Medical knowledge-enriched textual entailment framework Prediction of rna-protein interactions using a nucleotide language model Prediction of RNA-protein interactions using a nucleotide language model Zhaochun Ren, and Huasheng Liang. 2021. Mˆ2-MedDialog: A Dataset and Benchmarks for Multi-domain Multi-service Medical Dialogues Federated machine learning: concept and applications On the generation of medical dialogues for covid-19 Clinical concept extraction using transformers Measurement of semantic textual similarity in clinical texts: comparison of transformer-based models Xlnet: generalized autoregressive pretraining for language understanding Sequence tagging for biomedical extractive question answering Pre-trained language model for biomedical question answering Biobert based named entity recognition in electronic medical record Songfang Huang, and Fei Huang. 2021. Improving biomedical pretrained language models with knowledge Coder: knowledge infused cross-lingual medical term embedding for term normalization Meddialog: a large-scale medical dialogue dataset Survey of natural language processing techniques in bioinformatics. Computational and mathematical methods in medicine 2015 Hurtful words: quantifying biases in clinical contextual word embeddings Cblue: a chinese biomedical language understanding evaluation benchmark Feng Gao, and Nengwei Hua. 2020. Conceptualized representation learning for chinese biomedical text mining Avirup Sil, and Todd Ward. 2020. Multi-stage pre-training for low-resource domain adaptation Multi-scale attentive interaction networks for chinese medical question answer selection Xin Jiang, and Qun Liu. 2020. Ternarybert: distillation-aware ultra-low bit bert Mie: a medical information extractor towards medical dialogues Dialogpt: large-scale generative pre-training for conversational response generation Pre-trained language model augmented adversarial training network for chinese clinical event detection Men also like shopping: reducing gender bias amplification using corpus-level constraints Dut-nlp at mediqa 2019: an adversarial multi-task network to jointly model recognizing question entailment and question answering Clinical concept extraction with contextual word embedding Discovering better model architectures for medical query understanding Aligning books and movies: towards story-like visual explanations by watching movies and reading books