key: cord-0557844-7km9uq1k
authors: Zhang, Taolin; Cai, Zerui; Wang, Chengyu; Qiu, Minghui; Yang, Bite; He, Xiaofeng
title: SMedBERT: A Knowledge-Enhanced Pre-trained Language Model with Structured Semantics for Medical Text Mining
date: 2021-08-20
journal: nan
DOI: nan
sha: e0a93234771ca530fce77e8d8822757f28316313
doc_id: 557844
cord_uid: 7km9uq1k

Recently, the performance of Pre-trained Language Models (PLMs) has been significantly improved by injecting knowledge facts to enhance their abilities of language understanding. For medical domains, the background knowledge sources are especially useful, due to the massive medical terms and their complicated relations are difficult to understand in text. In this work, we introduce SMedBERT, a medical PLM trained on large-scale medical corpora, incorporating deep structured semantic knowledge from neighbors of linked-entity.In SMedBERT, the mention-neighbor hybrid attention is proposed to learn heterogeneous-entity information, which infuses the semantic representations of entity types into the homogeneous neighboring entity structure. Apart from knowledge integration as external features, we propose to employ the neighbors of linked-entities in the knowledge graph as additional global contexts of text mentions, allowing them to communicate via shared neighbors, thus enrich their semantic representations. Experiments demonstrate that SMedBERT significantly outperforms strong baselines in various knowledge-intensive Chinese medical tasks. It also improves the performance of other tasks such as question answering, question matching and natural language inference.

Pre-trained Language Models (PLMs) learn effective context representations with selfsupervised tasks, spotlighting in various NLP tasks (Wang et al., 2019a; Nan et al., 2020; Liu et al., 2020a) .

In addition, Knowledge-Enhanced PLMs (KEPLMs) (Zhang et al., 2019 ; * T. Zhang and Z. Cai contributed equally to this work.

† Corresponding author. 1 The code and pre-trained models will be available at https://github.com/MatNLP/SMedBERT. Liu et al., 2020b; Wang et al., 2019b) further benefit language understanding by grounding these PLMs with high-quality, human-curated knowledge facts, which are difficult to learn from raw texts.

In the literatures, a majority of KEPLMs (Zhang et al., 2020a; Hayashi et al., 2020; inject information of entities corresponding to mention-spans from Knowledge Graphs (KGs) into contextual representations. However, those KEPLMs only utilize linkedentity in the KGs as auxiliary information, which pay little attention to the neighboring structured semantics information of the entity linked with text mentions. In the medical context, there exist complicated domain knowledge such as relations and medical facts among medical terms (Rotmensch et al., 2017; Li et al., 2020) , which are difficult to model using previous approaches. To address this issue, we consider leveraging structured semantics knowledge in medical KGs from the two aspects. (1) Rich semantic information from neighboring structures of linked-entities, such as entity types and relations, are highly useful for medical text understanding. As in Figure 1 , "新型冠状病毒" (novel coronavirus) can be the cause of many diseases, such as "肺炎" (pneumonia) and "呼 吸 综 合 征" (respiratory syndrome). 2 (2) Additionally, we leverage neighbors of linked-entity as global "contexts" to complement plain-text contexts used in (Mikolov et al., 2013a; Pennington et al., 2014) . The structure knowledge contained in neighbouring entities can act as the "knowledge bridge" between mention-spans, facilitating the interaction of different mention representations. Hence, PLMs can learn better representations for rare medical terms.

In this paper, we introduce SMedBERT, a KE-PLM pre-trained over large-scale medical corpora and medical KGs. To the best of our knowledge, SMedBERT is the first PLM with structured semantics knowledge injected in the medical domain. Specifically, the contributions of SMed-BERT mainly include two modules: Mention-neighbor Hybrid Attention: We fuse the embeddings of the node and type of linkedentity neighbors into contextual target mention representations. The type-level and node-level attentions help to learn the importance of entity types and the neighbors of linked-entity, respectively, in order to reduce the knowledge noise injected into the model. The type-level attention transforms the homogeneous node-level attention into a heterogeneous learning process of neighboring entities. Mention-neighbor Context Modeling: We propose two novel self-supervised learning tasks for promoting interaction between mention-span and corresponding global context, namely masked neighbor modeling and masked mention modeling. The former enriches the representations of "context" neighboring entities based on the well trained "target word" mention-span, while the latter focuses on gathering those information back from neighboring entities to the masked target like low-frequency mention-span which is poorly represented (Turian et al., 2010) .

In the experiments, we compare SMedBERT against various strong baselines, including mainstream KEPLMs pre-trained over our medical resources. The underlying medical NLP tasks include: named entity recognition, relation extraction, question answering, question matching and natural language inference. The results show that SMedBERT consistently outperforms all the base-2 Although we focus on Chinese medical PLMs here. The proposed method can be easily adapted to other languages, which is beyond the scope of this work. lines on these tasks.

PLMs in the Open Domain. PLMs have gained much attention recently, proving successful for boosting the performance of various NLP tasks . Early works on PLMs focus on feature-based approaches to transform words into distributed representations (Collobert and Weston, 2008; Mikolov et al., 2013b; Pennington et al., 2014; Peters et al., 2018) . BERT (Devlin et al., 2019) (as well as its robustly optimized version RoBERTa (Liu et al., 2019b) ) employs bidirectional transformer encoders (Vaswani et al., 2017) and self-supervised tasks to generate contextaware token representations. Further improvement of performances mostly based on the following three types of techniques, including self-supervised tasks (Joshi et al., 2020), transformer encoder architectures and multi-task learning (Liu et al., 2019a) . Knowledge-Enhanced PLMs. As existing BERTlike models only learn knowledge from plain corpora, various works have investigated how to incorporate knowledge facts to enhance the language understanding abilities of PLMs. KEPLMs are mainly divided into the following three types.

(1) Knowledge-enhanced by Entity Embedding: ERNIE-THU (Zhang et al., 2019) and Know-BERT (Peters et al., 2019) inject linked-entity as heterogeneous features learned by KG embedding algorithms such as TransE (Bordes et al., 2013) . (2) Knowledge-enhanced by Entity Description: E-BERT (Zhang et al., 2020a) and KE-PLER (Wang et al., 2019b) add extra description text of entities to enhance semantic representation.

(3) Knowledge-enhanced by Triplet Sentence: K-BERT (Liu et al., 2020b) and CoLAKE convert triplets into sentences and insert them into the training corpora without pre-trained embedding. Previous studies on KG embedding (Nguyen et al., 2016; Schlichtkrull et al., 2018) have shown that utilizing the surrounding facts of entity can obtain more informative embedding, which is the focus of our work. PLMs in the Medical Domain.

PLMs in the medical domain can be generally divided into three categories.

( (Zhang et al., 2020b) masks Chinese medical entities and phrases to learn complex structures and concepts. DiseaseBERT (He et al., 2020) leverages the medical terms and its category as the labels to pre-train the model. In this paper, we utilize both domain corpora and neighboring entity triplets of mentions to enhance the learning of medical language representations.

In the PLM, we denote the hidden feature of each token {w 1 , ..., w N } as {h 1 , h 2 , ..., h N } where N is the maximum input sequence length and the total number of pre-training samples as M . Let E be the set of mention-span e m in the training corpora. Furthermore, the medical KG consists of the entities set E and the relations set R. The triplet

where h is the head entity with relation r to the tail entity t. The embeddings of entities and relations trained on KG by TransR (Lin et al., 2015) are represented as Γ ent and Γ rel , respectively. The neighboring entity set recalled from KG by e m is denoted as N em = {e 1 m , e 2 m , ..., e K m } where K is the threshold of our PEPR algorithm. We denote the number of entities in the KG as Z. The dimensions of the hidden representation in PLM and the KG embeddings are d 1 and d 2 , respectively.

The main architecture of the our model is shown in Figure 2 . SMedBERT mainly includes three components: (1) Top-K entity sorting determine which K neighbour entities to use for each mention.

(2) Mention-neighbor hybrid attention aims to infuse the structured semantics knowledge into encoder layers, which includes type attention, node attention and gated position infusion module.

(3) Mention-neighbor context modeling includes masked neighbor modeling and masked mention modeling aims to promote mentions to leverage and interact with neighbour entities.

Previous research shows that simple neighboring entity expansion may induce knowledge noises during PLM training (Wang et al., 2019a) . In order to recall the most important neighboring entity set from the KG for each mention, we extend the Personalized PageRank (PPR) (Page et al., 1999) algorithm to filter out trivial entities. 3 Recall that the iterative process in PPR is V i = (1 − α)A · V i−1 + αP where A is the normalized adjacency matrix, α is the damping factor, P is uniformly distributed jump probability vector, and V is the iterative score vector for each entity.

PEPR specifically focuses on learning the weight for the target mention span in each iteration. It assigns the span e m a higher jump probability 1 in P with the remaining as 1 Z . It also uses the entity frequency to initialize the score vector V :

where T is the sum of frequencies of all entities. t em is the frequency of e m in the corpora. After sorting, we select the top-K entity set N em .

Besides the embeddings of neighboring entities, SMedBERT integrates the type information of medical entities to further enhance semantic representations of mention-span.

Different types of neighboring entities may have different impacts. Given a specific mention-span e m , we compute the neighboring entity type attention. Concretely, we calculate hidden representation of each entity type τ as

where f sp is the self-attentive pooling (Lin et al., 2017) to generate the mention-span representation h em ∈ R d 1 and the (h i , h i+1 , . . . , h j ) is the hidden representation of tokens (w i , w i+1 , . . . , w j ) in mention-span e m trained by PLMs. h ′ em ∈ R d 2 is obtained by σ(·) non-linear activation function GELU (Hendrycks and Gimpel, 2016) and the learnable projection matrix W be ∈ R d 1 ×d 2 . LN is the LayerNorm function (Ba et al., 2016) . Then, we calculate the each type attention weight using the type representation h τ ∈ R d 2 and the transformed mention-span representation h ′ em :

. Finally, the neighboring entity type attention weights α τ are obtained by normalizing the attention score α ′ τ among all entity types T .

Apart from entity type information, different neighboring entities also have different influences.

Specifically, we devise the neighboring entity node attention to capture the different semantic influences from neighboring entities to the target mention span and reduce the effect of noises. We calculate the entity node attention using the mention-span representation h ′ em and neighboring entities representation h e i m with entity type τ as:

where W q ∈ R d 2 ×d 2 and W k ∈ R d 2 ×d 2 are the attention weight matrices.

The representations of all neighboring entities in N em are aggregated toh ′ em ∈ R d 2 :

is the mention-neighbor representation from hybrid attention module.

Knowledge-injected representations may divert the texts from its original meanings. We further reduce knowledge noises via gated position infusion:

where

2 is the span-level infusion representation. " " means concatenation operation. h ′ e mf ∈ R d 1 is the final knowledgeinjected representation for mention e m . We generate the output token representation h if by 4 :

where W ug , W ex ∈ R 2d 1 ×d 1 . b ug , b ex ∈ R d 1 . " * " means element-wise multiplication.

To fully exploit the structured semantics knowledge in KG, we further introduce two novel self-supervised pre-training tasks, namely Masked Neighbor Modeling (MNeM) and Masked Mention Modeling (MMeM).

Formally, let r be the relation between the mention-span e m and a neighboring entity e i m :

where h mf is the mention-span hidden features based on the tokens hidden representation h if , h (i+1)f , . . . , h jf . h r = Γ rel (r) ∈ R d 2 is the relation r representation and W sa ∈ R d 1 ×d 2 is a learnable projection matrix. The goal of MNeM is leveraging the structured semantics in surrounding entities while reserving the knowledge of relations between entities. Considering the object functions of skip-gram with negative sampling (SGNS) (Mikolov et al., 2013a) and score function of TransR (Lin et al., 2015) :

where the w in L S is the target word of context c. f s is the compatibility function measuring how well the target word is fitted into the context. Inspired by SGNS, following the general energybased framework (LeCun et al., 2006) , we treat mention-spans in corpora as "target words", and neighbors of corresponding entities in KG as "contexts" to provide additional global contexts. We employ the Sampled-Softmax (Jean et al., 2015) as the criterion L MNeM for the mention-span e m :

where θ denotes the triplet (e m , r, e i m ), e i m ∈ N em . θ ′ is the negative triplets (e m , r, e n ), and e n is negative entity sampled with Q(e i m ) detailed in Appendix B. To keep the knowledge of relations between entities, we define the compatibility function as:

where µ is a scale factor. Assuming the norms of both h mf M r + h r and h e i m M r are 1,we have:

f s e m , r, e i m = µ ⇐⇒ f tr (h mf , h r , h e i m ) = 0 (17) which indicates the proposed f s is equivalence with f tr . Because | h en M r | needs to be calculated for each e n , the computation of the score function f s is costly. Hence, we transform part of the formula f s as follows: 

In contrast to MNeM, MMeM transfers the semantic information in neighboring entities back to the masked mention e m .

where Y m is the ground-truth representation of e m and h ip = Γ p (w i ) ∈ R d 2 . Γ p is the pretrained embedding of BERT in our medical corpora. The mention-span representation obtained by our model is h mf . For a sample s, the loss of MMeM L MMeM is calculated via Mean-Squared Error:

where M s is the set of mentions of sample s.

In SMedBERT, the training objectives mainly consist of three parts, including the self-supervised loss proposed in previous works and the mentionneighbor context modeling loss proposed in our work. Our model can be applied to medical text pre-training directly in different languages as long as high-quality medical KGs can be obtained. The total loss is as follows: (22) where L EX is the sum of sentence-order prediction (SOP) (Lan et al., 2020) and masked language modeling. λ 1 and λ 2 are the hyperparameters.

Pre-training Data. The pre-training corpora after pre-processing contains 5,937,695 text segments with 3,028,224,412 tokens (4.9 GB). The KGs embedding trained by TransR (Lin et al., 2015) on two trusted data sources, including the Symptom-In-Chinese from OpenKG 5 and DXY-KG 6 containing 139,572 and 152,508 entities, respectively. The number of triplets in the two KGs are 1,007,818 and 3,764,711. The pre-training corpora and the KGs are further described in Appendix A.1. Task Data. We use four large-scale datasets in ChineseBLUE (Zhang et al., 2020b ) to evaluate our model, which are benchmark of Chinese medical NLP tasks. Additionally, we test models on four datasets from real application scenarios provided by DXY company 7 and CHIP 8 , i.e., Named Entity Recognition (DXY-NER), Relation Extraction (DXY-RE, CHIP-RE) and Question Answer (WebMedQA (He et al., 2019) ). For other information of the downstream datasets, we refer readers to Appendix A.2.

In this work, we compare SMedBERT with general PLMs, domain-specific PLMs and KEPLMs with knowledge embedding injected, pre-trained on our Chinese medical corpora: General PLMs: We use three Chinese BERTstyle models, namely BERT-base (Devlin et al., 2019), BERT-wwm (Cui et al., 2019) and RoBERTa (Liu et al., 2019b) . All the weights are initialized from (Cui et al., 2020) . Domain-specific PLMs: As very few PLMs in the Chinese medical domain are available, we consider the following models. MC-BERT (Zhang et al., 2020b ) is pre-trained over a Chinese medical corpora via masking different granularity tokens. We also pre-train BERT using our corpora, denoted as BioBERT-zh. KEPLMs: We employ two SOTA KEPLMs continually pre-trained on our medical corpora as our baseline models, including ERNIE-THU (Zhang et al., 2019) and KnowBERT (Peters et al., 2019) . For a fair comparison, KEPLMs use other additional resources rather than the KG embedding are excluded (See Section 2), and all the baseline KEPLMs are injected by the same KG embedding. The detailed parameter settings and training procedure are in Appendix B.

To evaluate the semantic representation ability of SMedBERT, we design an unsupervised semantic similarity task. Specifically, we extract all entities pairs with equivalence relations in KGs as positive pairs. For each positive pair, we use one of the entity as query entity while the other as positive candidate, which is used to sample other entities as negative candidates. We denote this dataset as D1. Besides, the entities in the same positive pair often have many neighbours in common. We select positive pairs with large proportions of common neighbours as D2. Additionally, to verify the ability of SMedBERT of enhancing the low-frequency mention representation, we extract all positive pairs that with at least one low-frequency mention as D3. There are totally 359,358, 272,320 and 41,583 samples for D1, D2, D3 respectively. We describe the details of collecting data and embedding words in Appendix C.

In this experiments, we compare SMedBERT with three types of models: classical word embedding methods (SGNS (Mikolov et al., 2013a) , GLOVE (Pennington et al., 2014) ), PLMs and KEPLMs. We compute the similarity between the representation of query entities and all the other entities, retrieving the most similar one. The evaluation metric is top-1 accuracy (Acc@1).

Experiment results are shown in Table 1 . From the results, we observe that: (1) SMedBERT greatly outperforms all baselines especially on the dataset D2 (+1.36%), where most positive pairs have many shared neighbours, demonstrating that ability of SMedBERT to utilize semantic information from the global context. (2) In dataset D3, SMedBERT improve the performance significantly (+1.01%), indicating our model is effective to enhance the representation of low-frequency mentions.

We first evaluate our model in NER and RE tasks that are closely related to entities in the input texts. Table 2 shows the performances on medical NER and RE tasks. In NER and RE tasks, we can observe from the results: (1) Compared with PLMs trained in open-domain corpora, KEPLMs with medical corpora and knowledge facts achieve better results. (2) The performance of SMedBERT is greatly improved compared with the strongest baseline in two NER datasets (+0.88%, +2.07%), and (+0.68%, +0.92%) on RE tasks. We also evaluate SMedBERT on QA, QM and NLI tasks and the performance is shown in Table 3 . We can observe that SMedBERT improve the performance consistently on these datasets (+0.90% on QA, +0.89% on QM and +0.63% on NLI). In general, it can be seen from Table 2 and Table 3 that injecting the domain knowledge especially the structured semantics knowledge can improve the result greatly. 

In this experiment, we explore the model performance in NER and RE tasks with different entity hit ratios, which control the proportions of knowledge-enhanced mention-spans in the samples. The average number of mention-spans in samples is about 40. Figure 3 illustrates the performance of SMedBERT and ERNIE-med (Zhang et al., 2019) . From the result, we can observe that: (1) The performance improves significantly at the beginning and then keeps stable as the hit ratio increases, proving the heterogeneous knowledge is beneficial to improve the ability of language understanding and indicating too much knowledge facts are unhelpful to further improve model performance due to the knowledge noise (Liu et al., 2020b) . (2) Compared with previous approaches, our SMedBERT model improves performance greatly and more stable.

We further evaluate the model performance under different K over the test set of DXY-NER and DXY-RE. Figure 4 shows the the model result with K = {5, 10, 20, 30}. In our settings, the SMed-BERT can achieve the best performance in different tasks around K = 10. The results of SMed-BERT show that the model performance increasing first and then decreasing with the increasing Table 4 : Ablation study of SMedBERT on four datasets (testing set). Due to the space limitation, we use the abbreviations "D5", "D6", "D7", and "D8" to represent the cMedQANER, DXY-NER, CHIP-RE, and DXY-RE datasets respectively.

of K. This phenomenon also indicates the knowledge noise problem that injecting too much knowledge of neighboring entities may hurt the performance.

In Table 4 , we choose three important model components for our ablation study and report the test set performance on four datasets of NER and RE tasks that are closely related to entities. Specifically, the three model components are neighboring entity type attention, the whole hybrid attention module, and mention-neighbor context modeling respectively, which includes two masked language model loss L MNeM and L MMeM . From the result, we can observe that: (1) Without any of the three mechanisms, our model performance can also perform competitively with the strong baseline ERNIE-med (Zhang et al., 2019) .

(2) Note that after removing the hybrid attention module, the performance of our model has the greatest decline, which indicates that injecting rich heterogeneous knowledge of neighboring entities is effective.

In this work, we address medical text mining tasks with the structured semantics KEPLM proposed named SMedBERT. Accordingly, we inject entity type semantic information of neighboring entities into node attention mechanism via heterogeneous feature learning process. Moreover, we treat the neighboring entity structures as additional global contexts to predict the masked candidate entities based on mention-spans and vice versa. The experimental results show the significant improvement of our model on various medical NLP tasks and the intrinsic evaluation. There are two research directions that can be further explored: (1) Injecting deeper knowledge by using "farther neighboring" entities as contexts; (2) Further enhancing Chinese medical long-tail entity semantic representation.

to form their corresponding smaller version for experiments. DXY-NER and DXY-RE are datasets from real medical application scenarios provided by a prestigious Chinese medical company. The DXY-NER contains 22 unique entity types and 56 relation types in the DXY-RE. These two datasets are collected from the medical forum of DXY and books in the medical domain. Annotators are selected from junior and senior students with clinical medical background. In the process of quality control, the two datasets are annotated twice by different groups of annotators. An expert with medical background performs quality check manually again when annotated results are inconsistent, whereas perform sampling quality check when results are consistent. Table 5 shows the datasets size of our experiments.

Hyper-parameters. d 1 =768, d 2 =200, K=10, µ =10, λ 1 =2, λ 2 =4.

Model Details. We align the all mention-spans to the entity in KG by exact match for comparison purpose with ENIRE-THU (Zhang et al., 2019) . The negative sampling function is defined as Q(e i m ) = t e i m C e i m , where C e i m is the sum of frequency of all mentions with the same type of e i m . The Mention-neighbor Hybrid Attention module is inserted after the tenth transformer encoder layer to compare with KnowBERT (Peters et al., 2019) , while we perform the Mention-neighbor Context Modeling based on the output of BERT encoder. We use all the base-version PLMs in the experiments. The size of SMedBERT is 474MB while 393MB of that are components of BERT, and the added 81MB is mostly of the KG embedding. Results are presented in average with 5 random runs with different random seeds and the same hyper-parameters.

Training Procedure. We strictly follow the originally pre-training process and parameter setting of other KEPLMs. We only adapt their publicly available code from English to Chinese and use the knowledge embedding trained on our medical KG. To have a fair comparison, the pre-training processing of SMedBERT is mostly set based on ENIRE-THU (Zhang et al., 2019) without layerspecial learning rates in KnowBERT (Peters et al., 2019) . We only pre-train SMedBERT on the collected medical data for 1 epoch. In pre-training

Layer normalization. CoRR

Scibert: A pretrained language model for scientific text

Translating embeddings for modeling multi-relational data

A unified architecture for natural language processing: deep neural networks with multitask learning

Revisiting pre-trained models for chinese natural language processing

Pre-training with whole word masking for chinese BERT. CoRR, abs/1906.08101. Richard Socher, and Christopher D. Manning

Knowledge enhanced contextual word representations

Deep contextualized word representations

Pre-trained models for natural language processing: A survey

Learning a health knowledge graph from electronic medical records

Modeling relational data with graph convolutional networks

Colake: Contextualized language and knowledge embedding

Word representations: A simple and general method for semi-supervis In ACL

Attention is all you need

Improving natural language inference using external knowledge in the In AAAI

KEPLER: A unified model for knowledge embedding and pre-trained CoRR

String comparator metrics and enhanced decision rules in the fellegi-s

Cluecorpus2020: A large-scale chinese corpus for pre-training langua CoRR

Xlnet: Generalized autoregressive pretraining for language understand In NIPS

Feng Gao, and Nengwei Hua. 2020b. Conceptualized representation learning for chinese biomedical text m CoRR

Chinese medical question answer matching using end-to-end characte

ERNIE: enhanced language representation with informative entities

We would like to thank anonymous reviewers for their valuable comments. This work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB1000904, and Alibaba Group through Alibaba Research Intern Program.

The pre-training corpora is crawled from DXY BBS (Bulletin Board System) 9 , which is a very popular Chinese social network for doctors, medical institutions, life scientists, and medical practitioners. The BBS has more than 30 channels, which contains 18 forums and 130 finegrained groups, covering most of the medical domains. For our pre-training purpose, we crawl texts from channels about clinical medicine, pharmacology, public health and consulting. For text pre-processing, we mainly follow the methods of . Additionally, (1) we remove all URLs, HTML tags, e-mail addresses, and all tokens except characters, digits, and punctuation (2) all documents shorter than 256 are discard, while documents longer than 512 are cut into shorter text segments.

The DXY knowledge graph is construed by extracting structured text from DXY website 10 , which includes information of diseases, drugs and hospitals edited by certified medical experts, thus the quality of the KG is guaranteed. The KG is mainly disease-centered, including totally 3,764,711 triples, 152.508 unique entities, and 44 types of relations. The details of Symptom-In-Chinese from OpenKG is available 11 . We finally get 26 types of entities, 274,163 unique entities, 56 types of relations, and 4,390,726 triples after the fusion of the two KGs.

We choose the four large-scale datasets in Chi-neseBlue tasks (Zhang et al., 2020b) 

Since the KGs used in this paper is a directed graph, we first transform the directed "等价关系" (equivalence relations) pairs to undirected pairs and discard the duplicated pairs. For each positive pairs, we use head and tail as query respectively and sample the negative candidates based on the other. Specifically, we randomly select 19 negative entities with the same type and has a Jaro-Winkle similarity (Winkler, 1990) bigger 0.6 with the ground-truth entity. We select from all samples in Dataset-1 with positive pairs that the neighbours sets of head and tail entity have Jaccard Index (Jaccard, 1912) no less than 0.75 and at least 3 common element to construct the Dataset-2. For Dataset-3, we count the frequency of all entity mentions in pre-training corpora, and treat mentions with frequency no more than 200 as low-frequency mentions.

We train the character-level and word-level embedding using SGNS (Mikolov et al., 2013a) and GLOVE (Pennington et al., 2014) model respectively on our medical corpora with opensource toolkits 12 . We average the character embedding for all tokens in the mention to get the character-level representation.However, since some mentions are very rare in the corpora for word-level representation, we use the character-level representation as their word-level representation.BERT-like Representation Embedding: We extract the token hidden features of the last layer and average the representations of the input tokens except [CLS] and [SEP] tag, to get a vector for each entity.Similarity Measure: We try using the inverse of L2-distance and cosine similarity as measurement, and we find that cosine similarity always perform better. Hence, we report all experiment results under the cosine similarity metric.