key: cord-1028411-0zkhz71p
authors: Pattisapu, Nikhil; Anand, Vivek; Patil, Sangameshwar; Palshikar, Girish; Varma, Vasudeva
title: Distant Supervision for Medical Concept Normalization
date: 2020-08-09
journal: J Biomed Inform
DOI: 10.1016/j.jbi.2020.103522
sha: 7b03934655dbb0671184b498ba4607ba7ea7c675
doc_id: 1028411
cord_uid: 0zkhz71p

We consider the task of Medical Concept Normalization (MCN) which aims to map informal medical phrases such as “loosing weight” to formal medical concepts, such as “Weight loss”. Deep learning models have shown high performance across various MCN datasets containing small number of target concepts along with adequate number of training examples per concept. However, scaling these models to millions of medical concepts entails the creation of much larger datasets which is cost and effort intensive. Recent works have shown that training MCN models using automatically labeled examples extracted from medical knowledge bases partially alleviates this problem. We extend this idea by computationally creating a distant dataset from patient discussion forums. We extract informal medical phrases and medical concepts from these forums using a synthetically trained classifier and an off-the-shelf medical entity linker respectively. We use pretrained sentence encoding models to find the k-nearest phrases corresponding to each medical concept. These mappings are used in combination with the examples obtained from medical knowledge bases to train an MCN model. Our approach outperforms the previous state-of-the-art by 15.9% and 17.1% classification accuracy across two datasets while avoiding manual labeling.

Medical social media is a subset of social media focusing on medical topics. It primarily includes medical forums, blogs, and tweets. It draws participation from a variety of cohorts such as patients, caretakers, consultants, doctors, pharmacists, researchers, and journalists [1] . Recent studies have shown that medical social media can be leveraged to find the side effects of a particular drug [2] , detect the spread of infectious diseases, monitor public health [3] and understand a patient's experience in healthcare [4] [5] . However, automatically identifying such insights is challenging due to the lexical and grammatical variability of the language used in social media which contains informal language, non-standard grammar, and typographic errors. Due to the varied backgrounds and expertise levels of its users, medical social media also contains non-standard medical terminology, jargon and abbreviations [6] . The task of Medical Concept Normalization (MCN) aims to map a variable length phrase to a medical concept in some external coding system. Table 1 provides a few examples of social media phrases mapped to medical concepts. 1 MCN has several applications in improving patient care, such as the understanding and answering of patients' questions, early detection of patients requiring immediate attention, and digital disease surveillance [7] . the same gassy feeling Bloating (SNOMED ID: 60728008) Table 1 : Examples of mappings between social media phrases and medical concepts. SNOMED CT is a medical knowledge base. 1 We use italics to denote medical phrases and typewriter to denote medical concepts.

Medical entity linking, which is closely related to MCN, links medical entity mentions in the text with their corresponding entities in a medical Knowledge Base (KB) such as SNOMED CT [8] . While both tasks share the primary goal of disambiguation of a sequence of words, there exist a few substantial differences. Medical entity linking operates on medical entity mentions which have an entity type such as Disease, Drug, Symptom, Treatment, and Test whereas MCN operates on phrases which may or may not have an entity type. For instance, the phrase cant shut up for the whole day is not recognized as a medical entity by any medical entity recognizer and yet is mapped to Hyperactive Behavior, SNOMED ID: 44548000 by MCN models [9] .

Furthermore, MCN, unlike medical entity linking, allows mapping between the source text and target concept to be loosely defined. For instance, no way i 'm gettin any sleep 2nite could be mapped to Insomnia, SNOMED ID: 193462001. Medical entity linking uses the context of an entity mention along with the information from the medical KB, to decide which entity is being referred to in the text [10] . Lack of such information makes MCN a relatively more challenging task than medical entity linking.

Most of the current deep learning models (Section 2) formulate MCN as a supervised text classification problem. This formulation has several major shortcomings. 

Existing approaches for MCN can be divided into three major categories. The first category formulates MCN as a monolingual translation problem, where the task is to map informal phrases (L 1 ) to formal medical concepts (L 2 ). Limsopatham et al. [11] adapt phrase-based machine translation [12] to translate medical phrases from

Twitter language to formal medical language. During inference, output of the machine translation model is mapped to one of the concepts in medical lexicons based on the ranked similarity of their word vector representations. Similar to [11] , statistical machine translation based techniques have also been used to normalize non-medical phrases [13] [14] [15] .

The second category poses MCN as a supervised text categorization problem, in which an informal medical phrase is categorized into one of the predefined categories wherein each category represents a unique concept in a medical lexicon. Limsopatham et al. [9] use Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) along with the pretrained word embeddings for normalizing medical concepts. Their approach outperforms lexical matching and machine translation based 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 approaches by up to 44%. Lee [19] propose the use of multilayered bidirectional transformer encoder BERT [20] to extract the vector representation of input phrases which are subsequently used to train a softmax classifier.

The third type of MCN methods project the informal phrases and formal medical concepts into a common embedding space. During inference, cosine similarity based ranking is used to retrieve the most similar concept to a given input phrase. Metke-Jimenez et. al. [21] use TF-IDF representation to retrieve relevant medical concepts corresponding to paraphrased concept mentions. In our recent work [6] we propose a RoBERTa [22] based neural model which maps the input phrase and medical concepts into a common embedding space. We leverage existing medical knowledge bases (such as SNOMED CT) to first obtain embeddings for each medical concept using various text and graph based embeddings. Subsequently, we train a neural model to map informal phrases into the target embedding space. We have shown that this approach outperforms all the existing approaches and achieves an improvement of up to 6 .3% compared to the previous state-of-the-art [19] . For the extent of this work, we regard [6] as the state-of-the-art for this task.

Most of the recent works [6, 9, 11, 17, 18, 19] use CADEC dataset for experimentation, while some of these [6, 19] also use PsyTAR dataset. Both CADEC and PsyTAR datasets were created from the patient discussion forum askapatient.com. Few works [9, 17, 18, 19] , use TwADR-L dataset for experiments which consisted of medical phrases extracted from Twitter mapped to the medical concepts in the SIDER 4 medical lexicon 2 . Almost all recent works [6, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 21] use classification accuracy as the primary metric for measuring the performance of MCN models.

Main drawback of previous approaches such as [7] , [9] and [19] is that they rely on the availability of pairs of medical phrases and concepts for training MCN models. Creating such a training dataset entails the painstaking task of manually identifying social media phrases and mapping them to one of the concepts in a medical lexicon. In our prior work [6] we used two techniques to overcome this challenge. First, we encoded all target medical concepts into a common embedding space using a variety of text and graph embeddings methods and thereafter transformed an input phrase into a vector in the target embedding space. This allowed our model to map phrases to even those medical concepts which were not present in the training set. Second, we generated labeled examples for MCN using SNOMED CT synonyms by treating each synonym as a medical phrase. We have shown that our model trained exclusively on SNOMED CT synonyms demonstrates reasonable performance when compared to other approaches. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 architecture as proposed in our prior work (Section 3.1). Our main contribution in this work is the automatic creation of distantly supervised MCN dataset which is discussed in Section 3.2. For the extent of this work, we use our prior work [6] as a baseline.

In this Section, we describe our previous approach for MCN. We used a two staged approach for normalizing medical concept mentions. In the first stage, all medical concepts from a target lexicon (such as SNOMED CT) are encoded into fixed sized embeddings using a variety of text and graph based embedding methods such that similar medical concepts are closer in the target embedding space. In the second stage, each input phrase is transformed into a vector m i using pretrained RoBERTa model [22] which is then transformed into a vector in the target embedding space r i using the two layered feed forward neural network shown in Equations 1, 2, where W w , b w , W r , b r and the weight matrices of the RoBERTa model are trainable parameters. All parameters are trained using AdamW [23] stochastic optimizer which aims to maximize the cosine similarity between the transformed vector r i and the corresponding target embedding. During inference, a new input phrase m j is mapped to a vector r j in the target concept embedding space which is then classified to a concept using the 1-NN (nearest neighbour) method.

We now describe the process of obtaining target embeddings for medical concepts using text and graph based embedding methods. For text embedding methods we extract the concept description of each target concept ID by doing a lookup in SNOMED CT knowledge base. Figure 1 . This graph is given as input to graph embedding algorithms such as Deepwalk [27] , Node2Vec [28] , HARP [29] and LINE [30] , each of which return an embedding for every vertex in the graph.

Both text and graph label embedding methods aim to map medical concepts into vectors, such that similar concepts are closer in the target embedding space. These embeddings are made publicly accessible at https://zenodo.org/record/ 3842143. Figure 2 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 select pairs with cosine similarity greater than a pre-designated threshold and discard the remaining pairs. Although this approach seems reasonable, we do not use it as it might introduce label imbalance in our dataset. For instance, common concepts such as

Pain might have a high representation in our dataset when compared to rare concepts such as Aarskog syndrome, which is undesirable. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 tracting noun phrases from a news corpora [33] (non-medical corpora) using Stanford

CoreNLP library [32] . We use binary Support Vector Machine (SVM) with radial basis function kernel to categorize the social media phrases, where each phrase is represented by the averaged word vector of the words present in it. The embeddings for each word were obtained from the pretrained Word2Vec model [26] 3.

We use MetaMap [34] as an off-the-shelf medical entity linker. It identifies concepts from text and links them to entries of KBs such as SNOMED CT and MeSH [35] .

MetaMap assigns one or more semantic types such as, sign or symptom and disease or syndrome to each identified concept. Existing study shows that MetaMap exhibits high precision and poor recall when used to extract concepts from medical social media posts [36] . In this work, precision is of paramount importance for generating high quality distant data. Therefore, we further improve the precision using the following two heuristics. First, medical entities which correspond to one of the major 18 semantic types described by [36] are retained while others are discarded. Second, the most frequent incorrect mappings are pruned using a manually created list. For instance, the generic word Hi, which is highly frequent, is wrongly identified as a disease 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 dataset from these mappings which were made publicly accessible. 3 .

PsyTAR: Psychiatric Treatment Adverse Reactions (PsyTAR) corpus [38] is an MCN dataset created from askapatient.com. In this dataset, all the posts which mention the medications Cymbalta, Effexor, Lexapro and Zoloft were obtained.

Similar to the CADEC dataset, human experts were asked to manually discover medical phrases and map to SNOMED CT concepts which resulted in 6,556 medical phrases mapped to 618 concepts. Miftahutdinov et al. [19] created a five fold dataset from these mappings which were made publicly accessible. 4 . In our prior work [6] we used the publicly accessible folds of the CADEC and PsyTAR datasets to evaluate the performance of our MCN model. For a fair comparison, we use the same folds to evaluate the performance of our current approach.

The SNOMED CT lexicon consists of a collection of medical concepts wherein each medical concept has a unique SNOMED ID. Every concept in this lexicon is associated with its fully specified name and its synonyms. Currently, SNOMED CT contains over 350,000 medical concepts. Table 4 shows few examples of medical concepts obtained from SNOMED CT. In our prior work [6] we leveraged this resource for training our MCN model. We extracted the synonyms of medical concepts present in PsyTAR and CADEC datasets and created an automatically labeled dataset by treating each synonym as a medical phrase. that it also contains some noisy labels, for instance, the phrase heart burn is wrongly associated with the medical concept Abdominal discomfort with a cosine similarity of 0.5194. We observed that compared to the SNOMED CT synonyms dataset, this dataset has a better coverage of phrases containing slang words, non-standard terminology, acronyms and spelling errors.

Our main objective in this work is to build an MCN model which does not use man-

ually labeled examples. We therefore use the SNOMED CT synonyms dataset and the distantly supervised dataset described in Section 4 to train our model. We do not use the training folds of CADEC and PsyTAR datasets. Our baseline is the MCN model proposed in our prior work [6] which is trained using SNOMED CT synonyms dataset.

To train our MCN model, we initialize its parameters (Section 3. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 evaluating the performance of both the models we compute the average classification accuracy across the test folds. Table 6 shows the comparison of our approach with the baseline [6] across multiple target embedding methods. Our model outperforms the baseline by a significant margin. Our model achieves the best classification accuracy of 74.39% on PsyTAR and 76.72% on CADEC datasets, whereas the best accuracy using the baseline is 63.81%

and 65.48% respectively. The best performance across both datasets and approaches was achieved using graph embedding methods (Deepwalk and HARP). Across all text embedding methods, we find that Universal Sentence Encoder (USE) consistently gives better performance .  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 7. Analysis and Discussion

We find that the performance improvement across target embedding methods is inconsistent. We attribute this to the suitability of the target embedding for this task.

We observe that the improvement is higher for low performing embeddings (such as ELMo) as compared to high performing embeddings (such as HARP). Additionally, we observe that the proposed approach improves the performance while reducing the standard deviation across multiple target embedding methods. In a way, distant supervision reduces the impact of the choice of target embedding method. We also find that the performance improvement is not consistent across datasets. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 We also observe that the performance improvement is not consistent across all target embedding methods. We attribute this to the quality of target embeddings and the nature of evaluation dataset. an input phrase to a correct target concept which does not match with the corresponding human labeled concept. We find this to be the most common error type, Table 8 lists Although our model shows relatively better performance across lengthy phrases, there is still a large scope for improvement. In general, we observed that distant supervision boosts the performance of our model in most cases. Figure 5 shows the t-SNE representation of medical concepts and phrase embeddings obtained using the baseline and the proposed approach. We observe that the MCN model trained using distant supervision maps the medical phrases closer to their corresponding concepts.

In this work, we address the problem of Medical Concept Normalization (MCN)

in social media which aims to map a variable length social media phrase to a medical concept in some external coding system. Most of the current approaches pose MCN as a supervised text classification task which requires a large number of medical phrase, concept pairs for training. Creating such a training dataset entails the painstaking task of manually identifying social media phrases and mapping them to one of the concepts in a medical lexicon. In our prior work [6] , we have leveraged medical knowledge bases (such as SNOMED CT) to extract synonyms of medical concepts and use the synonym, concept pairs as training data. Our model trained on SNOMED CT synonyms dataset alone shows a reasonable performance accuracy across multiple datasets. In this work, we extend this idea further by augmenting the SNOMED CT dataset with an automatically constructed distantly supervised dataset created from patient discussion forum posts. Experimental results across multiple datasets show that the MCN model trained using the proposed approach outperforms the baseline by a significant margin.

In future, we would like to increase the coverage of medical concepts in our distant data by obtaining multiple posts from a variety of medical social media forums. In our  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 current work, we use generic pretrained models such as Universal Sentence Encoders to measure the similarity between medical phrases and concepts. In future, we would like to use the model trained on SNOMED CT synonym dataset to select medical phrase, concept pairs. Our current work explores the use of several text and graph based embeddings. It would be interesting to experiment with heterogeneous label embeddings.

We would also want to experiment with graph convolutional networks [39] to obtain target embeddings for medical concepts. In our current work, we use AvgEmb based approach to extract medical phrases (Section 3.2.1) from patient discussion forums. It would be interesting to see if we can further improve the performance of our medical phrase extractor (and thereby the performance of our MCN model) by using other sentence encoders discussed in Section 3.1 .  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 

Medical persona classification in social media

Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features

Social media mining for public health monitoring and surveillance

Twitter social media is an effective tool for breast cancer patient education and support: patient-reported outcomes by survey

Take two aspirin and tweet me in the morning: how twitter, facebook, and other social media are reshaping health care

Medical concept normalization by encoding target knowledge

Medical concept normalization for online user-generated texts

Snomed-ct: The advanced terminology and coding system for ehealth

Normalising medical concepts in social media texts by learning semantic representation

Entity linking with a knowledge base: Issues, techniques, and solutions

Adapting phrase-based machine translation to normalise medical terms in social media messages

Statistical phrase-based translation

A character-level machine translation approach for normalization of sms abbreviations

Text normalization based on statistical machine translation and internet user support

Statistical machine translation based text normalization with crowdsourcing

Using an ensemble of generalised linear and deep learning models in the smm4h 2017 medical concept normalisation task

Medical concept normalization in social media posts with recurrent neural networks

Multi-task character-level attentional networks for medical concept normalization

Deep neural models for medical concept normalization in user-generated texts

Pre-training of deep bidirectional transformers for language understanding

Concept extraction to identify adverse drug reactions in medical forums: A comparison of algorithms

A robustly optimized bert pretraining approach

Decoupled weight decay regularization

Deep contextualized word representations

Efficient estimation of word representations in vector space

Deepwalk: Online learning of social representations

Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining

Harp: Hierarchical representation learning for networks

Line: Large-scale information network embedding

Visualizing data using t-sne

The stanford corenlp natural language processing toolkit

The reuters corpus volume 1-from yesterday's news to tomorrow's language resources

Effective mapping of biomedical text to the umls metathesaurus: the metamap program

Medical subject headings (mesh)

Extracting medical concepts from medical social media with clinical nlp tools: a qualitative study

Cadec: A corpus of adverse drug event annotations

A systematic approach for developing a corpus of patient reported adverse drug events: a case study for ssri and snri medications

Semi-supervised classification with graph convolutional networks

Nikhil Pattisapu: Conceptualization, Writing -Original draft preparation, Methodology, Software. Vivek Anand : Data curation, Resources, Visualization. Sangameshwar Patil : Validation, Formal Analysis, Writing -Reviewing and Editing. Girish Palshikar : Investigation, Funding acquisition. Vasudeva Varma : Supervision

• Medical Concept Normalization maps informal phrases to formal medical concepts.• Training MCN models requires creating labeled datasets, which is effort intensive.• We overcome this problem by creating automatically labeled dataset from social media.• We augment our distant dataset with examples extracted from the SNOMED CT lexicon.• Our approach significantly outperforms the previous state-of-the-art on two datasets.

☐ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:We have no conflicts of interest to disclose.