key: cord-0144994-ejj016yc authors: Haq, Hasham Ul; Kocaman, Veysel; Talby, David title: Mining Adverse Drug Reactions from Unstructured Mediums at Scale date: 2022-01-05 journal: nan DOI: nan sha: 056d735b9c370cb97282a2ec227f0300d7268f29 doc_id: 144994 cord_uid: ejj016yc Adverse drug reactions / events (ADR/ADE) have a major impact on patient health and health care costs. Detecting ADR's as early as possible and sharing them with regulators, pharma companies, and healthcare providers can prevent morbidity and save many lives. While most ADR's are not reported via formal channels, they are often documented in a variety of unstructured conversations such as social media posts by patients, customer support call transcripts, or CRM notes of meetings between healthcare providers and pharma sales reps. In this paper, we propose a natural language processing (NLP) solution that detects ADR's in such unstructured free-text conversations, which improves on previous work in three ways. First, a new Named Entity Recognition (NER) model obtains new state-of-the-art accuracy for ADR and Drug entity extraction on the ADE, CADEC, and SMM4H benchmark datasets (91.75%, 78.76%, and 83.41% F1 scores respectively). Second, two new Relation Extraction (RE) models are introduced - one based on BioBERT while the other utilizing crafted features over a Fully Connected Neural Network (FCNN) - are shown to perform on par with existing state-of-the-art models, and outperform them when trained with a supplementary clinician-annotated RE dataset. Third, a new text classification model, for deciding if a conversation includes an ADR, obtains new state-of-the-art accuracy on the CADEC dataset (86.69% F1 score). The complete solution is implemented as a unified NLP pipeline in a production-grade library built on top of Apache Spark, making it natively scalable and able to process millions of batch or streaming records on commodity clusters. Adverse drug events are harmful side effects of drugs, comprising of allergic reactions, overdose response, and general unpleasant side effects. Approximately 2 million patients in the United States are affected each year by serious ADR's, resulting in roughly 100,000 fatalities (Leaman et al. 2010) , and making ADR's the fourth leading cause of death in the United States (Giacomini et al. 2007) . Treatment related to Copyright © 2022 , Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. ADR's has been estimated to cost $136 billion each year in the United States alone (van Der Hooft et al. 2006) . Finding all ADR's of a drug before it is marketed is not practical for several reasons. First, The number of human subjects going through clinical trials is often too small to detect rare ADR's. Second, many clinical trials are shortlasting while some ADR's take time to manifest. Third, some ADR's only show when a drug is taken together with other drugs, and not all drug-drug combinations can be tested during clinical trials. Fourth, drug repurposing or offlabel usage can lead to unforeseen ADR's. As a result, detecting ADR's in drugs which are already being marketed is critical -a discipline known as postmarketing pharmacovigilance (Mammì et al. 2013) . Schemes which allow hospitals, clinicians, and patients to report ADR's have existed for many years, but only a fraction of events get reported through them. A meta-analysis of 37 studies from 12 countries found that the median rate of under-reporting was 94% (Hazell and Shakir 2006) . This led to work on mining ADR's from alternative sources, such as social media posts by patients or healthcare providers (Bollegala et al. 2018) . Outbreak of the COVID-19 pandemic has precipitated this trend of sharing such information (Cinelli et al. 2020) ; The size, variety, and instantaneous nature of social media provides opportunities for real-time monitoring of ADRs . Compared to traditional data source like research publications, this data is more challenging to process, as it is unstructured and contains noise in the form of jargon, abbreviations, misspellings, and complex sentence structures. Recent advancements in Natural Language Processing (NLP) in the form of Transformers (Vaswani et al. 2017) based architectures like BERT (Devlin et al. 2018) , have significantly pushed the boundaries of NLP capabilities. There is an increasing trend of training large models on domainspecific data like BioBERT (Lee et al. 2019) , and these methods have proven to achieve state-of-the-art (SOTA) results for document understanding and named entity recognition (NER). However, since these methods require significant computational resources during both training and inferring, it becomes impractical to apply them over large quantities of records in compute-restricted production environments. Despite the growing interest and opportunities to process large quantities of data, models and software frameworks that can scale to leverage compute clusters are scarce. This restricts the ability to utilize available data from social media and other mediums -such as transcripts of customer service calls with patients, or CRM notes about sales and support discussions with clinicians -to their true potential. The availability of high volume, variety, and velocity of data presents the opportunity to develop NLP solutions that outperform the existing SOTA accuracy, while also being easily scalable and computationally efficient. The purpose of this study is to illustrate how an endto-end system, based on the Apache Spark ecosystem and comprising of novel NLP techniques, can be used to process large quantities of unstructured text to mine ADRs. This system has been implemented on a production-ready, widely deployed, and natively scalable library, thus capable of processing millions of records, in either batch or streaming modes. The unified NLP pipeline includes new models for the three required sub-tasks: classifying text to decide if it is an indication of an ADR, recognizing named entities for reactions and drugs, and linking adverse events with drugs. Following are the novel contributions of this paper: • The first scalable end-to-end system for mining ADR's from unstructured text, including Document Classification, Named Entity Recognition, and Relation Extraction Models within a unified NLP pipeline. • New NER model for extracting reactions and drugs, whose accuracy outperforms previous SOTA models on public datasets for this task. • New Relation Extraction models for linking reactions and drugs, which outperform previous SOTA models when trained with additional data that was annotated by clinicians as part of this effort. • New text classification model for deciding if a piece of text reports an ADR, whose accuracy outperforms previous SOTA models. • Studying the utility of using non-contextual lightweight embeddings (Mikolov et al. 2013) like GloVe (Pennington, Socher, and Manning 2014) instead of memoryintensive contextual embeddings like BioBERT for these tasks, by comparing training times and accuracy improvements. • Detailed analysis of all the solution components and datasets, explaining how its modular structure can be customized to different data sources and runtimes. The extraction of ADRs from unstructured text has received growing attention in the past few years due to wide-spread adoption of Electronic Medical Records (EMR), and everincreasing number of users on social media who share their experiences. Existing work comprises of significant contributions in both, novelty in information extraction methodologies, as well as availability of relevant pre-annotated datasets containing annotations for a variety of subtasks. The problem of ADR extraction gained visibility with the introduction of challenges like Social Media Mining for Healthcare (SMM4H) (Weissenbacher and Gonzalez-Hernandez 2019) and National Clinical NLP Challenges (n2c2) (Henry et al. 2020) , which provide pre-annotated datasets for researchers to compete on. Other significant contributions for data collection include (Gurulingappa et al. 2012 ) which used the Pubmed corpus to develop the ADE corpus benchmark dataset, covering Classification, NER, and RE annotations for extracting and relating ADRs and drugs respectively. Another work (Karimi et al. 2015) produced an NER dataset (CADEC) by collecting and annotating reviews and comments from forums. Identification of text containing ADRs is formulated as a text classification problem for which different techniques have been applied. (Huynh et al. 2016 ) used different variations of Convolutional Neural Network (CNN) (e.g., CNN, CRNN, CNNA) to identify tweets containing ADRs on the twitter dataset. More elaborate techniques like fine-tuning of BERT models have been applied for text classification as well (Kayastha, Gupta, and Bhattacharyya 2021) . A standard method of formulating the extraction of drugs and ADR mentions is NER, for which, a number of architectures have been proposed. One of the classical approach is to use a BiLSTM (Graves and Schmidhuber 2005) architecture with Conditional Random Fields (CRF) as used by (Stanovsky, Gruhl, and Mendes 2017) . This method is a shallow network that relies on word embeddings and part of speech tags to classify each token to extract ADR mentions. (Ge et al. 2020 ) also added character level embeddings to the same architecture to incorporate spelling features, and enriched the training dataset by annotating additional data from DBpedia to achieve SOTA results on the CADEC dataset, demonstrating the benefits of using additional data. Similar to our approach, they also built an extensive training framework over multiple nodes. Relating ADR mentions with the drugs is formulated as a relation extraction (RE) task, which comprises of creation and classification of relations between entities (Haq, Kocaman, and Talby 2021) . Classical RE methods like (Fundel, Küffner, and Zimmer 2006) use lexical rules based on dependency parsing tree of the document. The introduction of transformer allowed for more context-aware solutions like feeding entity spans and document to transformers to predict relations (Soares et al. 2019) . Recently, more elaborate approaches like joint learning of both NER and RE have proved to be more beneficial. For example, (Crone 2020) used a single base network to generate joint features, while using separate BiRNN layers for both NER and RE, and creating skip connections between the NER and RE BiRNN layers to achieve SOTA performance on RE. While existing work has focused on pushing the boundaries for accuracy, little work is done to build a framework that can process large quantities of data from social media with accuracy. To achieve this, we develop separate architectures for all three tasks, and place them in a single pipeline, allowing us to maintain a modular structure to develop and test each component separately, while sharing common components (e.g., tokenization and embedding generation) for scalability. We divide the problem into three main tasks; Document Classification, Named Entity Recognition and Relation Extraction, and draw distinct solutions for each one of them for scalability. Since NER plays the most important role of identifying entity spans, we place all components in a single pipeline for an end-to-end solution. Figure 1 explains the complete pipeline using Apache Spark framework. Figure 1 : Overview of the complete architecture. All the components are sequentially placed in a single pipeline. Arrows represent output of one stage as input to the next stage. As illustrated in the system diagram in Figure 1 , Relation Extraction is heavily dependent on the NER model, as the latter provides relevant entity chunks which form basic inputs of the RE model. Since NER requires token level embeddings, we test with different types of embeddings; namely GLoVe (Pennington, Socher, and Manning 2014) and BERT (Devlin et al. 2018 ) based embeddings. This modular approach helps us keep the NER and RE architecture static while experimenting with different embedding types to analyse accuracy and performance differences. Given the nature of the data, we trained 200-dimension GLoVe embeddings on Pubmed and MIMIC datasets. For BERT embeddings we utilize the work by (Lee et al. 2019) , namely BioBERT. In general, BERT embeddings provide more useful information due to being context-aware and better handling of out of vocabulary (OOV) tokens. To be able to process large volume of data, the text classification model needs to be scalable, and accurate, as it is used to filter out documents, reviews, and tweets that do not contain any indication of adverse event. To achieve this, we use a FCNN model that does not require hand-crafted features, and relies on a single embedding vector for classification. Given the conversational nature of social media text, we can utilise the entire document to get efficient embeddings (with little text clipping in case of BioBERT embeddings) that we directly feed to the classifier model. Since there is only a single feature vector as input to the model, we test multiple embedding techniques to analyse performance. To extract ADR and other entities from text, we use our class-leading NER architecture, called BiLSTM-CNN-Char. We build our NER model by taking the work of (Chiu and Nichols 2015) as the base model, and made a few changes in the architecture according to our testing; removing lexical features like POS tags, and introducing new character level features. We used 1D convolution layer comprising of 25 filters having kernel size 3 to generate token feature maps that encapsulate information like spelling and casing. These additional features proved highly useful while dealing with spelling mistakes, as well as out-of-vocabulary tokens. We also updated the architecture by using BlockFused-LSTM cells in our implementation for increased speed. Figure 2 explains the architecture of our NER model. We treat Relation Extraction (RE) as a binary classification problem where each example is a pair of drug and ADR mentions in a given context, and develop two novel solutions; the first one comprising of a simpler FCNN architecture for speed, and the second one based on the BioBERT architecture for accuracy. We experiment both approaches and compare their results. For our first RE solution we rely on entity spans and types identified by the NER model to develop distinct features to feed to an FCNN for classification. At first we generate pairs of adverse event and drug entities, and then generate custom features for each pair. These features include semantic similarity of the entities, syntactic distance of the two entities, dependency structure of the entire document, embedding vectors of the entity spans, as well as embedding vectors for 100 tokens within the vicinity of each entity. Figure 3 explains our model architecture in detail. We then concatenate these features and feed them to fully connected layers with leaky relu activation. We also use batch normalisation after each affine transformation before feeding to the final softmax layer with cross-entropy loss function. We use softmax cross-entropy instead of binary cross-entropy loss to keep the architecture flexible for scaling on datasets having multiple relation types. Our second solution focuses on a higher accuracy, as well as exploration of relations across long documents, and is based on (Soares et al. 2019) . In our experiment we take checkpoints from the BioBERT model and train an end-toend model for relation extraction. Similar to our first solution, we rely on entity spans and use the entire document as context string while training the model. The original paper used sequence length of 128 tokens for the context string, which we keep constant, and instead experiment with the context string, additional data, and fine-tuning techniques. We test our models on three benchmark datasets; SMM4H NER challenge (Weissenbacher and Gonzalez-Hernandez 2019), ADE Corpus (Gurulingappa et al. 2012 ) and CADEC (Karimi et al. 2015) . The SMM4H NER challenge is a yearly challenge based on annotated twitter data. As this dataset is entirely based on tweets, it forms an ideal testing bed to test our model's performance on real world data. The ADE Corpus dataset is a benchmark dataset for classification, NER and RE tasks, while the CADEC dataset is primarily used for classification and NER benchmarks only. Keeping consistency with existing work, as well as aligning with our primary goal of extracting ADRs and related drugs, we keep two entities in all datasets; ADE and Drug. Details of the NER datasets can be found in Table 1 Dataset Since we treat the RE problem as binary classification, we need positive relations as well as negative relations to train the model. Positive relations are defined if the drug and reaction entities are related in the context, while negative relations comprise of drugs that are not responsible for a par-ticular reaction. This relation can be formulated as below: From the ADE dataset, we can sample negative relations by subtracting annotated drug-reaction pairs from all drugreaction pairs in the same document. The standard ADE Corpus does not have sufficient negative relations, raising the issue of class imbalance. To address this, we sampled and annotated 2000 notes from 2018 n2c2 shared task on ADE and medication extraction in EHRs dataset (Henry et al. 2020) , to create a supplementary dataset for relations. We keep the same entities (i.e., Drug and ADE) while annotating to align with our benchmark datasets. Also, to keep human bias at a minimum, we don't annotate entity spans; rather we use existing NER annotations to generate a dataset comprising of Drug and ADE pairs, and only classify each relation based on their context. Following previous work, we evaluate the models using 10-fold cross validation, and report macro and micro averaged precision, recall, and F1 scores. Exact experimental and evaluation settings for each stage are described below. Table 5 : Training and inference time (in seconds) taken by the NER model on each dataset using different token embeddings with respect to overal performance on test set. Epoch count was kept constant for all datasets while training. The experiment was performed on an 8-core machine having 64gb memory. evaluation a label is considered as correct if the starting and ending tags exactly match with the gold labels, while under relax evaluation only an overlap between annotations is considered. Consequently, the 'O' tag is not included in the calculation. Hyperparameter values, and training code is explained in Appendix A & B. • For training RE models, we use standard NER spans and binary labels. For our base RE model we use 200dimensional token-level GLoVe embeddings -the same embeddings we use for our base NER model. For our BERT based RE model, we don't use any explicit embeddings as the BERT model is trained in an end-toend fashion. We do specify details of entity spans like starting, ending indices, entity types, and the context in between. The context is generally the entire document, but since the model architecture has a 128 token limit, we create context text by taking text in between the entities, and found this method to be more accurate. We also test a hypothesis that fine-tuning BioBERT model on similar Relation Extraction tasks would increase the overall performance on the benchmark datasets. To test this hypothesis, we train an end-to-end RE model on Disease and Drug datasets like the 2010 i2b2 challenge (Uzuner et al. 2011 ) and saved it. We then use the same weights while discarding the final layers, and retrain the model on the benchmark dataset. Since the base model is trained on a similar taxonomy, the convergence was much faster, while being less prone to overfitting. For Hyperparameter tuning we utilize the development set and use random search. Exact hyperparameter values, and the search space for all the models can be found in Appendix A. Despite using a shallow architecture for classification, we achieved metrics that are on-par with SOTA metrics by using more accurate Sentence Bert Embeddings, as shown in Table 3 . While the performance difference between BioBERT and GLoVe embeddings is minor on the CADEC dataset, the difference is more prominent on the ADE dataset. This is primarily because of the complex intrinsic nature of biomed- Our NER architecture acheives new SOTA metrics on SMM4H, ADE, and CADEC NER datasets using contextual BioBERT embeddings as shown in Table 4 . Since the NER model was kept constant during the experiment, and we tuned the hyper parameters for each experiment, the performance difference between embedding types can be attributed to the word embeddings alone. Being able to incorporate contextual information with attention mechanism, BioBERT embeddings peformed better than non-contextual GLoVe embeddings. However, it is worth noticing that the performance difference between the two is in a margin of 1-2%, proving that domain specific GLoVe embeddings can provide comparable performance while requiring significantly less memory and computational resources. Table 5 provides side-by-side comparison of time and accuracy differences while using different embedding types. On average, the GLoVe embeddings are 30% faster compared to BioBERT embeddings during training, and more than 5x faster during inference, while being on-par in terms of f1 score. Our RE solutions perform on-par with existing SOTA systems, while being scalable and requiring less memory to train and test. The introduction of the extra data greatly improved results, enabling us to achieve SOTA on benchmark datasets as shown in Table 6 . While the more heavy BioBERT model outperformed our proposed RE model on the limited and imbalanced ADE dataset, the performance difference becomes diminutive when more data is added to the training data. Sample output and visualization of NER and RE results can be seen in Table 7 and Figure 4 . Despite the growing need and explosion of useful data for pharmacovigilance, there is a severe deficiency of production-ready NLP systems that can process millions of records while being accurate and versatile. In this study we address the problem by introducing novel solutions for Classification, NER, and RE while leveraging the Spark ecosystem and contemplating on accuracy, scalability, and versatility. For which we explain how we build a modular structure comprising of different embedding types, a classification and NER model, and two approaches for RE. We trained custom GLoVe embeddings model on domain-specific dataset, and compare its performance to SOTA BioBERT embeddings. We show through extensive testing that our text classification model, for deciding if a conversation includes an ADR, obtains new state-of-the-art accuracy on the CADEC dataset (86.69% F1 score). Our proposed NER architecture achieves SOTA results on multiple benchmark datasets. Namely, our proposed NER models obtain new state-of-the-art accuracy for ADR and Drug entity extraction on the ADE, CADEC, and SMM4H benchmark datasets (91.75%, 78.76%, and 83.41% F1 scores respectively). Then we explain two different architectures for RE, one based on BioBERT while the other utilizing crafted features over a FCNN, test them individually, and show that a simpler RE architecture with bespoke features performed on-par with more sophisticated BERT solution. To improve our RE model, we built a new dataset by manual annotations, and achieved higher metrics on the RE test datasets. Furthermore, we performed speed benchmarks to compare efficiency of two distinct embedding generation models to determine the ideal choice for deploying such solutions to process large quantities of data. In general, most pharmaceutical companies run on-premise servers which are geared towards general computation and do not utilise hardware acceleration like GPUs for running heavy models; In such cases where infrastructure is not mature enough to handle heavy models, lightweight glove-based models are a compelling alternative to BERT-based models, as they offer comparable performance while being memory and CPU efficient. Finally, we implement all these algorithms in Apache Spark ecosystem for scalability, and shipped in a production grade NLP library: Spark NLP. The following parameters provided best results on the classification development set (values within the parenthesis represent the parameter ranges tested): • Dropout rate: 0. Code for training an RE model is provided as a google colab notebook (JSL 2021). Entity-Level Classification of Adverse Drug Reaction: A Comparative Analysis of Neural Network Models Causality patterns for detecting adverse drug reactions from social media: text mining approach. JMIR public health and surveillance The COVID-19 social media infodemic Deeper Task-Specificity Improves Joint Entity and Relation Extraction. CoRR BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding FedNER: Privacy-preserving Medical Named Entity Recognition with Federated Learning Framewise phoneme classification with bidirectional LSTM and other neural network architectures Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports Deeper Clinical Document Understanding Using Relation Extraction Under-reporting of adverse drug reactions n2c2 shared task on adverse drug events and medication extraction in electronic health records Adverse Drug Reaction Classification With Deep Neural Networks JSL. 2021. Training Code for RE CADEC: A corpus of adverse drug event annotations BERT based Adverse Drug Effect Tweet Classification Biomedical Named Entity Recognition at Scale. CoRR Towards internet-age pharmacovigilance: extracting adverse drug reactions from user posts in health-related social networks BioBERT: a pre-trained biomedical language representation model for biomedical text mining Pharmacovigilance in pharmaceutical companies: An overview Efficient Estimation of Word Representations in Vector Space GloVe: Global Vectors for Word Representation Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. CoRR Mednli-a natural language inference dataset for the clinical domain Social media and pharmacovigilance: a review of the opportunities and challenges Matching the Blanks: Distributional Similarity for Relation Learning Recognizing Mentions of Adverse Drug Reaction in Social Media Using Knowledge-Infused Recurrent Models 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text Adverse drug reaction-related hospitalisations Attention Is All You Need Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task A Partition Filter Network for Joint Entity and Relation Extraction