key: cord-0109133-buqnwxs5 authors: Haq, Hasham Ul; Kocaman, Veysel; Talby, David title: Deeper Clinical Document Understanding Using Relation Extraction date: 2021-12-25 journal: nan DOI: nan sha: 31352e50b0265886f18077d99d85ac258dd6e99d doc_id: 109133 cord_uid: buqnwxs5 The surging amount of biomedical literature&digital clinical records presents a growing need for text mining techniques that can not only identify but also semantically relate entities in unstructured data. In this paper we propose a text mining framework comprising of Named Entity Recognition (NER) and Relation Extraction (RE) models, which expands on previous work in three main ways. First, we introduce two new RE model architectures -- an accuracy-optimized one based on BioBERT and a speed-optimized one utilizing crafted features over a Fully Connected Neural Network (FCNN). Second, we evaluate both models on public benchmark datasets and obtain new state-of-the-art F1 scores on the 2012 i2b2 Clinical Temporal Relations challenge (F1 of 73.6, +1.2% over the previous SOTA), the 2010 i2b2 Clinical Relations challenge (F1 of 69.1, +1.2%), the 2019 Phenotype-Gene Relations dataset (F1 of 87.9, +8.5%), the 2012 Adverse Drug Events Drug-Reaction dataset (F1 of 90.0, +6.3%), and the 2018 n2c2 Posology Relations dataset (F1 of 96.7, +0.6%). Third, we show two practical applications of this framework -- for building a biomedical knowledge graph and for improving the accuracy of mapping entities to clinical codes. The system is built using the Spark NLP library which provides a production-grade, natively scalable, hardware-optimized, trainable&tunable NLP framework. Biomedical literature has witnessed exponential rise in the past decade. MEDLINE currently holds more than 26 million records from 5639 publications, and has indexed more than 5 million records in the past seven years alone (Yadav et al. 2020) . Furthermore, public databases like https://clinicaltrials.gov have seen an explosion of trials data as the aftermath of the novel Covid-19 outbreak. In addition, wide-spread adoption of Electronic Health Records (EHRs), has made copious amount of free-text data available in digital format. This unstructured data is usually documented by healthcare professionals during the course of patient care, such as clinical notes, discharge summaries, lab reports, and pathology reports (Wei et al. 2019) . While publications and literature are growing rapidly, there still lacks structured knowledge that can be easily processed by computer programs. Relation Extraction becomes even more pertinent in biomedical research as it can provide the critical links required to generate knowledge graphs for better analysis and research, and even text summarization. Relating entities also help us improve medical coding by enriching vanilla entity chunks with surrounding information to get more accurate codes. Relation extraction is generally regarded as a classification problem where entity pairs -usually identified by NER models -are classified to determine their relationship type in a given context. These models are trained to identify semantic relations between recognized entities as illustrated in Figure 1 . Since the classification implicitly relies on context, transformers based models like BERT (Devlin et al. 2018) have been shown to outperform traditional methods of dependency parsing. Recently, there is also an increasing trend of jointly training large BERT models on NER and RE tasks with shared layers and features (Wang and Lu 2020) . However, even in joint learning, the RE classification is still contingent upon entity spans identified by the NER model. While the trend of training large transformer models continues, applying them on large datasets remains a challenge as they require significant computational resources. Furthermore, long documents containing high number of entity spans can exponentially increase probable entity pairs for RE classification -requiring significantly more resources and processing time. In this study we focus on three major aspects of RE; the model architectures and their scalability, evaluating the models on benchmark datasets, and training and using RE for general use-cases. We also study the application of RE for understanding different aspects of clinical documents like extracting and relating dates to generate timeline of a patient's data on a timeline, or parsing and understanding trial results on large cohorts for analysis. Following are the novel contributions of this paper: • Introducing two new RE architectures. • Evaluating and comparing performance of the proposed models on benchmark datasets. • Training the models on custom datasets and demonstrating how RE can be used to get a structured output for specific use-cases. • Studying the use-case of putting the history and medical history of patients on a timeline. • Analyzing the benefits of using RE to get more precise entity chunks for achieving better performance while mapping them to medical codes. We treat RE as a classification problem where each example is a pair of biomedical entities appearing in a given context -the entities being NER chunks, and context being the sentence / entire document -and develop two novel solutions; the first one comprising of a simpler FCNN architecture for speed, and the second one based on the BioBERT (Lee et al. 2019 ) architecture for accuracy. We experiment both approaches and compare their results. For our first RE solution we rely on entity spans and types identified by the NER model to develop distinct features to feed to an FCNN for classification. At first we generate distinct pairs of entities (e.g. symptom-treatment), and then generate custom features for each pair. These features include semantic similarity of the entities, syntactic distance of the two entities, dependency structure of the entire document, embedding vectors of the entity spans, as well as embedding vectors for 100 tokens within the vicinity of each entity. Figure 2 explains our model architecture in detail. We then concatenate these features and feed them to fully connected layers with leaky relu activation. We also use batch normalisation after each affine transformation before feeding to the final softmax layer with cross-entropy loss function. We use softmax cross-entropy instead of binary crossentropy loss to keep the architecture flexible for scaling on datasets having multiple relation types. Our second solution focuses on a higher accuracy, as well as exploration of relations across long documents, and is based on (Soares et al. 2019) . In our implementation, we implement the model in Apache Spark for scalability, take checkpoints from the BioBERT model, and train an end-toend BERT model for RE. Similar to the first solution, this architecture also depends on the entity spans identified by the base NER model, and uses the entire document as context string while training the model. The original paper used sequence length of 128 tokens for the context string, which we keep constant, and instead experiment with the content of the context string, training data augmentation, and finetuning techniques. We use Spark NLP's (Kocaman and Talby 2021) NER models (Kocaman and Talby 2020) as foundation for the RE models as these NER models provide entity spans required for performing RE. In a single inference pipeline, the RE models are placed sequentially after the the NER model, and are fed the results of the NER model, the context, embeddings, and dependency tree for feature generation. Apart from feature generation, the dependency tree also helps regularize candidate entity pairs for RE classification as we can eliminate pairs having a larger syntactic distance. This modular approach of arranging components reduces coupling and achieves a higher degree of memory and computational efficiency as components like sentences, tokens, and embeddings are shared between NER and RE models and don't need to be executed again. Since the NER model is essentially a token classifier and produces prediction per token, we convert the tokens to chunks using BIO tags. We test the models on public datasets, report evaluation metrics, and analyse the results on examples. In addition to public datasets, we explain the process of annotating and training models on new datasets. We then study the utility of applying RE for some use-cases like knowledge graph generation and improved entity resolution (the process of mapping entity chunks to medical codes). We test both model architectures on seven public datasets by using the official training-test split for training and testing the models, and report macro-average f1 scores for each one of them in Table 1 . These datasets include the 2012 i2b2 challenge for evaluating temporal relations in clinical text (Sun, Rumshisky, and Uzuner 2013) , the 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text (Uzuner et al. 2011) , the Drug-Drug-Interaction (DDI) dataset for linking drugs with dispositions and reactions (Herrero-Zazo et al. 2013) , the Chemical-protein interaction (CPI) dataset for linking genes/proteins with drug chemicals (Krallinger et al. 2017) , the Phenotype-Gene Relations (PGR) dataset for relating human phenotypes and genes (Sousa, Lamurias, and Couto 2019) , the adverse drug events dataset for relating drugs with their reactions (Gurulingappa et al. 2012) , and the posology relations task based on the 2018 n2c2 task (Henry et al. 2020) . For the sake of brevity we don't delve into the details for each dataset, and specific details for each dataset can be found in the cited resources. As explained in Table 1 , the BERT model achieves new SOTA metrics on 5 public datasets, and out performs the lighter FCNN model due to better contextual awareness. However, it is more than 3 time slower and has much higher memory requirements. In addition to the public datasets, we sampled approximately 5000 clinical notes and manually annotated them with the help of domain experts on the following guidelines: We selected general entities (e.g, body part, date, test result) that can compliment core entities (e.g, symptom, procedure, test) as the first entity and disjoint entity types -meaning the entities should not have relation among themselves -from the the core entities for the second entity as explained in Table 3 . Since the first entity can relate to multiple entities in the second column, we can define the relation between the two entity types as one-to-many, and can keep the relation types to a minimum i.e. are the two entities related or not. This approach helps reduce annotation complexity resulting in faster annotation times, and a higher inter-annotator agreement. For annotation purposes we utilized the publicly available Annotation Lab tool. The ability to semantically relate entities paves way for a lot of opportunities and use-cases. For example, the RE model for Adverse Drug Events can be used to identify drugs that caused reactions in large trial datasets. Figure 3 shows the output of running the ADE RE model on sample text. Similarly, lab results, discharge notes, and prescriptions can be parsed to get a structured output as illustrated in Figure 4 . Figure 3 : Output of the ADE RE model on sample data. Arrows with 0 represent the two entities are not related, while 1 represents that the reaction is caused by the drug. In addition of using the public models, following are some of the use-cases we explored with our general-purpose models: Most notable benefit of RE is the ability to generate knowledge graphs from unstructured text. For this experiment, we used pretrained Spark NLP NER models and the generalpurpose RE models explained in the previous section to process medical reports with the primary goal of generating a concise structured output of a report. For instance, we relate procedures with dates and findings to recognize dates of a procedure and its findings along with any existing condition. We use the relations between body parts and procedures to get more specific details of the location of the procedure. Similarly, relating body parts with findings like test results and measurements can add more details to the final output in specific use-cases. More granularity can be achieved by having further subdivisions of body parts. For instance, in our experiment, we divide the body part in three parts; the primary body part (e.g, lung), a sub-part (e.g, lobe), and direction/laterlity (e.g, left) of the body part. In practice, these specifc entities trickle from the NER model down to the RE models. A graph generated from a sample report can be seen in Figure 5 . Furthermore, the structured data can help create a patient timeline which can show progress of a certain condition over a certain duration. A sample timeline monitoring coronary calcium score and cyst can be seen in Figure 6 . Such information can be used to analyse multiple trends like effectiveness of a drug for treating a certain condition on large datasets. Figure 6 : A sample timeline of a patient showing calcium score trend, and evolution of cyst over multiple scans in a month. Entity Resolver models map entity chunks to medical codes like CPT (AMA 2020), ICD (WHO 2019), SNOMED (NLM 2019), MeSH (NLM 2021a), RxNorm (NLM 2021b) etc based on semantic similarity. This task becomes challenging due to two major reasons. First, the inherent noise of the text like abbreviations, acronyms, and synonyms can result in false positive results. Second, medical codes are sensitive to variables like severity, location in human body, administration type, diagnosis method, etc; For a given condition or treatment, there could be different codes (within the same ontology) depending on the aforementioned factors. This challenge is more prominent in ontologies with wider vocabularies like SNOMED. RE provides solution to both problems; First, it intrinsically cleans the input for the resolver models of stop words and noise without additional effort. Second, it adds additional information to the core entity chunks from surrounding context; With the help of relations, simple entities can be enriched with precise information to get accurate codes. For example, a chunk CT Scan -identified as a procedure -can be enriched with the imaging technique to achieve a more accurate CPT/SNOMED code. Enriching it further with the location of the procedure (e.g, chest) would result in an even accurate chunk that can be resolved to a more specific CPT/SNOMED code. Table 4 compares base chunks with enriched chunks that include body parts, demonstrating the benefits of enriched entity chunks for improved coding. In this paper we presented two new model architectures for RE while enabling scalability. We then tested the models on public datasets and reported evaluation metrics. The model metrics show that the BioBERT based model outperforms the lighter FCNN model, and obtains new state-of-the-art accuracy on on three benchmarks. However, for datasets with a small number of relation types, the simpler FCNN model may be a compelling option not only due to faster run times, but also much lower memory requirements compared to the BioBERT model, allowing to process larger datasets on commodity hardware. We also explain how to train RE models from scratch and describe the design behind the pretrained models available as part of Spark NLP library. We then study practical use cases where RE plays the salient role of linking entities together to generate knowl- edge graphs, patient timelines, and structured summaries of medical notes. Relating dates to primary procedures and problems can help create a timeline for each patient. Finally, using granular NER models together with discrete RE models to clean and enrich entity chunks enables better entity resolution to clinical codes. Given the complex nature of RE, and the pivotal role of contextual information, a common approach is to limit relations within a certain syntactic span as even BERT models have token sequence limit. A future research direction could be to focus on improving contextual representation of large documents to allow relations over lengthy contextual spans. A second future research direction is to test whether auxiliary data -either from medically annotated data or through transfer learning from healthcare-specific language modelscan deliver higher accuracy Relation Extraction on the same neural network architectures. Since optimal hyperparameter values vary for each dataset, a range of values which performed best in all the datasets can be seen in Table 5 . Since RE is a classification task, the primary inputs are the context string (sentence), and a pair of entities. If there are multiple pairs in a single context string, we treat them as disjoint inputs as each input encapsulates the required inputs like entity chunk pairs and context -which are then used to create input features. We can create a csv formatted file where each row is a training example for the model, and contains the aforementioned inputs. Exact schema of the training file can be found in the training notebook (JSL 2021). Code for training an RE mode is provided as a google colab notebook (JSL 2021). As majority of the public datasets are protected and can not be shared, they need to be obtain from their official websites and converted to the required format before training. Using drug descriptions and molecular structures for drug-drug interaction extraction from literature Deeper Task-Specificity Improves Joint Entity and Relation Extraction. CoRR BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Robustly Pre-trained Neural Model for Direct Temporal Relation Extraction Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports n2c2 shared task on adverse drug events and medication extraction in electronic health records The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions Improving Clinical Document Understanding on COVID-19 Research with Spark NLP Spark NLP: Natural Language Understanding at Scale. Software Impacts BioBERT: a pre-trained biomedical language representation model for biomedical text mining Relation extraction between the clinical entities based on the shortest dependency path based LSTM. CoRR Sci-Five: a text-to-text transformer model for biomedical literature Matching the Blanks: Distributional Similarity for Relation Learning BiOnt: Deep Learning using Multiple Biomedical Ontologies for Relation Extraction A Silver Standard Corpus of Human Phenotype-Gene Relations Evaluating temporal relations in clinical text: 2012 i2b2 Challenge 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text Two are Better than One: Joint Entity and Relation Extraction with Table-Sequence Encoders Relation extraction from clinical narratives using pre-trained language models Relation Extraction from Biomedical and Clinical Text: Unified Multitask Learning Framework Clinical Relation Extraction Using Transformer-based Models