key: cord-0672242-ttqrjxf9
authors: Lau, Wilson; Lybarger, Kevin; Gunn, Martin L.; Yetisgen, Meliha
title: Event-based clinical findings extraction from radiology reports with pre-trained language model
date: 2021-12-27
journal: nan
DOI: nan
sha: f12cf11514f7eef2be0f34e0be4ec1fdb5925c8c
doc_id: 672242
cord_uid: ttqrjxf9

Radiology reports contain a diverse and rich set of clinical abnormalities documented by radiologists during their interpretation of the images. Comprehensive semantic representations of radiological findings would enable a wide range of secondary use applications to support diagnosis, triage, outcomes prediction, and clinical research. In this paper, we present a new corpus of radiology reports annotated with clinical findings. Our annotation schema captures detailed representations of pathologic findings that are observable on imaging ("lesions") and other types of clinical problems ("medical problems"). The schema used an event-based representation to capture fine-grained details, including assertion, anatomy, characteristics, size, count, etc. Our gold standard corpus contained a total of 500 annotated computed tomography (CT) reports. We extracted triggers and argument entities using two state-of-the-art deep learning architectures, including BERT. We then predicted the linkages between trigger and argument entities (referred to as argument roles) using a BERT-based relation extraction model. We achieved the best extraction performance using a BERT model pre-trained on 3 million radiology reports from our institution: 90.9%-93.4% F1 for finding triggers 72.0%-85.6% F1 for arguments roles. To assess model generalizability, we used an external validation set randomly sampled from the MIMIC Chest X-ray (MIMIC-CXR) database. The extraction performance on this validation set was 95.6% for finding triggers and 79.1%-89.7% for argument roles, demonstrating that the model generalized well to the cross-institutional data with a different imaging modality. We extracted the finding events from all the radiology reports in the MIMIC-CXR database and provided the extractions to the research community.

facilitate many secondary use applications, including clinical decision-support systems [3] , diagnostic surveillance of medical problems [4] , identification of patient cohorts with specific phenotypes [5] , tracking follow-up recommendations [6] , image retrieval and data-mining [7] , and simplification of report language for patients [8] .

Large-scale and real-time use of radiological finding information in these types of secondary use applications requires a detailed semantic representation of the findings that captures the most salient information. Since imaging tests are commonly used for cancer screening and diagnosis, semantic representations for findings associated with lesions and medical problems would be largely applicable to secondary use.

In this paper, we explored the extraction of comprehensive representations of clinical findings from radiology reports, including the creation of a novel annotation schema, annotation of a new clinical data set, and the development of state-of-the-art clinical finding extraction models. In our annotation schema, we categorized findings in radiology reports as Lesion findings and Medical Problem findings. A Lesion finding was defined as an abnormal spaceoccupying mass that was observable on the images. Lesions included primary tumors, metastases, benign tumors, abscesses, nodules, and other masses. A Medical Problem finding was a pathological process that was not a lesion, for example cirrhosis, air-trapping, atherosclerosis, and effusion. Each finding category was represented through finegrained event-based annotations. We presented a new annotated corpus of 500 computed tomography (CT) reports from the University of Washington (UW). To extract the finding events, we developed a deep learning extraction framework that fine-tuned a single BERT [9] model. We explored different contextualized embeddings through pretraining on different text sources. To assess the generalizability of the event extraction model, we annotated a subset of the MIMC-CXR radiology reports [10] . The extraction model achieved comparable performance on the MIMIC-CXR and UW data sets, despite the differences between the data sets. We extracted the clinical findings from the entire MIMIC-CXR data set and made the extracted findings available to the research community 1 . We also made the annotation guidelines and event extraction framework available 2 . The extraction framework directly processes annotated event data from the BRAT annotation tool [11] and can be readily used for event extraction without any deep learning coding experience.

The development of NLP-based information extraction (IE) models that target important information in clinical text has increased in recent decades [12] . Radiology is a clinical domain where NLP approaches, including IE, have been extensively applied [2] . Radiological finding information can be extracted by using named entity recognition (NER) to identify fine-grained details, such as anatomy, size, characteristics and assertion, and subsequently linking related phenomena using relation extraction (RE) . Several studies employed custom rule-based linguistic patterns to identify clinical finding observations in radiology reports, including appendicitis indication, anatomy and assertion [13] , adrenal observations and modifiers [14] , and osteoporosis fracture categories and modifiers [15] . Due to the heterogeneity of writing styles, ambiguity of abbreviations, and presence of "hedging" statements [16] , engineering linguistic and semantic rules to extract information from radiology reports requires substantial effort and clinical expertise. Furthermore, rule-based approaches produce brittle extraction models that do not generalize well. One example is the MedLEE system developed by Columbia University which incorporated comprehensive syntactic and semantic grammars to extract information from chest radiograph reports [17] . The conceptual model comprised 350 semantic grammar rules, 1,720 single-word lexicons, and 1,400 multi-word phrases. Development of the MedLEE semantic grammars required half a person-year [18] , [19] . Sevenster et al. used MedLEE to identify finding observation and body location entities and establish relationships between entities through relations. However, the major drawback was that the recall of overall extraction (entities and relations) was less than 46% due to the lack of comprehensive lexicons and grammatical rules [20] .

To overcome the limitations of rule-based systems, more contemporary radiology extraction work used statistical machine learning approaches to extract finding information. There is a body of radiology IE work that utilized discrete modeling approaches. For example, Hassanpour et al. used conditional Markov and conditional random field (CRF) models to extract anatomy, observations, modifiers, uncertainty entities from a corpus of 150 reports [21] . Yim et al.

employed maximum entropy models to extract relations between tumor references and attributes from radiology reports of hepatocellular carcinoma patients [22] . One challenge with statistical machine learning approaches is that manually engineered features are often tailored to solve a specific problem and are not easily adaptable to other domains.

Recent radiology extraction studies utilize neural networks, which offer improved modeling capacity, abstraction, and transfer learning than discrete modeling approaches. A commonly applied neural approach is the sequence-based recurrent neural network (RNN) model, which encodes sequences using an internal memory mechanism. The Bidirectional Long Short-term Memory (BiLSTM) network is a popular RNN variant, which captures long-range sequential dependencies in the forward and backward directions. Cornegruta et al. extracted 4 different entities (body location, clinical finding, descriptor and medical device) with an annotated corpus of 2,000 radiology reports using BiLSTM [23] . Steinkamp et al. extracted clinical finding observations and their relations to modifier entities, such as location, size and change over time using another RNN variant, the Gated Recurrent Unit [24] .

Most state-of-the-art NLP classification work, including IE within the radiology domain, utilized pre-trained transformer models with over hundreds of millions of model parameters. The popular BERT [9] model offers several benefits over RNN variants, including the combination of self-supervised pre-training and sub-token representation.

BERT learns word relationships through a masked language modeling task and learns sentence dependency by predicting whether two sentences are adjacent. This pre-training process allows the model to develop deep representation of words in context through layers of multi-head self-attention. BERT intrinsically attends to certain types of syntactic relations [25] , and the dependency information can be leveraged to increase relation extraction performance [26] , [27] . Provided that the model is sufficiently pre-trained on unlabeled data in the target domain, the expressive contextual representations of BERT can be transferred to specific prediction tasks, including IE, and achieve state-of-the-art performance. Sugimoto et al. extracted 7 different clinical entities from a corpus of 540 Japanese CT radiology reports by fine-tuning a pre-trained Japanese BERT model [28] . Other studies extracted breast imaging entities and relations from Chinese radiology reports [29] , [30] . Datta et al. employed a similar BERT finetuning approach to extract relations for clinical finding with spatial indication, such as "within" or "near" [31] .

We identified several gaps in prior work that limit the creation of comprehensive semantic representations of findings in radiology reports, including: (1) the limited scope of the annotation and extraction schemas, (2) the limited scope of diseases and anatomy explored, and (3) the lack of demonstrated generalizability. Findings in radiology reports can be relatively complex, and several attributes are often needed to fully capture all the finding information present (e.g., assertion, anatomy, size, and other characteristics) for meaningful secondary use. Many prior studies only focused on entity extraction, without identifying the relations between entities in order to fully represent the findings [14] , [15] , [21] , [23] , [28] . To address this gap, we introduced an event-based annotation schema that captured a majority of the finding information. Several studies focused on specific diseases and/or anatomical regions [13] , [22] , [29] [30] [31] .

While this focus may improve performance for the target diseases and/or anatomy, it reduces the generalizability of the annotated data sets and extraction models. To address this gap, we created the first general-purpose gold standard annotated with event-based schema on Lesion and Medical Problem findings without disease or anatomy constraints.

The gold standard contained randomly sampled 500 CT reports. In comparison to the reports in other imaging modalities, such as chest X-ray reports, CT reports covered a wide range of anatomy, medical problems, lesion types, lesion characteristics, and assertions. We trained and evaluated the event extraction framework on this gold standard of CT reports. No other previous studies evaluated the generalizability of extraction models across imaging modalities or institutions. To address this gap, we evaluated the extraction performance on an external validation set we created from chest X-ray reports from the publicly available MIMC-CXR data set.

We used an existing clinical dataset of 706,908 computed tomography (CT) reports from the UW clinical repository from 2008-2018. We randomly sampled 500 CT reports from this dataset and annotated as our gold standard corpus.

Retrospective review of this dataset was approved by the UW institutional review board, and the dataset was deidentified to preserve the privacy of the patients and ensure HIPAA compliance.

Our annotation schema is summarized in Table 1 . We used an event-based representation to capture the details of two clinical finding types: Lesion and Medical Problem. Each event was characterized with a trigger and a set of connected arguments. The trigger was a required key phrase identifying the finding event, while the arguments provided fine-grained details about the event. The argument entities were linked to the corresponding triggers through argument roles, forming a detailed and nuanced semantic representation of the clinical findings. We defined two types of arguments: span-only and span-with-value. The annotation of span-only arguments included the selection of the relevant phrase, assignment of an argument type label, and connection to the trigger, similar to most event annotation work. The annotation of span-with-value arguments included the selection of the relevant phrase, assignment of an argument type label with an additional categorical label that captures the clinical meaning of the selected phrase, as well as connection to the trigger. The categorical labels normalized the contents of the annotated phrase, allowing the extracted information to more easily be incorporated into secondary use applications. For example, in the sentence "No traumatic abnormality in the abdomen or pelvis", annotating the text span "no" as Medical-Assertion would also include the assignment of the categorical label absent. Because the presence of a lesion or medical problem could be implied rather than explicit, present was the default categorical label for Assertion, unless the report clearly indicated that the possible or absent labels were applicable.

Lesion Description (Trigger) span-only -"mass", "lesion", "nodule" Anatomy span-only -"left lower lobe" Assertion span-with-value present (default), absent, possible "no", "possible" Characteristics span-only -"hypodense", "septal" Count span-only -"2", "numerous", "multiple" Size span-only -"4.1 x 3.1 cm", "small" Size Trend span-with-value new, increasing, decreasing, no-change "stable", "unchanged"

Medical Problem (Trigger) span-only -"atherosclerotic calcifications" Anatomy span-only -"abdominal aorta", "right kidney" Assertion span-with-value present (default), absent, possible "no", "possible" Extraction of these findings was treated as a slot filling task by identifying the text spans that corresponded to the arguments (argument entities with roles) of the clinical finding events. Figure 1 presents example annotations for a

Lesion event and a Medical Problem event. For span-only arguments, the slot values would be the identified text spans. For span-with-value arguments, the slot values would be the identified categorical labels, which capture the meaning of the annotated phrases. A finding event might include multiple arguments of the same type. For example, a medical problem could be linked to multiple anatomical locations, or a lesion could be described by multiple characteristics.

• Problem = traumatic abnormality • Anatomy = abdomen or pelvis • Assertion = absent 

Inter-annotator agreement and model extraction performance was evaluated using the same scoring criteria. The annotated and extracted events include trigger and argument entities that are connected through argument roles. The pairing of triggers and arguments (entities with identified roles) assembles events from the individual entities. The scoring criteria for trigger and argument entities and argument roles are presented below.

Trigger and argument entities scoring considered the span identification and labeling, without considering the roles linking trigger and argument entities. All trigger and argument entities were compared at the token-level (rather than span-level) to allow partial matches, since partially matched text spans could still contain clinically relevant information, e.g. "mass lesions" vs "lesions".

Argument role scoring considered three annotated/extracted phenomena: (1) the trigger entity, (2) the argument entity, and (3) the argument role (linking the trigger-argument entity pair). Argument role equivalence required the trigger entity, argument entity, and role label to be equivalent. In argument role scoring, the entity equivalence criteria for triggers, span-only arguments, and span-with-value arguments were based on the semantics of the event representation, by considering the most salient information being captured by the entities [32] .

Triggers: Events were aligned based on trigger equivalence, and the arguments associated with aligned events (events with equivalent triggers) were compared based on the argument types. Triggers were considered equivalent if the spans overlapped by at least one token. Figure 2 shows an example of two Medical Problem annotations. Although the word "displaced" is not part of the trigger in Annotation #2, their overlapping text spans and connections to the Medical-Anatomy argument entities indicates that both argument entities belong to the same event and can be scored accordingly.

Annotation #1 Annotation #2 

The annotation was performed by one medical student and one graduate student using the BRAT rapid annotation tool [11] . Annotation guidelines were provided describing the details of each clinical finding event. In the initial iterations, the annotators were given the same samples to annotate independently. After each iteration, the annotators met with the domain expert radiologist to discuss and resolve the disagreements. The annotation guidelines were updated accordingly. At each iteration, we calculated inter-annotator agreement using pair-wise F1 score [33] , Table 2 . As can be observed, the number of annotated Medical Problem events was more than three times higher than the number of Lesion events. In general, each argument type corresponded to a single argument role type (one-to-one mapping between argument types and roles). One exception is Lesion-Size, which could be connected to a trigger through a Lesion-Size (Past) or Lesion-Size (Present) argument role. Table 3 . Gold standard corpus statistics.

The finding events were extracted in two separate steps: (1) the trigger and argument entities were extracted and (2) the argument roles were identified by connecting extracted trigger and argument entities through relations. The pairing of the trigger and argument entities through the argument roles assembles events from the individual entity extractions.

Our event extraction pipeline operated on sentences, which were treated as independent samples.

The extraction of trigger and argument entities was defined as a NER task. For the span-with-value argument entities, the categorical labels were appended to in the entity labels, for example, Medical-Assertion (absent). Predicting the labels of the argument entities would therefore predict both the argument type and the categorical labels. We evaluated two state-of-the-art neural network architectures: (1) BiLSTM-CRF [34] and (2) BERT NER [9] . BiLSTM-CRF was considered a strong NER baseline by multiple studies [28] , [30] , [31] . We used the open source NeuroNER [35] for the BiLSTM-CRF implementation. Figure 5 presents NeuroNER's BiLSTM-CRF architecture. Each token in the input sentence was represented by the concatenation of a pretrained word embedding and a character-aware word embedding.

The character-aware word embedding was generated by a BiLSTM operating on the individual characters associated with each token. The character-aware word embedding enabled the model to learn the morphological structure in each word and to encode out-of-vocabulary tokens. The sequence of word embeddings was then encoded using a second BiLSTM layer to create a contextualized representation of the sentence. The label of each word was predicted by a CRF output layer which took into account the conditional dependencies across the neighboring labels. To create input labels for the NER model from our annotated corpus, we used the Begin, Inside, Outside (BIO) tagging schema, based on whether the token was at the beginning, inside or outside of a label. For instance, consider the sentence "Probable malignant pancreatic mass with no evidence of vascular encasement". The labels would be classified as illustrated in The BERT NER model was implemented by adding a single linear layer to the BERT output hidden states and finetuning a pre-trained BERT model, as described by Devlin et al [9] . Because BERT utilized WordPiece tokenization [36] , rare words would be segmented into multiple sub-tokens. These sub-tokens, prefixed by "##" if not the first sub-token, allowed the segments of the words to be represented in a deterministic fashion. Rather than using a universal token like [UNK], the sub-token representation provided richer contextual embeddings for the model to generalize. During the BIO labeling, the sub-tokens starting with "##" were assigned a special label #. In addition, the BERT input included the special tokens [CLS] and [SEP] at the beginning and end of a sentence respectively, to signify the sentence boundaries. Figure 6 illustrates how the labels of an input sentence were classified by BERT NER. 

Once the trigger and argument entities were extracted, the argument roles were identified by predicting the links between trigger and argument entities. Identifying the roles of the argument entities filled the slots of the clinical finding events, similar to Figure 1 . Each event included a trigger that anchored the event, with zero or more argument connections. Each argument role was represented by a unidirectional relation where the head was the trigger entity and the tail was an argument entity. We predicted the argument roles, by decomposing each event into a set of relations, predicting the relations, and then assembling events from the predicted relations.

Relations were extracted using BERT by adding a linear layer to the pooled output state (encoded in the [CLS] token) and fine-tuning the model. Figure 7 presents the BERT relation extraction (RE) model with an example input sentence. were alternated randomly, minimizing the cross-entropy loss for the applicable target (NER or RE), and thereby effectively allowing the model to learn from the two different tasks.

We performed 5-fold cross validation (CV) for all experiments using the same data split ratio (80% for training, 10%

for validation, 10% for testing). The validation set was used for applying early stopping in order to avoid overfitting the training data [38] . The training was stopped when the validation results no longer showed improvement.

For the entity extraction baseline (BiLSTM-CRFrad), we used the word2vec embeddings pre-trained on a radiology report dataset from our previous work [37] . This dataset contained over 3 million reports covering a wide range of imaging modalities and were collected from four institutions including the University of Washington Medical Center, Northwest Hospital and Medical Center, the Seattle Cancer Care Alliance, and Harborview Medical Center. In terms of the model hyperparameters, the embedding dimension and the hidden state dimension of the character and sequence LSTM layers were 25 and 100. We used the Adam Optimizer with a learning rate of 0.005, as suggested by NeuroNER.

We experimented with three different pre-trained BERT models (BERTbase, BERTclinical and BERTrad). BERTbase was pre-trained on Wikipedia and BookCorpus, and made available by Google [9] . BERTclinical was pre-trained on 2 million clinical notes, including over 500,000 radiology reports, from the MIMIC-III database [39] , [40] . BERTrad was pretrained on over 3 million UW radiology reports and was initialized from the BERTclinical. We pre-trained BERTrad for To better assess the general performance of the models with different subsamples, we repeated the cross validation 10 times. For each run, the cross validation data splits were created with a different random seed [38] . We reported the average precision, recall and F1 scores across these 50 different runs and included the 95% confidence intervals.

All of the trigger and argument entities were extracted first before their relations were identified. Trigger and argument entity extraction performance was evaluated at the token-level, as described in Section 2.2.1. The results are shown in Table 4 . All of the BERT implementations outperformed BiLSTM-CRFrad. The BERT model with radiology-specific pretraining, BERTrad, generally performed better than the other variants, BERTbase and BERTclinical, achieving the highest overall average F1 of 85.5%. In Lesion-Count prediction, BERTclinical is slightly higher than BERTrad. In

Lesion-Size-Trend prediction, the decreasing label had relatively low extraction performance due to the small sample size. For the Assertion extraction, the absent label was easier to predict since most of the annotated text spans comprised a single word "no", which constituted 70% of the Medical-Assertion and 84% of the Lesion-Assertion entities.

We conducted statistical significance tests using the overall F1 to access whether the difference in model results were due to randomness or sampling variability. In cross validation, the training sets overlap between different folds. As a result, the classification performance from each fold is not completely independent, and can lead to misleading statistical results when applying standard paired t-tests [41] . Hence, we applied the corrected resampled t-test, as suggested by Nadeau and Bengio [42] , to better estimate the sample variance. The test results showed that the overall performance of BERTrad was better than the other architectures with significance (p-value < 5e-6).

In this section, we present the end-to-end argument role extraction results. Specifically, we predicted the argument roles using the extracted triggers and argument entities rather than the gold standard entities. We conducted the same statistical tests on the event argument extraction results using the overall performance scores presented in Table 6 . BERTrad achieved the best overall performance with significance (p-values < 1.6e-4).

We used the chest X-ray reports in the MIMC-CXR database, to explore the generalizability of the event extraction study is associated with a single radiology report [10] . The dataset was made publicly accessible to support independent research, such as predicting pulmonary edema severity [43] , predicting COVID-19 pneumonia severity [44] , and evaluating FDA approved AI devices [45] .

To evaluate the generalizability of our extraction model, we manually annotated 50 randomly selected chest X-ray reports from the MIMC-CXR database using the same finding event annotation schema. This validation set included 257 Medical Problem finding events (141 argument entities and 313 roles) and 7 Lesion finding events (9 argument entities, 15 roles). The overall F1 scores on this validation set were 95.6% for triggers, 79.1% for span-only arguments and 89.7% for span-with-value arguments, evaluated using the same argument role scoring criteria in Section 2.2.2.

The extraction performance was comparable to our repeated 5-fold cross validation performance, despite the fact that the MIMC-CXR reports were from a different institution and based on a different imaging modality. The MIMC-CXR radiology reports were generally shorter than the reports in our training corpus. The statistics of word count per report had a mean of 87, and a median of 79, in comparison to the mean and median of 327 and 288 in our corpus. We found that the event extraction model was able to identify clinical concepts that were unseen in our training corpus. For instance, the words "plasmacytoma" and "fibroadenomas" were correctly identified as lesions and "acute respiratory distress syndrome" was correctly identified as medical problem, even though these lesion and medical problem mentions did not appear in any radiology reports in the training corpus. This could be attributed to the pre-training of BERTrad with 3 million UW radiology reports covering a wide range of modalities.

We extracted lesion and medical problem findings from all 227,835 chest X-ray reports in the MIMIC-CXR dataset with our event extraction framework. A total of 1,420,604 medical problem findings and 31,706 lesion findings were extracted using the fine-tuned BERTrad model. To contribute to the core aim of the MIMIC-CXR project and facilitate future research studies in medical imaging, we are releasing the finding extraction results for all 227,835 radiology reports. The extracted data are in BRAT's standoff format and follow the same subject IDs, study IDs and folder structure, such that they can be readily used to augment the existing images and reports 3 .

We presented a new schema for representing lesion and medical problem findings in radiology reports. In trigger and argument entity extraction, the BERT-based NER models outperformed the BiLSTM-CRF baseline. In both the entity extraction and argument role prediction tasks, the BERT model with the most domain-specific pre-training, BERTrad, achieved the best performance. Pre-training BERTrad on 3 million UW radiology reports allowed the model to learn better contextual representations and transfer the knowledge of clinical concepts that are absent in the training corpus.

BERTrad achieved an end-to-end performance of 92.9% F1 for triggers, 75.0% F1 for span-only arguments, and 84.8%

F1 for span-with-value arguments.

Among the finding entities, Medical-Problem and Medical-Anatomy had relatively long text spans. Over 25% of Medical-Problem spans and 35% of Medical-Anatomy spans contained at least 5 words. We found that some entities with lengthy spans were extracted into multiple separate entities, particularly before and after a conjunctive word.

About 4% of all Medical-Problem entities and 7% of all Medical-Anatomy entities were split into multiple entities by the entity extraction models. Figure 8 presents an example of each case. In our annotation schema, the same entity could be assigned multiple labels. For example, the same anatomy span could possibly be annotated as both Lesion-Anatomy and Medical-Anatomy. Our NER models could only assign a single label to each token, so a text span cannot be extracted as multiple argument entities. Approximately 1% of all entities in our annotated corpus had multiple labels, so this limitation does not fundamentally impact extraction performance. One way to circumvent this single-label limitation is by having a single entity for both findings.

Although a single anatomy entity no longer carries any clinical finding connotation, its association with the finding events can still be identified by the RE model.

Our extraction framework employed multi-task learning to optimize the parameters of a single BERT model. We did not explore other fine-tuning approaches, such as using graph structures to jointly model the span relations in the different tasks [46] or using entity aware markers to encode input sentences in a relation extraction model, which was shown to outperform joint modeling architectures [47] . Our BERTrad model was pre-trained using the common transfer learning paradigm by initializing its weight from another BERT model in relevant domain. This approach is particularly advantageous when the target data is scarce. However, a recent study showed that pre-training the language model from scratch in a domain with abundant unlabeled text could derive better in-domain vocabulary and result in substantial performance improvement [48] . Since our UW data set contained more than 3 million radiology reports, this pre-training approach could potentially improve the contextual representation of the BERTrad model and possibly lead to better event extraction performance.

In this paper, we presented a new schema for extracting lesion and medical problem findings from radiology reports.

The event representation of each clinical finding comprised a trigger and different arguments, capturing the finegrained semantic information of the finding. A total of 2,344 lesion findings and 8,065 medical problem findings were annotated in 500 CT radiology reports. For argument entity extraction, we evaluated two state-of-the-art neural architectures using BiLSTM-CRF and BERT. The BERTrad model pre-trained on 3 million radiology reports achieved an overall average F1 score of 85.5%, based on token-level evaluation. We then extracted the clinical finding events by predicting the argument roles for the extracted entities. The overall average F1 scores for end-to-end event extraction were 92.9% for triggers, 75.0% for span-only arguments and 84.8% for span-with-value arguments. To demonstrate the generalizability of the BERTrad model, we extracted the clinical findings (1,420,604 medical problem findings and 31,706 lesion findings) from all the radiology reports in the MIMIC-CXR database. Based on the evaluation with a manually labeled validation set of 50 chest X-ray reports, the overall average F1 scores for the extraction were 95.6% for triggers, 79.1% for span-only arguments and 89.7% for span-with-value arguments. The extraction performance was comparable to the repeated 5-fold cross validation performance with the UW corpus. We are releasing both our deep learning event extraction framework as well as the MIMIC-CXR extracted clinical findings to the research community. 

Common Data Elements in Radiology

Natural Language Processing in Radiology: A Systematic Review

What can natural language processing do for clinical decision support?

Use of computerized surveillance to detect nosocomial pneumonia in neonatal intensive care unit patients

Automated Identification of Patients With Pulmonary Nodules in an Integrated Health System Using Administrative Health Plan Data, Radiology Reports, and Natural Language Processing

Automated Tracking of Follow-Up Imaging Recommendations

Intelligent image retrieval based on radiology reports

Text Simplification Using Consumer Health Vocabulary to Generate Patient-Centered Radiology Reporting: Translation and Evaluation

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports

brat: a Web-based Tool for NLP-Assisted Text Annotation

Clinical information extraction applications: A literature review

Extracting Actionable Findings of Appendicitis from Radiology Reports Using Natural Language Processing

Development of Automated Detection of Radiology Reports Citing Adrenal Findings

Natural language processing of radiology reports for identification of skeletal site-specific fractures

Language of the radiology report: primer for residents and wayward radiologists

Architectural requirements for a multipurpose natural language processor in the clinical environment

A conceptual model for clinical radiology reports

A general natural-language text processor for clinical radiology

Automatically Correlating Clinical Findings and Body Locations in Radiology Reports Using MedLEE

Information extraction from multi-institutional radiology reports

Tumor information extraction in radiology reports for hepatocellular carcinoma patients

Modelling Radiological Language with Bidirectional Long Short-Term Memory Networks

Toward Complete Structured Information Extraction from Radiology Reports Using Machine Learning

What Does BERT Look at? An Analysis of BERT's Attention

A Dependency-Based Neural Network for Relation Classification

Classifying Relations via Long Short Term Memory Networks along Shortest Dependency Paths

Extracting clinical terms from radiology reports with deep learning

Extraction of BI-RADS findings from breast ultrasound reports in Chinese using deep learning approaches

Extracting comprehensive clinical information for breast cancer using deep learning methods

Understanding spatial language in radiology: Representation framework, annotation, and spatial relation extraction from chest X-ray reports using deep learning

Annotating social determinants of health using active learning, and characterizing determinants using neural event extraction

Agreement, the F-Measure, and Reliability in Information Retrieval

Neural Architectures for Named Entity Recognition

NeuroNER: an easy-to-use program for named-entity recognition based on neural networks

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Extraction and Analysis of Clinically Important Follow-up Recommendations in a Large Radiology Dataset

Automatic early stopping using cross validation: quantifying the criteria

Publicly Available Clinical BERT Embeddings

MIMIC-III, a freely accessible critical care database

Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms

Inference for the Generalization Error

Joint Modeling of Chest Radiographs and Radiology Reports for Pulmonary Edema Assessment

Predicting COVID-19 Pneumonia Severity on Chest X-ray With Deep Learning

How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals

Entity, Relation, and Event Extraction with Contextualized Span Representations

A Frustratingly Easy Approach for Entity and Relation Extraction

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing