key: cord-0925075-tdcg1cq0
authors: Lybarger, Kevin; Ostendorf, Mari; Thompson, Matthew; Yetisgen, Meliha
title: Extracting COVID-19 Diagnoses and Symptoms From Clinical Text: A New Annotated Corpus and Neural Event Extraction Framework
date: 2021-03-26
journal: J Biomed Inform
DOI: 10.1016/j.jbi.2021.103761
sha: e437a39b516d6fd6da5beae9feaf87a76f8ddef0
doc_id: 925075
cord_uid: tdcg1cq0

Coronavirus disease 2019 (COVID-19) is a global pandemic. Although much has been learned about the novel coronavirus since its emergence, there are many open questions related to tracking its spread, describing symptomology, predicting the severity of infection, and forecasting healthcare utilization. Free-text clinical notes contain critical information for resolving these questions. Data-driven, automatic information extraction models are needed to use this text-encoded information in large-scale studies. This work presents a new clinical corpus, referred to as the COVID-19 Annotated Clinical Text (CACT) Corpus, which comprises 1,472 notes with detailed annotations characterizing COVID-19 diagnoses, testing, and clinical presentation. We introduce a span-based event extraction model that jointly extracts all annotated phenomena, achieving high performance in identifying COVID-19 and symptom events with associated assertion values (0.83-0.97 F1 for events and 0.73-0.79 F1 for assertions). Our span-based event extraction model outperforms an extractor built on MetaMapLite for the identification of symptoms with assertion values. In a secondary use application, we predicted COVID-19 test results using structured patient data (e.g. vital signs and laboratory results) and automatically extracted symptom information, to explore the clinical presentation of COVID-19. Automatically extracted symptoms improve COVID-19 prediction performance, beyond structured data alone.

As of December 20, 2020, there were over 75 million confirmed COVID-19 cases globally, resulting in 1.6 million related deaths [1] . Surveillance efforts to track the spread of COVID-19 and estimate the true number of infections remains a challenge for policy makers, healthcare workers, and researchers, even as testing availability increases. Symptom information provides useful indicators for tracking potential infections and disease clusters [2] . Certain symptoms and underlying comorbidities have directed COVID -19 testing. However, the clinical presentation of COVID-19 varies significantly in severity and symptom profiles [3] .

The most prevalent COVID-19 symptoms reported to date are fever, cough, fatigue, and dyspnea [4] , but emerging reports identify additional symptoms, including diarrhea and neurological symptoms, such as changes in taste or smell [5] [6] [7] . Certain initial symptoms may be associated with higher risk of complications; in one study, dyspnea was associated with a two-fold increased risk of acute respiratory distress syndrome [8] . However, correlations between symptoms, positive tests, and rapid clinical deterioration are not well understood in ambulatory care and emergency department settings.

Routinely collected information in the Electronic Health Record (EHR) can provide crucial COVID-19 testing, diagnosis, and symptom data needed to address these knowledge gaps. Laboratory results, vital signs, and other structured data results can easily be queried and analyzed at scale; however, more detailed and nuanced descriptions of COVID-19 diagnoses, exposure history, symptoms, and clinical decision-making are typically only documented in the clinical narrative. To leverage this textual information in large-scale studies, the salient COVID-19 and symptom information must be automatically extracted.

This work presents a new corpus of clinical text annotated for COVID-19, referred to as the COVID-19

Annotated Clinical Text (CACT) Corpus. CACT consists of 1,472 notes from the University of Washington (UW) clinical repository with detailed event-based annotations for COVID-19 diagnosis, testing, and symptoms. The event-based annotations characterize these phenomena across multiple dimensions, including assertion, severity, change, and other attributes needed to comprehensively represent these clinical phenomena in secondary use applications. This work is part of a larger effort to use routinely collected data describing the clinical presentation of acute and chronic diseases, with two major aims; 1) to describe the presence, character, and changes in symptoms associated with clinical conditions, where delays or misdiagnoses occur in clinical practice and impact patient outcomes (e.g. infectious diseases, cancer) [9] , and 2) to provide a more efficient and cost-effective mechanism to validate clinical prediction rules previously derived from large prospective cohort studies [10] . To the best of our knowledge, CACT is the first clinical data set with COVID-19 annotations, and it includes 29.9K distinct events. We present the first information extraction results on CACT using an end-to-end neural event extraction model, establishing a strong baseline for identifying COVID-19 and symptom events. We explore the prediction of COVID-19 test results (positive or negative) using structured EHR data and automatically extracted symptoms and find that the automatically extracted symptoms improve prediction performance.

Given the recent onset of COVID-19, there are limited COVID-19 corpora for natural language processing (NLP) experimentation. Corpora of scientific papers related to COVID-19 are available [11, 12] , and automatic labels for biomedical entity types are available for some of these research papers [13] . However, we are unaware of corpora of clinical text with supervised COVID-19 annotations.

Multiple clinical corpora are annotated for symptoms. As examples, South et al. [14] annotated symptoms and other medical concepts with negation (present/not present), temporality, and other attributes. Koeling et al. [15] annotated a pre-defined set of symptoms related to ovarian cancer. For the i2b2/VA challenge, Uzuner et al. [16] annotated medical concepts, including symptoms, with assertion values and relations. While some of these corpora may include symptom annotations relevant to COVID-19 (e.g. "cough" or "fever"), the distribution and characterization of symptoms in these corpora may not be consistent with presentation. To fill the gap in clinical COVID-19 annotations and detailed symptom annotation, we introduce CACT to provide a relatively large corpus with COVID-19 diagnosis, testing, and symptom annotations.

The most commonly used United Medical Language System (UMLS) concept extraction systems are the clinical Text Analysis and Knowledge Extraction System (cTAKES) [17] and MetaMap [18] . The National Library of Medicine (NLM) created a lightweight Java implementation of MetaMap, MetaMapLite, which demonstrated real-time speed and extraction performance comparable to or exceeding the performance of MetaMap, cTAKES, and DNorm [19] . In previous work, we built on MetaMapLite, incorporating assertion value predictions (e.g. present versus absent) using classifiers trained on the 2010 i2b2 challenge dataset to create the extraction pipeline referred to here as MetaMapLite++ [20] . MetaMapLite++ assigns each extracted UMLS Metathesaurus concept an assertion value with an Support Vector Machine (SVM)-based assertion classifier that utilizes syntactic and semantic knowledge. The SVM assertion classifier achieved state-of-the-art assertion performance (Micro-F1 94.23) on the i2b2 2010 assertion dataset [21] . Here, we use MetaMapLite++ as a baseline for evaluating extraction performance for a subset of our annotated phenomena, specifically symptoms with assertion values, using the UMLS "Sign or Symptom" semantic type. The Mayo Clinic updated its rule-based medical tagging system, MedTagger [22] , to include a COVID-19 specific module that extracts 18 phenomena related to COVID-19, including 11 common COVID-19 symptoms with assertion values [23] . We do not use the COVID-19 MedTagger variant as a baseline, because our symptom annotation and extraction is not limited to known COVID-19 symptoms.

There is a significant body of information extraction (IE) work related to coreference resolution, relation extraction, and event extraction tasks. In these tasks, spans of interest are identified, and linkages between spans are predicted. Many contemporary IE systems use end-to-end multi-layer neural models that encode an input word sequence using recurrent or transformer layers, classify spans (entities, arguments, etc.), and predict the relationship between spans (coreference, relation, role, etc.) [24] [25] [26] [27] [28] [29] . Of most relevance to our work is a series of developments starting with Lee et al. [30] , which introduces a span-based coreference resolution model that enumerates all spans in a word sequence, predicts entities using a feed-forward neural network (FFNN) operating on span representations, and resolves coreferences using a FFNN operating on entity span-pairs. Luan et al. [31] adapts this framework to entity and relation extraction, with a specific focus on scientific literature. Luan et al. [32] extends the method to take advantage both of co-reference and relation links in a graph-based approach to jointly predict entity spans, co-reference, and relations. By updating span representations in multi-sentence co-reference chains, the graph-based approach achieved state-of-the-art on several IE tasks representing a range of different genres. Wadden et al. [33] expands on Luan et al. [32] 's approach, adapting it to event extraction tasks. We build on Luan et al. [31] and Wadden et al. [33] 's work, augmenting the modeling framework to fit the CACT annotation scheme. In CACT, event arguments are generally close to the associated trigger, and inter-sentence events linked by co-reference are infrequent, so the graph-based extension, which adds complexity, is unlikely to benefit our extraction task.

Many recent NLP systems use pre-trained language models (LMs), such as ELMo, BERT, and XLNet, that leverage unannotated text [34] [35] [36] . A variety of strategies for incorporating the LM output are used in IE systems, including using the contextualized word embedding sequence: as the input to a Conditional Random Field entity extraction layer [37] , as the basis for building span representations [32, 33] , or by adding an entity-aware attention mechanism and pooled output states to a fully transformer-based model [38] . There are many domain-specific LM variants. Here, we use Alsentzer et al. [39] 's Bio+Clinical BERT, which is trained on PubMed papers and MIMIC-III [40] clinical notes.

There are many pre-print and published works exploring the prediction of COVID-19 outcomes, including COVID-19 infection, hospitalization, acute respiratory distress syndrome, need for intensive care unit (ICU), need for a ventilator, and mortality [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] . These COVID-19 outcomes are typically predicted using existing structured data within the EHR, including demographics, diagnosis codes, vitals, and lab results, although Izquierdo et al. [46] incorporates automatically extracted information from the existing EHRead tool. Our literature review identified 24 laboratory, vital sign, and demographic fields that are predictive of COVID-19 (see Table 7 in the Appendix details). While there are some frequently cited fields, there does not appear to be a consensus across the literature regarding the most prominent predictors of COVID-19 infection. These 24 predictive fields informed the development of our COVID-19 prediction work in Section 5. Prediction architectures includes logistic regression, SVM, decision trees, random forest, K-nearest neighbors, Naïve Bayes, and multilayer perceptron [46, 47, [51] [52] [53] .

This work used inpatient and outpatient clinical notes from the UW clinical repository. COVID-19-related notes were identified by searching for variations of the terms coronavirus, covid, sars-cov, and sars-2 in notes authored between February 20-March 31, 2020, resulting in a pool of 92K notes. Samples were randomly selected for annotation from a subset of 53K notes that include at least five sentences and correspond to the note types: telephone encounters, outpatient progress, emergency department, inpatient nursing, intensive care unit, and general inpatient medicine. Multiple note types were used to improve extraction model generalizability.

Early in the outbreak, the UW EHR did not include COVID-19 specific structured data; however, structured fields indicating COVID-19 test types and results were added as testing expanded. We used these structured fields to assign a COVID-19 Test label describing COVID-19 polymerase chain reaction (PCR) testing to each note based on patient test status within the UW system (no data external to UW was used):

• none: patient testing information is not available Given the sparsity of positive and negative notes, CACT is intentionally biased to increase the prevalence of these labels. To ensure adequate positive training samples, the CACT training partition includes 46% none, 5% negative, and 49% positive notes. Ideally, the test set would be representative of the true distribution;

however, the expected number of positive labels with random selection is insufficient to evaluate extraction performance. Consequently, the CACT test partition was biased to include 50% none, 46% negative, and 4% positive notes. Notes were randomly selected in equal proportions from the six note types. CACT includes 1,472 annotated notes, including 1,028 train and 444 test notes.

We created detailed annotation guidelines for COVID-19 and symptoms, using the event-based annotation scheme in Table 1 . Each event includes a trigger that identifies and anchors the event and arguments that characterize the event. The annotation scheme includes two types of arguments: labeled arguments and span-only arguments. Labeled arguments (e.g. Assertion) include an argument span, type, and subtype (e.g. present). The subtype label normalizes the span information to a fixed set of classes and allows the extracted information to be directly used in secondary use applications. Span-only arguments (e.g. Characteristics) include an argument span and type but do not include a subtype label, because the argument information is not easily mapped to a fixed set of classes.

For COVID events, the trigger is generally an explicit COVID-19 reference, like "COVID-19" or "coron- Assertion † {present, absent, possible, hypothetical, not patient} "positive," "low suspicion" Symptom Trigger * -"cough," "shortness of breath" Assertion * {present, absent, possible, conditional, hypothetical, not patient} "admits," "denies"

Change {no change, worsened, improved, resolved} "improved," "continues" Severity {mild, moderate, severe} "mild," "required ventilation"

Anatomy -"chest wall," "lower back"

Characteristics -"wet productive," "diffuse" Duration -"for two days," "1 week" Frequency -"occasional," "chronic" reported, indications of disorders and diseases (e.g "cough"). For Symptom events, the trigger identifies the specific symptom, for example "wheezing" or "fever," which is characterized through Assertion, Change, Severity, Anatomy, Characteristics, Duration, and Frequency arguments. Symptoms were annotated for all conditions/diseases, not just COVID-19. Notes were annotated using the BRAT annotation tool [54] . Figure 1 presents BRAT annotation examples. Most prior medical problem extraction work, including symptom extraction, focuses on identifying the specific problem, normalizing the extracted phenomenon, and predicting an assertion value (e.g. present versus absent). This approach omits many of the symptom details that clinicians are taught to document and that form the core of many clinical notes. This symptom detail describes change (e.g. improvement,

worsening, lack of change), severity (e.g. intensity and impact on daily activities), particular characteristics (e.g. productive, dry, or barking for cough), and location. We hypothesize that this symptom granularity is needed for many clinical conditions to improve timely diagnosis and validate diagnosis prediction rules.

Annotation and extraction is scored as a slot filling task, focusing on information most relevant to secondary use applications. Figure 2 presents the same sentence annotated by two annotators, along with the populated slots for the Symptom event. Both annotations include the same trigger and Frequency spans ("cough" and "intermittent", respectively). The Assertion spans differ ("presenting with" vs. "presenting"), but the assigned subtypes (present) are the same, so the annotations are equivalent for purposes of populating a database. Annotator agreement and extraction performance are assessed using scoring criteria that reflects this slot filling interpretation of the labeling task.

⇓ SSx(trigger="cough", Assertion=present, Frequency="intermittent") The Symptom trigger span identifies the specific symptom. For COVID, the trigger anchors the event,

although the span text is not salient to downstream applications. For labeled arguments, the subtype label captures the most salient argument information, and the identified span is less informative. For span-only arguments, the spans are not easily mapped to a fixed label set, so the selected span contains the salient information. Performance is evaluated using precision (P), recall (R), and F1. Trigger: Triggers, T i , are represented by a pair (event type, e i ; token indices, x i ). Trigger equivalence is defined as

Arguments: Events are aligned based on trigger equivalence. The arguments of events with equivalent triggers are compared using different criteria for labeled arguments and span-only arguments. Labeled arguments, L i , are represented as a triple (argument type, a i ; token indices, x i ; subtype, l i ). For labeled arguments, the argument type, a, and subtype, l, capture the salient information and equivalence is defined as

Span-only arguments, S i , are represented as a pair (argument type, a i ; token indices, x i ). Span-only arguments with equivalent triggers and argument types,

, are compared at the token-level (rather than the span-level) to allow partial matches. Partial match scoring is used as partial matches can still contain useful information.

CACT includes 1,472 notes with a 70%/30% train/test split and 29.9K annotated events (5.4K COVID and 24.4K Symptom). Figure 3 contains a summary of the COVID annotation statistics for the train/test subsets. By design, the training and test sets include high rates of COVID-19 infection (present subtype for Assertion and positive subtype for Test Status), with higher rates in the training set. CACT includes high rates of Assertion hypothetical and possible subtypes. The hypothetical subtype applies to sentences like, "She is mildly concerned about the coronavirus" and "She cancelled nexplanon replacement due to COVID-19."

The possible subtype applies to sentences like, "risk of Covid exposure" and "Concern for respiratory illness (including COVID-19 and influenza)." Test Status pending is also frequent. There is some variability in the endpoints of the annotated COVID trigger spans (e.g. "COVID" vs.

"COVID test"); however 98% of the COVID trigger spans in the training set start with the tokens "COVID," "COVID19," or "coronavirus." Since the COVID trigger span is only used to anchor and disambiguate events, the COVID trigger spans were truncated to the first token of the annotated span in all experimentation and results.

The training set includes 1,756 distinct uncased Symptom trigger spans, 1,425 of which occur fewer than five times. Figure 4 presents the frequency of the 20 most common Symptom trigger spans in the training set by Assertion subtypes present, absent, and other (possible, conditional, hypothetical, or not patient). The extracted symptoms in Figure 4 were manually normalized to aggregate different extracted spans with similar meanings (e.g. "sob" and "short of breath" → "shortness of breath"; "febrile" and "fevers" → "fever"). Table   6 in the Appendix presents the the symptom normalization mapping, provided by a medical doctor. These 20 symptoms account for 62% of the training set Symptom events. There is ambiguity in delineating between some symptoms and other clinical phenomena (e.g. exam findings and medical problems), which introduces some annotation noise. 

All annotation was performed by four UW medical students in their fourth year. After the first round of annotation, annotator disagreements were carefully reviewed, the annotation guidelines were updated, and annotators received additional training. Additionally, potential COVID triggers were pre-annotated using pattern matching ("COVID," "COVID-19," "coronavirus," etc.), to improve the recall of COVID annotations.

Pre-annotated COVID triggers were modified as needed by the annotators, including removing, shifting, and adding trigger spans. Figure 5 presents the annotator agreement for the second round of annotation, which included 96 doubly annotated notes. For labeled arguments, F1 scores are micro-average across subtypes.

Event extraction tasks, like ACE05 [55] , typically require prediction of the following event phenomena: The CACT annotation scheme differs from this configuration in that labeled arguments require the argument type (e.g. Assertion) and the subtype (e.g. present, absent, etc.) to be predicted. Resolving the argument subtypes require a classifier with additional predictive capacity.

We implement a span-based, end-to-end, multi-layer event extraction model that jointly predicts all event phenomena, including the trigger span, event type, and argument spans, types, and subtypes. Figure 6 presents our Span-based Event Extractor framework, which differs from prior related work in that multiple span classifiers are used to accommodate the argument subtypes. 

where w α,c is a learned 1 × 2v h vector. For span representation c, span i, and token position t, the attention weights are calculated by normalizing the attention scores as

where start(s i ) and end(s i ) denote the start and end token indices of span s i . Span representation c for span i is calculated as the attention-weighted sum of the bi-LSTM hidden state as 

where φ c (s i ) yields a vector of label scores of size |L c |, FFNN s,c is a non-linear projection from size 2v h to v s , and w s,c has size |L c | × v s . The trigger prediction label set is L trigger = {null, COVID, Symptom}. Separate classifiers are used for each labeled argument (Assertion, Change, Severity, and Test Status) with label set, 

where ψ d (s j , s k ) is a vector of size 2, FFNN r,d is a non-linear projection from size 4v h to v r , and w r,d has size 2 × v r . Span pruning: To limit the time and space complexity of the pairwise argument role predictions, only the top-K spans for each span classifier, c, are considered during argument role prediction. The span score is calculated as the maximum label score in φ c , excluding the null label score.

The model configuration was selected using 3-fold cross validation (CV) on the training set. 

During initial experimentation, Symptom Assertion extraction performance was high for the absent subtype and lower for present. The higher absent performance is primarily associated with the consistent presence of negation cues, like "denies" or "no." While there are affirming cues, like "reports" or "has," the present subtype is often implied by a lack of negation cues. For example, an entire sentence could be "Short of breath." To provide the Symptom Assertion span classifier with a more consistent span representation, we replaced each Symptom Assertion span (token indices) with the Symptom trigger span in each event and found that performance improved. We extended this trigger span substitution approach to all labeled arguments (Assertion, Change, Severity, and Test Status) and found performance improved. By substituting the trigger spans for the labeled argument spans, trigger and labeled argument prediction is roughly treated as a multi-label classification problem, although the model does not constrain trigger and labeled argument predictions to be associated with the same spans. As previously discussed, the scoring routine does not consider the span indices of labeled arguments. annotator agreement, and Frequency extraction performance is lower than annotation agreement. Change, Severity, and Characteristics extraction performance is low, again likely related to low annotator agreement for these cases.

Existing symptom extraction systems do not extract all of the phenomena in the CACT annotation scheme; however, we compared the performance of our Span-based Event Extractor to MetaMapLite++ for symptom identification and assertion prediction. MetaMapLite++ is an analysis pipeline that includes a UMLS concept extractor with assertion prediction [20] . Table 3 presents the performance of MetaMapLite++ on the CACT test set. The spans associated with medical concepts in MetaMapLite++ differ slightly from our annotation scheme. For example, "dry cough" was extracted by MetaMapLite++ as a symptom, whereas our annotation scheme labels "cough" as the symptom and "dry" as a characteristic. To account for this difference, Table 3 presents the performance of MetaMapLite++ for two trigger equivalence criteria: 1) exact match for triggers is required, as defined in Section 3. symptom acronyms and abbreviations that frequently occur in our data, for example "N/V/D" for "nausea, vomiting, and diarrhea." Table 3 only reports the performance for the UMLS "Sign or Symptom" semantic type. When all UMLS semantic types are used, the recall improves; however, the precision is extremely low (P = 0.02 − 0.03) 3 .

Exact trigger match Table 4 presents the assertion prediction performance for both systems, only considering the subset of predictions with exact trigger matches (i.e. assertion prediction performance is assessed without incurring penalty for trigger identification errors). The number of gold assertion labels ("# Gold") is greater for the 

The 27.5K telephone notes. This data set has some overlap with the data set used in Section 4.1 but is treated as a separate data set in this COVID-19 prediction task. The notes in the CACT training set are less than 1% of the notes used in this secondary use application.

Features: Symptom information was automatically extracted from the notes using the Span-based Event

Extractor trained on CACT. 4 The extracted symptoms were normalized using the mapping in Table 6 in the Appendix. Each extracted symptom with an Assertion value of "present" was assigned a feature value of 1.

The 24 identified predictors of COVID-19 from existing literature (see Section 2) were mapped to 32 distinct fields within the UW EHR and used in experimentation. Identified fields are listed in Table 9 of the Appendix.

For the coded data (e.g. structured fields like "basophils"), experimentation was limited to this subset of literature-supported COVID-19 predictors, given the limited number of positive COVID-19 tests in this data set.

Within the 7-day history, features may occur multiple times (e.g. multiple temperature measurements).

For each feature, the series of values was represented as the minimum or maximum of the values depending on the specific feature. For example, temperature was represented as the maximum of the measurements to detect any fever, and oxygen saturation was represented as the minimum of the values to capture any low oxygenation events. Table 9 in the Appendix includes the aggregating function, f , used for each field.

Where symptom features were missing, the feature value was set to 0. For features from the structured EHR data, which are predominantly numerical, missing features were assigned the mean feature value in the set used to train the COVID-19 prediction model. Model: COVID-19 was predicted using the Random Forest framework, because it facilitates nonlinear modeling with interdependent features and interpretability analyses (Scikit-learn Python implementation used [57] ). Alternative prediction algorithms include Logistic Regression, SVM, and FFNN. Logistic Regression assumes feature independence and linearity, which is not valid for this task. For example, the feature set includes both the symptom "fever" and temperature measurements (e.g. "38. Experimental paradigm: The available data was split into train/test sets using an 80%/20% split by patient, although training and evaluation was performed at the test-level (i.e. each COVID-19 test result is a sample). Performance was evaluated using the receiver operating characteristic (ROC) and the associated area under the curve (AUC). Given the relatively small number of positive samples, the train/test splits were randomly created 1,000 times through repeated hold-out testing [59] . Kim [59] demonstrated that repeated hold-out testing can improve the robustness of the results in low resource settings. For each train/test split, the AUC was calculated, and an average AUC was calculated across all hold-out iterations. The random holdout iterations yield a distribution of AUC values, which facilitate significance testing. The significance of the AUC performance was assessed using a two-sided T-test. The Random Forest models were tuned using 3-fold cross validation on the training set and evaluated on the withheld test set. COVID-19 prediction experimentation included three feature sets: structured (32 structured EHR fields), notes (automatically extracted symptoms), and all (combination of structured fields and automatically extracted symptoms). Separate models were trained and evaluated for each note type (ED, progress, and telephone) and feature set (structured, notes, and all). The selected Random Forest hyperparameters are summarized in Table 10 in the Appendix. Figure 7 presents the ROC for the COVID-19 predictors with the average AUC across repeated hold-out partitions. The AUC evaluates model performance across all operating points, including operating points that are not clinically significant, for example extremely low true positive rate (TPR). To address this AUC limitation and provide an alternative method for comparing feature sets, we selected a fixed operating point on the ROC, comparing the false positive rate (FPR) at a specific TPR. We selected a TPR (sensitivity) of 80%, as a value that has clinical value for identifying individuals with COVID-19, and we examined the FPR (specificity) at this fixed TPR. In this use case, we are attempting to see how well structured EHR fields and symptoms perform compared to the reference standard of a laboratory PCR test. Given the relatively small sample size and low proportion of positive COVID-19 tests, the SHAP impact values presented in Figure 8 were aggregated across repeated hold-out runs. Figure symptoms; progress -fever, myalgia, respiratory symptoms, cough, and ill; and telephone -fever, cough, myalgia, fatigue, and sore throat. The differences in symptom importance by note type reflects differences in documentation in the clinical settings (e.g., emergency department, outpatient, and tele-visit).

We present CACT, a novel corpus with detailed annotations for COVID-19 diagnoses, testing, and symptoms. CACT includes 1,472 unique notes across six note types with more than 500 notes from patients with future positive COVID-19 tests. We implement the Span-based Event Extractor, which jointly extracts all annotated phenomena, including argument types and subtypes. The Span-based Event Extractor achieves near-human performance in the extraction of COVID triggers (0.97 F1) and Symptom triggers (0.83 F1) and Assertions (0.79 F1). The performance of several attributes (e.g. Change, Severity, Characteristics, Duration, and Frequency) is lower than that of Assertion. This lower performance may partly be due to the focus on COVID-19, where clinicians' notes: 1) are highly structured around the presence/absence of a certain set of symptoms, 2) usually describe a single consultation per patient within 7 days of COVID-19 testing, and 3)

focus on assessing the need for COVID-19 testing and in-person ambulatory or ED care.

In a COVID-19 prediction task, automatically extracted symptom information improved the prediction of COVID-19 test results (with significance) beyond just using structured data, and the top predictive symptoms include fever, cough, and myalgia. This application is limited by the size and scope of the available data. CACT only includes notes from early in the COVID-19 pandemic (February-March 2020), and our understanding of the presentation of COVID-19 has evolved since that time. CACT was annotated for all symptoms described in the clinical narrative, not just known symptoms of COVID-19, so the annotated symptoms cover most of the symptoms currently known to be associated with COVID-19. However, CACT includes infrequent references to losses of taste or smell. Additional annotation of notes from later in the pandemic is needed to address this gap.

In future work, the extractor will be applied to a much larger set of clinical ambulatory care and ED notes from UW. The extracted symptom information will be combined with routinely coded data (e.g. diagnosis and procedure codes, demographics) and automatically extracted data (e.g. social determinants of health [61] ).

Using these data, we will develop models for predicting risk of COVID-19 infection. These models could better inform clinical indications for prioritizing testing, and the presence or absence of certain symptoms can be used to inform clinical care decisions with greater precision. This future work may also identify combinations of symptoms (including their presence, absence, severity, sequence of appearance, duration, etc.) associated with clinical outcomes and health service utilization, such as deteriorating clinical course and need for repeat consultation or hospital admission. The use of detailed symptom information will be highly valuable in informing these models, but potentially only with the level of nuance that our extraction models provide. For the COVID-19 pandemic, we anticipate that the extraction model presented here will be of increasing value to clinical researchers, as the need to distinguish COVID-19 from other viral and bacterial respiratory infections becomes more necessary. As the pandemic subsides with widespread vaccination, we will return to the more typical "winter respiratory infection/influenza" seasons, where routine medical care involves differentiating COVID-19 from many other types of viral infections and identifying individuals that require COVID-19 testing. Symptom extraction models, like the model presented here, may provide the data needed to determine risk of certain infections and triage the need for testing.

We intend to explore the value of this symptom annotation scheme and extraction approach for clinical conditions where multiple consultations lead to a time point in the diagnosis pathway and symptom attributes, like change and severity, are even more important. We are especially interested in medical conditions for which delayed or missed diagnoses are known to lead to patient harm [9] . We intend to examine data sets associated with other acute and chronic conditions to investigate symptom patterns that could be used to more efficiently and accurately identify patients with these conditions. Specifically, we intend to further develop the symptom extractor to reduce diagnostic delay for lung cancer, which is known to present at a later stage. Lung cancer diagnosis often occurs after many consultations in ambulatory settings, and there may be opportunities to more quickly identify high-risk individuals based on symptoms. 

☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: 

World Health Organization, Coronavirus disease (COVID-19) weekly epidemiological update

A framework for identifying regional outbreak and spread of COVID-19 from one-minute population-wide surveys

Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: Summary of a report of 72314 cases from the Chinese Center for Disease Control and Prevention

Prevalence of comorbidities in the novel Wuhan coronavirus (COVID-19) infection: a systematic review and metaanalysis

Clinical features of COVID-19

COVID-19 transmission within a family cluster by presymptomatic carriers in China

Presymptomatic transmission of SARS-CoV-2-Singapore

Risk factors associated with acute respiratory distress syndrome and death in patients with coronavirus disease

Serious misdiagnosis-related harms in malpractice claims: the "Big Three"-vascular events, infections, and cancers

Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod) the tripod statement

Applied Computational Linguistics Workshop on NLP for COVID-19

World Health Organization, Global literature on coronavirus disease

Comprehensive named entity recognition on CORD-19 with distant or weak supervision

Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease

Annotating a corpus of clinical text records for learning to recognize symptoms automatically

i2b2/VA challenge on concepts, assertions, and relations in clinical text

Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications

Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program

Metamap lite: an evaluation of a new java implementation of metamap

A new way of representing clinical reports for rapid phenotyping

Assertion modeling and its role in clinical phenotype identification

Desiderata for delivering nlp to accelerate healthcare ai advancement and a mayo clinic nlp-as-a-service implementation

Joint entity and relation extraction based on a hybrid neural network

Event detection with neural networks: A rigorous empirical evaluation

Extracting entities with attributes in clinical text via joint deep learning

A deep neural network model for joint entity and relation extraction

Extracting medications and associated adverse drug events using a natural language processing system combining knowledge base and deep learning

Adverse drug events and medication relation extraction in electronic health records with ensemble deep learning methods

End-to-end neural coreference resolution

Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction

A general framework for information extraction using dynamic span graphs

Entity, relation, and event extraction with contextualized span representations

Deep contextualized word representations

BERT: Pre-training of deep bidirectional transformers for language understanding

Generalized autoregressive pretraining for language understanding

BERT-based multi-head selection for joint entity-relation extraction

Extracting multiple-relations in one-pass with pre-trained transformers

Clinical Natural Language Processing Workshop

MIMIC-III, a freely accessible critical care database

Predictors of mortality in hospitalized COVID-19 patients: A systematic review and meta-analysis

Predictors of adverse prognosis in COVID-19: A systematic review and meta-analysis

Predictive symptoms and comorbidities for severe COVID-19 and intensive care unit admission: a systematic review and meta-analysis

A novel simple scoring model for predicting severity of patients with sars-cov-2 infection

Risk factors for adverse clinical outcomes with COVID-19 in China: a multicenter, retrospective, observational study

Clinical characteristics and prognostic factors for intensive care unit admission of patients with COVID-19: Retrospective study using machine learning and natural language processing

Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal

Epidemiology and clinical features of COVID-19: A review of current literature

Risk factors of severe disease and efficacy of treatment in patients infected with COVID-19: A systematic review, meta-analysis and meta-regression analysis

Detection of COVID-19 infection from routine blood exams with machine learning: A feasibility study

Artificial intelligence-enabled rapid diagnosis of patients with COVID-19

Personalized predictive models for symptomatic COVID-19 patients using basic preconditions: Hospitalizations, mortality, and the need for an ICU or ventilator

BRAT: a web-based tool for NLPassisted text annotation

ACE 2005 multilingual training corpus ldc2006t06

Pytorch: An imperative style, high-performance deep learning library

Scikit-learn: machine learning in Python

From local explanations to global understanding with explainable AI for trees

Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap

A symptom-based rule for diagnosis of COVID-19

Annotating social determinants of health using active learning, and characterizing determinants using neural event extraction

Structured fields from UW EHR used to predict COVID-19 infection. f indicates the function used to aggregate multiple measurements/values. Fields that measure the same phenomena and were treated as a single feature

Fields in UW EHR f age "AgeIn2020" max ALT "ALT (GPT)" max albumin "Albumin" min ALP "Alkaline Phosphatase (Total)" max AST "AST (GOT)" max basophils "Basophils" and "% Basophils" min calcium "Calcium" min CRP "CRP, high sensitivity" max D-dimer "D Dimer Quant" max eosinophils "Eosinophils" and "% Eosinophils" min GGT "Gamma Glutamyl Transferase" max gender "Gender" last heart rate "Heart Rate" and "HR" max LDH "Lactate Dehydrogenase" max lymphocytes "Lymphocytes" and "% Lymphocytes" min monocyptes "Monocytes" max neutrophils "Neutrophils" and "% Neutrophils" max oxygen saturation "Oxygen Saturation" and "O2 Saturation (%)" min platelets "Platelet Count" min PT "Prothrombin Time Patient" and "Prothrombin INR" max respiratory rate "Respiratory Rate" max temperature "Temperature -C" and "Temperature (C)" max troponin "Troponin I" and "Troponin I Interpretation" max WBC count "WBC" min