key: cord-0252663-5whp5r9v
authors: Wang, Jingqi; Pham, Huy Anh; Manion, Frank; Rouhizadeh, Masoud; Zhang, Yaoyun
title: COVID-19 SignSym: A fast adaptation of general clinical NLP tools to identify and normalize COVID-19 signs and symptoms to OMOP common data model
date: 2020-07-13
journal: ArXiv
DOI: nan
sha: 3ff96010f7d1a9a5e759d02f36884ebdab316722
doc_id: 252663
cord_uid: 5whp5r9v

The COVID-19 pandemic swept across the world rapidly infecting millions of people. An efficient tool that can accurately recognize important clinical concepts of COVID-19 from free text in electronic health records will be significantly valuable to accelerate various applications of COVID-19 research. To this end, the existing clinical NLP tool CLAMP was quickly adapted to COVID-19 information and generated an automated tool called COVID-19 SignSym, which can extract and signs/symptoms and their eight attributes such as temporal information and negations from clinical text. The extracted information is also mapped to standard clinical concepts in the common data model of OHDSI OMOP. Evaluation on clinical notes and medical dialogues demonstrated promising results. It is freely accessible to the community as a downloadable package of APIs (https://clamp.uth.edu/covid/nlp.php). We believe COVID-19 SignSym will provide fundamental supports to the secondary use of EHRs, thus accelerating the global research of COVID-19.

The COVID-19 pandemic 1 swept across the world rapidly infecting more than two million people in the US and resulting in the death of almost 120,000 at a mortality rate of about 6%. 2 Scientists and researchers from multiple organizations world widely have been working collaboratively targeting effective prevention and treatment strategies. 3 Research findings, data resources, informatics tools and technologies are being openly shared, aiming to speed up the fight against such emerging pandemic. 45 Facilitated by PubMed, large datasets of literature articles relevant to COVID-19 are being accumulated and shared at a rapid pace in the medical community. 6 For example, the COVID-19 Open Research Dataset (CORD-19) has already accumulated more than 75, 000 full text articles. 6 Based on such resources, many tools have been developed using natural language processing (NLP) techniques to unlock COVID-19 information from literature, including tools of search engines, information extraction and knowledge graph building. 78 As another important data source for COVID-19 research, Electronic Health Records (EHRs) store the clinical data of COVID-19 patients, which are critical for various applications, such as clinical decision support, predictive modeling and phenotyping-based cohort stratification. 910 An efficient tool that can accurately recognize important clinical concepts of COVID-19 from clinical notes will be significantly valuable to save time and accelerate the chart-review workflow.

Despite that several large consortia have been formed to construct large clinical data networks for COVID-19 research, such as The National COVID Cohort Collaborative (N3C) 11 and the international EHR-derived COVID-19 Clinical Course Profiles (4CE) 12 , very few informatics tools have been developed for clinical notes of COVID-19. As far as the authors know, the only tool available in public is MedTagger. 13 Based on a pre-collected list of symptoms and synonyms, MedTagger recently provides a rule-based tool, which mainly extracts COVID-19 sign/symptoms and their three attributes of certainty, status (i.e., HistoryOf or Present) and experiencer (i.e., Patient or others). 14 However, limited sign/symptom lists may not be sufficient to recognize or normalize the varying expressions in clinical text; other important attributes critical to unfold the clinical course and prognosis of patients are also not recognized in this tool, including the onset time, severity, course and body location. On the other hand, some existing clinical NLP tools can already recognize signs/symptoms and a more comprehensive set of attributes with good performances. Given that COVID-19 signs/symptoms are a subset of a more general scope of clinical problems, existing tools can be leveraged and quickly adapted for COVID-19 information extraction.

Therefore, in this study, we built an automatic NLP tool, named as COVID-19 SignSym, to extract COVID-19 signs/symptoms and their eight attributes (body location, severity, temporal expression, subject, condition, uncertainty, negation, and course), by adapting existing pipelines in the CLAMP software (Clinical Language Annotation, Modeling, and Processing Toolkit). 1516 The extracted entities will also be normalized to standard terms in OHDSI OMOP CDM (Observational Medical Outcomes Partnership, Common Data model) 17 automatically. The set of signs and symptoms of COVID-19 is collected from four resources, including the case record form of WHO (World Health Organization), National COVID Cohort Collaborative, dictionaries in MedTagger and in-house dictionaries from John Hopkins. In total, 55 signs and symptoms are collected from these sources. UMLS CUIs of these signs and symptoms are assigned manually, their synonyms in UMLS are also extracted to extend the dictionary. A hybrid method combining deep learning models, lexicons and pattern-based rules is used to build COVID-19 SignSym. We believe this tool will provide fundamental supports to the secondary use of EHRs, thus accelerating the global research of COVID-19. Figure 1 illustrates an overview of the workflow for building COVID-19 SignSym. This workflow mainly consists of five steps: (1) Information model design to define the information scope of COVID-19, i.e. semantic types of clinical concepts and relations, to be extracted; (2) Sign/symptom collection to gather COVID-19 signs/symptoms from multiple sources; (3) Information extraction and normalization to extract and normalize COVID-19 signs/symptoms and eight types of attributes, including body location, severity, temporal expression, subject, condition, uncertainty, negation, and course. Hybrid approaches are used to adapt existing NLP pipelines in CLAMP for COVID-19 information extraction. Lexicons of COVID-19 signs/symptoms and pattern-based rules are integrated with CLAMP to optimize the performance. All signs/symptoms in free text are extracted first, which are filtered by lexicons and UMLS CUIs to maintain COVID-19 signs/symptoms. (4) OMOP Mapping to convert the output information into the format of OMOP CMD. 

Four clinical datasets were used for building and evaluating the pipeline: (1) MIMICIII: this data set contains clinical notes from intensive patient care. 200 discharge summaries are randomly selected, with 50% for model tuning and 50% for open test. In total, the selected dataset contains 17, 208 sentences and 329, 044 tokens.

(2) Medical dialogues: this data set contains medical dialogues related to COVID-19 collected from an online website between patients and doctors. 18 Fifty dialogues were randomly selected for external evaluation of COVID-19 SignSym. In total, the selected dataset contains 1,162 sentences and 22,324 tokens.

(3) Clinical notes from Johns Hopkins: This dataset contains 334 clinical notes of 40 patients and includes relevant note types such as H&P, Critical Care Notes, Progress Notes, and ED Notes, focusing specifically on the notes created within 48 hours before and after hospital admission. Notes are pre-processed by Hopkins in-house section identification tools, and only the relevant narrative parts, particularly the chief complaint and history of the present illness sections are extracted. For each of the 40 patients, each symptom was labeled as present or not-present, resulting in over 467 manually annotated symptoms. These gold standard labels are then used to validate COVID-19 SignSym. In total, the clinical notes in this dataset contain 13,397 sentences and 121,802 tokens.

The information model followed by the COVID-19 SignSym is illustrated in Figure 2 . In addition to mentions of signs and symptoms, eight important attributes and their relations are also recognized: (1) Severity: indicates the severity degree of a sign/symptom; (2) 

Leveraging the community efforts, COVID-19 signs and symptoms are collected from five sources: (1) WHO case record form: 26 signs and symptoms were collected from the SIGNs AND SYMPTOMS ON ADMISSION section of the case record form provided by WHO. 19 (2) National COVID Cohort Collaborative: 15 signs and symptoms were collected from the diagnosis table shared by the national COVID cohort collaborative as phenotyping information. 11 (3) MedTagger Lexicon: 17 signs and symptoms together with their 136 synonyms were collected from the lexicon in MedTagger. 14 (4) Lexicon from Johns Hopkins: 14 signs and symptoms, and 337 synonyms were collected from an in-house lexicon of Johns Hopkins. UMLS CUIs of these signs and symptoms (in total 124) were also assigned manually, with their UMLS synonyms collected. After removing redundancy, 55 signs and symptoms and 2, 022 synonyms were collected from these four sources. Text processing

The disease-attribute pipeline in CLAMP was adapted for COVID-19 information extraction in this study. This pipeline is built to automatically extract mentions of problems and their eight attributes from clinical text. The definition of problems follows that used in the i2b2 2010 shared task, which consists of eleven semantic types in UMLS (e.g., sign or symptom, pathologic functions, disease or syndrome, etc.). 20 Besides, the definitions of attributes follows that used in the SemEval 2015 Shared Task 14. 21

Adapting CLAMP for COVID-19 Sign/Sym: Besides, the lexicons collected previously are also used in an additional step of dictionary-lookup, to improve the coverage of recognized COVID-19 signs and symptoms. Furthermore, regular expressions and rules are also applied in a postprocessing step to boost the performance of attribute recognition.

Mentions of problems and attributes are normalized to standard concepts using the UMLS encoder module of CLAMP, which are also built on a hybrid method of semantic similarity based ranking and rules of concept-prevalence in UMLS. 22 Both CUIs and preferred terms in UMLS will be output for each recognized entity.

Filtering of COVID-19 Signs/Symptoms: Once medical problems are automatically recognized and normalized to UMLS CUIs, they will be filtered by the pre-collected lexicons and CUIs of COVID-19 signs and symptoms.

The remaining signs/symptoms and their attributes will also be mapped to standard concepts in OMOP CDM. The OMOP encoder module of CLAMP is used for this purpose, which applies a similar approach as in the UMLS encoder module, with a different scope of standard concepts and identifiers.

Evaluation criteria: (1) The performances of NER and relation extraction are evaluated using precision, recall and F-measure (F1); (2) The performance of concept normalization is evaluated using accuracy; (3) The performance of patient-level COVID-10 diagnosis is evaluated using precision, recall and F-measure. 

Information extraction: Performances of COVID-19 SignSym for information extraction are illustrated in Table 2 ; both performances on clinical text and medical dialogues are reported. The 95% confidence intervals (CIs) are also reported in Table 2 , by considering each clinical note or each post of medical dialogues as a sample. Promising results were achieved on sign/symptom extraction, with a F-measure of 0.992 ± 0.008 on clinical text and 0.99 ± 0.01 on medical dialogue. As for recognizing attributes of signs/symptoms, the tool yielded better performances on clinical text than on medical dialogue (e.g., F-measure: Concept Normalization: Based on manual check of 1, 000 gold standard entities and their automatically assigned CUIs, COVID SignSym obtained an accuracy of 95% for concept normalization.

In comparison with manually assigned signs/symptoms of each patients, the tool yielded a precision of 0.928, a recall of 0.957, and a Fmeasure of 0.942.

It is freely accessible to the community via a downloadable package of APIs 19 . A visualization of the output format is illustrated in Figure 3 . 

There is an urgent demand of automatic tools that can extract COVID-19 information from clinical text. This paper presents our preliminary work to quickly adapt an existing high-performance clinical NLP tool for COVID-19 sign/symptom extraction and normalization, leveraging COVID-19 lexicons provided from multiple resources. The tool of COVID-19 SignSym is evaluated on clinical notes and medical dialogues. Experimental results demonstrate that it achieves promising results in practical settings. The workflow of this study can be generalized to other use cases, where existing clinical NLP tools need to be customized for specific information needs within a short time.

Errors present in the outputs of COVID-19 SignSym are analyzed carefully for further improvement. One type of common errors is partial recognition of named entities. For example, "cough" is recognized, instead of "cough productive of yellow sputum". More patterns of signs and symptoms are needed to improve the coverage. Another type of common errors is related to cross sentence relations. For example, many temporal expressions are not in the same sentence with the relevant signs/symptoms. Besides, some attributes such as conditions are modifying signs/symptoms in a list of multiple items. Document structure and intra sentence relations need to be handled in the next step.

Limitations and future work: this work has several limitations and future works are needed. (1) First, performances on two COVID-19 datasets are evaluated and reported, additional evaluations are needed to further refine the tool and increase its generalizability; (2) Currently, only sign/symptoms and their attributes are extracted, additional works will be conducted for more related information such as comorbidities and medications; (3) The output information will also be mapped to other clinical data standards such as FHIR in the near future, to facilitate clinical operations and other applications.

This paper presents an automatic tool, named as COVID-19 SignSym, to extract and normalize signs/symptoms and their eight attributes from clinical text using NLP techniques. COVID-19 SignSym achieved promising results on clinical notes and medical dialogues. It is freely accessible to the community via a downloadable package of APIs. So far, COVID-19 SignSym has been applied by eight different institutions in academia and industry, which we believe will provide fundamental supports to the secondary use of EHRs for COVID-19 research upon a wide dissemination.

WHO Coronavirus Disease (COVID-19) Dashboard

000 Missing Deaths: Tracking the True Toll of the Coronavirus Outbreak. The New York Times

& Pm, 3:28. WHO launches global megatrial of the four most promising coronavirus treatments

Open science initiatives related to the COVID-1

covid19 Datasets and Machine Learning Projects

COVID-19 Open Research Dataset Challenge (CORD-19

CORD-19: The Covid-19 Open Research Dataset

Informative Ranking of Stand Out Collections of Symptoms: A New Data-Driven Approach to Identify the Strong Warning Signs of

Augmented Curation of Clinical Notes from a Massive EHR System Reveals Symptoms of Impending COVID-19 Diagnosis

National COVID Cohort Collaborative (N3C). National Center for Advancing Translational Sciences

International Electronic Health Record-Derived COVID-19 Clinical Course Profiles: The 4CE Consortium | medRxiv

Desiderata for delivering NLP to accelerate healthcare AI advancement and a Mayo Clinic NLP-as-a-service implementation

CLAMP -a toolkit for efficiently building customized clinical natural language processing pipelines

CLAMP | Natural Language Processing (NLP) Software

Covid-19' queries answered by top doctors | iCliniq

i2b2/VA challenge on concepts, assertions, and relations in clinical text

SemEval-2015 task 14: Analysis of clinical text

UTH_CCB: A report for SemEval 2014 -Task 7 Analysis of Clinical Text

This study is partially supported by the National Center for Advancing Translational Sciences (R44TR003254).