key: cord-0481662-ez30sj0b authors: Liu, Sijia; Wen, Andrew; Wang, Liwei; He, Huan; Fu, Sunyang; Miller, Robert; Williams, Andrew; Harris, Daniel; Kavuluru, Ramakanth; Liu, Mei; Abu-el-rub, Noor; Schutte, Dalton; Zhang, Rui; Rouhizadeh, Masoud; Osborne, John D.; He, Yongqun; Topaloglu, Umit; Hong, Stephanie S; Saltz, Joel H; Schaffter, Thomas; Pfaff, Emily; Chute, Christopher G.; Duong, Tim; Haendel, Melissa A.; Fuentes, Rafael; Szolovits, Peter; Xu, Hua; Liu, Hongfang; Subgroup, National COVID Cohort Collaborative Natural Language Processing; Collaborative, National COVID Cohort title: An Open Natural Language Processing Development Framework for EHR-based Clinical Research: A case demonstration using the National COVID Cohort Collaborative (N3C) date: 2021-10-20 journal: nan DOI: nan sha: 7a4432825e9527a2930932cee3ee0134e0628ee7 doc_id: 481662 cord_uid: ez30sj0b While we pay attention to the latest advances in clinical natural language processing (NLP), we can notice some resistance in the clinical and translational research community to adopt NLP models due to limited transparency, interpretability, and usability. In this study, we proposed an open natural language processing development framework. We evaluated it through the implementation of NLP algorithms for the National COVID Cohort Collaborative (N3C). Based on the interests in information extraction from COVID-19 related clinical notes, our work includes 1) an open data annotation process using COVID-19 signs and symptoms as the use case, 2) a community-driven ruleset composing platform, and 3) a synthetic text data generation workflow to generate texts for information extraction tasks without involving human subjects. The corpora were derived from texts from three different institutions (Mayo Clinic, University of Kentucky, University of Minnesota). The gold standard annotations were tested with a single institution's (Mayo) ruleset. This resulted in performances of 0.876, 0.706, and 0.694 in F-scores for Mayo, Minnesota, and Kentucky test datasets, respectively. The study as a consortium effort of the N3C NLP subgroup demonstrates the feasibility of creating a federated NLP algorithm development and benchmarking platform to enhance multi-institution clinical NLP study and adoption. Although we use COVID-19 as a use case in this effort, our framework is general enough to be applied to other domains of interest in clinical NLP. (OHDSI) Consortia with demonstrated successes 6, 7, 8, 9 . One common challenge faced by those initiatives is, however, the prevalence of clinical information embedded in unstructured text 10 . Compared to structured data entry, text is a more conventional way in the healthcare environment to document impressions, clinical findings, assessments, and care plans. Even with the advent of sophisticated EHR systems, studies have shown that capturing health information fully in structured format through data entry is unlikely to happen and a blended model where physicians use templates when and where possible and dictate the details of a patient visit in text 11 . Natural language processing (NLP) has been promoted as having a great potential to extract information from text 12 . NLP algorithms can generally be categorized into using either symbolic or statistical methods 13 . Since the turn of the century, machine learning algorithms (i.e., statistical NLP) have gained increased prominence in clinical NLP research 14 . Nevertheless, a substantial portion of clinical NLP use cases leverages symbolic techniques given that dictionary or rule-based methodologies suffice to meet the information needs of many clinical applications under specific use cases. In the context of EHR-based clinical research, NLP has been leveraged to assist information extraction and knowledge conversion at different stages of research including feasibility assessment, eligibility criteria screening, data elements extraction, and text data analytics. As a result, an increasing number of clinical research benefits from state-of-the-art NLP solutions and have been reported ranging from disease study areas 15, 16, 17, 18 to drug-related studies 19, 20 . A majority of existing clinical NLP studies are, however, done within a monoinstitutional environment 13 , which may suffer from limited external validity and research inclusiveness. Compared with single-site research, multisite research potentially offers larger sample size, more adequate representation of participant demographics (e.g., age, gender, race, ethnicity, and social-economic status), and more diverse investigator expertise, which may ultimately yield a higher level of research evidence 21, 22, 23, 24 . Despite a plethora of recent advances in adopting NLP for clinical research, there have been barriers towards adoption of NLP solutions in clinical and translation research, especially in multisite settings. The root causes of these barriers can be categorized into two major reasons: 1) heterogeneity of ETL (extract, transform, load) processes between differing sites with their own disparate EHR environments, and 2) human factor variation in gold standard corpus development processes. ETL Process Heterogeneity. The challenges faced by NLP development and evaluation to facilitate the secondary use of EHR data originate from the complex, voluminous, and dynamic nature of the data being documented and stored within a heterogeneous set of disparate, institution specific, EHR implementations. Variations in EHR system vendors, data infrastructure (e.g., unified, ontology driven, and de-centralized), and institutions' modes of operation can lead to idiosyncratic ways of clinical documentation, transformed, and representation 25 . Collecting these data would require a significant expenditure of effort to locate, retrieve, and link EHR data into a specific format 26 . This variability in ETL processes required to support a high level of data heterogeneity brings additional challenges in the adoption of NLP for clinical and translational research, which substantially limits both the cross-institutional interoperability of developed NLP solutions and the reproducibility of the associated evaluations. Human factor variation in gold standard corpus development process. The process of developing, evaluating, and deploying NLP solutions in both mono-and multi-site environments can be task-specific, iterative, and complex, often involving a multitude of stakeholders with diverse backgrounds 13, 26 . A key step prior to model development is corpus annotation, the process of developing a gold standard by marking the occurrence of both task-defined sets of clinical information as well as their associated interpretative linguistic features (e.g., certainty, status) within text documents. Due to the complexity of clinical language, creating such gold standard corpora requires significant expenditure of domain expertise and time as clinical experts regularly make decisions directly affecting study cohort, annotation guideline, and task definitions. Studies have discovered potential biases in clinical decision making and interpretation of clinical guidelines 27 , in coding of clinical terminologies 28 , and in interpretation of imaging findings 29 . This issue can be further exacerbated when conducting multi-site collaborations due to inter-site variations in care practice 30, 31 , ultimately affecting the validity and reliability of the resulting gold standard corpus. A coordinated, transparent, and collaborative platform is therefore needed to promote open team science collaboration in NLP algorithm development and evaluation through consensus building, process coordination, and best practice sharing. Built upon our previous work 32, 33 , here, we proposed an open NLP development framework to address the aforementioned issues through the following components: 1) an interoperable NLP infrastructure for incorporation of different NLP engines utilizing a clinical common data model for data source interfacing and representation with the aim of reducing the impact of the heterogeneity of ETL processes; 2) a transparent multi-site participation workflow on corpus development and evaluation with the aim of addressing the variation in data abstraction and annotation processes between sites; and 3) a user-centric crowdsourcing interface for collaborative ruleset development that enables effectively and efficiently gathering, synthesizing, and fusing site-specific knowledge and findings. To demonstrate the viability of the framework, we conducted a case study where we developed, evaluated, and implemented an NLP algorithm for extracting 34, 35, 36 COVID-19 signs and symptoms to support the National COVID Cohort Collaborative (N3C). The framework itself consists of a data ingestion layer, a processing layer, and a data persistence layer. The architecture of the proposed framework is illustrated in Figure 3 . The data ingestion layer works as the data collector with the ability to read text from a configurable variety of data sources such as relational databases or file systems including and load them into the NOTE table of OMOP CDM. The processing layer serves as the NLP engine where information extraction from raw texts happens given a set of heuristic rules created by various NLP engines. By default, as an example implementation, the MedTagger 37 NLP engine is provided, although alternative NLP engines can be substituted by wrapping their respective NLP pipelines to conform to a provided API specification. After the term modifiers added by contextual rules from ConText Algorithm 38 around the extracted condition mentions, these conditions will compose clinical events with temporal information. The reason we opt for a symbolic solution is due to its simplicity, transparency, and interpretability as the outcomes are fully deterministic based on the definition of the rules. When the baseline rulesets and dictionaries are made available to the public, they can therefore be easily refined by different users from different sites. The data persistence layer stores resulting extracted NLP artifacts in the OMOP CDM NOTE_NLP table as (https://ohnlp4covid-dev.n3c.ncats.io/ie_editor) and the "Dictionary Builder" (https://ohnlp4covid-dev.n3c.ncats.io/dict_builder) page ( Figure 2 (b)). Figure 2 (c) provides an example of the rules editing interface with the baseline COVID-19 ruleset. The rulesets can be tested in real time by clicking the "Upload and test" button, where the rulesets will be uploaded, and the NLP engine will be generated for testing and debugging purposes. As a use case study, we also provide an example NLP project for extracting signs/symptoms related to COVID-19 that was developed as an example use case for this framework. The elements with original texts such as text snippets and concept mentions are truncated before submission. Table 1 shows the annotation corpora statistics. A COVID-19 sign/symptom ruleset was produced consisting of 17 concepts. The IAA of the annotated corpus was 0.686 F1-score for Mayo, 0.516 for UMN and 0.211 for UKen. Two NLP algorithms were evaluated in this study. One was developed based solely on the narratives sourced from a single site (Mayo Clinic). The other used the resulting NLP algorithm from the single site and fine-tuned based on the annotated training data from an additional two sites (UMN and UKen). Table 2 shows the performance of the single-site NLP algorithm and Table 3 Our experiment results showed that a centralized approach towards multi-site NLP algorithm development is suboptimal for advancing the adoption of NLP techniques in the clinical and translational research community, this further support our proposed federated method. The experiment also demonstrates that deployment of NLP algorithms for multi-site studies needs to be done in each local site. To ensure the scientific rigor of the data generated, each site need to perform annotation and evaluation on their own while collectively contributing to NLP algorithm development and refinement. Since the NLP models are to be shared in rule-based systems, the models can be shared without the concerns typically associated with language resources involving the Protected Health Information (PHI) issue. In the proposed workflow, each site will evaluate the NLP algorithms for concept extraction by creating a gold standard corpus based on the common annotation guidelines. The federated evaluation can be deployed leveraging cloud computing through a centralized controller where NLP algorithms can be distributed to each institution. NLP Sandbox 1 is an example of such an evaluation framework, which uses Docker 39 containers to encapsulate algorithm implementations. By adopting this process, the evaluation only happens behind each institution's firewall, and only the summary statistics on NLP algorithm performance (i.e., no raw data containing PHI) is transferred out of the firewall. Performance statistics, such as the precision, recall, and F1-score, as defined depending on the experimental setting, can be obtained in near real-time and can thus be used as part of continuous development workflows. This federated process offers several benefits. For instance, when conducting error analysis, we discovered that contexts played an important role in this case study. Error analyses showed it was not a trivial task to extract COVID signs/symptoms, as their occurrence is not necessarily isolated only to occurring due to COVID, and as they could appear as adverse events/indication of treatment, or in instruction/patient education, or clinical goal/precaution, etc. This posed a challenge not only for annotation, but also for the NLP algorithm development. One benefit of the federated annotation and development process is that these contexts can be systematically incorporated by local expertise in the annotation process. Deployment of a federated development framework requires the participation of multiple sites. Adoption can, however, be hindered by the fact that the process of translating NLP algorithms into implementation is complex, much like the "bench to bedside" process that translates laboratory discoveries into patient care. To facilitate participation in our federated method, we have developed a further suite of tools such as MedTator 40 and best practice guidelines 41 . MedTator, a serverless annotation tool, aims to provide an intuitive and interactive user interface for highquality annotation corpus generation. The best practice guideline contains detailed instructions for facilitating multisite annotation practice with the following key activities: task formulation, cohort screening, annotation guideline development, annotation training, annotation production, and adjudication. Simply having the toolsets be available, is, however, insufficient. Pragmatically, we have seen that there is a hyper focus on novel methods in academia with competing as opposed to collaborative priorities in NLP algorithm development. Our experience suggests that a collaborative development process for NLP algorithms is needed for truly implementable and useful multi-site NLP solutions. This is one of the key goals we seek to achieve with the Open Health Natural Language Processing (OHNLP) Collaboratory and have thus positioned our framework's workflow to facilitate this task. Additionally, we recognize that it is not simply a software problem, a local workforce is also needed at each institution. As a consequence of conducting coordinated development of NLP algorithms deployed using our framework as a solution for consortia-specific tasks such as with the N3C, we simultaneously build the human workforce locally at institutions necessary to conduct the federated development, evaluation, and implementation of NLP algorithms using our framework. Incorporating standards and interoperability. A common barrier to the widespread adoption of NLP in clinical research is the need to transform input and outputs to conform to part of an overall pipeline. While seemingly straightforward, such a task is difficult without prior significant investment in associated infrastructure and dedicated software development. It is therefore desirable to leverage existing infrastructure where possible and incorporate such an effort into the distributed NLP pipeline to reduce technical burden on the end user. There is, however, significant variation in terms of available infrastructure and data availability amongst different institutions. Creating a solution that is immediately suitable for all these environments out of the box would be immensely challenging. For that reason, we sought to leverage existing data modeling efforts that are likely to be already adopted by academic medical institutions to standardize the data ingestion and output process. In our implementation, we chose It is important to note that standardization as a default only serves to simplify adoption for those who already have a solution complying with the standard and cannot be a comprehensive solution. A purely OMOP CDM reliant solution is not ideal, as not all institutions will have their own OMOP CDM instance and standing up such an instance to just use a pipeline may produce undue burden. For that reason, input/output in our infrastructure is modularized, and can be substituted at will: the default OMOP CDM I/O utilizes a variant of SQL-based data extractors/writers, and the specific query and connection strings used can be substituted via plaintext configuration changes. Additionally, SQL-based I/O is not the only supported setting, a variety of other data sources including Elasticsearch, google cloud storage, amazon s3, and plaintext are included as well as configuration-swappable options. Crowdsourcing algorithm development. To promote collaboration and sharing efforts between participants in the algorithm development process, we built a crowdsourcing platform for domain experts to upload, customize, and examine their NLP algorithms in an interactive web application. Users can create keyword-based and rule-based algorithms and test the performance in the online environment instantly. The crowdsourcing platform consists of three modules based on our NLP system to support expert collaboration, including dictionary builder, regular expression rule set editor, and detection result visualization. The dictionary builder can extend the keyword collection used by the algorithm. Users can customize particular terms from the ontology database such as CIDO 42 We evaluated the performance of single-site and multi-site algorithms using precision, recall, and F1-score for the annotated concept mentions, without and with certainty. A span can be represented from the start position to the end position of the concept mention. Certainty is an attribute of the concept mention including positive, negated, hypothetical and possible. For the mention-level evaluation without certainty, when there are overlaps between the gold standard mention span and the NLP detected mention span while the concept type (i.e., the specific sign/symptom such as fever, cough) is the same, it is considered a true positive (TP). If a concept mention exists in the gold standard annotation but not detected by the NLP algorithm, or spans overlap but the concept type is not matched, it is considered as a false negative (FN). If a concept mention is detected by the algorithm but does not exist in the gold standard annotation, the concept is considered as a false positive (FP). For the mention-level span and certainty evaluation, certainty match needs to be considered when calculating TP, FN and FP. The precision, recall and F1-score are then calculated as follows. We further manually analyzed errors from multi-site algorithm mentionlevel evaluation without certainty. The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future The Patient-Centered Outcomes Research Institute (PCORI) national priorities for research and initial research agenda All of Us" research program Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers Characterizing treatment pathways at scale using the OHDSI network Drawing reproducible conclusions from observational clinical data with OHDSI. Yearbook of medical informatics Reference standards, judges, and comparison subjects: roles for experts in evaluating system performance Data from clinical notes: a perspective on the tension between structure and flexible documentation Transcription and EHRs. Benefits of a blended approach Artificial Intelligence and the Future of Primary Care: Exploratory Qualitative Study of UK General Practitioners' Views Clinical concept extraction: a methodology review Don't parse, generate! a sequence to sequence architecture for task-oriented semantic parsing Epidemiology of functional seizures among adults treated at a university hospital Childhood respiratory illness presentation and service utilisation in primary care: a six-year cohort study in Wellington, New Zealand, using natural language processing (NLP) software Cardiovascular Outcome Risks in Patients With Erectile Dysfunction Co-Prescribed a Phosphodiesterase Type 5 Inhibitor (PDE5i) and a Nitrate: A Retrospective Observational Study Using Electronic Health Record Data in the United States Natural language processing of electronic health records is superior to billing codes to identify symptom burden in hemodialysis patients Automated prediction of risk for problem opioid use in a primary care setting Adverse and hypersensitivity reactions to prescription nonsteroidal anti-inflammatory agents in a large health care system From patient to patient-sharing the data from clinical trials Association of silent cerebrovascular disease identified using natural language processing and future ischemic stroke Site engagement for multi-site clinical trials Pathways to success for multi-site clinical data research Heterogeneity introduced by EHR system implementation in a deidentified data resource from 100 non-affiliated organizations Assessment of the impact of EHR heterogeneity for clinical research through a case study of silent brain infarction A replicable method for blood glucose control in critically ill patients Reliability of SNOMED-CT coding by three physicians using two terminology browsers Agreement between neuroimages and reports for natural language processingbased detection of silent brain infarcts and white matter disease Clinical practice variation Clinical documentation variations and NLP system portability: a case study in asthma birth cohorts across institutions An information extraction framework for cohort identification using electronic health records Desiderata for delivering NLP to accelerate healthcare AI advancement and a Mayo Clinic NLPas-a-service implementation Challenges in defining Long COVID: Striking differences across literature, Electronic Health Records, and patient-reported information. medRxiv Outcomes of COVID-19 in Patients With Cancer: Report From the National COVID Cohort Collaborative (N3C) The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports Docker: lightweight Linux containers for consistent development and deployment MedTator: a serverless annotation tool for corpus development Best practices of annotating clinical texts for information extraction tasks CIDO, a community-based ontology for coronavirus disease knowledge and data integration, sharing, and analysis The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species brat: a Web-based Tool for NLP-Assisted Text Annotation The Human Phenotype Ontology in 2021 The Unified Medical Language System (UMLS): integrating biomedical terminology The medical dictionary for regulatory activities (MedDRA) This research was possible because of the patients whose information is included within the data and the organizations and scientists who have contributed to the on-going development of this