key: cord-0795269-jvc1e8cq authors: Sass, J.; Bartschke, A.; Lehne, M.; Essenwanger, A.; Rinaldi, E.; Rudolph, S.; Heitmann, K. U.; Vehreschild, J. J.; von Kalle, C.; Thun, S. title: The German Corona Consensus Dataset (GECCO): A standardized dataset for COVID-19 research date: 2020-07-29 journal: nan DOI: 10.1101/2020.07.27.20162636 sha: 6d816b09e6874d9f94f201818713eac728eb09e9 doc_id: 795269 cord_uid: jvc1e8cq Background: The current COVID-19 pandemic has led to a surge of research activity. While this research provides important insights, the multitude of studies results in an increasing segmentation of information. To ensure comparability across projects and institutions, standard datasets are needed. Here, we introduce the "German Corona Consensus Dataset" (GECCO), a uniform dataset that uses international terminologies and health IT standards to improve interoperability of COVID-19 data. Methods: Based on previous work (e.g., the ISARIC-WHO COVID-19 case report form) and in coordination with experts from university hospitals, professional associations and research initiatives, data elements relevant for COVID-19 research were collected, prioritized and consolidated into a compact core dataset. The dataset was mapped to international terminologies, and the Fast Healthcare Interoperability Resources (FHIR) standard was used to define interoperable, machine-readable data formats. Results: A core dataset consisting of 81 data elements with 281 response options was defined, including information about, for example, demography, anamnesis, symptoms, therapy, medications or laboratory values of COVID-19 patients. Data elements and response options were mapped to SNOMED CT, LOINC, UCUM, ICD-10-GM and ATC, and FHIR profiles for interoperable data exchange were defined. Conclusion: GECCO provides a compact, interoperable dataset that can help to make COVID-19 research data more comparable across studies and institutions. The dataset will be further refined in the future by adding domain-specific extension modules for more specialized use cases. In December 2019, first reports of a cluster of 41 patients infected by a novel coronavirus emerged from Wuhan, China. 1 Within a few months, the new virus, subsequently named "severe acute respiratory syndrome coronavirus 2" (SARS-CoV-2), has spread around the world causing the global COVID-19 pandemic. Currently (as of July 1, 2020), SARS-CoV-2 has infected more than 10 million and killed more than half a million patients worldwide. 2 The pandemic has spurred intensive scientific research, including numerous regional, national and international epidemiological surveys and studies. [3] [4] [5] [6] [7] While this research provides important new insights, the multitude of studies threatens to generate a dangerous segmentation of information. This could delay or even prevent urgently needed scientific knowledge about SARS-CoV-2 and COVID- 19. To avoid this segmentation of information and make COVID-19 data more comparable and exchangeable across studies and institutions, interoperable datasets are needed. Various initiatives have started to define uniform datasets and Common Data Elements (CDEs) for the collection of information about COVID-19. For example, questionnaires and case report forms (CRFs) have been developed to collect data about COVID-19 patients in a standardized way. 5, 8, 9 While the CDEs defined in these projects are an important step, they are not enough to ensure interoperability. To make data syntactically and semantically interoperable, data elements also have to be embedded in standard data structures that can be exchanged across IT systems, and they have to use common terminologies that unambiguously define the meaning of clinical concepts. To improve interoperability of COVID-19 data, we developed the German Corona Consensus Dataset (GECCO), which uses international health IT standards and terminologies for interoperable data exchange. GECCO defines a compact set of data elements to be collected in COVID-19 studies and was developed within the National Research Network of University . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.27.20162636 doi: medRxiv preprint Medicine on COVID-19 ("Nationales Forschungsnetzwerk [NFN] der Universitätsmedizin zu COVID-19") funded by the German Federal Ministry of Education and Research (BMBF). The following paper provides an overview of the GECCO dataset and its development. An initial dataset was compiled as a working basis by merging data elements and response options of the following projects: the ISARIC-WHO CRF 8 ; the Pa-COVID-19 study 10 , which investigates the pathophysiology of COVID-19 in a prospective patient cohort; the LEOSS case registry 3 , a clinical patient registry for patients infected with SARS-CoV-2 initiated by the ESCMID Emerging Infections Task Force (EITaF), the German Center for Infection Research (DZIF) and the German Society for Infectiology (DGI). This draft dataset was saved in a spreadsheet and sent to members of an expert board for comment and proposal of additional data elements. The expert board was composed of health professionals from German university hospitals, professional associations and other relevant organizations. New data elements proposed by the expert board were added to the dataset for subsequent prioritization. For the prioritization, the experts were asked to assign a priority value to each data element of the dataset. Priorities were indicated on a 5-level scale that was loosely based on the NIH model for CDEs 11 (Table 1) . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.27.20162636 doi: medRxiv preprint From the data elements with the highest prioritizations, a preliminary core dataset with roughly 100 data elements was compiled (this size was chosen to include as many relevant data elements as possible, while keeping the dataset manageable and practical). This core dataset was then reviewed by an editorial team of seven experts from different disciplines. In consensual decisions, data elements not considered necessary for the core dataset were discarded; conversely, data elements that were considered highly important but had not yet been included in the core dataset were added. The final data elements of the core dataset were grouped into meaningful categories (e.g., demographics, symptoms or medication). Figure 1 shows the workflow of consensus building and dataset definition. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.27.20162636 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.27.20162636 doi: medRxiv preprint element descriptions, use case scenarios, value sets and Health Level 7 (HL7) templates and profiles. To define interoperable formats for data exchange, the HL7 standard "Fast Healthcare Interoperability Resources" (FHIR) 18 was used. FHIR builds on a set of "resources", which provide generic data structures for common healthcare concepts, such as Patient, Practitioner, Observation, Medication or Condition. From these resources more specific data structure definitions, so-called "profiles", can be defined, which allow for interoperable data exchange across health IT systems. To ensure interoperability, care was taken to build on previous work Combining the initial draft dataset and the additional proposals from the expert board, 702 potentially relevant data elements were collected. From these data elements and based on the prioritization of the expert board, the editorial team compiled a core dataset consisting of 81 elements with 281 response options. These data elements were grouped into the following categories: anamnesis / risk factors (n = 16); imaging (n = 2); demographics (n = 7); epidemiological factors (n = 1); complications (n = 1); onset of illness / admission (n = 1); laboratory values (n = 25); medication (n = 4); outcome at discharge (n = 3); study enrollment / inclusion criteria (n = 2); symptoms (n = 2); therapy (n = 6); vital signs (n = 11) (Figure 2 ). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.27.20162636 doi: medRxiv preprint During the consolidation process, it became clear that some data elements are important for certain disciplines but irrelevant for others. These elements were not included in the core . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.27.20162636 doi: medRxiv preprint dataset as they would have inflated the size of the dataset. The editorial team decided to include these data elements in domain-specific extension modules, which will be specified in more detail at later stages of the project. In this report, we presented the GECCO dataset, a core collection of data elements for acquiring and exchanging information about COVID-19 patients. By using standardized data structures (HL7 FHIR profiles) and international terminologies, the GECCO dataset is an important step towards interoperability of COVID-19 research data. It can facilitate harmonized data collection and analysis across institutions and IT systems, for example in clinical studies, registries or digital health applications. A key factor to the successful application of standard datasets like GECCO is a close collaboration with the scientific community. To ensure a high acceptance of the dataset, the Although the GECCO dataset was designed to be as compact and manageable as possible, acquiring and recording the information for all data elements still requires time (for example, when entering the information in an electronic case report form). Moreover, manual documentation is prone to transcription errors. Conversely, manually abstracted and structured . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.27.20162636 doi: medRxiv preprint information from unstructured health records may provide relevant insights for care-providers and improve their understanding of risk and outcome. For some of the data items, it is therefore desirable to automatically exchange data between a GECCO-based study database and existing IT systems, such as hospital information systems or clinical trial software. This requires standard interfaces between these systems. The FHIR profiles of the GECCO dataset provide an interoperable, machine-readable data structure that can facilitate this data exchange across IT systems. Scientific knowledge about COVID-19 and SARS-CoV-2 is changing fast, which may necessitate modifications to the GECCO dataset in the future. To incorporate new knowledge into the dataset, the NFN will put a governance framework in place that will coordinate revisions and extensions to the dataset. Domain-specific extension modules are already in preparation. Extension modules currently planned are: laboratory, diagnostics, immunology, gynecology and pregnancy, epidemiology, pediatrics, intensive care, oncology, radiology, virology, psychiatry and neurology (these extension modules are also accessible on the ART-DECOR platform). The GECCO dataset provides researchers and healthcare professionals with a compact, interoperable dataset for collecting, exchanging and analyzing COVID-19 data across institutions and software systems. Developed by a multidisciplinary group of experts, GECCO builds heavily on international terminologies and IT standards. GECCO can thus help to improve the harmonization and coordination of research efforts to successfully fight the COVID-19 pandemic. Future inclusion of domain-specific extension modules will further expand the use of the GECCO dataset. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.27.20162636 doi: medRxiv preprint We thank the members of the expert board and editorial team for their help with the development of the GECCO dataset. Data elements, response options and value sets of the GECCO dataset can be accessed at https://art-decor.org/art-decor/decor-datasets--covid19f-. FHIR profiles are available at https://simplifier.net/ForschungsnetzCovid-19. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.27.20162636 doi: medRxiv preprint Disease outbreak news: Update 12 Johns Hopkins Coronavirus Resource Center | COVID-19 Map Lean European Open Survey on SARS-CoV-2 Infected Patients GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany. GESIS Datenarchiv, Köln WHO tool for behavioural insights on COVID-19 UK Covid-19 Questionnaire Features of 20 133 UK patients in hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: prospective observational cohort study ISARIC | Clinical Data Collection -The COVID-19 Case Report Forms (CRFs) Center for Disease Control and Prevention (CDC) | Human Infection with 2019 Novel Coronavirus Person Under Investigation (PUI) and Case Report Form Studying the pathophysiology of coronavirus disease 2019 -a protocol for the Berlin prospective COVID-19 patient cohort (Pa-COVID-19). medRxiv, 11. National Institutes of Health (NIH) | Classifications of Data Elements for a Particular Disease Bundesinstitut für Arzneimittel und Medizinprodukte (BfArM) | ICD-10-GM LOINC, a Universal Standard for Identifying Laboratory Observations: A 5-Year Update WHO Collaborating Centre for Drug Statistics Methodology | International language for drug utilization research Logica Implementation Guide: Covid-19 CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity