key: cord-0846070-lbxj9k7p authors: Bauer, Denis C.; Metke‐Jimenez, Alejandro; Maurer‐Stroh, Sebastian; Tiruvayipati, Suma; Wilson, Laurence O. W.; Jain, Yatish; Perrin, Amandine; Ebrill, Kate; Hansen, David P.; Vasan, Seshadri S. title: Interoperable medical data: The missing link for understanding COVID‐19 date: 2021-01-29 journal: Transbound Emerg Dis DOI: 10.1111/tbed.13892 sha: a3a0ffb20351862cfae54a9e7da0f12ad37f77d5 doc_id: 846070 cord_uid: lbxj9k7p Being able to link clinical outcomes to SARS‐CoV‐2 virus strains is a critical component of understanding COVID‐19. Here, we discuss how current processes hamper sustainable data collection to enable meaningful analysis and insights. Following the ‘Fast Healthcare Interoperable Resource’ (FHIR) implementation guide, we introduce an ontology‐based standard questionnaire to overcome these shortcomings and describe patient 'journeys' in coordination with the World Health Organization's recommendations. We identify steps in the clinical health data acquisition cycle and workflows that likely have the biggest impact in the data‐driven understanding of this virus. Specifically, we recommend detailed symptoms and medical history using the FHIR standards. We have taken the first steps towards this by making patient status mandatory in GISAID (‘Global Initiative on Sharing All Influenza Data’), immediately resulting in a measurable increase in the fraction of cases with useful patient information. The main remaining limitation is the lack of controlled vocabulary or a medical ontology. tology-based standard questionnaire to overcome these shortcomings and describe patient 'journeys' in coordination with the World Health Organization's recommendations. We identify steps in the clinical health data acquisition cycle and workflows that likely have the biggest impact in the data-driven understanding of this virus. Specifically, we recommend detailed symptoms and medical history using the FHIR standards. We have taken the first steps towards this by making patient status mandatory in GISAID ('Global Initiative on Sharing All Influenza Data'), immediately resulting in a measurable increase in the fraction of cases with useful patient information. The main remaining limitation is the lack of controlled vocabulary or a medical ontology. Being able to link clinical outcomes to virus strains is a critical component of understanding COVID-19; however, current data collection practices hamper such analyses and require updating to support robust insights gained from the data collected. GISAID, established originally as the Global Initiative on Sharing All Influenza Data (Elbe & Buckland-Merrett, 2017) , has widened its remit with the EpiCoV ™ database to become the principal platform for the sharing of genomic sequences of SARS-CoV-2 (hCoV-19) from around the world. Such convergence by the global scientific community around a single database is critical to permit a near-real-time analysis of how the virus is evolving. While currently only 1 out of 258 confirmed cases (Worldometers Coronavirus, n.d.) sees the virus sequence submitted (i.e. 36,080,088 COVID-19 cases and 139,967 published SARS-CoV-2 sequences as of 1 October 2020, which indicates that circa 1 out of 258 cases are sent for virus sequencing), it represents the most thorough surveillance of an emerging virus outbreak in history (Massive coronavirus sequencing efforts urgently need patient data -Nature India, 2020). It is therefore critical to supplement the collected information on the virus genomes with the other critical component informing patient outcome: medical information. Such de-identified patient data would provide the missing information that enables the virus evolution to be linked to its host's clinical factors. For example, several studies have suggested the emergence of virus isolates associated with greater in vitro titres and cytopathic effects (Yao et al., 2020) ; greater infectivity (Korber et al., 2020); greater transmissibility (McAuley et al., 2020) ; and similar (Zhang et al., 2020) or attenuated (Su et al., 2020 ) phenotypes with consequent outcomes. Such observed variations, especially disease severity and phenotypic changes, may be attributable to genomic evolution and adaptation to the new human host. However, current analyses are confounded by factors such as co-morbidities, capacity of the healthcare system in terms of diagnostic testing, treatment choices and reporting of severity and fatality-making it impossible to robustly link patient outcome to genomic changes in the virus. This limits studies to being merely observational by reporting genomic differences of the virus (Bauer et al., 2020) or inferring pathogenicity from cell culture measurements such as replication rate (Yao et al., 2020) and cell toxicity (Chu et al., 2020) . While such in silico and in vitro studies are insightful, they are not a reliable predictor of disease severity in vivo. Recognizing the need for clinical data, GISAID enables 'patient status' to be recorded for each submitted isolate and made this field mandatory as of 27 April 2020. Two snapshots were taken to assess the uptake of this feature. One month after the change (15 May 2020), only 3% provided relevant information for this field, for instance, 9% (506/5122) of submitted isolates have this field filled in and of these only a third (164) have provided clinical information (Figure 1a ). At the 6-month mark (01 October 2020), this increased to 13% of entries with data other than 'unknown' (15,907/125,654) ; however, the usefulness of this data remains variable ( Figure 1b) . The word clouds highlight that 'unknown' remains the largest fraction and that the free-text field gives rise to a wide range of different descriptions identifying the same status. There are hence two areas where current processes hamper sustainable and meaningful data collection. Firstly, information is currently not captured in a standardized form that is tailored to COVID-19 infections; secondly, patient information is frequently not available when genomic information is submitted, and workflows are not set up to amend entries retrospectively. Data that are collected and submitted to a central repository such as GISAID likely come from multiple sources, with consequently a wide range of digital-readiness levels. For example, it might be extracted from Electronic Medical Records (EMRs) where the data are already in a structured form. However, it may also be that relevant information needs to first be extracted out of digital-or paper-based clinical notes. In the latter case, the same clinical symptom might be described differently, complicating downstream reporting or grouping of records. Hence, converting clinical observations into standardized terms, so called clinical terminologies that are applicable across the world, is relevant. Figure 2 illustrates this problem on the concept 'loss of sense of smell', which has several synonyms, such as 'anosmia' and 'absent smell', but is represented as a single concept in the 'SNOMED CT' (Systematized Nomenclature of Medicine Clinical Terms) terminology. While the progression towards EMRs is a much larger, multilayer problem that cannot be addressed quickly amid a pandemic, the mode of primary data collection into the central repository can be controlled by introducing standardized fields implementing standardized terminologies. This would ensure that researchers have a computable set of data to build robust statistical methodologies and artificial intelligence-based analyses, gaining insights from genomic and clinical data. However, there are several clinical terminologies, such as Systematized Nomenclature of Medicine (SNOMED) and International Classification of Diseases (ICD). SNOMED CT is the most comprehensive multilingual health terminology in the world, COVID-19, genome sequence, GISAID, ontology, patient information, SARS-CoV-2 while ICD is a classification specializing on disease description. The main difference between them is that SNOMED CT is much more detailed and can be used to capture fine-grained clinical information while ICD is primarily a classification designed for reporting. In addition to clinical terminologies, a standard that defines which clinical data should be collected is also needed. For example, in this case it is useful to capture symptoms, risk factors and complications, among others. This is usually referred to as the information model. The new 'Health Level Seven' (HL7) standard called 'Fast Healthcare Interoperable Resource' (FHIR) stands out as the best choice, given its substantial uptake and excellent support for clinical terminologies. There are multiple efforts that currently aim to define the minimal COVID-19-relevant clinical data. The World Health Organization (WHO) has developed a casebased reporting form and data dictionary, as well as interim guidance to clinicians regarding case definitions and clinical syndromes associated with COVID-19 (Table 1) . Although the WHO's forms are more likely to be accepted by clinical teams around the world, the resulting forms do not capture clinical symptoms and outcomes in detail, for example, only a field for indicating if the patient was showing symptoms but not which symptoms. Similarly, clinical course and outcomes are captured in little detail. F I G U R E 1 Word cloud of GISAID 'patient status' entries, where word size represents number of entries with this term (log10-transformed and pseudocounts to also visualize low frequency). (a) snapshot from 15 May 2020, (b) snapshot from 1 October 2020, after 'unknown' was made the default status when no status is provided. Actual counts are in Table S1; However, achieving international agreement on the exact thresholds for the grouping is likely difficult, especially as new evidence about the severity of individual symptoms becomes available (Menni et al., 2020) . It might hence be a more prudent approach to capture symptoms directly, as taken by the COVID-19 host genetics initiative (The COVID-2020 Host Genetics F I G U R E 3 Minimal common outcome measure as compiled by WHO. Figure reproduced Using existing technology and incorporating the above discussed guidelines for COVID-19 symptoms and severity, we built an example FHIR Implementation Guide (FHIR IG) and implemented it as a FHIR questionnaire (see Table 1 ). This allows the flexible collection of relevant terms for a specific use case and allows them to be expressed as an input form for data collection, for example into GISAID. Unlike the FHIR IG from Logica, which focuses on patient care, patient screening, public health reporting and general research, we designed the questionnaire (fields and values) for the specific use case of linking genomic data with clinical outcomes. The FHIR IG captures the following types of information: • Demographic information-such as the age and gender of the The FHIR IG provides a set of standard terms from the SNOMED CT clinical terminology in the form of value sets. These are available in the documentation as well as programmatically from a clinical terminology service. Advice around the design of a user interface is also provided-with an example of an implementation for the form used to collect the information shown in Figure 4D . The FHIR IG provides the guidance needed to build different approaches to data collection. For example, one approach might be to use data extracted from an Electronic Medical Record (EMR) system or a research Electronic Data Capture (EDC) system like REDCap (Harris et al., 2019) for sharing with an organisation such as GISAID. There are existing tools that can be used to facilitate this transformation (Metke-Jimenez & Hansen, 2019). Alternatively, a specific cloud-based web form can be built to capture data and store it in a cloud-based FHIR repository for later analyses. The value sets developed for the different fields in the clinical entry form can be browsed using a terminology browser. Figure 5 shows the symptom-value set in the CSIRO Shrimp browser, a front end for CSIRO's terminology server Ontoserver (Metke-Jimenez et al., 2018). While GISAID enables updates to submitted entries as more patient data become available, updating a submitted entry with clinical information is currently not a wide-spread practice. This in part is due to privacy restriction having prevented the sharing of patient information (Dyer, 2020) . While the current content of GISAID was carefully designed to preserve privacy, adding linkages to clinical databases may require a re-structure even with de-identification protocols in place (Bauer et al., 2020 ; Massive coronavirus sequencing efforts urgently need patient data -Nature India, 2020). For example, in regions with low prevalence, the exact location in combination with height and weight can be identifiable. For such a future addition, a clinical record guardian may be needed to provide access to clinical data via a tier system. (Korber et al., 2020) , it was discovered that VIC31 and VIC50 isolates originate from the same patient, and it is likely that more such duplicates exist and complicate data analysis. Data consistency issues will be an even greater challenge for low-resource and developing countries. As outlined by Banu et al., ef- ficient contact tracing is crucial as a single cluster can rapidly spread in densely populated countries such as India (Banu et al., 2020) . This is currently hampered by a lack of detailed reporting in India such as the patient's home state being different to that of the submitting laboratory, which can confuse epidemiological analyses, as was shown to be the case recently (Mehrotra et al., 2020) . In order to assess and detect a shift in the clinical presentation of COVID-19, de-identified patient data need to be collected in a more systematic way. We hence recommend three elements for the medical and scientific community to consider for capturing COVID-19 better: Anticipating the opportunity for retrospective data intake in a more controlled fashion, GISAID has a mechanism to reach out to data submitters to update entries. As a more immediate improvement, GISIAD now provides a filter for serving out cleaned data correcting and consolidating 26,838 entries (see consolidated entries as of 15th May 2020 in Table S2 ), which is aided by a data curation tool. These measures are valuable because the pandemic could well continue/re-emerge for some time creating the potential for new virus strains to be linked to decreased or increased case severity and/or fatality, and potentially affect the efficacy of vaccines and countermeasures. GISAID does offer clade/lineage and variant information to facilitate genotype-phenotype analyses. Gaining experience in controlled data collection increases our preparedness for future 'Disease X' outbreaks and pandemics, and enables the better support of research work for other infectious diseases such as influenza and the respiratory syncytial virus. ST is supported by a grant awarded to Timothy Barkham and Swaine Chen by the Temasek Foundation and by the Genome Institute of Singapore; ST and SMS are supported by the Agency for Science, Technology and Research (A*STAR). AP's work on the automated meta-data curation tool is supported by Institut Pasteur with feedback from its EpiCoV ™ data curation team aiding GISAID. Preparedness Innovations (CEPI). The authors declare that there are no competing interests. DCB, SSV and DPH conceived the paper. ST and AP structured the data. AM-J, LOWW and YJ conducted the analysis. DCB, SM-S, KE, DPH and SSV wrote the paper. All authors reviewed and finalized the document. Not applicable. Not applicable. Denis C. Bauer https://orcid.org/0000-0001-8033-9810 Seshadri S. Vasan https://orcid.org/0000-0002-7326-3210 A distinct phylogenetic cluster of Indian SARS-CoV-2 isolates Supporting pandemic response using genomics and bioinformatics: a case study on the emergent SARS-CoV-2 outbreak Comparative tropism, replication kinetics, and cell damage profiling of SARS-CoV-2 and SARS-CoV with implications for clinical manifestations, transmissibility, and laboratory studies of COVID-19: An observational study Covid-19: Rules on sharing confidential patient information are relaxed in England Data, disease and diplomacy: GISAID's innovative contribution to global health The REDCap consortium: Building an international community of software platform partners Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2 Massive coronavirus sequencing efforts urgently need patient data -Nature India Experimental and in silico evidence suggests vaccines are unlikely to be affected by D614G mutation in SARS-CoV-2 spike protein Unassigned' coronavirus cases near 3,000, rise as curbs on movement lifted. The Indian Express Real-time tracking of self-reported symptoms to predict potential COVID-19 FHIRCap: Transforming REDCap forms into FHIR resources Ontoserver: A syndicated terminology server Discovery of a 382-nt deletion during the early evolution of SARS-CoV-2. mBio The COVID-19 Host Genetics Initiative, a global initiative to elucidate the role of host genetic factors in susceptibility and severity of the SARS-CoV-2 virus pandemic A minimal common outcome measure set for COVID-19 clinical research. The Lancet Infectious Diseases Patient-derived SARS-CoV-2 mutations impact viral replication dynamics and infectivity in vitro and with clinical implications in vivo Viral and host factors related to the clinical outcome of COVID-19 Additional supporting information may be found online in the Supporting Information section