key: cord-0969049-9sn55lam
authors: Pedrera Jiménez, Miguel; García Barrio, Noelia; Cruz Rojo, Jaime; Isabel Terriza Torres, Ana; Ana López Jiménez, Elena; Calvo Boyero, Fernando; Jesús Jiménez Cerezo, María; Javier Blanco Martínez, Alvar; Roig Domínguez, Gustavo; Luis Cruz Bermúdez, Juan; Luis Bernal Sobrino, José; Serrano Balazote, Pablo; Muñoz Carrero, Adolfo
title: Obtaining EHR-derived datasets for COVID-19 research within a short time: a flexible methodology based on Detailed Clinical Models
date: 2021-02-03
journal: J Biomed Inform
DOI: 10.1016/j.jbi.2021.103697
sha: 35e707b1a7eb5ea0bb21cac0f730bd1cb33a2074
doc_id: 969049
cord_uid: 9sn55lam

Background COVID-19 ranks as the single largest health incident worldwide in decades. In such a scenario, electronic health records (EHRs) should provide a timely response to healthcare needs and to data uses that go beyond direct medical care and are known as secondary uses, which includes biomedical research. However, it is usual for each data analysis initiative to define its own information model in line with its requirements. These specifications share clinical concepts, but differ in format and recording criteria, something that creates data entry redundancy in multiple electronic data capture systems (EDCs) with the consequent investment of effort and time by the organization. Objective This study sought to design and implement a flexible methodology based on detailed clinical models (DCM), which would enable EHRs generated in a tertiary hospital to be effectively reused without loss of meaning and within a short time. Material and methods The proposed methodology comprises four stages: (1) specification of an initial set of relevant variables for COVID-19; (2) modeling and formalization of clinical concepts using ISO 13606 standard and SNOMED CT and LOINC terminologies; (3) definition of transformation rules to generate secondary use models from standardized EHRs and development of them using R language; and (4) implementation and validation of the methodology through the generation of the International Severe Acute Respiratory and emerging Infection Consortium (ISARIC-WHO) COVID-19 case report form. This process has been implemented into a 1,300-bed tertiary Hospital for a cohort of 4,489 patients hospitalized from 25 February 2020 to 10 September 2020. Results An initial and expandable set of relevant concepts for COVID-19 was identified, modeled and formalized using ISO-13606 standard and SNOMED CT and LOINC terminologies. Similarly, an algorithm was designed and implemented with R, and then applied to process EHRs, in accordance with standardized concepts, and transforming them into secondary use models. Lastly, these resources were applied to obtain a data extract conforming to the ISARIC-WHO COVID-19 case report form, without requiring manual data collection. The methodology allowed obtaining the observation domain of this model with a coverage of over 85% of patients in the majority of concepts. Conclusion This study has furnished a solution to the difficulty of rapidly and efficiently obtaining EHR-derived data for secondary use in COVID-19, capable of adapting to changes in data specifications and applicable to other organizations and other health conditions. The conclusion to be drawn from this initial validation is that this DCM-based methodology allows the effective reuse of EHRs generated in a tertiary Hospital during COVID-19 pandemic, with no additional effort or time for the organization and with a greater data scope than that yielded by conventional manual data collection process in ad-hoc EDCs.

COVID-19 ranks as the single largest health incident worldwide in decades [1, 2] , registering over 27,486,960 confirmed cases and 894,983 related deaths around the globe up to 9 September 2020 [3] . This study was undertaken at the Hospital Universitario 12 de Octubre [4], a 1,300-bed tertiary Hospital situated in Madrid Region (Spain), where 156,026 confirmed cases and 8,817 deaths had been recorded as of 10

September 2020 [5] . During the pandemic, average length of stay at this hospital increased by around 15%.

Likewise, the burden of managing COVID-19 patients rose to become an overload that saturated healthcare resources. In such a scenario, electronic health records (EHRs) should provide a timely response to healthcare needs (decision-making, whether for clinical or for resource-planning purposes) [6] [7] [8] , without generating errors [9] . These needs also extend to data uses that go beyond direct medical care and are known as secondary uses, which includes biomedical research [10] . It is usual for each data analysis initiative to define its own information model in line with its data requirements [11] . Although they share clinical concepts, these models differ in format and recording criteria, something that creates data entry redundancy in multiple electronic data capture systems (EDCs). Moreover, in a situation like that caused by a new disease such as COVID-19, data is needed in a short time and advances in research result in data specifications constantly changing. In order to overcome these issues, an innovative methodology, which enables semantics to be incorporated into the process of the reuse of routine healthcare data, must be defined and implemented [12] . In this way, EHRs can be reused for multiple purposes in a brief period and adapted to changes in data specifications, while maintaining their original meaning and an acceptable quality.

Nevertheless, current health information systems incorporate data semantics very poorly, which then hinders their combination and reuse. This is due to the fact they are "single-level" systems, in which the concept model is implicit in the data model. Advanced healthcare information systems and clinical data warehouses such as i2b2 and OMOP [13, 14] , implement a dual paradigm, which separates the data model and the concept model. This is based on Detailed Clinical Model (DCM) paradigm [15] ; in which the reference model defines the set of generic components for constructing interoperable EHRs, and the archetype model formalizes concepts of the clinical domain, constructed by the combination of the components and constraints of the reference model [16] . Some standards applying the dual model are the ISO 13606 standard and the OpenEHR specification [17, 18] , which has published specific resources for COVID-19 [19] . The archetypes make it possible to define terminology binding that associates each component with standard terminologies, such as Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) and Logical Observation Identifiers Names and Codes (LOINC) [20, 21] . SNOMED CT has published several new concepts and related descriptions pertaining to COVID- 19 [22] , while LOINC has published a set of codes for laboratory tests for the diagnosis of this new disease [23] . In this study, a flexible methodology based on this paradigm is proposed to help resolve existing difficulties arising in the rapid and efficient collection of data for COVID-19 research, considered extensible to other organizations and applicable to other conditions.

The aim of this study was to design and implement a flexible methodology based on the DCM paradigm that would enable EHRs to be effectively reused for COVID-19 secondary uses, without loss of meaning and within a short time. This implies a series of particular objectives, such as:

 specifying an initial and expandable set of relevant variables for COVID-19 on which to apply the methodology;  selecting and applying the appropriate modeling and terminological standards to the clinical concepts identified;  defining the necessary transformation rules to generate EHR-derived models from the standard information model; and,  implementing and validating the methodology through the generation of a data extract in accordance with a validated COVID-19 information model.

The proposed methodology should allow the representation and reuse of EHRs on any health condition, with no changes in their original meaning. It is supported by previous studies [24] [25] [26] [27] , and it comprises four stages:

1. health condition analysis and specification of relevant variables, i.e., analysis and identification of an initial and expandable set of relevant variables for healthcare and secondary use purposes; 2. modeling and formalization of the concepts of the clinical domain i.e., making use of resources based on the DCM paradigm to model and formalize the identified concepts;

3. definition of rules to generate EHR-derived models, i.e., analysis of secondary use models and design of rules of transformation to them from standardized EHRs; and, 4. implementation and validation of the methodology i.e., implementation of EHRs registration, extraction and transformation mechanisms. For validation purposes, a secondary use model is generated, and the data coverage achieved is analyzed. This study applies the methodology to COVID-19 in order to enhance the efficiency of data collection for the many data initiatives that have arisen around this condition. The methodology is valuable in this pandemic scenario, when data is needed urgently and reference specifications change constantly, providing the timing and flexibility required. This process is innovative compared to that of manual data entry, in which effort and time is proportional to the number of patients to be included, and changes in the secondary use model involve data re-entry.

To identify the gaps in standardization to which the methodology would be applicable, the different EHR domains in healthcare information systems were analyzed. It could be concluded that evaluations, instructions and actions had adequate modeling and standardization. Observable entities (OE), however, constituted very wide-ranging, heterogeneous sets that render reuse difficult. This is a domain where the DCM paradigm can make a major contribution, since it is essential not only to have codified value-sets, but also to ensure that each clinical domain concept, such as "Oxygen saturation" or "D-dimer", is represented formally without loss of meaning. Although in this case it was not necessary, other EHR domains would proceed the same way, with definitions of more general archetypes such as "Prescription" or "Health problem".

The requirements established for defining the initial set of variables were that it had to cover the necessary span for both patient care and secondary uses, and be parsimonious, since the data were to be recorded in healthcare practice, and it was important not to increase the health professionals' workload [28] . Thus, a work team was created in March 2020, consisting of health professionals attached to the main hospital departments tasked with the care of COVID-19 patients. A total of 58 health OE, 22 clinical and 36 laboratory-related, were identified by this group based on their clinical knowledge and scientific evidence.

During this task, the proposed methodology allowed the concept model to be expanded as the medical team identified new relevant variables for COVID-19. In the same way, owing to the fact that COVID-19 is a new disease, these concepts are just an initial set, expandable according to increased understanding of it.

DCM provides a real solution to the extension of this initially defined concept model without altering the information systems that implement them.

The modeling and formalization of concepts were performed in accordance with the ISO 13606 standard with this being adapted to the technical capacities of the hospital information systems. This standard was used for several reasons: (1) it defines a rigorous and stable information architecture for defining clinical domain concepts and communicating EHRs, (2) it allows adding clinical concepts without altering the databases structure, (3) it has current applications in health organizations through tools based on it [24, 29] , (4) it is used by the Spanish Ministry of Health and the different Regions as the standard for the definition of exchangeable EHR extracts in the country [30] , and (5) it was adopted by the Hospital for the management and governance of the clinical concepts and modeling resources [27] . The leaf node of the EHR hierarchy, containing a single data value". Each OE defined in this study was modeled using an entry component such as "Blood pressure", which, at the same time, contains the component elements relating to the specific concepts associated with it: "Systolic blood pressure", "Diastolic blood pressure" and "Mean blood pressure". Lastly, the Entry component contains a component element for representing the date on which the observation was made. ISO 13606 reference model also establishes the types of data permitted accordingly to ISO 21090 [31] . It was necessary to use the following four to cover the requirements of this use case:

 Physical Quantity (PQ): for OE whose result is a numeric value with unit of measurement, e.g., systolic blood pressure measured in mmHg;  Coded Value (CV): for OE whose result is a set of possible coded values, e.g., the result of the SARS-COV-2 virus detection test, which may be positive, negative or inconclusive;  Integer: for OE whose result is an integer value, e.g., Glasgow Coma Scale score; and,  Date Time: for OE whose value is a time point, e.g., date of initiation of smoking habit or date on which an observation was made. Figure 2 shows the mind map relating to the archetype "Oxygen saturation" ("Saturación de oxígeno" in Spanish), composed by an Entry and two Elements, "Oxygen saturation" ("Saturación de oxígeno" in Spanish) of Physical Quantity data type and "Observation date" ("Fecha de observación" in Spanish) of Date Time data type. The archetype model makes use of the above-defined components to formalize the concepts of the clinical domain. On the one hand, the "definition" section specifies the components of the archetype, along with their cardinality, type of data, minimum and maximum values, unit of measurement, codified value-set and other metadata. The full definition of the information model and its constraints ensure the completeness and consistency of EHR extracts [32] . On the other hand, the "ontology" section defines the terminology binding used, incorporating the semantics to the information model. The Archetype Definition Language (ADL) was employed for archetype development using LinkEHR Studio [29] . Figure 3 shows an ADL code fragment of the "Oxygen saturation" archetype. Unspecified specimen by NAA with probe detection", the SNOMED CT 'observable entity' axis was employed to represent concepts of clinical OE, e.g., "103228002 |Hemoglobin saturation with oxygen (observable entity)|". Here, it was necessary to resort to the terminology extension mechanism for five concepts. This allows each SNOMED CT National Reference Center (Centro Nacional de Referencia/CNR) to publish its own concepts [34], which are then proposed for inclusion in the international edition of this terminology. Lastly, the SNOMED CT 'finding' and 'qualifier' axes were used for OE responses reporting a set of possible values, e.g., "77176002 | Smoker (finding) |" and "10828004 |Positive (qualifier value)|".

Firstly, secondary use models of COVID-19 were studied to quantify the coverage that could be achieved on the basis of the standard concepts defined. If a concept was not covered by the initial specification, the utility of including it in the standard information model was analyzed by the clinical team. Expanding concept model is one of the advantages of a DCM-based methodology.

Following this, the rules to generate EHR-derived models were designed based on the format of these specifications. A total of five data operations were identified, considered applicable to any health condition:

1. Inference of specific variables from general concepts, e.g., inferring a yes/no response for an "active smoker" variable from a "smoking habit" concept that assumes "nonsmoker", "ex-smoker"

and "active smoker" as possible values.

2. Transformations between coding systems, e.g., transforming a concept "10828004 |Positive (qualifier value)|" into a local code 'P'.

3. Transformations between units of measurement, e.g., transforming a variable "C-Reactive Protein" measured in "mg/dL" into a variable relating to the same concept measured in "mg/L". 4. Selection according to specific values, e.g., selecting "Oxygen saturation" with value under 92%.

Selection of data at a given time point, e.g., selecting "Body temperature" value on admission to hospital.

The transformation rules were documented and shared with the clinical team for review. Once validated, an algorithm was developed in R language, version 3.6.1 [35] , which performed the combination of transformations needed to obtain the secondary use model. Figure 4 shows the flow chart of the algorithm developed. It functions iteratively selecting the data relating to the concepts of interest (index 'i'), from each visit (index 'j'), for each patient (index 'z'), and then applying the abovementioned operations to these. 

The starting point of the implementation of the methodology was the definition of the clinical archetypes in the multiple hospital information systems affected. For this purpose, the clinical concepts were identified or created in each information system and then mapped (in the system itself) from the local identifier to the standard code defined by the semantic of the archetype. Hence, data are stored following a key-value structure (observable entity-finding), in which each observation is identified in a standard and homogeneous way. This mechanism enables data to be extracted from the different systems for reuse, while maintaining their meaning unaltered and ensuring acceptable data quality: clinical archetypes are used to guarantee completeness and consistency by fully defining the information model and its constraints. Thus, if a datum is not compliant with the archetype, it is not used in the generation of the secondary use model. Figure 5 shows an EHR extract related to "Oxygen saturation" ("Saturación de oxígeno" in Spanish) archetype, implemented in Extensible Markup Language (XML). The transformation rules were applied to these EHR extracts to generate data files in accordance with secondary use models. To this end, different modules were designed and developed for each type of operation identified. The effort is not multiplied for each secondary use model: instead, these operations are adjusted in line with its specific requirements. This allows the generation processes to be reusable and scalable to any secondary use model. Figure 6 shows an example which selects the "Oxygen saturation" values (identified via SNOMED CT code "103228002") between the starting and finishing dates of the admission episode, and only the maximum and minimum values. Figure 6 . Code in R for generating data related to "Oxygen saturation" concept.

In view of the support shown by the clinical and scientific community [36] , the rapid case report form (CRF) proposed by the Severe Acute Respiratory and emerging Infection Consortium (ISARIC-WHO) was chosen as the secondary use model to transform to for technical validation of the methodology [37] .

Although Spain has not yet issued a COVID-19 data specification at a national level at the date of writing, this could be generated in the same way with the proposed methodology. The information model designed by ISARIC-WHO for the rapid CRF defines around 200 data elements, 68 of which are OE concerning to 36 concepts. It is structured in three modules: the first for hospital admission data; the second for the first day of admission to the intensive care unit (ICU) and as many times as possible across hospitalization; and the third for the date of patient discharge or death. By virtue of this model's volume of OE concepts and the data-registration criteria it establishes, it is optimal for validating the methodology. Thus, this model was generated from EHRs of 4,489 patients hospitalized due to COVID-19 from 25 February 2020 to 10

September 2020. Figure 7 shows an overview of the methodology implementation process, based on the components described above. 

The results of this study are the deliverables defined in the different stages of the methodology. Its implementation into the Hospital began on March 15, 2020 and the first EHR-derived extract was generated and validated on April 20, 2020.

The first result obtained in this study was the specification and standardization of a set of 22 clinical OE and 36 laboratory-related OE of interest in COVID-19 (included in Appendix A). These concepts, in consonance with the ISO 13606 standard and semantically linked to standard terminologies, are implemented in the multiple Hospital healthcare information systems, allowing homogenous data entry via clinical record forms or, transparently, through integration with laboratory equipment. Data are stored in each system's database, following a dual key-value structure: standard concept of the OE and finding reported. This allows the reuse of data, while maintaining their original meaning unaltered.

The second result achieved was the design and development of transformation rules to be applied on EHRs, based on standard archetypes, for obtaining secondary use models. In order to address the generation of the proposed ISARIC-WHO information model, data transformations rules were adapted to the specific criteria, without the need of creating any operations in addition to those identified in Stage 3 of the methodology.

An algorithm in R was thus implemented: this selects data for each patient in line with the standard OE concepts, to which it then applies the rules defined for generating the ISARIC-WHO COVID-19 information model.

Lastly, the set of OE proposed by ISARIC-WHO for the cohort of 4,489 hospitalized patients due to COVID-19 (4,286 confirmed by laboratory test and 203 with clinical diagnosis) from 25 February 2020 to 10 September 2020 was obtained. Of a total of 36 OE that define this model, 34 could be generated. As the concepts "Capillary refill time" and "Mid-upper arm circumference" were not identified in Stage 1 of the methodology, it was proposed that they should be included in the information model and, by extension, in the hospital information systems. The proposed methodology allowed for expanding the concept model without altering the data model, through the definition of new clinical archetypes. Table 1 shows the volume of data that could be automatically generated from EHRs. Firstly, it shows the total records directly extracted from health information systems, prior to being processed. Secondly, it shows the data after application of the generation algorithm for modules 1 and 2 of the ISARIC-WHO information model (module 3 does not include OE concepts), with the following breakdown: total number of records generated; number of patients to whom these refer; and the percentage with respect to the total cohort covered. As can be seen, the majority of OE had a patient coverage of over 85%. Some basic vital constants, e.g., blood pressure and oxygen saturation, as well as SARS-COV-2 detection test and common laboratory tests, e.g., hemogram, sodium or potassium, had a high coverage since these measurements are performed daily on most hospitalized COVID-19 patients. Even so, there were concepts, such as the Glasgow Coma Scale score or specific laboratory tests, e.g., IL-6 and lactate, in which the percentage of patients covered was in the region of 10%. This is due to not all patients underwent the complete set of OE included in the model.

The fact that these are real world data means that each patient exclusively generated data relating to the observations which professionals found necessary in healthcare activity.

The proposed methodology takes the DCM paradigm as its basis, being initially applied successfully to the creation of an i2b2 data warehouse in the Hospital [27, 38] . However, this study broadens its scope given that, for an effective reuse of health data, it is necessary to create a mechanism that offers data to consumers in the format they demand. In comparison with previous studies focus on DCM approach for data extraction from heterogeneous sources [39] , the proposed methodology serves not only to extract and standardize the data currently generated, but also to improve the Hospital information systems. Consequently, it is possible to record data with the modeling and standardization requirements needed for transforming them into the information models demanded by the different initiatives dedicated to the collection, integration, and harmonization of COVID-19 data. In this sense, ISARIC has implemented an EDC, based on ISARIC-WHO CRF, for reporting COVID-19 cases to generate monthly clinical data [40] ; the 4CE Consortium has designed a common model of aggregated COVID-19 data to perform combined studies [41] ; TriNetX has defined an essential set of data elements to build COVID-19 research cohorts from EHRs [42] ; the European Health Data Evidence Network (EHDEN) has launched a rapid call to homogenize COVID-19 data in a European network of OMOP repositories [43] ; and the National COVID Cohort Collaborative (N3C) has created an open scientific community focused on the analysis of patient-level data from multiple centers [44] . The aim of the methodology proposed in this study is not to replace these initiatives, but to obtain data conforming to the information model designed in each one of them rapidly and efficiently.

In parallel to archetype-based initiatives, such as ISO 13606 standard or OpenEHR specification [19] , the Fast Healthcare Interoperability Resources standard (FHIR) of Health Level Seven (HL7) has been applied to model COVID-19 information by different standardization initiatives [45] . This standard offers a rapid mechanism for information exchange between different systems without loss of meaning. To achieve this, FHIR provides a series of common health information resources, which incorporate semantics as an element of the information model itself, defining a generic "Observation" resource for representing and exchanging any observable entity. Nonetheless, for formalizing the concept model of multiple information systems, it is necessary to implement an archetype for each clinical concept, defining its specific components and constraints. This just means that both standards, ISO 13606 and FHIR, can be used in conjunction, applying each of them for its design purpose. In relation to this, the group of experts from the Technical Committee ISO / TC 215 Health informatics is working on the "Guidelines for implementation of HL7 / FHIR based on ISO 13940 and ISO 13606" [46] .

Therefore, use was made of the ISO 13606 standard, parts 1 (reference model) and 2 (archetype model), because of its stability and adaptability, as well as its adoption as a reference standard by Spain and our

Hospital. On the one hand, the reference model was used to model concepts pertaining to the OE to be implemented in healthcare information systems. On the other hand, the archetype model made it possible to formalize the information models and link their components to standard terminologies, which represent their clinical meaning. Adopting the ISO 13606 standard enabled the methodology to be a systematic process, homogenizing the data extracts to be transformed and ensuring the completeness and consistency of data through the full definition of the information model and its constraints. In addition, implementation of clinical archetypes allows these to be published and shared for subsequent use. Thus, reuse of clinical archetypes and the designed ETL process allow the methodology to be extended to other health organizations and applicable to other conditions with minimum effort. If an organization decides to apply it, the only manual work is required to implement the clinical archetypes in the information systems of the organization (creation and mapping of standard concepts) at Stage 4. In the case of applying the methodology to a different condition, it may be necessary to include new clinical concepts at Stage 1 and 2, as well as to adapt the transformation rules to the specified EHR-derived model at Stage 3. This reproducibility is essential in a country like Spain, which has 17 Regions with transferred health authority, so it could be applicable to each of them to standardize the clinical concept models of their multiple information systems [47] .

The fact that this methodology was developed in a scenario of a new disease means that the specification of relevant variables should be expanded in the future: it is preferable to collect useful data at this time rather than wait for a perfect model. DCM allows extending the initial concept model defined without altering the information systems that implement them. Thus, ISO 13606-compliant archetypes were used as a basis for implementing the clinical domain concepts that render the multiple Hospital healthcare information systems conceptually homogeneous. Some applications of the ISO 13606 standard in the methodology will be expanded in next studies: due to Hospital information systems are not prepared for automatically incorporating archetypes, the definition of the clinical concepts was performed manually in each of these systems on the basis of the defined archetypes (terminology binding and metadata). Similarly, a structure implemented in XML and Delimiter-Separated Values (DSV) was chosen for the EHR extracts on which to apply the transformation operations, since it allows data to be processed without loss of meaning. In order to make these extracts completely interoperable, use must be made of a common structure towards which to converge among different organizations. Accordingly, a constraint to be resolved in future studies is to employ the ISO 13606 archetype model for automatic definition of concepts in any healthcare information systems and generation of EHR extracts in line with these, as proposed by previous papers on the topic [48] .

Terminology binding of the OE was effected using only two terminologies: SNOMED CT and LOINC.

SNOMED CT has been used for clinical OE, and of a total of 22 concepts, only five could not be found in the International Edition, with resort being had to the concept-extension mechanism defined by this terminology. LOINC was used for laboratory OE, and a total of 36 concepts were found in the terminology.

The use of only two terminological standards to cover the complete spectrum of OE registered in healthcare information systems differs completely from conventional methodologies based on implementation of specific data collection forms with their own coding, where the same data is recorded in multiple systems in multiple ways [11] . This amounts to a real and initial implementation of something that international studies propose as a line to be pursued in health research based on real world data from multiple sources [49] .

In accordance with the archetypes implemented, transformation rules for generating the ISARIC-WHO information model were defined and then validated by the clinical team. These rules were designed with a multipurpose approach, so they can be adapted to generate any EHR-derived model that might require these OE. In this case, it was only necessary to adjust parameters regarding temporality and values of interest in accordance with the specific requirements of the model. These rules process EHR extracts implemented in XML and DSV and then generate EHR-derived data extracts conforming to ISARIC-WHO, which can be directly used by consumers or stored in shared repositories [40] . By way of complementing the above, this study is to be followed by systematic application of these transformation rules to EHR extracts in accordance with ISO 13606 as in previous studies on transformation between information models [50, 51] .

Lastly, the automatic generation of the ISARIC-WHO COVID-19 CRF had a patient coverage of over 85%.

The fact of reusing data from EHRs means that each patient exclusively generates data relating to the observations which professionals found necessary to obtain in healthcare activity. In this line, the EHR2EDC project has developed a seamless and acceptable method for reusing hospital EHR data within clinical trials. Its first objective was to transfer at least 15% of the specified data, and it was possible to achieve up to 37% [52] . Comparing with manual data collection methodologies, reusing health data has made greater data scope achievable, without the need for any additional effort on the organization side. At the same time of this project, a relevant COVID-19 study, based on manual data entry using ISARIC-WHO CRF, was conducted in 208 acute care hospitals of England, Wales and Scotland [36, 37] . It recollected adequately information of 20,133 hospitalized patients of domains identified by the proposed methodology as less problematic, such as demographic data, visits, comorbidities, symptoms or medication.

Nevertheless, the results of this study only included one clinical OE, smoking habit, and none laboratoryrelated OE. This underscores the need to standardize this highly extensive and heterogeneous data domain.

Moreover, the cohort of this study is composed of patients admitted with COVID-19 between February 6 2020 and April 19 2020. In manual data collection processes, the number of patients included determines the effort and time required by the organization. Our methodology was applied to a cohort of 4,489 patients hospitalized from 25 February 2020 to 10 September 2020. This process has no such limitation as once the process of generating the secondary use model from EHRs has been implemented, the number of cases to be included does not imply additional effort or time. That said, EHR data have certain characteristics that differ from those collected manually for a specific purpose [53] . Although the archetypes allow setting a basic control of the data quality, this study will be followed by another into the quality, validity and utility of EHR-derived data in research and other secondary uses.

This study has furnished a real and novel solution to the difficulty of rapidly and efficiently obtaining EHRderived data for secondary use in COVID-19, capable of adapting to changes in data specifications and ensuring acceptable data quality. Thus, a flexible methodology based on DCM paradigm was designed and implemented in a tertiary Hospital of Madrid Region, Spain. This country has 17 Health Services with health-authority transferred, so the methodology could be applicable to each Region, and even to other countries, to homogenize the data-reuse process for COVID-19 and other health conditions. The exposed methodology was divided in four stages. First, a total of 58 OE were identified as an initial set of relevant concepts for COVID-19. These were then modeled and formalized via parts 1 and 2 of the ISO 13606 standard, and semantically linked to standards such as SNOMED CT and LOINC. Selection and transformation rules for generating EHR-derived models were, therefore, designed and implemented.

Lastly, the transformation process was validated by generating the information model proposed by ISARIC-WHO for the 4,489 COVID-19 cases identified at the hospital up to 10 September 2020. Of the 36 OE included in the ISARIC-WHO model, it was possible to obtain 34 with a coverage, in most instances, of over 85% of patients in the cohort. The conclusion to be drawn from this initial validation is that this methodology allows the effective reuse of EHRs in a real and complex scenario with a greater scope than that yielded by classic manual-record process in ad-hoc EDC and without requiring additional effort or time on the part of the healthcare professionals. 

A novel coronavirus from patients with pneumonia in China

Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study

Situation Report of Health Ministry of Spain (COVID-19)

Definition, structure, content, use and impacts of electronic health records: A review of the research literature

Rapid response to COVID-19: health informatics support for outbreak management in an academic health system

Opportunities and challenges in utilizing electronic health records for infection surveillance, prevention, and control

Problems with health information technology and their effects on care delivery and patient outcomes: a systematic review

Toward a National Framework for the Secondary Use of Health

Data standards in clinical research: gaps, overlaps, challenges and future directions

Semantic processing of EHR data for clinical research

Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2)

Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers

Constraint-based domain models for future-proof information systems

Detailed Clinical Models: A Review

Building a logical EHR architecture based on ISO 13606 standard and semantic web technologies

Evaluating Model-Driven Development for large-scale EHRs through the openEHR approach

Development of an openEHR Template for COVID-19 Based on Clinical Guidelines

The advanced terminology and coding system for eHealth

LOINC, a universal standard for identifying laboratory observations: A 5-year update

LOINC resources for COVID-19

A CEN/ISO-13606 clinical repository based on ontologies

Proof-of-concept design and development of an EN13606-based electronic health care record service

Examining database persistence of ISO/EN 13606 standardized electronic health record extracts: relational vs. NoSQL approaches

Defining a Standardized Information Model for Multi-Source Representation of Breast Cancer Data

Physician stress and burnout: the impact of health information technology

LinkEHR-Ed: a multi-reference model archetype editor based on formal semantics

Clinical modeling resources, reference ISO 13606 archetypes

A data types profile suitable for use with ISO EN 13606

Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research

Health Mnistry of Spain: SNOMED CT resources for COVID-19

R Foundation for Statistical Computing

Features of 20 133 UK patients in hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: prospective observational cohort study

Building an I2B2-Based Population Repository for Clinical Research. Stud Health Technol Inform

Detailed clinical modelling approach to data extraction from heterogeneous data sources for clinical research

ISARIC-WHO COVID-19 Data Management & Hosting

International electronic health record-derived COVID-19 clinical course profiles: the 4CE consortium

TriNetX: COVID-19 Clinical Data

EHDEN: COVID19 Rapid Collaboration Call

The National COVID Cohort Collaborative

Logica COVID-19 FHIR Profile Library

Guidelines for implementation of HL7 / FHIR based on ISO 13940 and ISO 13606

Health Ministry of Spain: Electronic Health Records of the National Health System (HCDSNS)

Extraction of standardized archetyped data from Electronic Health Record systems based on the Entity-Attribute-Value Model

Federated electronic health records research technology to support clinical trial protocol optimization: Evidence from EHR4CR and the InSite platform

Ontology-based data integration between clinical and research systems

CLIN-IK-LINKS: A platform for the design and execution of clinical data transformation and reasoning workflows

EHR2EDC project

Defining and measuring completeness of electronic health records for secondary use

Credit Author statement

Miguel Pedrera Jiménez: Conceptualization, Methodology, Project administration, Writing-Original draft preparation

Writing-Original draft preparation

Data Curation, Validation, Writing-Reviewing and Editing

Ana Isabel Terriza Torres: Data Curation, Validation, Writing-Reviewing and Editing

Data Curation, Validation, Writing-Reviewing and Editing

Data Curation, Validation, Writing-Reviewing and Editing

Data Curation, Validation, Writing-Reviewing and Editing

Resources, Writing-Reviewing and Editing

Gustavo Roig Domínguez: Resources, Writing-Reviewing and Editing

Bermúdez: Supervision, Writing-Reviewing and Editing

José Luis Bernal Sobrino: Supervision, Writing-Reviewing and Editing

Conceptualization, Supervision, Writing-Reviewing and Editing

Adolfo Muñoz Carrero: Supervision, Writing-Reviewing and Editing

Hospital 12 de Octubre is supported by "Arquitectura normalizada de datos clínicos para la generación de infobancos y su uso secundario en investigación: caso de uso cáncer de mama, cérvix y útero, y evaluación" PI18/00981, "Infobanco para uso secundario de datos de salud basado en estándares de tecnología y conocimiento: evaluación de la calidad, validez y utilidad de la HCE como origen de datos para el estudio de la infección por VIH" PI18/01047 and Digital Health Research Department, Instituto de Salud Carlos III (ISCIII) is supported by PI18CIII/00019 "Arquitectura normalizada de datos clínicos para la generación de infobancos y su uso secundario en investigación: solución tecnológica"; funded by the Carlos III Health Institute from the Spanish National plan for Scientific and Technical Research and Innovation 2017-2020 and the European Regional Development Funds (FEDER).We would like to thank Mercedes Alfaro, Arturo Romero, Jorge Rangil, María Jesús López de Cuellar, Luis Lapuente, Ana Delgado, Rosalía Fernández and the SNOMED CT National Reference Center for Spain for the support in the standardization and creation of new concepts.

☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: