key: cord-0012995-j0zybkho authors: Gagalova, Kristina K; Leon Elizalde, M Angelica; Portales-Casamar, Elodie; Görges, Matthias title: What You Need to Know Before Implementing a Clinical Research Data Warehouse: Comparative Review of Integrated Data Repositories in Health Care Institutions date: 2020-08-27 journal: JMIR Form Res DOI: 10.2196/17687 sha: 5f291ac4aa5dace8a53e843a72ae199eef28ccb7 doc_id: 12995 cord_uid: j0zybkho BACKGROUND: Integrated data repositories (IDRs), also referred to as clinical data warehouses, are platforms used for the integration of several data sources through specialized analytical tools that facilitate data processing and analysis. IDRs offer several opportunities for clinical data reuse, and the number of institutions implementing an IDR has grown steadily in the past decade. OBJECTIVE: The architectural choices of major IDRs are highly diverse and determining their differences can be overwhelming. This review aims to explore the underlying models and common features of IDRs, provide a high-level overview for those entering the field, and propose a set of guiding principles for small- to medium-sized health institutions embarking on IDR implementation. METHODS: We reviewed manuscripts published in peer-reviewed scientific literature between 2008 and 2020, and selected those that specifically describe IDR architectures. Of 255 shortlisted articles, we found 34 articles describing 29 different architectures. The different IDRs were analyzed for common features and classified according to their data processing and integration solution choices. RESULTS: Despite common trends in the selection of standard terminologies and data models, the IDRs examined showed heterogeneity in the underlying architecture design. We identified 4 common architecture models that use different approaches for data processing and integration. These different approaches were driven by a variety of features such as data sources, whether the IDR was for a single institution or a collaborative project, the intended primary data user, and purpose (research-only or including clinical or operational decision making). CONCLUSIONS: IDR implementations are diverse and complex undertakings, which benefit from being preceded by an evaluation of requirements and definition of scope in the early planning stage. Factors such as data source diversity and intended users of the IDR influence data flow and synchronization, both of which are crucial factors in IDR architecture planning. An electronic health record (EHR) is a system for the input, processing, storage, and retrieval of digital health data. EHR systems have been increasingly adopted in the United States over the past 10 years [1] , and their use is spreading worldwide in both hospital and outpatient care settings [2, 3] . An EHR is typically organized in a patient-centric manner and has become a powerful tool to store data in a time-dependent and longitudinal structure. EHR data can also be integrated into an enterprise data warehouse or integrated data repository (IDR). IDRs collect heterogeneous data from multiple sources and present them to the user through a comprehensive view [4] . Unlike EHRs, IDRs offer specialized analytical tools for researchers or analysts to perform data analyses. An IDR is a significant institutional investment in terms of both initial costs and maintenance, but it offers the advantage of clinical data reuse beyond direct clinical care, such as for research and quality improvement studies. Secondary use of clinical data is a rapidly growing field [5, 6] ; an increasing number of institutions have implemented in-house IDRs and several others are developing IDRs for future research endeavors. Unlike clinical practice, which focuses on enhancing the well-being of current patients, the purpose of an IDR is to produce generalized knowledge that can be extended to future patients. Typical applications of IDRs include retrospective analysis and hypothesis generation [7] . Some IDRs also support clinical applications, such as clinical decision support systems (CDSSs), that work alongside clinical practice to estimate risk factors or predictive scores associated with clinical treatments. CDSSs help to avoid medical errors and deliver efficient and safer care by assisting the provider with diagnosis, therapy planning, and treatment evaluation decisions [8] . All these applications are valuable resources that have the potential to improve the quality of health care [9] and reduce health costs if implemented appropriately [10] . Our study is motivated by the need to develop a pediatric IDR at our institution and by the lack of literature providing practical recommendations to apply during the initial development stages. Reviews by Shin et al [11] and Huser et al [12] highlighted the recommended characteristics when designing an IDR; however, they include only a small set of examples and a limited number of example IDRs. Since 2014, the IDR landscape has evolved rapidly, and thus, we felt more recent developments needed to be better addressed as well. A 2018 review by Hamoud et al [13] provided a comprehensive description of most recent data warehouses, including information about their data content, processing, and main purpose; it also provides general recommendations for the implementation of an IDR, but no practical considerations to guide the planning stages. This study compares the features of contemporary IDRs and presents some guiding principles for the design and implementation of a clinical research data warehouse. Our research objective was to identify the major features of contemporary IDRs and obtain a list of established architectures used in the field of health informatics. We expect that this review will be useful for other small-to medium-sized institutions that plan to implement an institutional IDR and have no extensive experience in the field. We conducted a literature review and a targeted web-based search to identify the major existing IDRs and synthesized the retrieved information around key themes. We performed a narrative review following the procedure described below. First, a literature search was conducted using Ovid MEDLINE (Medical Literature Analysis and Retrieval System Online) and IEEE Xplore (Institute of Electrical and Electronics Engineers Xplore), queried in March 2020 ( Figure 1 ). Articles were identified in 2 iterative phases. The first phase used an initial list of keywords querying for infrastructure purposes (data integration, such as linkage and harmonization) as well as infrastructure type and hospital setting (Multimedia Appendix 1: A1). The second phase search used additional keywords identified from the titles and abstracts of articles retrieved in the first phase (Multimedia Appendix 1: A1). Second, Google Scholar was queried for major article keywords (Integrated Data Repository) OR (Clinical Data Warehouse), and the first 150 retrieved hits were screened. The query was executed in a single search stage because the traditional search methods using Ovid MEDLINE and IEEE Xplore already produced exhaustive results. We selected peer-reviewed articles, published in the English language between January 2008 and March 2020, to include the most current data warehouse features. Non-English articles were excluded because of a lack of resources for translation. We retained articles for which the full text was available and removed duplicates. KG read the abstracts, and the articles describing specific data integration strategies, describing architecture structures, or providing more information about the data models were included. When it was unclear whether an article should be included, the authors EPC and MG were consulted. Duplicated articles were removed using EndNote reference management software (Clarivate Analytics). Additional articles providing the most up-to-date information about selected IDRs or cited by the selected articles were included in the selection process because they were considered relevant for the IDR definition. Targeted Web-Based Search of Known Institutional IDRs We manually queried nonpublished resources with the goal of adding contemporary data warehousing practices implemented in large North American hospitals. A convenience sample of hospitals known to be leaders in these types of data warehousing was suggested by EPC and MG. Additionally, we browsed publicly available information on each of the targeted institutional websites (Multimedia Appendix 1: A2). This was complemented with relevant peer-reviewed articles cited in these websites related to the design, implementation, and applications of such repositories. For the comparative review analysis, we performed a manual selection to shortlist articles specifically describing IDR architectures. The shortlisting considered the major focus of the article and the presence of significant details describing data integration, data processing, or database services. The selected articles were searched for related IDR projects and further web-based resources (Table 1 and Multimedia Appendix 1: A3). Figure 2 . Architecture models identified from selected integrated data repositories (IDRs). Arrows indicate data output because of a query (blue) and data input (orange) because of data integration or update. Continuous lines show data query and integration applied by research users, whereas dashed lines are data queries performed by operational or clinical users. Information from the literature was aggregated through thematic analysis and collapsed into 4 classes of IDR architectures. We evaluated the main features of the identified IDRs, such as data processing components, data characteristics, common terminologies, and data models. Features were summarized, compared, and contrasted. We extracted information about host institutions and divided them into small (≤500 beds), medium (500-1000 beds), and large (>1000 beds) institutions based on the number of beds listed on the institution's websites. Selected articles were uploaded into NVivo 12 (QSR International LLC) for qualitative analysis, specifically to count the word frequency in the selected papers. The words with a minimum length of 5 in the full text were counted, excluding stop words, and grouped by synonyms. The word frequency is represented as a word cloud, generated with R (R Foundation for Statistical Computing) and wordcloud package 2.6. The references of the articles describing IDRs were downloaded in a semiautomated manner using Content Extractor and Miner software [50] to parse the full-text PDF files. References to web resources, video-cast meetings, and software were removed, and partial references were manually corrected. The references were grouped by first author and year of publication and loaded in R (R Foundation for Statistical Computing) and plotted with UpSetR [51] . A total of 241 articles were identified in the literature search [11, [13] [14] [15] [16] [17] [18] [19] ,21-29,31,33-35, 37, 43, 44, [47] [48] [49] ; the largest number of articles were identified in IEEE Xplore (n=112), followed by MEDLINE (n=95), and Google Scholar (n=71). After removing duplicates (n=24), we added 3 articles that were frequently cited in the selected articles but were missing from our search results [30, 36, 42] . Three articles [38, 40, 45] were further added that provided additional details relevant to the review topic. Finally, 1 article was replaced by a more updated publication [265] . These 247 articles were combined with the targeted web-based search [32, [39] [40] [41] [266] [267] [268] [269] ; hence, we identified a total of 255 articles ( Figure 1 ). The most frequent words in the articles were system,information, study, project, and design (Multimedia Appendix 1: A4. , which was developed to automatically load data from a clinical data repository into a standard data model that researchers can query; it is a successful example of fast data upload and query using data structures designed from standard data models available for clinical research. We identified 2 types of IDRs: those developed for use in a single institution (n=19) and those implemented for a collaborative project (n=12). The latter typically integrate patient data and provide project-specific tools. The median number of different institutional partners in a collaborative IDR is 6, with one of the partners acting as an organizational hub. The partners range from research institutes, laboratories, and private institutions to university medical centers. The IDRs were further divided by their scope (Table 1) , which were classified as general or specialized medical care (cancer, pediatrics, perinatal, cerebrovascular, or cardiovascular). Seven of the 10 IDRs containing specialized data were collaborative projects, likely indicating the need to pool data from several institutions when dealing with smaller but more focused patient populations. We identified 4 overarching conceptual architectures that summarize the data layers in the selected IDRs ( Figure 2 ). Different institutions can implement multiple architectures for different purposes; we assigned each IDR to a category considering the major features of the IDR, as described in their respective articles. The general architecture model is the most common model, with 19 identified IDRs structured around medical data mining ( Figure 2 , General architecture with optional CDSS). In outline, different data marts are transferred to a staging layer that harmonizes the input to a common data view; data are loaded into a common data warehouse and queried through an application layer that communicates with the user; a CDSS tool can provide added functionality. Hence, in this architecture, each data source is originally stored in an independent data mart, collecting data from a separate research or clinical source within the same institution. Data are processed in the staging layer, which reshapes the input to an integrated view through several steps of data linkage, transformation, and harmonization. The next stage of processing is loading the data into a single database connected to an application layer that provides the tools for end users, typically researchers, to access and analyze the data securely with different services. The biobank-driven architecture model is built around a particular application, in this case, biobanking (Figure 2 , Biobank-driven architecture). This model is similar to the general architecture model but, in this case, the IDR is built around the biosamples database. The biosample data integration occurs at the staging layer. The main feature is that the model allows the biosample operational user to access the raw and identified biobank data source for quality control and biosample management. An example of a biobank-driven structure is the biorepository portal (BRP) [41, 266] , which allows for the automatic integration of biosamples with clinical data, while maintaining unrestricted access to the biorepository for the operational team. The Mayo Clinic and Vanderbilt University adopt the general and biobank-driven architecture models in parallel. The user-controlled application layer architecture model does not have a specific staging layer (Figure 2 , User-controlled application layer). This architecture does not include a central data warehouse; the data are preprocessed and integrated from the original data sources only when the users query the data. Hence, data are processed in 2 stages: the first stage preprocesses the original data to a common format. The user query then carries out the final data integration function for the output delivery. In this architecture, a common data warehouse is not implemented, but rather the data are dynamically queried. An example is the text mining technology at the Léon Bérard Cancer Center (CLB) [44] , which indexes text documents during the preprocessing stage and in which the users' queries return the exact documents matched. The federated architecture is implemented for heterogeneous data retrieval and integration across multiple institutions ( Figure 2 , Federated architecture, adapted from OpenFurther). In this case, institutions selectively share their data through an adaptor system that applies common preprocessing, with data integrated on-the-fly in a virtual data warehouse. The FURTHeR federated query platform [45] builds a virtual IDR that responds to the needs of the user and calls several services for data resolution on-the-fly and upon query. The architecture model is flexible and operates using several services for data integration. An application of FURTHeR is the Pediatric Health Information System+ project [47] , which combines data from 6 institutions. The IDR uses a federation component, which aggregates and stores translated query results in a temporary, in-memory database for presentation and analysis by the researcher for the duration of the user's session. Federated data integration was also proposed using a research data management system (RDMS) [49] , which integrates clinical and biosample data from several institutions in Germany. The @neurIST [48] is a large IDR dedicated to translational research that includes data, computing resources, and tools for researchers and clinicians. Data are located across different sites and are securely shared with a grid infrastructure that allows federated data access. The 4 types of architecture present different analytics tools, data presentation logic, and query interface based on the type of user they serve, which can be classified into 2 major groups: the first group, such as researchers and operational or business analysts, uses the IDR to identify important clinical features that occur at the level of patient cohorts. The second type of user, such as physicians and other health care professionals, uses the IDR to make decisions at an individual patient level, for example, to plan specific therapeutic interventions or predict risk. The first type of user is served by all the architecture models (Research user in Figure 2) . The general architecture model that incorporates a CDSS presents a clear separation of both user types who have different applications for IDR data, with CDSS queries being made by clinical users (Figure 2 , General architecture with optional CDSS). Similarly, the biobank-driven architecture model includes operational users who can directly query the information regarding patient biosamples for clinical applications (Figure 2 , Biobank-driven architecture). Both data update and integration schedules in an IDR are important features that define the timeliness of data. Here, we describe some of the key limiting steps and their occurrence in the different IDR architecture models. The data processing involved in extraction, transformation, and loading (ETL) is described in detail in the articles of biomedical translational research information system (BTRIS) [14] , HaMSTR [24], Mayo Translational Research Center (TRC) [38] , CARPEM [28], onco-i2b2, Vanderbilt's Synthetic Derivative [39] and BioVU [40] , and BRP [41] . These IDRs represent the general and biobank-driven architecture models, which implement a staging layer for the ETL process. A temporal sequence of the ETL steps is as follows: 1. Data extraction from source(s): The source data are extracted by an automatic (or manual) process. 2. Deidentification: Identifiable patient features, such as demographics or localization, are removed before loading into the IDR. The biobank-driven IDRs implement an automated process of this step without the need for extensive institutional reviews. In addition to the deidentified data, BTRIS [14] and Vanderbilt's Synthetic Derivative [39] maintain a parallel database with original identifiable patient entries for research purposes where appropriate. 3. Assignment of unique identifiers: Deidentified data are assigned unique patient identifiers that are used as a reference for linking. 4. Data transformation and standardization: Data are first checked for possible errors or missing values and are then transformed into a common format that is standard for all cohorts. Data may be subjected to transformation, such as the derivation of new values from the existing ones (pseudonymization) for maintaining privacy. 5. Standard terminology and ontology mapping: Data types are labeled with standard terminologies. 6 . Data linkage: If the data are derived from multiple sources, they are linked and combined in the IDR. 7. Loading into the data warehouse: This is performed by either an update of existing data or a complete data re-import into the data warehouse. The CLB [44] IDR (user-controlled application layer architecture model) uses specialized software to manipulate the content from unstructured data without using an ETL process. IDRs representing architecture model 4 do not provide additional information on the ETL process in their respective articles. Five of the selected articles provide additional information about the frequency of data updates in their IDRs. BTRIS [14] and Vanderbilt's Synthetic Derivative [39] argue for daily IDR updates as new source data accumulate daily. Onco-i2b2 [43] performs more frequent data synchronization, as frequent as every 15 min. A real-time data update is presented by METEOR [35] and MOSAIC [33] , which also integrate a CDSS in their architecture model and thus need this frequency to make actionable decisions. MOSAIC presents an example with asynchronous data update; although the CDSS is updated in real time, the demographics are synchronized only every 6 months. The general architecture model combined with a CDSS may require real-time data updates, whereas the general or the biobank-driven architecture models, without a CDSS, may have periodic updates that vary widely in frequency. We have listed the data types in 19 of the selected IDRs based on information in the articles (Figure 3 ). The most common types of data are those extracted from EHR that include patient demographics, diagnoses, procedures, laboratory tests, and medications. Several IDRs incorporate data from biosamples and their omics characterization, especially those based on the biobank-driven architecture model such as TRC [38] , BRP [41, 266] , and BioVU [40] . Health information technology uses controlled terminologies to condense the information to a set of codes that can be manipulated more easily and automatically in data processing. We observed the adoption of both common [272, 273] and specialized terminologies (eg, Anatomical Therapeutic Chemical Classification [274] , human phenotype ontology [275] , Gene Ontology [276] ). The most broadly used were International Classification of Diseases (ICD)-9 and 10 for the classification of diseases, systematized nomenclature of medicine-clinical terms (SNOMED-CT) for a variety of medical domains, Logical Observation Identifiers, Names, and Codes for laboratory observations, and current procedural terminology for common procedures (Table 1 ). These terminologies were utilized within the EHR and further integrated into the IDRs. A common data model (CDM) is a standard data schema that enables data interoperability and sharing. Contemporary data warehouses propose an analytical platform built around the CDM that provides all the software components to construct and manage the data in a CDM. A few different CDMs have been developed and adopted by the wider clinical research community, although some institutions still favor using a custom data schema tailored to their specific needs. In our study, a standard CDM was adopted by 18 of the 29 IDRs. The most frequently applied CDM, found in 16 instances, is Informatics for Integrating Biology and the Bedside (i2b2) [277] . METEOR [35] applies i2b2 with an expanded schema, and CARPEM [28] applies tranSMART [278] , which is a framework layered on top of i2b2, dedicated to integrating omics data with EHR data. Another popular CDM that has been used more frequently in recent years is the Observational Medical Outcomes Partnership (OMOP) [279] , adopted by 3 IDRs, namely MIDH [27], OpenFurther [46] , and STARR [20] . OpenFurther uses OpenMRS [280] , which is an open-source software and CDM that delivers health care in low-and middle-income countries. The BRP [41] is the only example using Harvest as their CDM. Our review identified several institutions of various sizes and scopes that utilize an IDR. These IDRs contain data used for both research and clinical decision-making purposes. The use of structured data from natural language processing of clinical notes, clinical imaging, and omics data are the most recent big data types to be integrated with standard clinical observations. Owing to the large heterogeneity, however, integration is complex and tailored to the specific needs during the IDR implementation and maintenance, as ETL necessitates a significant effort in both the initial modeling and the ongoing updates. As a novel contribution, we proposed and classified IDR architectures into 4 major models that highlight the processing and integration steps. The most common architecture model employs a staging layer implemented before the data are loaded into the data warehouse. A set of common features are applied across most IDRs: IDRs commonly use standard terminologies such as ICD-9/10 and SNOMED-CT, which are often already part of the EHR data. Several IDRs use an open-source translational research framework to model their data, as described by Huser et al [12] . We observed extensive use of i2b2 CDM and the emergent adoption of OMOP CDM, which has the possibility to map additional domain-specific terminologies. Interestingly, PCORnet is one of the newest CDMs, but its application was not discussed in the sample of IDRs reviewed. The PCORnet is the most recently implemented CDM that borrows from several other CDMs and is organized around patient outcomes [261] . To safeguard the data in the IDR, data security and privacy need to be ensured from the initial steps of development. Data security is an important factor in all architecture types, with a particular need in collaborative projects that share data across jurisdictions. For example, in the general architecture of HSSC [29] , data need to be stored in physically and logically secure facilities, where data management is extended to all the parties involved, and data need to be transmitted between the participating institutions through private high-speed networks. In the case of federated data warehouses, such as @neurIST [48] , there is a tight control of data flow between different institutions and clinical and research domains, following policies aligned with recommendations from the Legal and Ethics Advisory Board. Privacy, referring to the protection of patient's personal information, emerged as an important feature, especially in the biobank-driven architecture; here, identifiable patient information is deleted from both the biosamples and the patient clinical data. Developers at the Children's Hospital of Philadelphia and the Children's Brain Tumor Tissue Consortium created an electronic Honest Broker (eHB) and Biorepository Portal (BRP) eHB [41] , which provides a method for patient privacy protection by removing all the exposure of the research staff to patient identifiers and automating the deidentification process. Following a different privacy-preservation approach, Vanderbilt's Synthetic Derivative database [39] alters the patient data by obfuscating the true entries while preserving their time dependence. The implementation of an IDR is subject to several factors that must be considered before development. We identified 2 major factors: (1) the data stored in the IDR and (2) the scope of the IDR, either being exclusively used for research purposes or in combination with clinical or operational purposes, as shown in the general and biobank-driven architecture models. Data types, heterogeneity, and volume greatly influence system load, update, and query of the database. The scope of the IDR influences its primary end users, researchers, clinical users, or operational users, who have different needs and, thus, need access to different sets of tools to extract, analyze, and visualize the data. All the features influence both the data latency and the data synchronization, which are major elements in the model architecture. Moreover, available funding plays an important role in architecture decisions, as are considerations for future expansions. Among the set of selected IDRs, we observed a number of collaborative projects that work within specialized medical domains, such as cancer or pediatrics. Collaborative IDRs are likely to integrate their data to increase the number of patients, thus increasing the statistical power of their respective cohorts. On the basis of our analysis, we highlight the following guiding principles for small-to medium-sized institutions planning to implement an IDR: 1. The general architecture model, with or without CDSS, is the most straightforward to implement; the data staging layer facilitates ETL and data processing before loading into the data warehouse. 2. Select a standard CDM already in use by other institutions; both i2b2 and OMOP provide server and client services in a single unique platform that serves the user with all the necessary tools to set up a structured IDR. 3. Wherever possible, adopt standard terminologies; we listed the most common terminologies derived from the integration with EHR data (Table 1) . One promising approach is that common terminologies are applied in the first phases of the IDR development with other, more specialized terminologies, added later as the project scope expands. 4. Finally, the data update requirements and ETL process design should be carefully considered, the level of automation, as these are the limiting stages in data integration and update. Commercial electronic medical record platforms such as Epic, Cerner, Meditech, and Allscripts are dominant in large institutions. However, although some information about how to query underlying databases and application programming interfaces to communicate with these systems are available, little information on transforming such data into IDR is available in the literature, most likely because of their proprietary nature. Most vendors also sell tools for analysts to query and make use of data from these clinical production systems; however, they are not IDRs themselves and are not targeted toward secondary use for research. As for lessons learned in the field, Epstein et al [281] demonstrate the feasibility of transferring the development of a perioperative data warehouse (schemas and processes) built on top of Epic's database from one institution to another. In their review, Hamoud et al [13] provided general requirements for building a successful clinical data warehouse, recommending a top-down approach to the initial stages of development. They recommended considering all the individual components of the final system to decrease integration obstacles when dealing with heterogeneous data sources. Three major factors contributing to the success of IDRs were identified by Baghal [231] when developing their in-house IDR: (1) organizational, enhancing the collaboration between different departments and researchers; (2) behavioral, building new professional relationships through frequent meetings and communication between departments; and (3) technical improvements to deploy new self-service tools that empower researchers. Collectively, these factors increase the utility and adoption of IDRs in clinical research. In addition, the report by Rizi and Roudsari [282] on lessons and barriers from their development of a public health data warehouse, which IDR developers might want to consider, specifically, not to underestimate technical challenges such as those related to extracting data from other systems, difficulties in modeling and mapping of data, as well as data security and privacy. Other considerations include leveraging the IDR to improve data quality at the source, implementing a data governance framework from the beginning, and ensuring that key organizational stakeholders endorse the project early and strongly [282] . Our search was not intended to be a systematic search; therefore, we may have missed some articles. An example of missing articles is those describing raw and unstructured data repositories, also referred to as data lakes, as these did not appear in our search results although we knew they exist. One of the data lakes was presented by Foran et al [207] as a file reservoir, integrated in the data warehouse schema. For researchers to access those data, it was necessary to use a feeder database before their upload to the final data warehouse. Furthermore, we were able to report on the IDRs and IDR features described in the literature, possibly omitting smaller institutions that are not actively publishing in peer-reviewed journals. In an attempt to mitigate this issue, we searched the representative institutional websites to retrieve additional details about the IDR architectures. As shown in Multimedia Appendix 1 [283] [284] [285] [286] [287] [288] : A2, several organizations provide further details about their architecture in GitHub repositories or institutional Wiki pages, which can be explored for additional information besides the published literature. This review includes articles and web resources shortlisted according to aspects of the IDR architectures that were considered relevant. Providing an exhaustive coverage of all aspects of IDR implementation, such as tools designed to interact with the IDR, are better left for a dedicated review. An example of such tools is the Green Button project, which provides critical help in treating patients [289] [290] [291] [292] . Examples of CDM-based tools, built around an application, are the @neurIST platform [48] , @neurLink, and @neurFuse application suites that consist of research-oriented modules dedicated to knowledge discovery and image processing. CDSS tools such as Green Button, @neurIST applications, or many other existing frameworks are essential in providing sophisticated analyses to support clinicians, but are beyond the scope of our review. There is significant potential in the implementation of IDRs in health institutions, and their importance is evident from the growing number of projects developed in the past 10 years. Despite the common trends in IDR implementation observed in this study, there are also many variations. There are 2 major design factors, namely data heterogeneity and IDR scope, which need to be carefully considered before embarking on the IDR design and planning process. Finally, we aim to apply the knowledge presented in this study for the implementation of a pediatric IDR at our institution. By sharing our experience of planning and designing our IDR with those joining the field or planning to implement an IDR for research purposes, we hope to contribute to future IDR endeavors. Electronic health record adoption in US hospitals: the emergence of a digital 'advanced use' divide Impact of electronic medical record on physician practice in office settings: a systematic review A survey of primary care physicians in eleven countries, 2009: perspectives on care, costs, and experiences Practices and perspectives on building integrated data repositories: results from a 2010 CTSA survey Implementation of a deidentified federated data network for population-based cohort discovery Clinical data reuse or secondary use: current status and potential future progress Current state of information technologies for the clinical research enterprise across academic medical centers Improving clinical practice using clinical decision support systems: a systematic review of trials to identify features critical to success The benefits of health information technology: a review of the recent literature shows predominantly positive results. Health Aff (Millwood) Precision diagnosis: a view of the clinical decision support systems (CDSS) landscape through the lens of critical care Characteristics desired in clinical data warehouse for biomedical research Desiderata for healthcare integrated data repositories based on architectural comparison of three public repositories Clinical data warehouse: a review The national institutes of health's biomedical translational research information system (BTRIS): design, contents, functionality and experience to date Building an i2b2-Based Integrated Data Repository for Cancer Research: A Case Study of Ovarian Cancer Registry Empowering mayo clinic individualized medicine with genomic data warehousing Secondary use of clinical data: the Vanderbilt approach Development of a large-scale de-identified DNA biobank to enable personalized medicine The biorepository portal toolkit: an honest brokered, modular service oriented software tool set for biospecimen-driven translational research BioBankWarden: a web-based system to support translational cancer research by managing clinical and biomaterial data An ICT infrastructure to integrate clinical and molecular data in oncology research An information retrieval system for computerized patient records in the context of a daily hospital practice: the example of the Léon Bérard cancer center (France) Federated querying architecture with clinical & translational health IT application Federating clinical data from six pediatric hospitals: process and initial results from the PHIS+ consortium NeurIST: infrastructure for advanced disease management through integration of heterogeneous data, computing, and complex processing services Metadata repository for improved data sharing and reuse based on HL7 FHIR CERMINE: automatic extraction of structured metadata from scientific literature UpSetR: an R package for the visualization of intersecting sets and their properties Experiences With Mirth: an Open Source Health Care Integration Engine Two-Phase chief complaint mapping to the UMLS metathesaurus in Korean electronic medical records Security Recommendations for Implementation in Distributed Healthcare Systems A hierarchical, ontology-driven Bayesian concept for ubiquitous medical environments--a case study for pulmonary diseases Implementing A Knowledge-Driven Hierarchical Context Model in a Medical Laboratory Information System Pseudonymization for Improving the Privacy in E-Health Applications Implementation of an integrated drug information system for inpatients to reduce medication errors in administering stage Data Integration in Cardiac Surgery Health Care Institution: Experience at G Pasquinucci Heart Hospital Flexible Data Integration and Ontology-Based Data Access to Medical Records Semantic integration of cervical cancer data repositories to facilitate multicenter association studies: the ASSIST approach The Multi-Knowledge Service-Oriented Architecture: Enabling Collaborative Research for E-Health Improving EMR System Adoption in Canadian Medical Practice: A Research Model Architecture of a federated query engine for heterogeneous resources A Software System Development for Probabilistic Relational Database Applications for Biomedical Informatics Biomedical Data Acquisition and Processing in the Decision Support Services of HEARTFAID Platform Analysis and Design on Standard System of Electronic Health Records Medical Signal Grid Repository, an Integration to Italica Project Design of ICU Medical Decision Support Applications by Integrating Service Oriented Applications With a Rule-Based System Achieving E-Health Care in a Distributed EHR System Distributed e-Health system with Smart Self-Care Units Towards an Integrated Platform for Improving Hospital Risk Management Development of a data warehouse for lymphoma cancer diagnosis and treatment decision support A Modular Clinical Decision Support System Clinical Prototype Extensible Into Multiple Clinical Settings PathMiner: a web-based tool for computer-assisted diagnostics in pathology Distributed Data Processing Framework for Oral Health Care Information Management Based on CSCWD Technology A unified framework for biomedical terminologies and ontologies Implementing RAID-3 on Cloud Storage for EMR System A unique digital electrocardiographic repository for the development of quantitative electrocardiography and cardiac safety: the Telemetric and Holter ECG Warehouse (THEW) The IT-infrastructure of a biobank for an academic medical center Strategies for Development and Adoption of EHR in German Ambulatory Care The REUSE project: EHR as single datasource for biomedical research Evaluation of Different Database Designs for Integration of Heterogeneous Distributed Electronic Health Records Suggested Criteria for Successful Deployment of a Clinical Decision Support System (CDSS) A Domain Ontology Approach in the ETL Process of Data Warehousing Sharing E-health Information Through Ontological Layering A Framework for Comprehensive Electronic QA in Radiation Therapy SMARTDIAB: a communication and information technology approach for the intelligent monitoring, management and follow-up of type 1 diabetes patients The TRITON project: design and implementation of an integrative translational research information management platform Health Information System Grid Based on WSRF Managing medical vocabulary updates in a clinical data warehouse: an RxNorm case study Design of and technical challenges involved in a framework for multicentric radiotherapy treatment planning studies Web Services Security Issues in Healthcare Applications The pediatrix babysteps data warehouse and the pediatrix qualitysteps improvement project system--tools for 'meaningful use' in continuous quality improvement The Change Strategy Towards an Integrated Health Information Infrastructure: Lessons From Sierra Leone Cerebrovascular Diseases Research Database Cerebrovascular Diseases Research Databaseresearch on Healthcare Integrating Model of Medical Information System Based on Agent A General Framework for Medical Data Mining Development of traditional Chinese medicine clinical data warehouse for medical knowledge discovery and decision support Using electronically available inpatient hospital data for research The ADE scorecards: a tool for adverse drug event detection in electronic health records Anesthesia Information Management System in Cardiac Surgery An Informatics Architecture for the Virtual Pediatric Intensive Care Unit Roogle: an information retrieval engine for clinical data warehouse An Education Support System with Anonymized Medical Data Based on Thin Client System Proposal of an Open-source Cloud Computing System for Exchanging Medical Images of a Hospital Information System Health Ontology System Strategies for maintaining patient privacy in i2b2 Development of a research dedicated archival system (TARAS) in a university hospital Web-based Medical Image Archiving and Communication System for Teleimaging an Electronic Nursing Management System to Facilitate Interdisciplinary Communication and Improve Patient Outcomes in Psychiatric Hospitals The biomedical resource ontology (BRO) to enable resource discovery in clinical and translational research Interoperability driven integration of biomedical data sources A Clinical Omics Database Integrating Epidemiology, Clinical, and Omics Data for Colorectal Cancer Translational Research Linked2Safety: A Secure Linked Data Medical Information Space for Semantically-interconnecting EHRs Advancing Patients' Safety in Medical Research Novel approach to utilizing electronic health records for dermatologic research: developing a multi-institutional federated data network for clinical and translational research in psoriasis and psoriatic arthritis Architectural approach for semantic EHR systems development based on detailed clinical models A Proposed Star Schema and Extraction Process to Enhance the Collection of Contextual & Semantic Information for Clinical Research Data Warehouses Development of a clinical data warehouse from an intensive care clinical information system A semantic web framework to integrate cancer omics data with biological knowledge Context-based electronic health record: toward patient specific healthcare A semantic proteomics dashboard (SemPoD) for data management in translational research University of Queensland vital signs dataset: development of an accessible repository of anesthesia patient monitoring data for research Automated realtime data import for the i2b2 clinical data warehouse: introducing the HL7 ETL cell Efficient data management in a large-scale epidemiology research project Enhanced Data Extraction, Transforming and Loading Processing for Traditional Chinese Medicine Clinical Data Warehouse An architecture for integrating cancer model repositories Anonymization of longitudinal electronic medical records Metamed-Medical Meta Data Extraction and Manipulation Tool Used in the Semantically Interoperable Research Information System Glocal clinical registries: pacemaker registry design and implementation for global and local integration--methodology and case study The biointelligence framework: a new computational platform for biomedical knowledge computing Multicentre clinical trials' data management: a hybrid solution to exploit the strengths of electronic data capture and electronic health records systems Survey on Privacy Preserving Updates on Unidentified Database Building data warehouse for diseases registry: first step for clinical data warehouse A Privacy Framework for Secondary Use of Medical Data Combining rules and machine learning for extraction of temporal expressions and events from clinical narratives Computational framework to support integration of biomolecular and clinical data within a translational approach The analytic information warehouse (AIW): a platform for analytics using electronic health record data Improving the Implementation of Clinical Decision Support Systems An Application of a Healthcare Data Warehouse System Adapting the design of anesthesia information management systems to innovations depicted in industrial property documents A Cloud-based System for Supporting Multi-centre Studies Design and experimental approach to the construction of a human signal-molecule-profiling database Healthcare Services in the Cloud--Obstacles to Adoption, and a Way Forward The COPD knowledge base: enabling data analysis and computational simulation in translational COPD research PRECISE:privacy-preserving cloud-assisted quality improvement service in healthcare WELCOME -innovative integrated care platform using wearable sensing and smart cloud computing for COPD patients with comorbidities A Data Gathering Framework to Collect Type 2 Diabetes Patients Data Opportunities and challenges provided by cloud repositories for bioinformatics-enabled drug discovery A Comparison of Search Engine Technologies for a Clinical Data Warehouse Building a Medical Research Cloud in the EASI-CLOUDS Project Use of administrative medical databases in population-based research Healthcare information exchange system based on a hybrid central/federated model The subarachnoid hemorrhage international trialists (SAHIT) repository: advancing clinical research in subarachnoid hemorrhage Cloud-Based Data Exchange Framework for Healthcare Services Clinical data warehouse issues and challenges Cloud Based Intelligent Healthcare Monitoring System A Clinical Data Warehouse Architecture based on the Electronic Healthcare Record Infrastructure Harvest: an open platform for developing web-based biomedical data discovery and reporting applications The mid-south clinical data research network Burden of diabetes mellitus estimated with a longitudinal population-based study using administrative databases Health information research platform (HIReP)--an architecture pattern A Cloud-based Model for Hospital Information Systems Integration An implementation framework for the feedback of individual research results and incidental findings in research Integration of Consumer Healthcare Data and Electronic Medical Records for Chronic Disease Management Improving the Effectiveness of Interprofessional Work Teams Using EHR-Based Data in the Treatment of Chronic Diseases: An Action Research Study Electronic health record adoption in US hospitals: progress continues, but challenges persist Towards a Model for Enhancing ICT4 Development and Information Security in Healthcare System Radiology reporting system data exchange with the electronic health record system: a case study in Iran ABC: A Knowledge Based Collaborative Framework for E-health Development, Deployment of a Telemedicine System in a Developing Country: Dealing With Organizational and Social Issues Semantic integration of medication data into the EHOP clinical data warehouse An ontology-based clinical data warehouse for scientific research TCGA4U: a web-based genomic analysis platform to explore and mine TCGA genomic data for translational research Towards a Semantic Clinical Data Warehouse: A Case Study of Discovering Similar Genes Knowledge Modelling Framework Protecting privacy in a clinical data warehouse Ehealth Recommendation Service System Using Ontology and Case-Based Reasoning Emerging Technologies for Health Data Analytics Research: A Conceptual Architecture Towards data integration automation for the French rare disease registry Personalised medicine possible with real-time integration of genomic and clinical data to inform clinical decision-making Bridging the gap from bench to bedside--an informatics infrastructure for integrating clinical, genomics and environmental data (ICGED) An Automatic Ehealth Platform for Cardiovascular and Cerebrovascular Disease Detection METEOR: an enterprise health informatics environment to support evidence-based medicine A Cloud-Based Radiological Portal for the Patients: IT Contributing to Position the Patient as the Central Axis of the 21st Century Healthcare Cycles Building a National Perinatal Data Base Without the Use of Unique Personal Identifiers Federated Service-Based Authentication Provisioning for Distributed Diagnostic Imaging Systems Toward distributed conduction of large-scale studies in radiation therapy and oncology: open-source system integration approach AEGLE: A Big Bio-Data Analytics Framework for Integrated Health-Care Services Federated queries of clinical data repositories: scaling to a national network Design and Implementation of a Privacy Aware Framework for Sharing Electronic Health Records Design and development of a medical big data processing system based on Hadoop Addressing the challenges of cross-jurisdictional data linkage between a national clinical quality registry and government-held health data Integrated data repository toolkit (IDRT). A suite of programs to facilitate health analytics on heterogeneous medical data Designing a clinical data warehouse architecture to support quality improvement initiatives Validating the extract, transform, load process used to populate a large clinical research database Construction of quality-assured infant feeding process of care data repositories: construction of the perinatal repository (part 2) Big Data Lakes Can Support Better Population Health for Rural India -Swastha Bharat Application of Big Data Analytics in Healthcare System to Predict COPD An 'integrated health neighbourhood' framework to optimise the use of EHR data Big Data in Healthcare: Prospects, Challenges and Resolutions An approach to acquiring, normalizing, and managing EHR data from a clinical data repository for studying pressure ulcer outcomes Clinical data models at university hospitals of Geneva MDi: acquisition, analysis and data visualization system in healthcare Holistic Perspective of Big Data in Healthcare A Systems Approach to Improving Patient Flow at UVA Cancer Center Using Real-time Locating System Roadmap to a comprehensive clinical data warehouse for precision medicine applications in oncology Development of National Health Data Warehouse Bangladesh: Privacy Issues and a Practical Solution Constructing the Cloud Computing System for Advanced Data Analysis of Biomedical Research Advanced Solutions for Medical Information Storing: Clinical Data Warehouse A standardized and data quality assessed maternal-child care integrated data repository for research and monitoring of best practices: a pilot project in Spain Cloud Based Big Data Platform for Image Analytics Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach Federated learning of predictive models from federated electronic health records Big Data Integration Case Study for Radiology Data Sources Exporting data from a clinical data warehouse Next generation phenotyping using narrative reports in a rare disease clinical data warehouse Terminology services: standard terminologies to control health vocabulary Visual progression analysis of event sequence data A Privacy Framework in Cloud Computing for Healthcare Data Greater Noida (UP) Data Warehouse Based Analysis with Integrated Blood Donation Management System Leveraging Distributed Data Over Big Data Analytics Platform for Healthcare Services Clinical data warehouse query and learning tool using a human-centered participatory design process The C6H6 NMR repository: an integral solution to control the flow of your data from the magnet to the public PCORnet's collaborative research groups MedCo: enabling secure and privacy-preserving exploration of distributed clinical and genomic data A clinical data warehouse based on OMOP and i2b2 for Austrian health claims data Automated population of an i2b2 clinical data warehouse using FHIR Combining information from a clinical data warehouse and a pharmaceutical database to generate a framework to detect comorbidities in electronic health records A Big Data Repository and Architecture for Managing Hearing Loss Related Data Factors associated with increased adoption of a research data warehouse Meaningful Integration of Data, Analytics and Services of Computer-Based Medical Systems: The MIDAS Touch On Distributed Collaboration for Biomedical Analyses Health Information Management System for a Rural Medical Clinic in Nicaragua Section Editors for the IMIA Yearbook Section on Clinical Research Informatics DiiS: a biomedical data access framework for aiding data driven research supporting fair principles. Data Big data sharing and analysis to advance research in post-traumatic epilepsy A tale of two databases: the DoD and VA infrastructure for clinical intelligence (DaVINCI). Stud Health Technol Inform Query translation between openEHR and i2b2 Cardiac Networks United Executive Committee and Advisory Board. Cardiac networks united: an integrated paediatric and congenital cardiovascular research and improvement network Incorporating a location-based socioeconomic index into a de-identified i2b2 clinical data warehouse Privacy in the Future of Integrated Health Care Services -Are Privacy Languages the Key? A generic method and implementation to evaluate and improve data quality in distributed research networks A framework for public health monitoring, analytics and research An Information Integration System to Continuing of Care Case study Nongsung Hospital Towards Heterogeneous Big Data Analysis for Personalized Medicine From discovery to practice and survivorship: building a national real-world data learning healthcare framework for military and veteran cancer patients Constructing a Comprehensive Clinical Database Integrating Patients' Data from Intensive Care Units and General Wards Approaching clinical data transformation from disparate healthcare IT systems through a modular framework Big data and biomedical informatics: preparing for the modernization of clinical neuropsychology Factors Affecting the Adoption of Cloud Computing in a South African Hospital A theoritical exploration of data management and integration in organisation sectors Medical Data Exploration Based on the Heterogeneous Data Sources Aggregation System A method for EHR phenotype management in an i2b2 data warehouse Identifying preanalytic and postanalytic laboratory quality gaps using a data warehouse and structured multidisciplinary process Remote surveillance technologies: realizing the aim of right patient, right data, right time The national MDS natural history study: design of an integrated data and sample biorepository to promote research studies in myelodysplastic syndromes Merging heterogeneous clinical data to enable knowledge discovery Developing an Integration Architecture to Manage Heterogeneous Data by Private Healthcare Practitioners: A Case of Namibia A Review on Big Data Practices in Healthcare Learning to share health care data: a brief timeline of influential common data models and distributed health data networks in US health care research. EGEMS (Wash DC) Research on hierarchical data fusion of intelligent medical monitoring Secure Pattern-Based Data Sensitivity Framework for Big Data in Healthcare Optimizing the electronic health records through big data analytics: a knowledge-based view MIMIC-III, a freely accessible critical care database. Sci Data Scalable Biobanking: a Modular Electronic Honest Broker and Biorepository for Integrated Clinical, Specimen and Genomic Research Too much of a good thing is wonderful: observational data for perioperative research Cavatica-a pediatric genomic cloud empowering data discovery through the pediatric brain tumor atlas PedcBioPortal: a Cancer Data Visualization Tool for Integrative Pediatric Cancer Analyses Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) The ONCO-I2b2 project: integrating biobank information and clinical data to support translational research in oncology Recent developments in clinical terminologies -SNOMED CT, LOINC, and RxNorm High-quality, standard, controlled healthcare terminologies come of age Structure and Principles. WHO Collaborating Centre for Drug Statistics and Methodology. 2020 The human phenotype ontology: a tool for annotating and analyzing human hereditary disease Expansion of the gene ontology knowledgebase and resources Architecture of the open-source clinical research chart from informatics for integrating biology and the bedside TranSMART: an open source knowledge management and high content data analytics platform Observational health data sciences and informatics (OHDSI): opportunities for observational researchers Open MRS Collaborative Investigators. OpenMRS, a global medical records system collaborative: factors influencing successful implementation Successful implementation of a perioperative data warehouse using another hospital's published specification from epic's electronic health record system Development of a public health reporting data warehouse: lessons learned Perioperative Outcomes Group Children's Hospital of Philadelphia® Center for Data-Driven Discovery in Biomedicine UpSet: visualization of intersecting sets Bringing cohort studies to the bedside: framework for a 'green button' to support clinical decision-making A 'green button' for using aggregate patient data at the point of care It is time to learn from patients like mine Performing an informatics consult: methods and challenges The project was supported, in part, by an Evidence to Innovation (E2i) Research Theme seed grant through the BC Children's Hospital Research Institute. The authors wish to thank Colleen Pawliuk for her help with the literature search strategy development and execution and Nicholas West and Zoltan Bozoky for editorial assistance. None declared. 0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on http://formative.jmir.org, as well as this copyright and license information must be included.