key: cord-0810893-4pq16tm5 authors: Jhaveri, Ravi; John, Jordan; Rosenman, Marc title: Electronic Health Record Network Research in Infectious Diseases date: 2021-10-08 journal: Clin Ther DOI: 10.1016/j.clinthera.2021.09.002 sha: beb49d9b0abcb6fcb8846d030a47c514577561cc doc_id: 810893 cord_uid: 4pq16tm5 With the marked increases in electronic health record (EHR) use for providing clinical care, there have been parallel efforts to leverage EHR data for research. EHR repositories offer the promise of vast amounts of clinical data not easily captured with traditional research methods and facilitate clinical epidemiology and comparative effectiveness research, including analyses to identify patients at higher risk for complications or who are better candidates for treatment. These types of studies have been relatively slow to penetrate the field of infectious diseases, but the need for rapid turnaround during the COVID-19 global pandemic has accelerated the uptake. This review discusses the rationale for her network projects, opportunities and challenges that such networks present, and some prior studies within the field of infectious diseases. In the past decade, the field of medicine has been transformed in both positive and negative ways by widespread use of the electronic health record (EHR). With EHRs offering the promise of increased data access for clinicians and incentivized by passage of the Patient Protection and Affordable Care Act and the Centers for Medicare & Medicaid Services meaningful use program, approximately 90% of physicians and hospitals use EHR systems. 1 Although initially designed to improve clinical decision support, information sharing between practitioners, scheduling, and accounting, billing, and revenue (the major driver of the ongoing development and proliferation of EHRs), researchers have steadily realized that EHR systems can facilitate many aspects of clinical research that are not possible with traditional clinical trials. 2 Given the historical emphasis within the field of infectious diseases on laboratory-based and traditional clinical research, its embrace of EHR-based research has been slow. However, the SARS-CoV-2 pandemic has created an environment that has accelerated the uptake of EHR-based studies. This review discusses the limitations of traditional clinical studies, highlights the capabilities, advantages, and challenges of EHR-based research, and briefly summarizes several studies in the field of infectious diseases. This review does not cover the use of EHR to implement best practices or quality improvement initiatives and limits the discussion to research uses of the EHR. The current goal of most medical clinicians is to practice evidence-based medicine. However, not all evidence is equal. Several articles have discussed how the evidence pyramid should be adjusted, but randomized clinical trials (RCTs) are consistently considered the highest-quality evidence followed by large cohort studies, case-control studies, and then case series and reports at the lowest level. 3 To be successful, RCTs require relatively large patient sample sizes, frequently require multiple centers, and usually require significant financial resources Figure 1A . 4 In infectious diseases, many conditions are relatively uncommon and are sporadic in nature, neither of which is conducive to conducting an RCT. As a result, the field is perpetually caught in the cycle of trying to conduct better case-control and case series studies, optimizing the lower part of the pyramid. Systematic reviews and meta-analyses are often used to try to cobble together many studies to solve the problem of small sample sizes, but these methods can only partially compensate for and cannot completely erase the fundamental methodologic flaws of these studies. Researchers have tried to compensate for some of these limitations by creating large cohorts using regional or national administrative claims databases. Usually established by large public or private insurers, these databases allow researchers to identify patients with a common diagnostic condition ( International Classification of Diseases, Tenth Revision [ICD-10] code) or procedure ( Current Procedural Terminology or other code), establish frequencies across a large population, and collect all the associated health care events associated with the specified population. These types of studies can be helpful to define patterns of practice and highlight large variations in practice across hospital systems and regions. However, claims data usually have only dispensing records (and not orders) for medications, limited or no detailed laboratory results or radiographic data, and no realtime clinical information, such as vital signs or pain ratings. 5 , 6 Given the limitations of the evidence base, clinicians in the real world are left to practice the art of medicine. They encounter patients with relatively infrequent conditions, they factor in their own personal experience treating patients with similar conditions along with evidence from similar conditions, and they come up with a custom plan. The clinicians who have more extensive experience and who are better at synthesizing this kind of customized treatment approach are often acknowledged as master clinicians, but the best they can do is summarize their experience in a case series that then gets rated as low-quality evidence. As a result, the field of infectious diseases continues with the status quo, with practitioners claiming, "There are no good data on this topic!" What if the collective experience of these practitioners could be combined into a large real-world cohort database with patients divided into subgroups based on the different treatment they receive? Consider the example of the invasive fungal infection mucormycosis. 7 Because this infection is a rare event, RCTs are virtually impossible. However, the collective number of patients seen across most of the United States may be significant, with different institutions having slightly different practices. There is ongoing debate about whether different antifungal prophylaxis strategies have a significant impact on rates of invasive mucormycosis. One could develop a real-world study on the number of cases of mucormycosis to compare the effect of using voriconazole versus posaconazole as prophylaxis in high-risk oncology patients and patients undergoing hematopoietic stem cell transplantation. The clinical information for all patients in such a database would be standardized so data could be pooled easily and analyzed across institutions Figure 1B . In an established, multicenter, multipurpose (inclusive of patients with any health condition or health care encounter) EHR-based database, the costs to collect such data for mucormycosis would be far less than what is required for running a multicenter clinical trial and would offer the advantage of results obtained in actual practice. These are the potential benefits that can be realized with EHR-based research networks. Another advantage to EHR-based clinical networks is that one can take an unbiased approach to the data for discovery purposes. In the mucormycosis example, one could identify that although the number of fungal infections did not differ between the 2 antifungal agents being compared, 1 treatment led to increased laboratory abnormalities that required additional interventions. Using an administrative claims database, one needs to predefine which variables to evaluate and thus would not find an association unless specifically searching for it (a traditional biased approach). Alongside the potential benefits, EHR-based observational studies themselves have noteworthy limitations and potential for bias. Without randomization, statistical techniques, such as propensity scores, must be applied in an attempt to handle baseline differences between groups. Such techniques have limitations, [8] [9] [10] and EHR-derived data are at risk for bias for other reasons as well. The challenges involve those inherent network research. (A) Traditional multicenter research involves investigative teams independently coordinating with individual institutions to collect patient data that can ultimately be curated into patient cohorts for comparison of a condition or intervention. Data collected in this approach is limited to specific items the team is most interested in evaluating (biased analysis). (B) EHR network research involves investigative teams designing and validating a computable phenotype that will identify the patient population desired. This phenotype is then shared with the EHR network partner, either a distributed network or a centralized one, and is then used to collect data from all member institutions. Validation of the data collected is performed before release. Major advantages of this approach are the sheer volume of data collected and the ability to conduct unbiased analyses, which means that associations can be identified without any prior assumptions. in any observational study plus complexities specific to EHRs, clinical workflows, and patterns of care. [11] [12] [13] Two oft-cited ( > 6000-fold taken together) articles published 21 years ago put forth the promise of observational studies, finding that they had results and effect sizes similar to those of RCTs of the same research questions, 14 , 15 but, as Banerjee and Prasad 16 and others 17 , 18 have summarized, considerable evidence has emerged in recent years to the contrary. In many cases, the answer to the question, "Can real-world data really replace RCTs?" is no. [19] [20] [21] Nevertheless, observational data have an important role to play in studies that develop the evidence base and in fostering learning health systems. 16 , 22 , 23 EHR-based research networks can help build the necessarily nuanced methods for applying and for understanding when and how to apply observational EHR data on their own or as complementary data in the conceptualization, implementation, and analyses of pragmatic RCTs. [24] [25] [26] [27] [28] A key prerequisite for the success of EHR research networks is to set up the system so that a common data language and format is used by all member institutions within the network; the institutions must also extract, transform, and load their local EHR repository data into the network database with good quality and completeness. Data relevant to the particular studies that use data from the network should also be easily shareable. Another key limitation of EHR-based research is the challenge of applying criteria or clinical severity scores. The Sequential Organ Failure Assessment (SOFA) score is an example of a measure that was developed primarily in adults that could be evaluated across many institutions. However, the SOFA score may not be applicable to many other patient subpopulations (infants, children, and pregnant women), so this limits EHR-based research in these groups to the clinical variables directly assessed during care. 29 Administrative claims databases have long been used as an effective tool for outcomes research in medicine. Many investigators have been reluctant to let go of using these databases and ask, "What do EHR data networks offer that claims databases don't?" DeShazo and Hoffman 30 Diseases. These resources note that hospitals in the Nationwide Inpatient Sample skewed smaller ( < 99 beds) and were less likely to have fully converted their systems to EHRs (as of 2010, when this comparison study was conducted). This observation highlights one of the limitations of EHR network research, which is that larger systems and institutions are generally the ones with the resources to adopt and maintain EHR systems and to participate in the new EHR data-sharing networks discussed below. This limitation creates a bias in representation that needs to be accounted for when conducting population-based studies. The National Inpatient Sample is weighted to help the studies that analyze its data be nationally representative. Butame et al 31 analyzed the barriers and facilitators to EHR data collection in HIV trials in adolescents. They performed qualitative interviews with national experts and key stakeholders in the field of adolescent HIV. They found that respondents used 17 different EHR systems (among 29 participants), and several used > 1 at their site to achieve different tasks (billing vs patient management vs reporting requirements from federal funders, such as the US Health Resources and Services Administration). Other barriers were identified: specific variable collection, general data collection, and HIV-specific variable data collection. The authors summarize their findings as describing a "fractured EHR landscape," a phrase that summarizes the situation for most practitioners and researchers well beyond the field of infectious diseases. Many of the barriers highlighted in this study are addressed when using the PCORnet Common Data Model (CDM), but some of the barriers highlighted still exist. To avoid some of the barriers mentioned, an EHR network would need to have developed a system that facilitates data sharing, formatting, and harmonization across institutions, with a robust process for improving and maintaining the quality and completeness of the data. Fortunately for clinical investigators based in the United States, such an EHR-based data collection system already exists: PCORnet, the National Patient-Centered Clinical Research Network. Guided by the Patient-Centered Outcomes Research Institute, Dr Francis Collins, and various stakeholders, PCORnet was created in 2014 and has been expanding and evolving in the ensuing years. The impetus for PCORnet was, in part, what is discussed above: to address some of the key challenges in building large, multi-institution study cohorts. Local EHR data repositories are idiosyncratic and not easy to combine across institutions. The overarching rationale for PCORnet was to be able to strengthen the clinical evidence base by making large pragmatic trials much more feasible. 32 PCORnet is a distributed network in which each of the institutions extracts, transforms, and loads (ETLs) data from its local EHR repository into a common data model data mart, which the institution also maintains locally. The PCORnet CDM data mart specifications are issued by the PCORnet coordinating center, which is run (2014 to the present) jointly by Duke University and Harvard Pilgrim. The specifications address not only the data mart format (tables and fields) but also the content by requiring data element values and units of measure to conform to standards (RxNorm, LOINC [Logical Observation Identifiers Names and Codes], and others, by providing guidance on mapping local values to the common terminologies, and through quarterly refreshes that are accompanied by data quality and completeness checks that have increased in stringency. 33 The common data model format, along with a query-and-data transmission software that also has been adopted by every participating institution, allows the coordinating center to issue a query that can be run at all participating institutions. The query outputs (aggregate results or individual-level data sets) then are collated by the coordinating center and/or by a project's investigator team. The harmonization of formats, content, and processes creates substantial efficiencies. In addition, each data mart includes all of an institution's patients who had a health care record from 2011 forward; therefore, studies of clinical research questions in any specialty or subspecialty can be pursued. An early example from PCORnet in pediatrics is a study of whether there is an association between early antibiotic exposure and weight gain. 34 A recent example is the Centers for Disease Control and Prevention (CDC)-sponsored COVID-19 project that began in May 2020; this project provides the CDC with regular reports on data related to the pandemic. 35 Forty-three PCORnet institutions are participating. Every specific project affords an opportunity to improve the PCORnet data marts with regard to that project's particular data elements or required collaborative processes; the CDC COVID-19 project prompted each site to improve its ETL procedures for SARS-CoV-2 testing and vaccination data and to establish more frequent (approximately biweekly) data refreshes. As of the middle of 2021, PCORnet contained 9 clinical research networks (with EHR-derived data for a total of approximately 80 million patients) and 2 health care payer data marts (with claims-derived data for a total of approximately 60 million patients). 36 The 9 clinical research networks include a total of 70 EHR data-contributing institutions, within which there are 337 hospitals, approximately 170,000 physicians, and more > 3500 primary care practices. 36 Each of the 9 clinical research networks has a hub site and may perform some studies on its own (without the PCORnet coordinating center). Many of the clinical research networks are in a particular geographic area (eg, the Greater Plains Collaborative includes institutions from the Midwest to Texas); by contrast, the PEDSnet clinical research network is composed of an increasing number of children's hospitals from across the country. In PCORnet (and in any other EHR data network), a common data model and harmonization are helpful, up to a point. With inclusion criteria that include all patients since 2011 and with the large number of data tables and fields, not everything can be audited (although the quarterly data checks have continued to expand), not all scenarios can have been envisioned, and not all of the underlying idiosyncrasies can have been resolved. 33 , 37 Even if application of the PCORnet data mart specifications were perfect at all institutions, there would still be some variation because of the data provenance upstream of the PCORnet data marts (eg, the electronic systems and data flows and repositories that a hospital has in place for its surgery data or for its physician billing data may affect how much data and what types make it into the PCORnet data mart). In addition, many PCORnet data-contributing institutions are complex institutions themselves. Many are composed of > 1 hospital. Most have multiple affiliated outpatient clinic locations. Understandably, there tends to be intrainstitutional variation, upstream of one's PCORnet data mart, that affects how various data types do or do not flow in from the various reaches of the institution. These situations are related to the fundamental challenges that PCORnet and other similarly motivated multi-institutional networks, such as Sentinel, arose to address. 38 Institutions also have seen the need to craft (1) research data repositories that are more expansive than a PCORnet data mart and (2) research collaborations that cut across institutions and data models to foster learning health systems science in various specialty and subspecialty domains. Duke University built the Duke Clinical Research Datamart by incorporating the PCORnet data model plus other data types and data tables. 39 PEDSnet is a PCORnet clinical research network and maintains a PCORnet data mart to participate in PCORnet projects but also has designed and implemented a PEDSnet Common Data Model. 23 The PEDSnet Common Data Model is based largely on the robust Observational Medical Outcomes Partnership (OMOP) model. 40 , 41 Using an OMOP-like model allows PEDSnet to expand its pediatric research capabilities and its data quality monitoring infrastructure, [42] [43] [44] PEDSnet is an exemplar for networks that seek to be (and to build) learning health systems 22 , 23 ; it has facilitated collaboration among subspecialists, generalists, and informaticians in the fields of gastroenterology, nephrology, and several others. 42 , 45 Other networks also have been building unprecedented opportunities for observational research. The OMOP data model was initiated in 2007 by a public-private partnership among the US Food and Drug Administration, the Foundation for the National Institutes of Health, and Pharmaceutical Research and Manufacturers of America. 46 Seven years later, a new organization, Observational Health Data Sciences and Informatics (OHDSI), was formed to continue to develop this model and to encourage its use in research. OHDSI is a federated network with a central office at Columbia University; each OHDSI institution uses ETL procedures to create its own OMOP data mart. OHDSI and OMOP also offer analytical tools to help sites understand the quality and completeness of their data and to facilitate research. OHDSI now includes sites in approximately 30 countries and 600 million patients. 47 Among many computable phenotypes that OHDSI has developed, those related to infectious diseases include HIV infection, human papillomavirus infection, influenza, methicillin-resistant Staphylococcus aureus , otitis media, pneumonia, respiratory syncytial virus infection, and tuberculosis. 47 OHDSI facilitates the efficient analysis of large cohorts, as seen in an international study of 34,128 patients hospitalized with COVID-19. 48 Big EHR vendors also have been developing capabilities for multi-institution observational research. The Cerner Corporation (North Kansas City, Missouri) has Cerner Real World Data (formerly Cerner Health Facts), a centralized database that uses data from a subset of the institutions that use a Cerner EHR system 49 Clinical and Translational Science Award program formed the National COVID Cohort Collaborative (N3C); N3C has addressed many infrastructure challenges in data governance, extraction, quality, harmonization, and analytics to create a centralized database for COVID-19 studies. 69 , 70 The N3C now includes > 7 billion rows of EHR data and > 2.1 million patients with COVID-19 (and > 6.3 million patients overall) and has more than 200 projects under way Table Supplemental 1 . 70 Supplemental table 1 offers a summary of the EHR networks discussed. A custom, cross-cutting network in infectious disease would build on this track record. It would facilitate collaboration by experts across the large number of institutions from which data would be needed to address rare infections and related conditions. It would help PEDSnet, PCORnet, and other networks enhance the ways that microbiology and virology laboratory results (including antibiotic susceptibility results) and antimicrobial medication data records are stored and made accessible for research projects. It also would support the development of learning health systems. There are several examples of EHR-network research outside infectious diseases that illustrate the potential advantages of using EHR-based systems. Hornik et al 71 within the Pediatric Trials Network developed their own EHR-based data repository with 9 of the highest-enrolling centers within their existing research network. They created a data model with 147 mandatory elements and 99 optional ones for all children discharged between January 2013 and June 2017. The authors used the PCORnet CDM as a starting point for their project because it already established many of the elements their team needed to evaluate drug exposure information in children. Their final data repository consisted of institutions using both Epic-and Cerner-based systems. They designed an implementation guide to help each site contribute its data accurately and ultimately collected data from > 380,000 encounters from > 260,000 children. This project is a good example of the steps required in creating a custom EHR network using a predefined group of investigators and/or institutions. PEDSnet is one of the PCORnet member networks that has helped advance EHR-based research for a number of pediatric conditions outside infectious diseases. Khare et al 42 used PEDSnet to create a computable phenotype for children with Crohn disease. They created several versions of this phenotype, ranging from less stringent (1 diagnosis code or medication associated with Crohn disease) to more stringent ( ≥3 encounters and/or medications associated with Crohn disease). Using less stringent criteria, they were able to identify approximately 12,000 children with Crohn disease, and using more stringent criteria, they identified approximately 8000. They validated their data using a national registry as well as manual record review at participating institutions. They also found few false-negative results, which means that they were effectively capturing patients with the desired diagnoses. This study is a great example of the importance of the computable phenotype and how the rigor with which one defines the phenotype dictates the nature of the cohort one is left to study. Within the field of infectious diseases, studies performed in the pre-COVID-19 era using EHR networks offer a glimpse of the potential benefits of EHR-based research. One study from the United Kingdom by Esan et al 72 evaluated the total burden of Campylobacter and nontyphoidal Salmonella infections within the National Health Service (NHS) by evaluating data from the Clinical Practice Research Datalink (CRPD). 72 The CRPD is an EHR data warehouse that consists of all outpatient primary care encounters within the NHS. CRPD data contain demographic information and specific encounter-level data associated with International Classification of Diseases, Ninth Revision (ICD-9)and ICD-10 diagnosis codes and are linked to inpatient hospital encounter data, mortality records, and socioeconomic status. They identified > 20,000 patients (approximately 18,000 with Campylobacter and > 2000 with nontyphoidal Salmonella ) within the database. They described the rate of secondary complications, such as reactive arthritis, and found the peak onset of this complication within the first month after infection. Using the EHR data, they were also able to evaluate risk factors. They identified proton pump inhibitors as being associated with an elevated risk of Campylobacter infections (adjusted odds ratio = 2.1; 95% CI, 1.5-2.9). They were also able to define the annual costs to NHS of > £1.5 million and found increased health care use after infections and their associated complications. This study is a good example of using an EHR network to evaluate the impact of infections across a large population and bring attention JID: CLITHE [mNS; 21:28 ] to the public health burden that may not otherwise be appreciated at any single center or practice. Vihta et al 73 evaluated trends in Escherichia coli infections during an 18-year period in the region of Oxfordshire, United Kingdom. 73 They used the Infections in Oxfordshire Research Database, which has records of all admissions to the Oxford University Hospital NHS trust (a collection of 4 hospitals in the region) since 1997. This database includes inpatient data, including microbiology, biochemistry, and hematology results, along with demographic information and out-of-hospital mortality data. The authors evaluated all episodes of bloodstream infections and urinary tract infections (UTIs), evaluated episodes of relapse or recurrent infections, catalogued antibiotic prescriptions for these infections, and defined the outcomes of these infections. Their results indicated > 5000 patients with E coli bloodstream infection and a relapse or recurrence rate of 9%. The authors found that almost half of these cases were nosocomial or in patients with chronic medical conditions who were recently admitted for treatment. They found year over year increases in both nosocomial and community cases of E coli bacteremia. Their analysis identified > 137,000 patients with E coli UTI and a recurrence rate of 40% and found that 70% of all UTIs were de novo from the community. They found significant increases in community UTIs during the study period but a significant decrease in nosocomial UTIs during the study period. They identified that E coli with resistance to amoxicillinclavulanate increased in bloodstream and UTI isolates during the study, reaching > 40% in 2016. They were able to define increased antibiotic use as a major risk factor for subsequent infections with antibioticresistant isolates. This study is an example of evaluating a more selected population in detail for outcomes of infection as well as resistance trends over time. With the onset of the COVID-19 pandemic, researchers and clinicians needed to collect data from large numbers of patients across domestic and international networks to rapidly predict which patients would develop severe disease and to determine which treatments were most effective at reducing mortality. Klann et al 74 organized many institutions across the United States and Europe to attempt to answer many of these questions. They used data from the early reports of COVID-19 cases to build a computable phenotype for COVID-19 cases and for severe cases. Across 12 participating sites, they were able to evaluate > 10,000 patients with COVID-19 infection as well as > 3000 with severe disease. Importantly, they also performed validation via medical record review at each participating site, as well as cross-referencing EHR case data with locally acquired data on COVID-19 cases. They documented variation in rates of severe COVID-19 across sites and evaluated the different measures of disease severity (eg, acute respiratory disease, need for intubation, and vasopressor use) They constructed a Venn diagram that evaluated those patients who had markers of severe infection, those who had laboratory values indicating severe disease, and those who had the presence of a diagnosis or procedure code consistent with severe disease. There was not complete overlap among these 3 categories. Lastly, they simultaneously attempted to build a machine learning algorithm to help identify those patients with severe COVID-19 infection, but the first iteration did not perform as well as the expert-derived computable phenotype at identifying patients with severe COVID-19. Klann also played a key role in the N3C initiative detailed above. 69 During the early phase of the epidemic, most of the focus was on adults given the severity and mortality observed in many settings. Data on children with COVID-19 were much harder to come by. PEDSnet was able to generate a national picture of what the COVID-19 pandemic looked like from the pediatric perspective. Bailey et al 44 reported the outcomes for > 135,000 children tested for SARS-CoV-2 in the United States. They found that > 4000 children had confirmed COVID-19 infection (4%) and confirmed many of the racial and ethnic disparities that were first reported in adults: a higher proportion of Black and Latine persons with positive SARS-CoV-2 test results and a higher proportion of children with underlying disorders who tested positive. They also confirmed a very low overall hospitalization rate (359 hospitalizations [7% of total cases identified]), a low rate of severe infection and mechanical ventilatory support (99 and 33 cases, respectively), and an extremely low death rate (8 total [0.2%]). Interestingly, in their unbiased data analysis, they identified a 40% decrease in cases of Kawasaki disease across all sites. Kawasaki disease has long been suspected to have an infectious origin; a similar observation had been reported in a single-center study and was hypothesized to be attributable to masking and social distancing having reduced transmission of an infectious agent. Given all the purported benefits reviewed above, one obvious question that could be asked is, "Why has there not been more progress and/or more studies in infectious disease using EHR-based methods?" There are likely several reasons for this, some of which were mentioned earlier. Infectious disease has traditionally been a bench research specialty, so there is not a long tradition of prior EHR research to draw on. Infectious disease is also an older specialty, so there may be significant discomfort with embracing EHR-based research when everyday EHR use can be a challenge. It is also likely that many infectious disease-related projects are performed by people doing more informatics-related research within hospitalbased medicine or generalist, health services research, or epidemiology divisions. Informatics expertise is still a limited commodity so there are likely many institutions that lack a core group of personnel or sufficient bandwidth among their personnel to support the desired studies. These are all speculations, and the truth likely encompasses elements of all the above. In addition, because the clinical epidemiology questions that arise in infectious disease are diverse and often complex and nuanced, limitations remain in the EHR research networks (as of 2021) that are particularly relevant to infectious disease. Some types of data (eg, ICD-10 codes) cannot simply be retrieved from these networks. Microbiology culture and antibiotic susceptibility results are among the most difficult EHR elements for institutions to store in a way that makes the data amenable to use in subsequent multi-institutional cohort studies and in networks. 76 , 77 EHR results data for sexually transmitted infection tests, especially syphilis (and any test involving titers), are also complex to examine in an EHR network. Challenges in neonatal and pediatric intensive care unit research arise when one wants to analyze the details of mechanical ventilatory support or of medication dose changes and their timing. There may be ample (and unwieldy) data for these elements in an institution's own EHR, but the ETL processes that build the data marts in the EHR networks do not bring in all these details. The ETL processes create simplifications and structure but cannot at the same time be allencompassing. For those interested in beginning to conduct EHRbased research, there are several key steps that are critical in determining the success of any project. The previous example of mucormycosis serves as a good framework for considering what is required in each step. The first step is to define the question. In general, a precise focus a priori will improve the preliminary feasibility assessments (step 2), the clarity of the data collected, and the statistical analyses. For mucormycosis, the question would be something such as, "For institutions that use voriconazole versus posaconazole for antifungal prophylaxis for high-risk oncology and stem cell transplants, how do the ultimate rates of invasive mucormycosis compare between the 2 agents?" The second step is to define the size of the network that would be needed and explore whether the data elements available are suitable/ The size of the network will be dictated in part by the prevalence of the condition to be studied and in part by what networks and sites may be available to use. For a common condition such as E coli UTI, the Oxfordshire regional EHR network mentioned above identified > 100,000 patients who could be included in a cohort study. For a rare condition such as mucormycosis, one would require a large national or international network capable of contributing several years of data. A PCORnet clinical research network may be a place to start, with the thought of expanding to other networks after generating preliminary data and additional resources. Preparatory-to-research queries in PCORnet or other networks may find robust voriconazole and posaconazole data but also may find that fungal culture data require additional work on ETL procedures before an observational study would be feasible. The third step is to create the computable phenotype. A precise set of criteria should be created for the conditions of interest (denominator and numerator populations, exposure[s], covariate[s, and outcome[s]) that maximizes the total number of patients identified and minimizes the number of patients who have related (or unrelated) conditions who should not be included in the final analysis. For mucormycosis, creation of an overlapping (Venn diagram) set of criteria would help to enrich the patient population. Patients could be identified by searching for the appropriate ICD-9/ICD-10 codes for various invasive mucormycosis JID: CLITHE [mNS; 21:28 ] diagnoses and independently identifying all fungal culture results that yield a species within the Mucor group of fungi. At the same time, one would limit the patient population to those with a primary diagnosis of an oncologic condition and/or a procedure code that corresponds to hematopoietic stem cell or cord blood transplant, while excluding other nononcologic populations who also are susceptible to mucormycosis infections (patients with diabetes, burns, or SARS-CoV-2 and corticosteroid therapy). The fourth step is to perform initial local validation. The easiest way to evaluate the computable phenotype initially is to evaluate how it performs within one's own EHR. Do the criteria designed above capture the known patients recently treated for mucormycosis? If yes, one could proceed to broader data collection. If not, the phenotype needs to be refined and tested again with local validation. However, as noted in step 2, even if it works well locally, a computable phenotype may not work as well across the other datacontributing sites in an EHR network; such networks are standardized and harmonized only up to a point. Most projects conduct data quality and completeness assessments at the data-contributing sites before or during the execution of the main queries. The fifth step is to determine the scope of data to be collected. EHR networks are good at gathering structured data. These data may include vital signs, laboratory test results, medication dispensing and dosing information, primary and secondary diagnoses, procedures performed, intensive care admissions, and deaths. EHR networks, for the most part, are not yet collecting unstructured data: practitioner progress notes, radiology reports, pathology reports, and any other test results with extensive free text or narrative discussion. Efforts are constantly under way to improve the ability to capture and analyze unstructured data, but at the moment it is not feasible on a large scale. For the mucormycosis study, the structured data mentioned should allow for analysis of the rates of mucormycoses identified, the number of patients given voriconazole and posaconazole as daily prophylaxis before the diagnosis being made, and laboratory and outcomes data to determine if there are adverse events associated with the 2 different prophylaxis strategies that could negate any benefit in preventing a few cases of mucormycosis. The sixth step is to share the search strategy and data requirements with partners. By sharing the refined computable phenotype and data requirements with network partners or a centralized data coordinating center, one can gather the desired raw data at this step. For the mucormycosis project, this step would identify the numerator of all mucormycosis cases in oncology patients and patients undergoing stem cell transplantation within the national network and the denominator of all oncology patients and patients undergoing stem cell transplantation who received voriconazole and posaconazole prophylaxis during their treatment. One would also obtain the structured data elements mentioned previously (to the extent available) across all patients. Microbiology culture results, because of their complex hierarchical structure, are the most challenging type of laboratory result to store in a data mart in an analyzable way; fungal culture data in this study may require more work. The seventh step is to perform additional validation via medical record review at each site. Assuring data integrity at each partner site is a critical step in maintaining a robust network. The best studies ensure that each site performs an audit of a sample of all cases identified to make sure that the patients have the study condition, received the particular drug, and had the electronically ascribed outcome. For the mucormycosis study, each site would confirm the diagnosis of fungal infection, confirm the underlying diagnosis of oncology or stem cell transplantation, and confirm the final outcomes (survived and discharged or died in the hospital or subsequently). The eighth step is data and statistical analysis. Acquisition of the data described would be useless unless an investigator has the ability for proper data analysis. Having collaborators with experience analyzing large EHR datasets is crucial to the success of any EHR network study. The experience entails not only an analysis of the expected effects but also an unbiased search for differences that were not expected but that could generate new hypotheses for future research. In the mucormycosis study, an unbiased analysis might reveal that widespread use of posaconazole prophylaxis led to small but significant increases in Aspergillus ustus infections, a multidrugresistant species that can be seen with extensive antifungal exposure in highly susceptible patients. Although the steps outlined above do not capture every aspect of EHR-based research, they highlight key aspects of a successful project. With the widespread adoption of EHRs across the United States and many parts of the world, there is a tremendous opportunity to interrogate many aspects of clinical care. It should no longer be acceptable to claim that data are not available on any topic. EHR networks may allow researchers to evaluate the clinical experience of a large pool of treating physicians, to describe trends across a regional or national population, and to evaluate seemingly rare conditions that are otherwise difficult to study. As the field of infectious diseases moves to more widely embrace EHR-based studies, data on outcomes and comparative effectiveness will be more readily available. None. Does Hospital EHR Adoption Actually Improve Data Sharing Electronic health records: then, now, and in the future New evidence pyramid Using electronic health records for clinical trials: where do we stand and where can we go? Prenatal syphilis screening rates measured using medicaid claims and electronic medical records Prediction accuracy with electronic medical records versus administrative claims Global guideline for the diagnosis and management of mucormycosis: an initiative of the European Confederation of Medical Mycology in cooperation with the Mycoses Study Group Education and Research Consortium The role of the contextual cohort to resolve some challenges and limitations of comparisons in Pharmacoepidemiology Propensity scores: uses and limitations The pros and cons of propensity scores Caveats for the use of operational electronic health record data in comparative effectiveness research Biases in electronic health record data due to processes within the healthcare system: retrospective observational study Research and reporting considerations for observational studies using electronic health record data A comparison of observational studies and randomized, controlled trials Randomized, controlled trials, observational studies, and the hierarchy of research designs Are observational, real-world studies suitable to make cancer treatment recommendations? Evaluation of the use of cancer registry data for comparative effectiveness research Comparison of population-based observational studies with randomized trials in Oncology Can real-world data really replace randomised clinical trials Randomization versus real-world evidence. reply The magic of randomization versus the myth of real-world evidence Learning health systems PEDSnet: a national pediatric learning health system Use of real-world evidence to evaluate the effectiveness of Herpes Zoster vaccine When can we trust real-world data to evaluate new medical treatments? When are treatment blinding and treatment standardization necessary in real-world clinical trials? JID: CLITHE [mNS Real-world evidence: the devil is in the detail Comparative effectiveness of Aspirin dosing in Cardiovascular Disease Adaptation and validation of a pediatric sequential organ failure assessment score and evaluation of the Sepsis-3 definitions in critically ill children A comparison of a multistate inpatient EHR database to the HCUP Nationwide Inpatient Sample Barriers and facilitators to the collection and aggregation of electronic health record HIV data: an analysis of study recruitment venues within the adolescent medicine trials network for HIV/AIDS interventions (ATN) PCORnet: turning a dream into reality Evaluating foundational data quality in the National Patient-Centered Clinical Research Network Early antibiotic exposure and weight outcomes in young children Public Health Informatics Institute. PCORnet CDC COVID-19 Healthcare Data Initiative PCORnet(R) 2020: current state, accomplishments, and future directions Comparing prescribing and dispensing data of the PCORnet common data model within PCORnet antibiotics and childhood growth study Four health data networks illustrate the potential for a shared national multipurpose big-data network Development of an electronic health records datamart to support clinical and population health research Validation of a common data model for active safety surveillance research Drawing reproducible conclusions from observational clinical data with OHDSI Development and evaluation of an EHR-based computable phenotype for identification of pediatric Crohn's disease patients in a National Pediatric Learning Health System A longitudinal analysis of data quality in a large pediatric data research network Assessment of 135794 pediatric patients tested for severe acute Respiratory Syndrome Coronavirus 2 across the United States Using electronic health record data to rapidly identify children with Glomerular Disease for clinical research Learning to share health care data: a brief timeline of influential common data models and distributed health data networks in U.S. health care research Deep phenotyping of 34,128 adult patients hospitalised with COVID-19 in an international network study HealtheDataLab -a cloud computing solution for data science and advanced analytics in healthcare with application to predicting multicenter pediatric readmissions Medication use among patients with COVID-19 in a large, national dataset: cerner real-world data Inappropriate empirical antibiotic therapy for bloodstream infections based on discordant in-vitro susceptibilities: a retrospective cohort analysis of prevalence, predictors, and mortality risk in US hospitals Incidence and trends of Sepsis in US hospitals using clinical vs claims data JID: CLITHE [mNS Electronic health record data for antimicrobial prescribing Epic Systems Corporation. Epic Health Research Network With a nod to disco era, Epic Systems Corp. looks to Cosmos, voice-activated software Extracting and utilizing electronic health data from Epic for research COVID-19 Racial Disparities in Testing, Infection, Hospitalization, and Death: Analysis of Epic Patient Data The Indiana network for patient care: a working local health information infrastructure. An example of a working infrastructure collaboration that links data from five health systems and hundreds of millions of entries Enabling international adoption of LOINC through translation Epidemiology of Sexually Transmitted Infections Among Offenders Following Arrest or Incarceration A solutions-based approach to building data-sharing partnerships Comparative effectiveness research using the electronic medical record: an emerging area of investigation in pediatric primary care Shone L. 30th anniversary of pediatric research in office settings (PROS): an invitation to become engaged Automated identification of implausible values in growth data from pediatric electronic health records Electronic health record (EHR) based postmarketing surveillance of adverse events associated with pediatric off-label medication use: a case study of short-acting beta-2 agonists and arrhythmias Variation in antibiotic prescribing across a pediatric primary care network The National COVID Cohort Collaborative (N3C): rationale, design, infrastructure, and deployment US Department of Health and Human Services. National COVID Cohort Collaborative (N3C) Creation of a multicenter pediatric inpatient data repository derived from electronic health records Incidence, risk factors, and health service burden of sequelae of campylobacter and non-typhoidal salmonella infections in England, 2000-2015: A retrospective cohort study using linked electronic health records Trends over time in Escherichia coli bloodstream infections, urinary tract infections, and antibiotic susceptibilities in Validation of an internationally derived patient severity phenotype to support COVID-19 Analytics from electronic health record data The impact of social distancing for COVID-19 upon diagnosis of Kawasaki Disease Nascent regional system for alerting infection preventionists about patients with multidrug-resistant gram-negative bacteria: implementation and initial results Transformation of microbiology data into a standardised data representation using JID: CLITHE [mNS