key: cord-1032151-tni610fo authors: Hogan, William R; Shenkman, Elizabeth A; Robinson, Temple; Carasquillo, Olveen; Robinson, Patricia S; Essner, Rebecca Z; Bian, Jiang; Lipori, Gigi; Harle, Christopher; Magoc, Tanja; Manini, Lizabeth; Mendoza, Tona; White, Sonya; Loiacono, Alex; Hall, Jackie; Nelson, Dave title: The OneFlorida Data Trust: a centralized, translational research data infrastructure of statewide scope date: 2021-10-19 journal: J Am Med Inform Assoc DOI: 10.1093/jamia/ocab221 sha: b4480efceac4de0f11c7801163468ad76a83e6c4 doc_id: 1032151 cord_uid: tni610fo The OneFlorida Data Trust is a centralized research patient data repository created and managed by the OneFlorida Clinical Research Consortium (“OneFlorida”). It comprises structured electronic health record (EHR), administrative claims, tumor registry, death, and other data on 17.2 million individuals who received healthcare in Florida between January 2012 and the present. Ten healthcare systems in Miami, Orlando, Tampa, Jacksonville, Tallahassee, Gainesville, and rural areas of Florida contribute EHR data, covering the major metropolitan regions in Florida. Deduplication of patients is accomplished via privacy-preserving entity resolution (precision 0.97–0.99, recall 0.75), thereby linking patients’ EHR, claims, and death data. Another unique feature is the establishment of mother-baby relationships via Florida vital statistics data. Research usage has been significant, including major studies launched in the National Patient-Centered Clinical Research Network (“PCORnet”), where OneFlorida is 1 of 9 clinical research networks. The Data Trust’s robust, centralized, statewide data are a valuable and relatively unique research resource. Over the past 15 years, research reuse of electronic health record (EHR) data has become commonplace and occurs increasingly at scale. Key drivers of this trend are the Clinical and Translational Science Award (CTSA) program of the National Institutes of Health (NIH) 1,2 and the National Patient-Centered Clinical Research Network (PCORnet) of the Patient-Centered Outcomes Research Institute (PCORI). 3 More recently, the growth of the Observational Health Data Sciences and Informatics (OHDSI) program 4 and the U.S. Food and Drug Administration's (FDA's) efforts to promote the generation of real-world evidence from real-world data 5, 6 have accelerated the trend further. These initiatives echo the much longer history of repurposing of large, administrative claims datasets for research. Within the CTSA program and PCORnet, research usage of EHR data extends beyond retrospective, observational analysis. Besides cohort detection, researchers collect study data elements from EHRs on outcomes, interventions, adverse events, and factors associated with outcomes. The goal is to enable prospective translational studies that include clinical and pragmatic trials and implementation science. The OneFlorida Clinical Research Consortium is currently one of the 9 clinical research networks (CRNs) that, along with a coordinating center, constitute PCORnet 2.0. As a CRN, OneFlorida is unique in its statewide scope, covering all the major metropolitan regions in the nation's third most populous state. In addition, the Florida of today resembles the United States of tomorrow: older and more diverse. The OneFlorida Data Trust is the research patient data repository established and operated by OneFlorida. At its core, it includes EHR data from 10 health systems in Miami, Orlando, Tampa, Jacksonville, Tallahassee, and Gainesville linked to statewide claims data from Medicaid and Medicare. We report here on the design and composition of the Data Trust, our methods for building it, toolsets we employ for building and using it, examples of research usage, and key lessons learned. We have previously described OneFlorida and its history. 7 Briefly, OneFlorida began as a partnership between UF and Florida State University funded through an administrative supplement to UF's CTSA. The addition of the University of Miami and state funding through Florida's James and Esther King foundation built additional infrastructure focused on cancer research. OneFlorida became a part of PCORnet in 2015 in phase 2 as one of 2 new networks. Since then, OneFlorida has grown to include 2 additional health system partners ( Figure 1 ). Importantly, the addition of University of South Florida (USF) and Tampa General Hospital (TGH) in 2019 added coverage of Florida's third-largest city, Tampa. In addition, at that time, some OneFlorida partners had not yet begun contributing data, who are now routinely submitting data that meet PCORnet standards. They include Bond Community Health Center, Inc. (CHC) and a practice in central Florida facilitated by Community-Health IT. Bond CHC is a federally qualified health center (FQHC) in Tallahassee, and CommunityHealth IT is a 501(c)(6) that helps health systems and safety net medical facilities adopt health information technology. CommunityHealth IT will add additional practices from its consortium on an ongoing basis. OneFlorida is committed to the inclusion of safety net providers, such as FQHCs. These entities serve unique populations that typically experience gross health inequities. By ensuring their inclusion, we ensure that data used for research and for example, construction of artificial intelligence models, are less biased and do not help to perpetuate health inequities. By engaging these providers in prospective trials, we more directly reduce disparities in research results generated in OneFlorida. Further, as Florida is geographically diverse with mixed rural versus urban areas and is rapidly becoming a majorityminority state, the real-world data of unselected populations from the OneFlorida network provide a golden opportunity for health disparity research, especially research that focuses on racial and ethnic minorities and geographic disparities. A key differentiator of the OneFlorida Data Trust is that it is centralized: all partners contribute data to a single, secure data warehouse managed by the OneFlorida Coordinating Center (OFCC) at the University of Florida. Our key motivation for this approach was cost-effectiveness and efficiency of the network: many OneFlorida partners did not have existing research infrastructure on which to build an independent PCORnet data mart. Nor were they as experienced with the core data standards used such as LOINC, RxNorm, and the PCORnet Common Data Model (CDM). Thus, a centralized model at the OFCC enabled significant costefficiencies: we estimate $183K/year/partner for a centralized versus $2.0M/year/partner for federated approach, using as estimates our own costs for personnel and technology infrastructure (see Supplemental Material for our methods and more detailed results). Academic centers typically have distributed these costs among several sources (especially a CTSA) that are unavailable at nonacademic partners. Because a common approach across all OneFlorida partners was also a cost-saving imperative, academic partners adopted the centralized model too. In total, the Data Trust houses data on 17.2 million unique individuals dating from January 2012 through March 2021 (Table 1) . It includes EHR data, claims data, and additional data that we discuss next. We estimate that our partners in Florida serve 40-50% of Florida's population. In 2019 and 2020, 5.5 million and 5.4 million unique patients had an encounter at one of our partners' facilities, respectively. Across both 2019 and 2020, the total was 7.4 million. Even in a 2-year window, it is unlikely that every person in a population visits a healthcare provider, so the true 1-and 2-year denominators are unknown but certainly less than the total Florida population. The chief data requirement of PCORnet CRNs is to house EHR data for unselected patient populations from each healthcare system in the network. Ten OneFlorida providers contribute EHR data ( Figure 1 ), populating all the EHR-based tables in the PCORnet CDM, including diagnoses, procedures, observations (including vital signs), prescriptions, laboratory tests, and immunizations. In total, 9.9 million patients have EHR data. Partners refresh EHR data no less frequently than quarterly. Given typical billing cycles, the data lag 30-90 days behind the date of submission. Since its inception, the Data Trust has housed complete, statewide Medicaid data through a partnership with the Florida Agency for Healthcare Administration (AHCA). We subsequently added Medicare data for the dual-eligible (Medicare/Medicaid) population. Capital Health Plan-a health maintenance organization with 135 000 members and 3 health centers-contributes claims data and laboratory test results for patients cared for at Tallahassee Memorial Healthcare (TMH). For individual studies, we can link the entire Medicare population at each participating partner. In partnership with commercial payers, we also link commercial claims data to the entire Data Trust for individual studies. In total, 9 million patients have claims data. As we discuss below, our entity resolution processes have linked patients across sources and as a result, 1.7 million patients have both EHR and claims data. Each healthcare partner includes information about the deaths of patients occurring within their institutions. In 2020, OneFlorida purchased and linked a commercial death dataset from Datavant, fulfilling one of OneFlorida researchers' most common requests for additional data. We refresh death data monthly with updates from Datavant. Tumor registry data UF Health, Orlando Health, and TMH submit to the Data Trust an Health Insurance Portability and Accountability Act (HIPAA) limited data set (LDS) copy of the data they submit to the state tumor registry, the Florida Cancer Data System, following the North American Association of Central Cancer Registries (NAACCR) data standards. Currently, the Data Trust holds tumor information for 87 179 patients with cancer, which includes detailed tumor characteristics (eg, tumor size and staging information), treatment information (eg, surgery, systemic therapy, immunotherapy, etc.), genetic markers (eg, HER2 status) among much other information that are important to cancer research. Other data sources including publicly available third-party data We have integrated American Community Survey data-and deprivation indexes derived from them-by linking patients' home location to census tracts, ZIP codes, and counties in American Community Survey. More recently, we established an external exposome database with over 5500 variables from the natural (eg, air pollution), built (eg, walkability), and social (eg, housing) environ-ment derived from well-validated, public data sources. 8 We can provision patient-level OneFlorida data linked to environmental exposures through residential history. For 70% of patients in the Data Trust, we have 9-digit home ZIP, which enables linking to census tract level. For certain studies, we have also integrated OneFlorida data with Florida birth certificates and fetal death certificates data, through which a mother-baby relationship can be established. This capability enables linking a mother's data (EHR, claims, etc.) to her baby's data (EHR, claims, etc.) to study the effects of maternal health on child health outcomes. The Data Trust houses mother-baby birth, death, and prenatal screening data for all births. With the adoption of a centralized model, OneFlorida chose to make the Data Trust a HIPAA LDS, primarily motivated by partner concerns for patient privacy and by simplifying regulatory compliance. Being an LDS means only that the Data Trust excludes 16 of 18 categories of identifiers under HIPAA. What remains are dates of service, birth, and death and geographic indicators at resolutions smaller than a state. From a regulatory perspective, an LDS requires only a data-use agreement (DUA) between the UF and each data provider, avoiding the need for Business Associate Agreements. Each partner additionally signs a memorandum of understanding with UF that expresses the intent of both parties to jointly conduct research. The DUA requires that the healthcare partner submit EHR data per PCORnet requirements, including the addition of new data as the PCORnet CDM expands or as new OneFlorida priorities arise. We also have a DUA with AHCA that enables the inclusion of the Medicaid and Medicare data and a DUA with the Florida Department of Health (FLDOH) for birth certificate and fetal death data. The Data Trust is approved as a UF "data bank" type Institutional Review Board (IRB) protocol. Because an HIPAA LDS is still protected health information, the IRB requires periodic, extensive information technology security reviews to ensure that we provide strong, state-of-the-art controls and safeguards of the data. The OneFlorida Executive Committee serves as the oversight board and monitoring committee under the IRB, reviewing and reporting all Data Trust activity quarterly to the IRB. The Data Trust IRB permits patient re-identification through honest brokers (Figure 2) . Each partner has staff who are not members of any research team and can re-identify patients at their sites for approved research protocols. Such re-identification enables participant recruitment for prospective studies and trials. Because each partner controls reidentification at its sites, no researcher can contact its patients without its involvement, which was also a key concern of OneFlorida partners generally. The fact that the Data Trust is an HIPAA LDS presented an added challenge to linking patients' data across multiple data sources. We implemented a hash-based privacy-preserving entity resolution (PPER, or record linkage) solution. Initially, we deployed our inhouse solution-OneFl Deduper-across all OneFlorida partners and evaluated its performance-that is, precision 0.97-0.99 and re-call 0.75. 9 For several studies, we used Deduper to link other data sources such as mother-baby relationships established from birth certificates. In late 2019, PCORnet began the process of identifying a common PPER solution and selected a commercial, hash-based PPER solution from Datavant. We are in the process of implementing Datavant at all partners. However, we continue to operate both solutions in parallel, and we still carry out entity resolution via Deduper so its performance is applicable to our current entity resolution results. As of April 2021, through our existing PPER process with OneFl Deduper, we deduplicated 2.4 million patient records for 2.2 million patients and linked 1.7 million (10%) patients' EHR data to their claims data. In total, 9 million patients have claims data and 9.9 million have EHR data. Given that our PPER has an estimated recall of 75-85%, the true number of unique patients is likely to be slightly lower than 17.2 million. Our geographical coverage of unique individuals across the state of Florida mirrors the population density of the state (Figure 3 ). The Data Trust nevertheless covers all 67 counties in Florida and most rural areas have substantial representation. We also implemented additional entity resolution processes to improve data quality. For example, when a particular patient has records from multiple partners, we source their demographic data from the partner with the most recent encounter. However, we overwrite fields that have "other" and "unknown" values with known values when they are present in other partners' older data. For example, a recent race value of "unknown" from partner A would be overwritten with an older value of "black" from partner B. Excluding unknown values, among the 2.2 million linked patient record pairs, 0.17% of pairs have conflicting sex values, 2.6% have multiple race values (which might or might not conflict depending on whether the individuals in question are multiracial), and 3.7% have conflicting ethnicity values (ie, Hispanic vs not Hispanic). We note that the state of the art among our partners is to collect limited gender, race, and ethnicity data, the latter 2 as defined by Office of Management and Budget (OMB) standards which are ba- Notes: OneFlorida numbers are cumulative since 2012, whereas Florida and US numbers are current state. The high percentages of other and unknown in One-Florida are due to data limitations that reflect a suboptimal state of the art in collecting these data, driven by basic compliance with federal data standards as the primary consideration. sic and simplistic. In particular, claims data use the single-question format to capture both race and ethnicity, leaving one or the other unknown. For example, a response of Black does not capture ethnicity. A response of Hispanic does not capture race. This situation leads to the high percentages of "unknown" and "other" values in Table 1 . Sadly, we believe this situation results from the lack of incentives in the healthcare industry at large to capture richer gender, race, and ethnicity data. We have confronted this problem through, for example, the development of transgender and gender nonconforming computable phenotypes that leverage data besides the demographics table in the PCORnet CDM. 10 We have taken multiple approaches to ensuring data quality in the Data Trust. The most intensive quality assurance process in terms of the number of checks, running time, comprehensiveness across PCORnet CDM tables, and the addition of new checks over time is the PCORnet empirical data characterization (EDC) process. PCORnet EDC occurs quarterly, with a mandate to refresh all EHR data each time. EDC includes mandatory checks for certification for use in PCORnet studies and also flags serious issues. Datamarts are judged based on the numbers of serious issues outstanding. Despite the Data Trust being centralized, PCORnet treats each OneFlorida source (including Medicaid) as a separate PCORnet data mart, each of which must pass EDC on its own. OFCC staff perform EDC for each data mart and work with partners to address issues identified. To further improve our data quality practices, we synthesized data quality dimensions and assessment methods for real-world data, especially EHRs, through a systematic scoping review and then assessed gaps in existing PCORnet EDC processes. 11 Guided by this review, we further advanced data quality assessment within OneFlorida, especially providing more comprehensive, study-specific data quality checks. For example, our study-specific data quality checks often focused on various aspects of plausibility, concordance, consistence, and relevance, while the PCORnet EDC merely focus on conformance and completeness. Further, as part of the OneFlorida project intake process, a dedicated data team with informatics faculty and data analysts iteratively works with the investigators to address various data issues including data quality issues. A key function of the OFCC is to provide query and data extraction services. Researchers use the i2b2 self-service query tool for many preparation-to-research queries, including cohort identification and counts. The OneFlorida i2b2 instance includes one large project with all linked data comprising over 12 billion facts, as well as 11 partnerspecific projects (10 health systems plus Medicaid) that when linked comprise the larger dataset. Each partner may query the combined data project and its own project, but not any other partner's project. The individual partner-based projects do not include additional data not also included in the large project. Data analysts write and curate more complex queries and conduct study data extraction. Often, an informatics faculty liaison from the OFCC is involved in the query fulfillment process, providing expertise and consultation on best practices for using real-world data in research. The centralized model has enabled the OFCC to manage all query requests and responses for PCORnet studies. We have not only fulfilled the needs of PCORnet and OneFlorida studies but also conducted novel informatics research advancing the field (eg, developing and applying novel distributed learning algorithms and trial simulations using real-world data from the OneFlorida network). [12] [13] [14] [15] Approach to findability, accessibility, interoperability, and reusability We deposit metadata about each PCORnet-certified, quarterly refresh of the Data Trust into DataCite Commons, including title, creators, various dates, and a description. DataCite Commons assigns a digital object identifier (DOI) to each metadata submission, [16] [17] [18] [19] [20] en-abling us to meet a key findability, accessibility, interoperability, and reusability (FAIR) data metric of assigning the metadata a unique, resolvable identifier. We provide these DOIs to investigators so they can cite the Data Trust in their work. Although these efforts address the findability of Data Trust data and begin to guide reuse by clearly identifying successive versions of the Data Trust as utilized in various studies, they represent only our initial efforts toward FAIR. Because FAIR principles also apply to metadata, we note that our metadata adhere to FAIR principles to the extent that the Data-Cite Commons itself supports them. Several retrospective studies have utilized the Data Trust. Examples include studies on hypertension, 21 obesity, 22,23 sickle cell disease, 24 stillbirth, 25 hepatitis C, 26 opioid use disorder, 15 and adults with multiple chronic conditions. 27 Researchers have also used the Data Trust to develop computable phenotypes, including resistant hypertension, type 1 diabetes mellitus, and transgender and gender nonconforming individuals. 10, 28, 29 In addition, the Data Trust has played key roles in several, large studies carried out in PCORnet. [30] [31] [32] [33] [34] Also, to confront the COVID-19 pandemic, PCORnet partnered with the Centers for Disease Control and Prevention (CDC) on a COVID-19 healthcare data initiative, 35 with 5 OneFlorida partners participating. OneFlorida and its Data Trust were essential to the formation of the South East Enrollment Center (SEEC) in the All of Us Research Program (AoURP), 36 which includes the University of Miami, University of Florida, Morehouse School of Medicine, and Emory University. SEEC has to date enrolled over 20 000 AoURP participants, and the OFCC has delivered EHR data for a significant majority. The expansion of OneFlorida to OneFloridaþ occurred in 2020 with the additions of Emory University in Georgia and the University of Alabama at Birmingham. The result is an expansion of the network and research infrastructure from Florida to the southeastern United States. The vision is to address the health needs and health disparities of the entire expanded region. The new partners already participate on the OneFloridaþ Executive Committee, and we currently are adding their data to the Data Trust. OneFloridaþ has applied for funding for these partners, as well as for USF/TGH, to become official PCORnet 3.0 partners. As we add healthcare organizations managed by CommunityHealth IT, we will prioritize safety net providers such as FQHCs. Discussions are ongoing with key stakeholders about linking the Data Trust to additional statewide datasets, such as the Florida immunization registry, emergency medical services data, and the Merlin case reporting system of the FLDOH (dashed lines in Figure 2 ). 37 The OneFlorida Data Trust is a key component of OneFlorida's statewide research infrastructure. It has grown considerably over time, to include new partners, new sources and types of data, and the addition of third-party data sets. It serves as a robust research resource, including but not limited to translational, comparative effectiveness, and informatics research. In keeping with OneFlorida's mission to translate research results into practice across the spectrum, it includes nonacademic community health systems and safety net providers as partners. The inclusion of these entities, plus patient privacy and regulatory concerns, drove the development of a centralized model of the Data Trust. The centralized model has provided substantial efficiencies in building and operating it, including querying and analyzing the data. Although the Data Trust is not the first or largest significant linkage of EHR and claims data, it remains among the largest-and one of the most comprehensive at a statewide scale. Author WRH drafted the initial manuscript with contributions of sections from authors JB, LM, SW, and AL. Authors TR, OC, PSR, RZE, and DN provided significant input and review in describing the overall goals of OneFlorida, the issues of health equity, issues of real-world data and real-world evidence, and the agreements, structure, history, and partnership of OneFlorida. Authors GL, CH, TM, SW, and AL provided significant input and review on the technical and infrastructure aspects of the Data Trust, including data content, refreshes, and query and analytics. Author JH generated Figure 3 and provided significant input into the section describing Census data and its use in calculating deprivation indexes. All authors reviewed, edited, and commented on the manuscript and approved its final form. Supplementary material is available at Journal of the American Medical Informatics Association online. Sustainability considerations for clinical and translational research informatics infrastructure Understanding enterprise data warehouses to support clinical and translational research Launching PCORnet, a national patient-centered clinical research network Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers Real-world evidence-what is it and what can it tell us? Use of Electronic Health Record Data in Clinical Investigations Guidance for Industry. US Food and Drug Administration OneFlorida Clinical Research Consortium: linking a Clinical and Translational Science Institute with a community-based distributive medical education model Semantic standards of external exposome data Implementing a hash-based privacy-preserving record linkage tool in the OneFlorida clinical research network Developing and validating a computable phenotype for the identification of transgender and gender nonconforming individuals and subgroups Assessing the practice of data quality evaluation in a national Clinical Data Research Network through a systematic scoping review in the era of real-world data Learning from local to global: an efficient distributed algorithm for modeling time-to-event data Leverage real-world longitudinal data in large clinical research networks for Alzheimer's Disease and Related Dementia (ADRD) Exploring the feasibility of using realworld data from a large Clinical Data Research Network to simulate clinical trials of Alzheimer's disease Identifying clinical risk factors for opioid use disorder using a distributed algorithm to combine real-world data from a large Clinical Data Research Network Hypertension in Florida: data from the OneFlorida Clinical Data Research Network Characterization of adult obesity in Florida using the OneFlorida Clinical Research Consortium Objectively measured pediatric obesity prevalence using the OneFlorida Clinical Research Consortium Shared care for adults with sickle cell disease: an analysis of care from eight health systems Ambient heat and stillbirth in Northern and Central Florida HCV testing: order and completion rates among baby boomers obtaining care from seven health systems in Florida Prevalence of multiple chronic conditions among older adults in Florida and the United States: comparative analysis of the OneFlorida Data Trust and national inpatient sample Optimizing identification of resistant hypertension: computable phenotype development and validation An iterative process for identifying pediatric patients with type 1 diabetes: retrospective observational study The ADAPTABLE trial and aspirin dosing in secondary prevention for patients with coronary artery disease PCORnet Bariatric Surgery Collaborative. The National Patient-Centered Clinical Research Network (PCORnet) bariatric study cohort: rationale, methods, and baseline characteristics Early life antibiotic prescriptions and weight outcomes in children 10 years of age Diabetes medication regimens and patient clinical characteristics in the national patient-centered clinical research network Wake Forest School of Medicine. PREVENTABLE-PRagmatic EValuation of evENTs And Benefits of Lipid-lowering in oldEr adults Public Health Informatics Institute. PCORnet CDC COVID-19 Healthcare Data Initiative. PCORnet CDC COVID-19 Healthcare Data Initiative All of Us Research Program Investigators. The "All of Us" research program Potential effects of electronic laboratory reporting on improving timeliness of infectious disease notification-Florida The authors would like to acknowledge all the individuals working at all our partner organizations who make OneFlorida and the Data Trust possible. They would also like to thank Mahmoud Enani and Erik Schmidt for their contributions to the analysis provided in the Supplementary Material. They would also like to thank Joe Selby for his vision and leadership in creating the National Patient-Centered Clinical Research Network (PCORnet), as well as the Patient-Centered Outcomes Research Institute Board of Governors for their bedrock support and confidence in PCORnet and its Clinical Research Networks and Coordination Center. None declared. The OneFlorida Data Trust is available to investigators for research purposes through the formal policies and procedures established by the OneFlorida Executive Committee. The identifiers of sequential versions of the Data Trust