key: cord-0931129-q1byvwtc
authors: Walonoski, Jason; Klaus, Sybil; Granger, Eldesia; Hall, Dylan; Gregorowicz, Andy; Gregorowicz, Andy; Neyarapally, George; Watson, Abigail; Eastman, Jeff
title: Synthea? Novel Coronavirus (COVID-19) Model and Synthetic Data Set
date: 2020-10-02
journal: Intell Based Med
DOI: 10.1016/j.ibmed.2020.100007
sha: 60b2a0070c90e1e74d5bd5d47e5aa2a7bcf861db
doc_id: 931129
cord_uid: q1byvwtc

March through May 2020, a model of novel coronavirus (COVID-19) disease progression and treatment was constructed for the open-source Synthea patient simulation. The model was constructed using three peer-reviewed publications published in the early stages of the global pandemic, when less was known, along with emerging resources, data, publications, and clinical knowledge. The simulation outputs synthetic Electronic Health Records (EHR), including the daily consumption of Personal Protective Equipment (PPE) and other medical devices and supplies. For this simulation, we generated 124,150 synthetic patients, with 88,166 infections and 18,177 hospitalized patients. Patient symptoms, disease severity, and morbidity outcomes were calibrated using clinical data from the peer-reviewed publications. 4.1% of all simulated infected patients died and 20.6% were hospitalized. At peak observation, 548 dialysis machines and 209 mechanical ventilators were needed. This simulation and the resulting data have been used for the development of algorithms and prototypes designed to address the current or future pandemics, and the model can continue to be refined to incorporate emerging COVID-19 knowledge, variations in patterns of care, and improvement in clinical outcomes. The resulting model, data, and analysis are available as open-source code on GitHub and an open-access data set is available for download.

Introduction from, but do not preserve, individual data records). Deidentified data are often modified from real data points using methodologies such as masking or deleting fields and introducing noise.

The assumption that deidentification guarantees privacy or eliminates risk is false. [6] Synthetic data has been widely used as a safe alternative to deidentification. Synthetic data is considered ethically superior to deidentified data, because there is no individual sensitive record underneath any synthetic record that can ever be reidentified. [6] Synthetic clinical data sets can be openly shared to enable innovation, such as from the opensource Synthea ("Synthetic Health") project which publicly provides millions of longitudinal synthetic health records. [7] Synthetic data is being used for software testing and validation (including privacy and security testing), education, academic research, feasibility assessments and algorithm validation, but not yet for clinical discovery and scientific inference. [8] Criticisms of Synthea are that it does not fully account for variations in health care delivery by providers, has limited heterogenous health outcomes after major interventions, but it can be improved and validated, [9] and that it does not yet contain sufficient clinical notes. [10] Other synthetic generation techniques, such as Generative Adversarial Networks (GANs) used in medical imaging are experiencing rapid growth (150 articles in the last three years) and the results are filling an important niche in data science. [11] Synthetic data aligns with the Open Science movement which includes open access, open source, and open data among its principles to address the scientific reproducibility problem. The scientific reproducibility problem is especially severe in health research (especially health machine learning) where data sets and code are more likely to be unavailable. Synthetic data has been identified as a way for researchers to meaningfully release data, code, and results. [12] When properly constructed and validated, synthetic data used in data analytics and machine learning tasks has been shown to have the same results as real data in several domains without compromising privacy. [13] However, these domains are generally not as complex or as highstakes as health care responses to a pandemic such as COVID-19, so synthetic health data should always be validated for a researcher's specific use-case prior to utilization. Without access to the unpublished raw data, only peer-reviewed research results with summary statistics, we have attempted to calibrate the synthetic data to those reference statistics (as presented in Methods and Materials and Results). We outline limitations and suggested uses in the Discussion section.

The COVID-19 models within Synthea were primarily modeled on three peer-reviewed clinical papers, based on findings from Wuhan, China [14] [15] and mortality data from New York City, USA. [16] The characteristics of these studies and the final synthetic data are summarized in Table 1 . Determines exposure and infection rates.

Contains the daily loop during hospitalization and ICU treatment.

Determines risk based on comorbidities, severity of disease, and whether or not the patient will survivor.

Determines whether or not patients will be testing, the testing results, and whether or not they are admitted to the hospital.

Records daily lab values.

Records frequent lab values.

Records vital signs.

Potentially enrolls a critical or severe patient in one of eighteen clinical trials.

Sets lab values for patients who will not survive.

Sets lab values for patients who will survive.

Determines outcomes and complications based on risk and disease severity.

Ends complications after recovery (if applicable).

Contains the daily supplies used within the hospital for 1 patient, 1 physician, and 1 nurse.

Contains the daily supplies used within ICU for 1 patient, 1 physician, and 1 nurse.

Contains the supplies used for intubation.

Determines the symptoms presenting in each patient.

Ends symptoms after recovery (if applicable).

Patients will likely develop blot clots during inpatient and ICU stay.

Blood clots need to be treated once they are developed. 

We generated 124K patients and performed some basic analysis to produce Figures and summarize outcomes corresponding to Figures and Tables from our primary data sources and present these for comparison. It is important to note that the model was developed from the primary source tables, and not the raw data from these sources which was unpublished. Of the 88166 infections in the generated population (not all the simulated patients became infected), 18177 patients were hospitalized. Based on current knowledge this hospitalization rate is high, but at the early stages of the pandemic, we estimated hospitalization rates based on projected outcomes related to patient comorbidities and risk factors without consideration of disease prevention measures. The mortality graph above shows the mortality disparity between age and gender groups. Outcomes are enumerated in Table 3 . The "Outcome" column describes an outcome (e.g. ventilated, recovered, or death), and the other columns are cross correlated with other groups (e.g. all patients, patients who were hospitalized, patients who were admitted to the ICU, and patients who required ventilation). Regarding cells marked with "1.00" -that indicates that 100% of the patients in that group had that outcome. For example, the cell corresponding to the "ICU Admitted" column and "Hospital Admission" row is "1.00" -indicates that all patients who were admitted to the ICU were also admitted to the hospital (in this case, a prerequisite event in our model).

The amount of supplies consumed for this particular simulation run are enumerated in the table below. The simulation ran for 88K infected patients, of which 18177 were admitted to the hospital. A discussion on the assumptions made about supply consumption and device usage are documented in Appendix A. Supply models are documented in Appendix B: Supply and Device Lists.

Basic endotracheal tube single-use 2914

Disposable air-purifying respirator 446438

Endotracheal tube stylet single-use 2914

Human plasma blood product (product)

Isolation gown single-use 2583654

Lubricant 2914

Nitrile examination/treatment glove non-powdered sterile

Operating room gown single-use 2914

Protective glasses device 8742

Syringe device 5828 

Complications among survivors and non-survivors are listed in Tables 5 and 6 for synthetic patients and the reference data, respectively. There are discrepancies between the these outcomes (compare the "percent" columns in each table) because the outcomes in the model are not fixed by percentages from Table 6 , but are based on risk-factors that determine severity and mortality including gender, age, and comorbidities that differ from the reference population. 

For comparison, Table 6 reproduces reference data from the "Outcomes" portion of 

Major lab values were modeled according the temporal changes in laboratory markers documented in Figure 2 from [14] . Distribution unlimited 20-01468.

Patient timelines were modified from the illustration of common patient timelines and statistics from Figure 1 ("Clinical courses of major symptoms and outcomes and duration of viral shedding from illness onset in patients hospitalized with COVID-19") from [14] and are paralleled here in Figures 5 and 6 . Figure 5 shows patients who were hospitalized, some of which are later admitted to the ICU. Figure 6 shows only the ICU patients. The average of length of stay for the synthetic patients is detailed in Table 7 , with reference ranges from [14] that are inclusive of both survivors and non-survivors. [15] , except for Loss of Taste which was based upon the findings from [17] . The symptoms in 15 are based on disease severity (severe and non-severe) while are findings are broken down by survivor and non-survivor, which are overlapping but not identical populations. Loss of Taste 0.64 130 (n=374) n/a n/a n/a n/a Synthea is an open-source modeling and simulation platform for disease progression and treatment. If we take the George Box quote above to be true, that all models are wrong, then Synthea is wrong. And if we take Field Marshall Moltke's notion of "no plan survives contact with the enemy" as true and expand the scope to modeling and simulation, then we might say that "no model survives contact with reality." Which is all to say that our model of novel coronavirus is flawed as all models are, and as a model, it cannot not survive contact with reality. Nevertheless, we hope it is useful.

To our knowledge, the Synthea COVID-19 data has been useful in several online challenges, hackathons, and conferences [18 -22] . In these venues, the data has spurred innovation and exchange of ideas about software solutions, enabled software development and testing, and has been used as the basis for some prediction modeling.

Those prediction models are likely not suitable for application in the delivery of clinical care, however they do enable a machine-learning team to begin to explore realistic data, develop their ideas and solution, build a processing pipeline -all before they are able to gain secure access to restricted data sets of real COVID-19 patients and outcomes. It also provides learning opportunities to teams that would otherwise be unable to gain access to such data sets and lowers the barrier to entry to participating in AI and ML activities in healthcare.

In the future, when more COVID-19 real-world data sets become available, including EHR data and associated outcomes, it will be possible to tune the model weights and probabilities to match real cohorts and diverse variations in care. For example, using data from one region during a particular month of the pandemic, the model could be calibrated to generate data more representative of that cohort (including infection rates, disease severity, treatments, and outcomes).

• Critical severity ─ individuals who have acute respiratory failure requiring mechanical ventilation, septic shock, and/or multiple organ dysfunction (sepsis). [5, 6 ] Of note, model illness severity rates specified in the covid19/determine_risk module do not account for asymptomatic or pre-symptomatic infection, as the COVID-19 model requires symptom onset (mild illness) for viral diagnostic testing.

Sepsis: is defined as life-threatening organ dysfunction caused by a dysregulated host response to infection in accordance with the 2016 Third International Consensus Definition for Sepsis and Septic Shock, wherein organ dysfunction is determined by an acute change in the total Sequential [Sepsis-Related] Organ Failure Assessment Score (SOFA score) > 2 points as a result of infection. [1, 7] Septic shock: is defined as a subset of sepsis in accordance with the 2016 Third International Consensus Definition for Sepsis and Septic Shock, wherein individuals with septic shock are identified clinically as those with persisting hypotension requiring vasopressor support to maintain mean arterial pressure (MAP) and a serum lactate level > 2. [1, 7] 

The COVID-19 model represents the following risk factors for severe illness:

• Age • Cancer • Cardiovascular disease or other serious heart condition • Chronic lung disease, including moderate to severe asthma • Diabetes mellitus • Homelessness • Immunocompromised condition • Obesity with a BMI > 40

The COVID-19 model does not account for increased risk of severe illness associated with residence in a nursing home, long-term care facility, or other congregate environment. Although experts have observed lower rates of severe illness among those with asthma, we retained this risk factor as it is currently unclear whether lower rates of severe illness are attributable to physiologic factors or increased vigilance among those with asthma. We also include homelessness as a relevant social risk factor, as public health experts acknowledge that individuals experiencing homelessness may have unique risks to their health and safety that may contribute to more severe illness and worse health outcomes. Of note, conditions represented in the covid19/determine_risk module reflect other non-COVID-19 disease models within Synthea™; consequently, treatment of those conditions (i.e. medications) is reflected in COVID-19 model outputs.

The inpatient clinical management we represent in the COVID-19 model is informed by clinical guidelines available at the time of model development, as well as discussions with frontline clinicians. [5, 6] While individuals <18 are included in the COVID-19 model, it currently reflects the clinical management of adult inpatient populations only, as pediatric data on disease physiology and management is limited. While we made every effort to update the model as new clinical guidance arose, newer aspects of clinical management may not be fully represented in the model given the dynamic and rapidly evolving knowledge of COVID-19 physiology and subsequent management. The main aspects of clinical management addressed or excluded from the model are discussed in Table 1 below, with relevant commentary on model representation. Considering the timing and low rates of bacterial co-infection observed by Zhou et al. [1] and others [8, 9] , we reserve antibiotic therapy (specifically broad-spectrum use) for individuals admitted to the ICU with culture-confirmed bacterial infection.

While guidelines generally recommend prone positioning for mechanically ventilated adults, our model also includes awake prone positioning for non-mechanically ventilated, hospitalized individuals based on benefits observed among frontline clinicians.

We reserve bronchodilator therapy for individuals with asthma, COPD, or severe wheezing, in light of the observed, limited benefit of bronchodilators for management of COVID-19.

Corticosteroid use is not represented in the model per the NIH COVID-19 management guidelines, which recommends against use or suggests insufficient data for use beyond refractory shock. [6] COVID-19 diagnostic viral testing

While the model reflects initial diagnostic testing for SARS-CoV-2 infection with a reverse transcriptase polymerase chain reaction (RT-PCR), this step is inclusive of all diagnostic test assays.

Influenza testing during influenza season is represented in the model; however, we reserve full respiratory viral panel testing for individuals who are immunocompromised.

ECMO is not represented in the COVID-19 model given low utilization rates and insufficient data to recommend either for or against the routine use.

Model references to "ventilation" are inclusive of all approaches to invasive mechanical ventilation management based on an individual's underlying lung physiology. Pulmonary imaging Model references to "Chest X-ray" are inclusive of pulmonary imaging with either chest x-ray or chest computed tomography (chest CT) in accordance with Fleischner Society recommendations for chest imaging in COVID-19 [10] , or for other clinical appropriateness criteria (e.g. to confirm endotracheal tube placement). While we acknowledge the use of point of care ultrasound as a potential diagnostic tool, it is not currently represented in the model.

We represent the use of RRT in the COVID-19 model due to its impact on therapeutic options and observed rates. [1, 11, 12] Test-based strategy While meeting criteria for discontinuation of transmission-based precautions is not a prerequisite for hospital discharge, for the purposes of the COVID-19 model we use a test-best strategy to represent patients meeting clinical criteria for hospital discharge (e.g. resolution of fever, improvement in respiratory symptoms, and at least two negative test results from respiratory specimens). [13] All inpatients are placed on pharmacologic thromboprophylaxis with either low-molecular weight heparin (LMWH) or unfractionated heparin (UFH) based on renal function, with dosing flexibility for institution variation. [14] For simplicity, enoxaparin is representative of all LMWHs; although, other preparations are available.

For consistency, all inpatients who develop VTE are treated with full-dose anticoagulation with either LMWH or UFH infusion based on renal function. However, LMWH or UFH infusion for VTE treatment is inclusive of other therapeutic agents, such as direct oral anticoagulants.

To illustrate the use of pharmacologic therapy for COVID-19 treatment, we randomly enroll ~10% of the inpatient population in a clinical trial (see covid19/medications module). Individuals are eligible for participation in a clinical trial if they met the exclusion criteria specified in Table  2 . Pregnancy Therapeutic options may be limited in pregnancy due to risk of teratogenicity; pregnancy was also an exclusion for numerous clinical trials.

End-stage renal disease on dialysis (CKD) A significant number of trials had an exception for eGFR <30.

Included as a catch-all for solid/hematologic malignancy and a low absolute neutrophil count, which were common exclusions noted for trials.

In the covid19/medications module, individuals enrolled in a clinical trial are equally distributed among each of the therapeutic agents (clinical trials) in Table 3 below. To select therapeutic agents for the module, the model team reviewed U.S., European, and Canadian trials posted on ClinicalTrials.gov before April 17, 2020. In light of the significant number of therapeutic agents being tested via clinical trials, we opted to include one agent representative of a potential drug target (e.g., IL-6), therapeutic class (e.g., antimalarial), or pharmacologic class (e.g., JAK inhibitor) based on available evidence of SARS-CoV-2 virology and potential drug targets. [15] The representative therapeutic agent selected for inclusion in the table is the agent with the earliest online posting date. As protocols differ across clinical trials, treatment dose in the covid19/medications module does not denote a specific trial protocol but rather common dosing available in RXNorm, if available. The treatment frequency, duration, and target model population reflect protocol information that was available at the time of analysis. We anticipate that the therapeutic agents available and the respective dosing, frequency, and target model population will change given the highly dynamic clinical trial environment. We note that individual outcomes (survival vs death) are not based on real-world outcomes, as clinical trial data was limited at the time of analysis. The covid19/medications module will need to be updated as clinical trials proceed and outcome data becomes available. Personal protective equipment (PPE) Supplies PPE supplies in the covid19/supplies_hospitalization and covid19/supplies_ICU modules reflect current recommended hospital infection prevention and control measures for COVID-19. This includes the use of masks, gowns, gloves, and eye protection, in conjunction with administrative and engineering controls. [16] While mask recommendations (e.g. surgical mask or N95 respirator or PAPR) vary across institutions, for the purposes of the module we use the U.S. Centers for Disease Control and Prevention (CDC) Preferred PPE standard of N95 or higher respirator, rather than the acceptable alternative using a surgical facemask. [17] Administrative assumptions and references to determine supply quantity counts are detailed in the module remarks. PPE supply lists are available in Appendix B.

Ethical Issues in Secondary Use of Personal Health Information

An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record

Clinical Data: Sources and Types, Regulatory Constraints, Applications

The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures

Influence of simulation on electronic health record use patterns among pediatric residents

Generative adversarial network in medical imaging: A review

Reproducibility in Machine Learning for Health

The Synthetic Data Vault

Bin Cao. Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study

Clinical characteristics of coronavirus disease 2019 in China

Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the New York City Area

Alterations in Smell or Taste in Mildly Symptomatic Outpatients With SARS-CoV-2 Infection

Pandemic Response Hackathon. Hack COVID-19: Project Roundup. Convened by Datavant

Geomapping COVID19 Data from Hospitals using FHIR

Microsoft Hack-on-FHIR and The Synthea COVID-19 Dataset

Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study. The lancet

Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. The lancet

Acute Kidney Injury Work Group. KDIGO Clinical Practice Guideline for Acute Kidney Injury

Acute Respiratory Distress Syndrome: The Berlin Definition

Clinical management of severe acute respiratory infection (SARI) when COVID-19 disease is suspected: interim guidance

COVID-19) Treatment Guidelines. National Institutes of Health

The third international consensus definitions for sepsis and septic shock (Sepsis-3)

COVID-19, superinfections and antimicrobial development: What can we expect?

Bacterial and fungal co-infection in individuals with coronavirus: A rapid review to support COVID-19 antimicrobial prescribing

The role of chest imaging in patient management during the COVID-19 pandemic: a multinational consensus statement from the Fleischner Society

Characteristics and Clinical Outcomes of Adult Patients Hospitalized with COVID-19 -Georgia

Presenting characteristics, comorbidities, and outcomes among 5700 patients hospitalized with COVID-19 in the New York City area

COVID-19 and VTE/anticoagulation: frequently asked questions

Infectious Diseases Society of America Guidelines on Infection Prevention in Patients with Suspected or Known COVID-19

Pharmacologic treatments for coronavirus disease 2019 (COVID-19): a review

COVID-19 Personal Protective Equipment (PPE) for Healthcare Personnel Available at

Acute cardiac injury: is defined as serum levels of cardiac biomarkers (e.g., high sensitivity cardiac troponin I) above the 99 th percentile of the upper reference limit, or new ECG or echocardiogram abnormalities. [1, 2] Acute kidney injury: is defined in accordance with the Kidney Disease: Improving Global Outcomes (KDIGO) guidelines as any of the following:• increase in serum creatinine by > 0.3 mg/dL (26.5 micromol/L) • increase in serum creatinine to > 1.5 times baseline, which is known or presumed to have occurred within the last seven days • urine volume <0.5 mL/kg/hour for six hours. [3] Acute respiratory failure (outcome): is defined as acute failure of respiratory oxygenation or carbon dioxide elimination requiring supplemental oxygenation, non-invasive ventilation, and/or intubation and invasive mechanical ventilation.Acute respiratory distress syndrome (ARDS): is defined as an acute, inflammatory lung condition diagnosed in accordance with the Berlin Definition, in which clinical diagnosis is determined by syndrome timing, chest imaging, edema origin, and oxygenation. [4] Bacterial infection: is defined as showing clinical symptoms or signs concerning for bacterial respiratory infection or bacteremia and a positive culture obtained from lower respiratory tract specimens (qualified sputum, endotracheal aspirate, or bronchoalveolar lavage fluid) or blood. [1, 2] Hypoxemia (reason): is defined as a below normal oxygen saturation measured by arterial blood gas or peripheral pulse oximetry, often requiring supplemental oxygen therapy.Illness Severity: individual illness severity is defined as follows:• General severity ─ individuals with mild illness defined by non-specific signs and symptoms (e.g., fever, cough, sore throat, malaise, headache, muscle pain) with no shortness of breath or signs of lower respiratory disease illness by clinical assessment or abnormal imaging, or individuals with moderate illness defined by lower respiratory disease or pneumonia by clinical assessment or imaging and a peripheral oxygen saturation (SpO2) >93% on room air. [5, 6] • Severe severity ─ individuals with a respiratory rate >30 breaths per minute, SpO2 ≤93% on room air at sea level, ratio of arterial partial pressure of oxygen to fraction of inspired oxygen (PaO2/FiO2) <300, or lung infiltrates >50%. [5, 6] 

☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:J o u r n a l P r e -p r o o f