key: cord-0951939-9igv2din authors: Rinderknecht, M. D.; Klopfenstein, Y. title: Predicting Critical State after COVID-19 Diagnosis Using Real-World Data from 20152 Confirmed US Cases date: 2020-07-27 journal: nan DOI: 10.1101/2020.07.24.20155192 sha: 50743502ca824f9e0df083321b6944aa7bb2cd76 doc_id: 951939 cord_uid: 9igv2din The global COVID-19 pandemic caused by the virus SARS-CoV-2 has led to over 10 million confirmed cases, half a million deaths, and is challenging healthcare systems worldwide. With limited medical resources, early identification of patients with a high risk of progression to severe disease or a critical state is crucial. We present a prognostic model predicting critical state within 28 days following COVID-19 diagnosis trained on data from US electronic health records (EHR) within IBM Explorys, including demographics, comorbidities, symptoms, laboratory test results, insurance types, and hospitalization. Our entire cohort included 20152 COVID-19 cases, of which 3160 patients went into critical state or died. Random, stratified train-test splits were repeated 100 times to obtain a distribution of performance. The median and interquartile range of the areas under the receiver operating characteristic curve (ROC AUC) and the precision recall curve (PR AUC) were 0.863 [0.857, 0.866] and 0.539 [0.526, 0.550], respectively. Optimizing the decision threshold lead to a sensitivity of 0.796 [0.775, 0.821] and a specificity of 0.784 [0.769, 0.805]. Good model calibration was achieved, showing only minor tendency to over-forecast probabilities above 0.6. The validity of the model was demonstrated by the interpretability analysis confirming existing evidence on major risk factors (e.g., higher age and weight, male gender, diabetes, cardiovascular disease, and chronic kidney disease). The analysis also revealed higher risk for African Americans and "self-pay patients". To the best of our knowledge, this is the largest dataset based on EHR used to create a prognosis model for COVID-19. In contrast to large-scale statistics computing odds ratios for individual risk factors, the present model combining a rich set of covariates can provide accurate personalized predictions enabling early treatment to prevent patients from progressing to a severe or critical state. and MERS in 2012 (Peeri et al., 2020) , the overall number of 12 123 257 confirmed cases and 551 384 deaths from COVID-19 (Johns Hopkins University (JHU), 2020) (status July 9, 2020) far outweigh the other two epidemics. These high numbers have forced governments to respond with severe containment strategies to delay the spread of COVID-19 in order to avoid a global health crisis and collapse of the healthcare systems Armocida et al., 2020) . Several countries have been facing shortages of intensive care beds or medical equipment such as ventilators (Ranney et al., 2020) . Given these circumstances, appropriate diagnostic and prognostic tools for identifying high-risk populations and helping triage are essential for informed protection policies by policymakers and optimal allocation of resources to ensure best possible care (e.g., early treatments) for the patients. Today's availability of data enables the development of different solutions using machine learning to address these needs, as described in the recent reviews by Bullock et al. (2020) and Wynants et al. (2020) . One type of proposed solutions is prognostic prediction modeling, which consists in predicting patient outcomes such as hospitalization or exacerbation to a critical state, using longitudinal data from medical healthcare records of COVID-19 patients Feng et al., 2020; Ferrari et al., 2020; Gong et al., 2020; Haimovich et al., 2020; Jiang et al., 2020; Liu et al., 2020a; Petrilli et al., 2020; Vaid et al., 2020; Xie et al., 2020; Yan et al., 2020a) or proxy datasets based on other upper respiratory infections (DeCaprio et al., 2020) . To this date, most studies include data exclusively from one or few hospitals and therefore relatively small sample sizes of confirmed COVID-19 patients (i.e., below 1000 patients), with the exception of the retrospective studies in New York City by Petrilli et al. (2020) with 4103 or by Vaid et al. (2020) with a total of 3055 patients. The aim of this work was to create a prognostic prediction model for critical state after COVID-19 diagnosis based on a retrospective analysis of a large set of de-identified electronic health records (EHRs) of patients across the US using the IBM® Explorys® database (IBM, Armonk, NY) . Such a predictive model allows identifying patients at risk based on predictive factors to support risk stratification and enable early triage. This work was achieved by using the RWE Insights Platform, a data science platform for analyses of medical realworld data to generate real-world evidence (RWE) recently developed by IBM. The RWE Insights Platform is a data science pipeline facilitating the setup, execution, and reporting of analyses of medical real-world data to discover RWE insights in an accelerated way. The platform architecture is built in a fully modular way to be scalable to include different types of analyses (e.g., treatment pathway analysis, treatment response predictor analysis, comorbidity development analysis) and interface with different data sources (e.g., the Explorys database). For the present use case of COVID-19 prognosis prediction, we used the comorbidity development analysis which allows defining a cohort, an outcome to be predicted, a set of predictors, and relative time windows for the extraction of the samples from the data source. New data-extraction modules for specific disease, outcome, treatments, and variables for the current use case were developed. The RWE Insights Platform has been developed using opensource tools and includes a front end based on HTML and CSS interfacing via a Flask RESTful API to a Python back end (python 3.6.7) using the following main libraries: imbalancedlearn 0.6.2, numpy 1.15.4, pandas 0.23.4, scikit-learn 0.20.1, scipy 1.1.0, shap 0.35.0, statsmodel 0.90.0, and xgboost 0.90. The platform is a proprietary software owned by IBM. The detailed description of the RWE Insights Platform is beyond the scope of this publication. Our work was based on de-identified data from the Explorys database. The Explorys database is one of the largest clinical datasets in the world containing EHRs of around 64 million patients and spanning over 360 hospitals across the US as well as over 920 000 providers (Watson Health, IBM Corporation, 2016) . Data were standardised and normalised using common ontologies, searchable through a Health Insurance Portability and Accountability Act (HIPAA)-enabled, de-identified dataset from IBM Explorys. Individuals were seen in multiple primary and secondary healthcare systems from 1999 to 2020 with a combination of data from clinical electronic medical records, health-care system outgoing bills, and adjudicated payer claims. The de-identified EHR data include patient demographics, diagnoses, procedures, prescribed drugs, vitals, and laboratory test results. Hundreds of billions of clinical, operational, and financial data elements are processed, mapped, and classified into common standards (e.g., ICD, SNOMED, LOINC, and RxNorm) within the data lake. As Explorys is updated continuously, a view of the database was created and frozen on July 16, 2020 for reproducibility of this work. The cohort included all patients in the Explorys database with a diagnosis of COVID-19 since December 1, 2019. As the new ICD-10 (International Classification of Diseases) code U07.1 for confirmed COVID-19 cases has been created and prereleased a couple of months after pandemic onset, hospitals may have used for early cases other already existing ICD codes related to coronavirus. The December 2019 cutoff was instituted to be consistent with the spread of COVID-19 in the US and to limit inclusion of patients who may have been diagnosed with other forms of coronavirus besides SARS-CoV-2. The ICD codes used to create the cohort are listed in Table 1 . In case of multiple entries per patient after December 1, 2019, the first entry date was used as COVID-19 diagnosis date. In order to have enough data to extract the patient's outcome, the diagnosis date had to be at least 7 weeks before the freeze date of the database (July 16, 2020), as it may take up to 7 weeks from symptom onset to death . LOINC codes for SARS-CoV-2 tests (e.g., 94500-6, 94309-2, 94502-2) (LOINC, 2020) with positive results available in the database were not used, as patients may have gone to a provider within the Explorys network to perform the test, but may have been treated in another hospital not covered by Explorys. This would generate a large number of additional subjects without known outcome and generate unreliable data for training the model. In contrast, patients having a diagnosis based on an ICD code may have a higher chance to be treated or have a follow-up in the same hospital. Critical state was used as a binary prediction target and included sepsis, septic shock, and respiratory failure (e.g., acute respiratory distress syndrome (ARDS)) (WHO, 2020). Severe sepsis is associated with multiple organ dysfunction syndrome. The precise definition based on ICD codes used for critical state is listed in Table 2 . In case of multiple entries for a patient, the first entry was retained. In addition, the date of the entry for critical state had to be in a window of [0, +28] days (boundaries included) after the diagnosis date to be eligible, as illustrated in Figure 1 ). Four weeks were chosen to ensure coverage of the majority of critical outcomes, as the interquartile range of time from illness onset to sepsis and ARDS were reported to be [7, 13] and [8, 15] days, respectively . Patients with an eligible entry for critical state were labeled as entering critical state, whereas patients eligible based on cohort definitions without any entry for critical state were labeled as not entering critical state. One exception to these rules were patients who are flagged as deceased in the Explorys database. In order to include death cases potentially related to COVID-19 in the critical state group, and as death dates and records with diagnoses and procedures relating to the patient's death are not available in Explorys to avoid re-identification of patients and ensure data privacy, patients with one of the following conditions were also labeled as entering critical state: deceased with an entry for critical state within the window, deceased with an entry for critical state within and after the window, or deceased without any entry for critical state (and thus excluding deceased patients with an entry for critical state before the window). In the latter case, the date was set to the end of the window for critical state entries. To validate these assumptions, the proportion of patients assumed to be deceased due to COVID-19 in our cohort was compared to epidemiological numbers. Features were mainly grouped into "acute" features and "chronic" features. Acute features are a set of features which should be temporally close to the COVID-19 diagnosis (e.g., recent laboratory tests, symptoms potentially related to COVID-19, or hospitalization prior to the diagnosis), whereas chronic features are a set of features which have no direct temporal relation to the COVID-19 diagnosis (e.g., chronic co- Table 2 ICD-10 codes for the prediction target. Patients with first diagnosis of any of the listed ICD-10 codes within the specified time window were labeled as entering critical state. morbidities, measurable demographics, or long-term habits). Features were selected based on their appearance in literature on potential risk factors and predictors related to COVID-19. Figure 1 illustrates their difference in terms of time windows for extraction. A negative value for boundaries of time window definitions stand for dates prior to the reference date (e.g., prior to the diagnosis date). Ideally, acute features should have been recorded for higher consistency at diagnosis date. However, this may not be always the case in the EHR compared to data from clinical studies. To account for recorded symptoms previous to the diagnosis (e.g., through tele-medicine before performing a SARS-CoV-2 test or due to potentially required multiple testing because of false negatives delaying diagnosis), a time window of [−14, 0] days before the diagnosis was used to extract acute features. Patients were considered hospitalized (inpatient) if the reported admission-discharge period of the hospitalization overlapped with the acute feature extraction time window. Entries for chronic features were considered if prior to the diagnosis date, without additional restriction. Demographic features which were not restricted to any time window (e.g., gender or race) or required a spe-. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 27, 2020. . https://doi.org/10.1101/2020.07.24.20155192 doi: medRxiv preprint cial way of extraction/computation (e.g., age) are grouped as "special" features and are not represented in Figure 1 . As part of the de-identification process, for patients over 90 years of age, the age is truncated to 90 years. Similarly, the age of all patients born within the last 356 days is be set to 0 years. The full list of features including their definitions (e.g., based on ICD or LOINC codes) is provided in Table 3 , grouped by extraction time window type. As features entries (especially relevant for chronic features) may have been entered several years ago, ICD-9 codes were used as well for the extraction. In general, the last entry within the specific extraction time window was used to construct the feature, except if described otherwise in Table 3 . The full dataset was constructed based on COVID-19 diagnosis including binary prediction target labels for critical state and enriched by the various features. Patients with missing age or gender information were removed from the dataset, and all missing binary features (i.e., obtained from ICD code entries) of Table 3 were imputed with zero. Descriptive distribution statistics were created for all features, and features with more than 90% missing values were removed from the feature set. For the remaining feature set, the concurvity (non-linear collinearity) among features was assessed using Kendall's τ, a non-parametric measure of correlation. In case of |τ| > 0.7 (Dormann et al., 2013) , the feature with more missing values was removed from the feature set. In case of equal number of missing values, the feature with the higher mean was removed in order to keep the minorities and make the larger group part of the predicted probability baseline. To train and evaluate the model, the dataset was split into a train set (80%) and test set (20%) using stratification of the prediction target. This procedure was repeated 100 times based on different random seeds to get a distribution and confidence intervals of the model performance and feature importance, as performance may change depending on the choice of splits. For each random split the following steps were executed: The non-binary features of the train set and the test set were imputed based on the feature medians of the train set to avoid data leaking. An XGBoost model was trained on the train set using default parameters of the XGBoost Python package without additional hyperparameter tuning. XGBoost is a decision-tree-based ensemble machine learning algorithm using a gradient boosting framework. Gradient tree boosting models have shown to outperform other types of models on a large set of benchmarking datasets (Olson et al., 2018) . The trained XGBoost model was subsequently used to create predictions for the test set. The performance of the model was evaluated on the test set for each random train-test split seed and reported with median and interquartile range across seeds. This provides a distribution of expected performance, if a new model would be trained on similar data. Following metrics were computed: receiver operating characteristic (ROC) curve and precision recall (PR) curve as well as their respective areas under the curve (ROC AUC and PR AUC). The confusion matrix, sensitivity, and specificity were reported for the optimal probability classification threshold. This threshold was obtained based on maximizing the largest Youden's J statistic (corresponding to the largest geometric mean as a metric for imbalanced classification seeking for a balance between sensitivity and specificity). Furthermore, the calibration of the model was reported, comparing binned mean predicted values (i.e., probabilities) to the actual fraction of positives (labeled as critical state) (Van Calster et al., 2019) , in order to evaluate whether the predicted probability is realistic and can provide some confidence on the prediction. Interpretability of the model was generated using Tree SHAP (Lundberg et al., 2020) , a version of SHAP (SHapley Additive exPlanations) optimized for tree-based models. SHAP is a framework to explain the contribution of feature values to the output of individual predictions by any type of model and to compute the global importance of features. This individual contribution is expressed as SHAP value, corresponding to log-odds (output of the trees in XGBoost), before they are converted into probabilities with a logistic function. The global feature importance as well as a summary plot of individual contributions including feature values were created. In our case, a positive SHAP value indicates a contribution towards increased probability for critical state, whereas a negative SHAP value indicates a reduction of probability for critical state. The total number of identified patients diagnosed with COVID-19, the number of patients with age and gender information (referred to as the cohort), the number of patients labeled as not entering critical state and labeled as entering critical state as well as the sizes of the partitions for training and testing are reported in the schematic in Figure 2 . Among patients labeled as critical state, a total of 1009 patients were flagged as deceased in the Explorys database. This corresponds to 5.0% of the entire cohort. Figure 3 shows the distribution of included Explorys patients with COVID-19 diagnosis across the US. The majority of the patients are in the states LA (43.9%), OH (25.8%), DC (7.5%), FL (7.3%), and MD (7.3%). In comparison, the percentages of totally recorded patients (i.e., also non-COVID cases) in Explorys . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 27, 2020. . https://doi.org/10.1101/2020.07.24.20155192 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 27, 2020. To train and evaluate the model, the dataset was split using stratification of the prediction target. This procedure was repeated 100 times based on random seeds to get a distribution of model performance. for these states are: LA (4.5%), OH (24.3%), DC (1.0%), FL (5.2%), and MD (4.3%). Descriptive statistics after zero-imputation of the binary features and before feature reduction are reported in Table 4 . Based on these results, the following features were removed due to a too high proportion of missing data: C-reactive protein, C-reactive protein (high sensitivity method), Lactate dehydrogenase (L>P), Lactate dehydrogenase (P>L), Lymphocytes (/100 leukocytes in blood), Lymphocytes (/blood volume), Neutrophils (/100 leukocytes in blood), Neutrophils (/blood volume), and Oxygen saturation. Rank correlations across features after removing features based on the threshold for missing data is shown in the heatmap in Figure 4 . The following feature combinations showed a strong rank correlation: {Race (African American), Race (Caucasian)} and {BMI, Weight}, from which the following features were removed due to higher proportion of missing data or higher mean: Race (Caucasian) and BMI. The performance and calibration of the model was evaluated on the 4031 patients of the test set for each train-test split seed. The ROC AUC and PR AUC across different seeds were 0.863 [0.857, 0.866] and 0.539 [0.526, 0.550], respectively. Figure 5 shows their distributions, together with the ROC curve and the precision recall curve. The confusion matrix for the identified optimal classification threshold is shown in Figure 6 . Figure 8 shows the results of the model interpretability analysis based on Tree SHAP. Pneumonia and older age are by far the principal predictors for critical state. The main features contributing to a higher probability of critical state in case of high feature values or presence are (in decreasing order of global feature importance): pneumonia, older age, hospitalization (inpatient), weight, shortness of breath, diabetes, race (African American), and cardiovascular disease. The main features leading to lower probability of critical state in are female gender and cough. Note that for binary features "max" feature values correspond to 1 (e.g., presence of the feature). In the case of gender, 1 corresponds to female (see Table 3 ). In this work, a prognostic model was created based on real-world data from 16121 patients to predict at COVID-19 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 27, 2020. . https://doi.org/10.1101/2020.07.24.20155192 doi: medRxiv preprint Table 4 Descriptive statistics of the features. The descriptive statistics are based on the full dataset after zero-imputation of the binary features but before feature reduction. The percentages 25%, 50%, and 75% refer to the first (Q1), second (median), and third quartiles (Q3). Note that for binary features the Mean column represents the proportion of positive entries. Note that as part of Explorys' de-identification process the feature Age has a ceiling effect at 90 years, and the age of all patients born in the last 365 days is reported as zero. For gender, 1 corresponds to female. diagnosis, whether patients will enter a critical state within the next 28 days or not. In addition to demographic, clinical, and laboratory data, hospitalization and insurance types were used as predictors. Our results based on new 4031 patients unseen during training showed high predictive performance (sensitivity of 0.796 and specificity of 0.0784) and well-calibrated output probabilities. Furthermore, the interpretability analysis identified pneumonia, older age, hospitalization (inpatient), weight, shortness of breath, diabetes, race (African American), and cardiovascular disease as main predictive features risk factors and female gender and cough as risk reducing factors. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 27, 2020. . https://doi.org/10.1101/2020.07.24.20155192 doi: medRxiv preprint More than 20 000 US patients diagnosed with COVID-19 met the inclusion criteria. To the best of our knowledge, it is the largest cohort used in a retrospective analysis for predictive modeling to date based on real-world data. As highlighted in Figure 3 , close to half of the cases were reported in Louisiana and one fourth in Ohio. This comes from the fact that the Explorys database has major contributors in the East cost of the United States. Therefore, our cohort may not be fully representative of the entire US population. The definitions used for severe state or critical state vary across different sources (e.g., intubation prior to ICU admission, discharge to hospice, or death (Vaid et al., 2020) , moder-ate to severe respiratory failure (Ferrari et al., 2020) , oxygen requirement greater than 10 L/min or death (Haimovich et al., 2020) ), or are not described in detail. Based the definition by the WHO (2020) including sepsis, septic shock, and respiratory failure (e.g., ARDS), the proportion of patients entering critical state (15.7%) in our study is within the range of prevalence (12.6% to 23.5%) reported in a review covering 21 studies . Similarly, case fatality rates vary across US states and countries, as they directly depend on factors such as the number of tested people, demographics, socioeconomics, or healthcare system capacities. The death rate for the entire US is estimated to be 4.3% (Johns Hopkins University (JHU), 2020) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 27, 2020. . https://doi.org/10.1101/2020.07.24.20155192 doi: medRxiv preprint (status July 9, 2020), while for Louisiana and in particular in New Orleans it is higher and around 4.6% (Louisiana Department of Health, 2020) and 6.5% (Johns Hopkins University (JHU), 2020), respectively. In the present work, the reported proportion of people assumed to be deceased because of COVID-19 is 5.0%. These minor difference may be justified in part by the fact that in these sources the outcome (i.e., potential death) of recently confirmed cases is yet unknown when computing the case fatality rate, hence leading to underestimation. As our analysis enforces at least 7 weeks of data after diagnosis date increasing changes of knowing the patients' outcomes, we are able to reduce this underestimation. Nevertheless, death rates based on Explorys should be cautiously interpreted, as death is not reliably reported. Regarding demographics of our cohort, there are only minor dissimilarities to numbers reported by the Centers for Disease Control and Prevention (CDC) or US states. The interquartile of the age distribution of our cohort (37-66 years) matches 33-63 years for COVID-19 cases across the entire US (Stokes et al., 2020) . The racial breakdown varies strongly across different US states. Given that Louisiana (and in particular New Orleans) is a main contributor in the Explorys network, this also explains the high proportion of African Americans. Given that Caucasian and African American represent together 87.4% of the dataset, there is a strong negative correlation between the two features, for which reason the majority group (race (Caucasian)) was considered as baseline and removed from the feature set. The proportion of female cases (57.6%) is more pronounced compared to the US-wide incidences of 406 (female) and 401 (male) cases per 100 000 persons also showing a marginally higher rate for females than males, respectively (Stokes et al., 2020) . The higher count of COVID-19 cases among females (especially in African American) in the US state Georgia (Georgia . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 27, 2020. . https://doi.org/10.1101/2020.07.24.20155192 doi: medRxiv preprint Figure 8 . Model interpretability. Left: Box plots (across different seeds) of the average absolute impact of features on the model output magnitude (in log-odds) ordered by decreasing feature importance. Right: Illustration of the relation between feature values and impact (in terms of magnitude and direction) on prediction output (all seeds pooled). Each dot represents an individual patient in the test set. The color of each point corresponds to the normalized feature value (min-max normalization on test set). As an example for continuous features, older patients tend to have a higher SHAP value). For binary features, the maximum feature value 1 corresponds to presence of the feature, and 0 to absence of the feature. For gender, 1 corresponds to female. Department of Public Health, 2020) having a similar racial distribution to Louisiana might also support our larger proportion of female cases, as African Americans cover almost half of our dataset. Since the medical system captured by Explorys is separate to the billing system, it can be expected that information on insurance types is not widely available. As a matter of fact, less than 10% of patients have a reported insurance type. Nevertheless, it has been shown to be of value (in particular knowing if the insurance is self-pay or not) for predicting critical state. The most common underlying comorbidities identified through ICD codes in our cohort are hypertension, obesity, cardiovascular disease, diabetes, and chronic lung disease (includes asthma and chronic obstructive pulmonary disease). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 27, 2020. . https://doi.org/10. 1101 As this is in line with statistics from the CDC (Stokes et al., 2020; Garg et al., 2020) as well as other studies conducted in China (e.g., Zhou et al. (2020) ) and the prevalence of such features is not affected by any time window restrictions (i.e., the entire patient history was considered), it confirms the validity of the Explorys data. In contrast, all quantitative measurements identified through LOINC codes (e.g., body temperature or C-reactive protein) are missing in more than 75% of the COVID-19 cases (except for BMI and weight), and for most features it is even more than 95%. This observation could partially be explained by the fact that the acute time window only captures entries before the COVID-19 diagnosis, that lab test results are often known only after the diagnosis date, and that tests are mainly performed on hospitalized patients or patients in a severe state. Moreover, some tests might not be commonly performed in the hospitals within the Explorys network or they might just not be reported. Due to the high number of missing values, the majority of the quantitative measurements were excluded for modeling. Even though the lack of these features may compromise the performance of the model, it simplifies the model and increases its practical usability to do early predictions before extensively performing lab tests. Since the aim of the present work is to develop a model for predictions at the time point of COVID-19 diagnosis, symptoms identified through ICD codes (e.g., fever or cough) are only extracted from the 14 days previous to the COVID-19 diagnosis. As the COVID-19 diagnosis may be early or late in the disease progression, there is the possibility to capture either early or late symptoms depending on each case. However, due to the time window restriction, the prevalence of reported symptoms tends to be lower compared to statistics including reported symptoms during the entire course of the disease (Stokes et al., 2020) . Despite these lower numbers, the most common symptoms in our cohort, namely cough, fever, and shortness of breath, are confirmed by other reports and studies Stokes et al., 2020; Yang et al., 2020) . Overall, the size and quality of the EHR dataset based on the Explorys database demonstrates high value with regards to demographics, chronic features, and acute symptoms in most cases, but is less suitable for laboratory test results extracted by LOINC codes. To avoid the unreliable use of quantitative features with a significant proportion of missing values, such features were removed for the prognostic modeling. Although our dataset is based on fragmented real-world data with a high proportion of missing data, our prognostic model shows an excellent model performance in terms of ROC AUC (0.863 [0.857, 0.866]) (Mandrekar, 2010) and a substantial improvement of the PR AUC (0.539 [0.526, 0.550]) compared to chance level (0.157). Optimizing the decision threshold by maximizing the Youden's J statistic lead to a sensitivity of 0.796 [0.775, 0.821] and a specificity of 0.784 [0.769, 0.805]. Depending on the medical requirements for the prognostic model in terms of sensitivity and specificity, the threshold could easily be adjusted for a real application. As different types of datasets, inclusion/exclusion criteria, features, and prediction target definitions were used in other papers presenting the development of models predicting COVID-19 critical state, (e.g., Haimovich et al. (2020) ; Vaid et al. (2020) , or see review by Wynants et al. (2020) ), it renders it difficult to do a direct performance comparison (reported metrics were in the following ranges: ROC AUC 0.81-0.99, PR AUC 0.56-0.71, sensitivity 0.70-0.94, specificity 0.75-0.85). Furthermore, some publications do not mention metrics (e.g., PR AUC, or sensitivity and specificity) required to properly evaluate performance on an imbalanced dataset, which is the case for this type of COVID-19 prognosis. Unlike other papers (Ferrari et al., 2020; Haimovich et al., 2020; Vaid et al., 2020) usually performing a cross-validation or using a limited number of independent sets for the testing, the present approach used random, stratified train-test splits repeated 100 times to obtain a distribution of performance. Such an approach has the advantage of providing a better understanding of the generalizability model and the robustness of the performance estimate, as it is likely that a single test set might underestimate or overestimate the real performance for small testing sets. Even though our model was trained on data coming from many hospitals compared to other work being only based on a single or limited number of contributors, an external validation should be performed to better assess its generalizability. Most publications on prognosis prediction models do not report model calibration (Wynants et al., 2020) , with the exception of a few (Haimovich et al., 2020; Xie et al., 2020) . The present model based on the Explorys dataset is well-calibrated, showing only minor tendency to over-forecast probabilities above 0.6. We hypothesize that this over-forecast comes from the fact that treatment features were not included in the model. Assuming that treatments reduce the probability of entering critical state, taking a treatment will lead to an overestimated probability by the model, as this information is not available to the model. In any case, over-forecast accentuating cases with relatively high probability is preferrable to under-forcast, where patients with high probability of critical case may not be identified. Overall, our prognostic model shows excellent performance and has the advantage to provide a calibrated risk score instead of a binary classification that could potentially help healthcare professionals take better decisions to improve patients' outcome when diagnosed to COVID-19. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 27, 2020 . . https://doi.org/10.1101 It is not surprising to see pneumonia among the top features, as pneumonia is a diagnosis defining moderate and severe cases (WHO, 2020), which are precursor stages for critical state due to COVID-19 disease. The results from a study with 1099 patients showed that patients with severe disease had a higher incidence of physician-diagnosed pneumonia than those with nonsevere disease . Increased age (e.g., above 65 years) has been confirmed by many studies to be an important risk factor for progress to grade IV and V on the pneumonia severity index and mortality of COVID-19 patients Liu et al., 2020b; Mehra et al., 2020) . The developed model was also able to endorse existing results showing that men are, despite similar prevalence to women, more at risk for worse disease severity, independent of age (Jin et al., 2020) . Similarly, obesity has been identified as a factor increasing probability of higher disease severity and lethality (Petrakis et al., 2020; Lighter et al., 2020; Guan et al., 2020) . While according to our interpretability analysis the feature obesity shows marginal importance in the output of the model, the feature weight is among the top features leading to high risk (in case of high weight). It can be assumed that the feature obesity with a prevalence of 29% in our dataset compared to age-adjusted prevalence of obesity in the US is around 35% (Flegal et al., 2012) is underreported in the EHR data of our cohort. At an average US male height of 175 cm (Fryar et al., 2018) , the median weight and median BMI in our dataset are very close to the threshold from overweight to obesity (BMI of > 30 kg/m 2 ). Hence it can be concluded that approximatively 50% of our patients are obese. In addition, the weight feature is a continuous variable with only 21.3% missing entries, having thus more information content and, as a result, shows higher predictive importance. A more related feature to obesity would be BMI. However, BMI was removed due to high correlation with weight and a larger proportion of missing values. In line with the literature, the following comorbidities were also shown to drive high probabilities for critical state: diabetes (Guo et al., 2020b; Wang et al., 2020a; Yan et al., 2020b) , chronic kidney disease (Cheng et al., 2020; Emami et al., 2020; Henry and Lippi, 2020) , and cardiovasular diseases (Bansal, 2020; Guo et al., 2020a; Mehra et al., 2020) . As a matter of fact, many elderly patients with these comorbidities use Angiotensin-converting enzyme (ACE) inhibitors and angiotensin-receptor blockers (ARBs) which upregulate the ACE-2 receptor . Given that ACE-2 receptor has been proposed as a functional receptor for the cell entry mechanism of coronaviruses, it has been hypothesized that as a consequence this may lead to a higher prevalence and elevated risk for a severe disease progression after SARS-CoV-2 infection (Shahid et al., 2020) . Our model also revealed disparities in terms of probability for critical state between races: The race Caucasian showed a lower risk, while, as there is a strong negative correlation between the race feature Caucasian and African American (both together represent almost the entire cohort), the data displays that African Americans have a higher risk. This fact has been verified in several states, among others Louisiana where around 70% of deaths have occurred among African Americans, although they represent only one third of the state's population (Yancy, 2020) . While a higher prevalence of comorbidities such as hypertension, diabetes, obesity, and cardiovascular disease among African Americans may be one reason for these disproportion, also late lockdowns in southern states or social determinants (e.g., living in poor areas with high housing density, high crime rates, poor access to healthy foods) may be strong contributors (Dyer, 2020; Yancy, 2020) . The importance of socioeconomic factors for severe disease progression is also underlined by examining the consequences of insurance types. The SHAP analysis clearly showed that patients with self-pay healthcare tend to have a higher probability to enter critical state, as they may be reluctant to seek early medical care. The two primary symptoms influencing the progression of the disease based on the present analysis are shortness of breath (dyspnoea) and cough, both prevalent symptoms for COVID-19 . Interestingly, they have opposite effects on the prediction probability of the model, with shortness of breath increasing and cough decreasing the probability for critical state. This can be explained by the fact that cough is an early symptom during mild or moderate disease, and shortness of breath develops in the late course of illness. This concurs with statistical reports from China showing higher prevalence of shortness of breath in severe cases and a higher prevalence of cough in non-severe cases and survivors Li et al., 2020; Zhou et al., 2020) . Hence, if cough is reported, this may indicate that the disease in still in early stage and there is the chance that it may not lead to a critical state, whereas if shortness of breath is reported, chances for further disease progression may be much higher. Fever may be at the same time an early appearing symptom but has also been shown to be developed later during hospitalization Yang et al., 2020) . In addition, as reported by Zhou et al. (2020) , fever has the same prevalence in survivors and non-survivors. This may also explain why it is more difficult to use it as a predictive feature, unlike for example cough, despite being also among the most prevalent symptoms . Nonspecific neurological symptoms like headache and confusion are less commonly reported . Nevertheless, confusion showed to contribute to an increase in the model's output probability. While headaches may have many potential origins not necessarily related to COVID-19, confusion may be a clearer precursor of neuroinvasion of SARS-CoV2, which has been suggested to potentially lead to respiratory failure (Asadi-Pooya and Simani, 2020). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 27, 2020. . https://doi.org/10. 1101 Finally, the feature hospitalization (inpatient) also appears among others in the top features in terms of feature importance. Less than 10% of patients were already hospitalized (inpatient) in the 14 days before or at diagnosis date. The SHAP analysis showed clearly that hospitalization before COVID-19 diagnosis predicts progression towards a severe or critical state. Overall, the findings of this work are in line with results from the vast number of studies reported in the literature and the interpretability analysis provides evidence for the validity of the prognostic prediction modeling. EHRs can be a powerful datasource to create evidence based on real-world data, especially when combined with a platform facilitating the structured extraction of data. However, there are trade-offs to be made when doing analyses on EHR data in contrast to the analysis of clinical study data (Kim et al., 2018) . One major limitation is that patients may get diagnoses, treatments, or laboratory measurement results outside of the hospital network covered by Explorys, resulting in incomplete patient histories with potentially high proportion of missing data. For this reason, it was for example preferred to rely on COVID-19 diagnoses based on ICDcodes, instead of relying on LOINC codes for SARS-CoV-2 tests, to increase the probability of inclusion of patients being treated within the Explorys network. This highly fragmented data also requires imputation, as there is rarely a patient with a complete data record, especially when the feature set is large. The method of imputation may also introduce additional biases which are difficult to control. Moreover, features with high proportions of missing data (in particular laboratory measurement results) were removed to reduce the bias. Wherever imputation was still necessary, it was ensured that the imputation was based purely on the train set to avoid additional information leakage. While the removal of potentially important laboratory measurement results may compromise the performance of the model, it also increases practical usability of the model, as less laboratory tests are required to create a prediction. Furthermore, to ensure data privacy and prevent re-identification, patients' age is truncated, and death dates and related diagnoses and procedures are not available in Explorys. As the latter is highly relevant for the present modeling, several assumptions had to be taken. Nevertheless, resulting death rates correspond well to official COVID-19related death rates in the US or relevant states. An additional limitation and potential bias is linked to the data extraction using time windows. Even though the window lengths were motivated by medical reasoning, they are subject to trade-offs which is not the case for clinical studies due to precise protocols: extending the windows to capture enough information spread over multiple visits and account for delays in EHR entries, versus remaining recent enough and related to COVID-19. Furthermore, the features used in this model do not capture the time information for the individual samples (e.g., how many days before COVID-19 diagnosis the ICD code for fever entered into the system). The model was based on US data from hospitals of the Explory network and the cohort analysis showed that the highest data contribution came from only few states, respectively counties. This resulted for example in a higher ratio of African Americans compared to the US average, it is highly likely that there are demographic and socioeconomic biases, in addition to the fact that economically disadvantaged patients may seek medical help too late. Also in terms of testing, diagnosing, and treating, the data reflects the American healthcare system. Despite these limitations, RWE can retrospectively generate insights on a scale which would not be feasibly with an observational clinical study. Thus, it may be a starting point for subsequent, more focused clinical studies. Furthermore, approaches based on RWE might even have higher clinical applicability due to their incorporation of statistical noise while model training (Bachtiger et al., 2020) . The results of this work demonstrate that it is possible to develop an explainable machine learning model based on patient-level EHR data to predict at the time point of COVID-19 diagnosis whether individual patients will progress into critical state in the following four weeks. Without the necessity of relying on multiple laboratory test results or imaging such as CTs, this model holds promise of clinical utility due to the simplicity of the relevant features and its adequate sensitivity and specificity. Even though this prognostic model for critical state has been trained and evaluated on the largest cohort to date with over 20000 patients, it includes only cases from certain regions within the US and may therefore be biased towards sub-populations of the US and the American healthcare system. To prove its generalizability before being considered for clinical implementation, it should be validated with other datasets. This model could be augmented with treatment features (e.g., drugs or other interventions) after diagnosis in order to predict whether the respective treatments would lead to an improvement (i.e., reduction of the probability of entering critical state). Such models will never replace clinical trials to evaluate treatment effectiveness, but will help to identify responder groups or inform the design of clinical trials to eventually reduce burden on the healthcare system and optimize personalized treatment. Disclosure/conflict-of-interest statement MR and YK are employees of IBM Switzerland AG. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 27, 2020. . https://doi.org/10. 1101 Author contributions MR and YK lead the development of the RWE Insights Platform, contributed to the conception of this work, developed the methodology, implemented the use case and the modeling approach, performed the analysis, interpreted the results, and drafted the manuscript. Both authors revised the manuscript and approved the final version. How will country-based mitigation measures influence the course of the COVID-19 epidemic? The italian health system and the COVID-19 challenge. The Lancet Public Health 5, e253 Central nervous system manifestations of COVID-19: A systematic review Machine learning for covid-19—asking the right questions. The Lancet Digital Health Predicting COVID-19 malignant progression with AI techniques Cardiovascular disease and covid-19. Diabetes & Metabolic Syndrome Mapping the landscape of artificial intelligence applications against Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in wuhan, china: a descriptive study Kidney disease is associated with in-hospital death of patients with COVID-19 Building a COVID-19 vulnerability index Collinearity: a review of methods to deal with it and a simulation study evaluating their performance Predictors of mortality for patients with COVID-19 pneumonia caused by SARS-CoV-2: a prospective cohort study Covid-19: Black people and other minorities are hardest hit in US Prevalence of underlying diseases in hospitalized patients with COVID-19: a systematic review and metaanalysis. Archives of academic emergency medicine Early prediction of disease progression in 2019 novel coronavirus pneumonia patients outside wuhan with CT and clinical characteristics Machine learning in predicting respiratory failure in patients with COVID-19 pneumonia -challenges, strengths, and opportunities in a global health emergency Prevalence of Obesity and Trends in the Distribution of Body Mass Index Among US Adults Mean body weight, weight, waist circumference, and body mass index among adults: United States Hospitalization rates and characteristics of patients hospitalized with laboratory-confirmed coronavirus disease 2019 -COVID-NET, 14 states Georgia COVID-19 status report A tool to early predict severe 2019-novel coronavirus pneumonia (COVID-19) : A multicenter study using the risk nomogram in Wuhan and Guangdong The species severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2 Clinical characteristics of coronavirus disease 2019 in China Cardiovascular Implications of Fatal Outcomes of Patients With Coronavirus Disease 2019 (COVID-19) Diabetes is a risk factor for the progression and prognosis of COVID-19 Development and validation of the COVID-19 severity index (CSI): a prognostic tool for early respiratory decompensation Chronic kidney disease is associated with severe coronavirus disease 2019 (covid-19) infection Prevalence and severity of corona virus disease 2019 (COVID-19): A systematic review and meta-analysis Towards an artificial intelligence framework for data-driven prediction of coronavirus clinical severity Gender differences in patients with COVID-19: Focus on severity and mortality COVID-19 dashboard by the Center for Systems Science and Engineering Real-world evidence versus randomized controlled trial: Clinical research based on electronic medical records The clinical and chest CT features associated with severe and critical COVID-19 pneumonia Obesity in patients younger than 60 years is a risk factor for covid-19 hospital admission Neutrophilto-lymphocyte ratio predicts severe illness patients with 2019 novel coronavirus in the early stage Clinical features of COVID-19 in elderly patients: A comparison with young and middle-aged patients Louisiana Coronavirus COVID-19 From local explanations to global understanding with explainable ai for trees Receiver operating characteristic curve in diagnostic test assessment Cardiovascular disease, drug therapy, and mortality in covid-19 Data-driven advice for applying machine learning to bioinformatics problems The SARS, MERS and novel coronavirus (COVID-19) epidemics, the newest and biggest global health threats: what lessons have we learned? International Obesity -a risk factor for increased COVID-19 prevalence, severity and lethality (review) Factors associated with hospitalization and critical illness among 4,103 patients with COVID-19 disease Critical supply shortages -the need for ventilators and personal protective equipment during the Covid-19 pandemic COVID-19 and older adults: What we know Coronavirus disease 2019 case surveillance-United States Calibration: the achilles heel of predictive analytics Does comorbidity increase the risk of patients with COVID-19: evidence from meta-analysis Updated understanding of the outbreak of 2019 novel coronavirus (2019-ncov) in wuhan IBM Explorys Network-Unlock the power of big data beyond the walls of your organization Severe Acute Respiratory Infections Treatment Centre Prediction models for diagnosis and prognosis of COVID-19 infection: systematic review and critical appraisal Development and external validation of a prognostic multivariable model on admission for hospitalized patients with COVID-19 A machine learning-based model for survival prediction in patients with severe COVID-19 infection Clinical characteristics and outcomes of patients with severe covid-19 with diabetes COVID-19 and African Americans Clinical course and outcomes of critically ill patients with SARS-CoV-2 pneumonia in Wuhan, China: a single-centered, retrospective, observational study Incidence, clinical characteristics and prognostic factor of patients with COVID-19: a systematic review and meta-analysis COVID-19 and the cardiovascular system Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study The authors would like to thank T. Egli, O. Müller, A. Peak, and S. Schumacher for contributing to the development of the RWE Insights Platform, and in particular T. Egli and O. Müller for their feedback on the manuscript. Further thanks go to the IBM Watson Health® team for providing access to the Explorys dataset enabling this project, B. Kolt for support and advice related to EHR data, and B. Brady for critical review of the manuscript. The RWE Insights Platform project is supported and sponsored by P. Bassignana and L. Böhm (IBM Switzerland AG).