key: cord-0155634-ouvja6c7
authors: Buch, Varun; Zhong, Aoxiao; Li, Xiang; Rockenbach, Marcio Aloisio Bezerra Cavalcanti; Wu, Dufan; Ren, Hui; Guan, Jiahui; Liteplo, Andrew; Dutta, Sayon; Dayan, Ittai; Li, Quanzheng
title: Development and Validation of a Deep Learning Model for Prediction of Severe Outcomes in Suspected COVID-19 Infection
date: 2021-03-21
journal: nan
DOI: nan
sha: 9e2f80015ef65acd6213a385562daf2cb2a34669
doc_id: 155634
cord_uid: ouvja6c7

COVID-19 patient triaging with predictive outcome of the patients upon first present to emergency department (ED) is crucial for improving patient prognosis, as well as better hospital resources management and cross-infection control. We trained a deep feature fusion model to predict patient outcomes, where the model inputs were EHR data including demographic information, co-morbidities, vital signs and laboratory measurements, plus patient's CXR images. The model output was patient outcomes defined as the most insensitive oxygen therapy required. For patients without CXR images, we employed Random Forest method for the prediction. Predictive risk scores for COVID-19 severe outcomes ("CO-RISK"score) were derived from model output and evaluated on the testing dataset, as well as compared to human performance. The study's dataset (the"MGB COVID Cohort") was constructed from all patients presenting to the Mass General Brigham (MGB) healthcare system from March 1st to June 1st, 2020. ED visits with incomplete or erroneous data were excluded. Patients with no test order for COVID or confirmed negative test results were excluded. Patients under the age of 15 were also excluded. Finally, electronic health record (EHR) data from a total of 11060 COVID-19 confirmed or suspected patients were used in this study. Chest X-ray (CXR) images were also collected from each patient if available. Results show that CO-RISK score achieved area under the Curve (AUC) of predicting MV/death (i.e. severe outcomes) in 24 hours of 0.95, and 0.92 in 72 hours on the testing dataset. The model shows superior performance to the commonly used risk scores in ED (CURB-65 and MEWS). Comparing with physician's decisions, CO-RISK score has demonstrated superior performance to human in making ICU/floor decisions.

First identified in the Hubei province of China in December 2019, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) 1 has spread globally in a matter of months to trigger a pandemic of unprecedented scale and severity. By November 2020, a combination of high infectivity and pathogenicity has resulted in more than 1.3 million deaths resulting from at least 53.7 million known infections 2 . In human hosts, the virus results in Coronavirus Disease 2019 (COVID- 19) , which is characterized by a flu-like illness in mild cases, multi-lobar pneumonia, acute respiratory distress syndrome (ARDS) and multi-organ failure in the most severe of cases 3 .

As the pandemic is going to prevail for months, it will continuously challenge the health system on its allocation of resource (e.g. mechanical ventilator, oxygen, ICU beds and experts in respiratory intensive care unit). Currently, most hospitals can triage COVID-19 like patients to clinic or emergency department according to their initial presentation.

However, it is less effective on identifying those with worse prognosis (needing advanced oxygen therapy, mechanical ventilator, or have a higher risk of death) based on patient's first present to the ED. Determining the patient's risk of severe outcome is crucial for two main reasons. Firstly, during times when case numbers are surging, such as outbreaks, resources such as mechanical ventilators, personal protective equipment (PPE) and intensive care unit (ICU) beds are likely to be in short supply 4 . Secondly, admitting a patient with COVID-19 disease into a care facility increases the chances of the condition spreading to vulnerable patients that are already admitted in the hospital, which facilitating the spread of the condition.

A clinical decision support system for helping stratify the severity of COVID-19 in the Emergency Department (ED) would therefore represent significant value to the disease management as well as to the hospital operations, and there have been increasing number of applications leveraging artificial intelligence especially deep learning systems for patient screening, triaging and management 5, 6 . Although there are established scores for risk stratification that can be used in ED, such as the CURB-65 7 for acute onset of pneumonia, or the Modified Early Warning System (MEWS) score 8 for more generally acute unwell patient, COVID-19 represents an entirely new human disease with its own characteristics and prognostic signature. Consequently, there have been many early studies attempting to predict COVID-19 progression based on the data captured during the pandemic 9,10 . However, these models have generally used small sample sizes 11 , have been limited to a single site for model training and testing 12 , and have used patient and data selection methodology that could lead to bias 13 . Furthermore, existing work has not compared model performance against the performance of physicians making similar decisions to the models; thus, it remains unclear whether such models would help or hinder physicians in making COVID-19 management decisions.

In this study, we conduct a large-scale analysis of consecutive patients with suspected COVID-19 infection, presenting to one of five Emergency Departments in the Greater Boston Area of Massachusetts, with the objective of developing and validating a predictive model for severe COVID-19 outcomes. Our approach utilizes the full spectrum of data elements available at initial patient presentation to a physician in the ED, including, patient demographics, vital signs, lab results and imaging studies. This allows us to make a fair comparison between our model and human performance at the task of COVID-19 patient management. Due to the heterogeneity and scale of our input data, our study harnesses artificial intelligence, specifically deep learning, as the principal modelling approach.

The study's dataset (the "MGB COVID Cohort") was constructed from all patients presenting to one of the five Emergency Departments (ED) in the Greater Boston Area of Massachusetts within the Mass General Brigham (MGB) healthcare system from March 1st to June 1st, 2020. All patients were followed-up until June 30th, 2020 and therefore, the minimum observation period was 30 days. ED visits with incomplete or erroneous data, such as missing visit outcome information, contradictory EHR timestamps (e.g. visit ending before visit starting) and unusually long or short visits (typically records created for administrative reasons) were excluded. Patients under the age of 15 were excluded. Patients who were not suspected of having COVID-19, by virtue of not having a SARS-CoV-2 Antigen Polymerase Chain Reaction (PCR) test ordered at the time of the visit or 14 days prior, were excluded. Furthermore, patients with a confirmed negative test any time in the past 14 days were also excluded. The EHR dataset was retrospectively collected from the MGB Enterprise Data Warehouse (EDW). We also retrospectively collected chest X-ray (CXR) images from the clinical Picture Archiving and Communication System (PACS), if the patient had an X-ray scan performed within 24 hours of visit. The study was approved by the institutional review board under data use agreement 2020P000819 and was compliant with the Health Insurance Portability and Accountability Act. Waiver for the need to obtain informed consent was granted. TRIPOD guidelines for reporting of multivariable prediction models were followed 14 .

For each patient included in the cohort, predictor data and outcome data were collected 

As the CO-RISK performs patient risk prediction based on learning the non-linear relationship between predictors and outcomes, as a first step we split the whole cohort into training, validation and testing sets. Among five ED sites involved in this study, data collected from two sites were used for training and validation, and the remaining three sites were used for testing the model. EHR data were preprocessed by standard de-identification and missing value imputation via the MissForest algorithm 15 As both the deep learning and Random Forest methods performed the same task of risk prediction and had the same type of model output (continuous value from 0 to 1, in 24/72 hours), results from them were combined for a single model evaluation: for patients with CXR available, prediction from the deep learning method was used; and for patients with only EHR data, prediction from Random Forest was used. In order to establish the final CO-RISK scores, a cube root transform was applied to the combined results, followed by multiplication by a factor of 100. The cube root transform reduced the skewness of the score distribution 18 , while the multiplication made score more readable.

Feature importance of the CO-RISK model was evaluated by permutation importance 19 .

where we randomly permutated the value of each feature (i.e. item in EHR data) and

recorded the corresponding changes in prediction error. Higher increase in prediction error due to the permutation indicates higher importance of that feature.

Two widely-used clinical scores were selected in this work in order to compare their performance with the new proposed method: CURB-65 7 and MEWS 8 . CURB-65 is a six-points score based on confusion, urea, respiratory rate, blood pressure, and age that can be used to stratify community acquired pneumonia patients into different management groups and perform 30 day mortality risk estimation 7 . MEWS is also a point based system that evaluates vital signs (systolic blood pressure, heart rate, respiratory rate and temperature) and mental state to identify patients at risk for deterioration (ICU admission, cardiorespiratory emergency and death) 8 

In order to evaluate physician's performance in making clinical decisions, we investigated the scenario where physicians were making ICU (a high dependency care unit) or floor (a normal dependency care unit) assignment at the patient's initial present, which is a major decision to be made at ED. We established the ground truth for ICU/floor assignment based on whether the patient received MV in 72 hours after the ED visit, for the same reason of ED necessity. Then we calculated the physician's performance (sensitivity/specificity) based on the correspondence between patient's ICU/floor decision and 72-hours MV. We also obtained the ROC curve of the CO-RISK score performing the same task based on the ground truth. Finally, we referred to the physicians' performance to calculate the score threshold, by identifying the point on CO-RISK ROC curve that is closest (measured by Euclidean distance) to the physicians' sensitivity and specificity.

A total of 11060 COVID-19 confirmed or suspected patients with EHR data available were included in the cohort (mean age, 57 years [standard deviation 20 years]; 49.4% male; 62.6% White and 13.2% African American). Flow of patients through this study can be found in Figure 1 . Outputs from both methods were then combined and transformed to 24/72 hours risk scores for every patient in the cohort.

ROC curves of the CO-RISK model for predicting whether patients in the testing dataset will need MV or die within 24/72 hours, which is a major goal of CO-RISK model, are shown in Figure 2 . We also obtained the ROC curves for predicting the need for other oxygen therapies, which are provided in supplemental materials. Area under the Curve (AUC) of predicting MV/death in 24h was 0.95 (95%CI, 0.92-0.96). For MV in 72h, the AUC was 0.92 (95%CI, 0.90-0.94). Based on feature permutation importance analysis, it was found that the following patient characteristics were of the topmost importance for the prediction: two vital signs (SPO2 and respiratory rate), the oxygen device that the patient is using upon ED visit, as well as patient's age. Secondly important are the following laboratories: lactate, lactate dehydrogenase, C reactive protein, and neutrophils. The last group that is considered as important for the predictions includes systolic/diastolic blood pressure, as well as other laboratories (aspartate aminotransferase, glomerular filtration rate, platelet, troponin T, glucose, ddimer, creatinine and white blood cell count). 

Based on the scheme for evaluating physician's performance as introduced in section 2.5, we obtained the sensitivity and specificity in deciding ICU/floor assignment by physicians and CO-RISK model in the testing dataset. ROC curve of CO-RISK score for this task and the identified threshold (red asterisk and the corresponding error bar), along with physician's performance (green dot and error bar), are shown in Figure 4 . As we have also collected the 30-days morality data in the MGB COVID cohort, we analyzed patients' survival stratification based on the physician's decision for ICU/floor or discharged using Kaplan-Meier survival curve, as shown in Figure 5 (a). We also obtain the K-M curve using CO-RISK score, where patients were stratified as 'Low', 'Medium' and 'High' risk of death based on the threshold derived from training set, as in Figure 5 (b). The thresholds were established in order to best match physician's decisions in the training set: patient discharged, admitted to floor, or admitted to the ICU. Based on the K-M curves of CO-RISK score, it was found that the mortality of patients in the high-risk group was significantly higher than in the medium-and lowrisk groups (38.24% vs 6.52% and 0.45%; p<0.001). 

The CO-RISK score was developed to predict short-term (24/72h) oxygen requirements, rather than just a binary outcome (e.g. death). Therefore, the score is potentially more clinically useful, since it is more actionable and directly points to the physician what kind of treatment and resources the patient might require. In the study of comparison with physician's decisions, CO-RISK score has demonstrated superior performance to human in making ICU/floor decisions. As shown in Figure 5 , patients determined by the model as "high risk" have a much worse survivability, comparing with patients sent to ICUs in reality. Thus, in a scenario where resources are limited, the score can help physicians stratify patients and better plan the use of available equipment and hospital beds. The model could also be deployed for management of patients in areas unfamiliar with the clinical condition. The threshold suggested by the model provides a reference decision point which could be adjusted to accommodate available resources.

In the current experiment setting, we split the MGB COVID cohort into training and testing dataset based on different sites and found that CO-RISK can adapt well to the changes in hospitals. To further investigate the adaptiveness of CO-RISK to different time period, which also reflect prevalence of COVID, we split the data based on different time periods within the study timeframe. The CO-RISK model was trained by data from all hospitals during March to April 2020. It was then tested on three test sets: test set I included ED visits between May 1st and May 10th, test set 2 between May 11th and May 20 th , test set 3 between May 21st and May 31st. As the number of new cases fluctuated during the pandemic, the rate of positive case varied accordingly. In the MGB COVID cohort, the portion of COVID-19 positive patients presented to ED was 41.1% during 3/3 -4/30, but dropped to 33.1%, 18.5% and 13.1% during 5/1 -5/10, 5/11 -5/20, and 5/21 -5/31, respectively. Performance of CO-RISK score for the prediction of oxygen therapies in 24/72 hours using the new data split show similar AUCs with the results we reported above, indicating that CO-RISK can also adapt well to different COVID prevalence. In addition, this train/test data split is equivalent to a prospective data collection, indicating the feasibility of CO-RISK to be applied prospectively. Detailed analysis of the prediction performance can be found in supplemental materials.

CO-RISK model was developed without relying on COVID-19 diagnosis status (e.g. from PCR test), which made it suitable for application in ED where the patients cannot get their COVID-19 diagnosis immediately upon visit. One of the most critical challenges in COVID-19 research is the development and testing of therapeutics aiming to treat infected patients that are experiencing a severe illness. The CO-RISK score could be used at initial presentation in the ED to calculate baseline risk of future deterioration. Thus, the CORISK score could be used to assess the therapeutic benefit of candidate therapies in clinical trials. Furthermore, the score would allow more targeted enrollment of patients likely to experience a more severe illness, affording a better risk/reward trade-off for patients receiving experimental therapy.

We compared the performance of CO-RISK score with physicians on dispositioning patients to ICU/floor, where the result showed similar performance. Currently, we are unable to account for unmeasured confounders in the decision-making process.

However, it is less likely and unethical to conduct randomized controlled clinical trials, especially in the stressing environment of ED at a stretching moment in the pandemic.

During preliminary attempts of deploying CO-RISK in clinical workflow, we found certain patients are missing key clinical variables and CXRs, which prohibited the model from running. Such missing data pattern is informative and could be related to the patient's clinical condition when presenting to ED. For instance, we found that relatively severe patients were more likely to have CXR scans, and physicians would order a blood panel or ferritin based on their clinical judgement.

The MGB COVID cohort was established based on patients from five teaching hospitals within the same healthcare system in Boston, which share similar technical infrastructure in electronic health record (EHR), data storage and ED protocols. The generalization and scale up of CO-RISK score should be cautious and tailored to the hospital-specific context, depending on the existing information infrastructure, data availability and healthcare provider's adoption.

In this study of using EHR and CXR data for predicting of severe outcomes of COVID-19, we developed a deep learning-based risk score which demonstrated high sensitivity and specificity for predicting patient outcome and making clinical decisions. Further research is necessary for a prospective deployment in ED and integration into clinical workflow. We are also investigating the feasibility of using federated learning scheme to incorporate additional sites into CO-RISK score development, in order to establish an international framework of COVID-19 patient risk stratification.

The species Severe acute respiratory syndromerelated coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2

Coronavirus disease ( COVID-19): weekly epidemiological update

Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study

Fair Allocation of Scarce Medical Resources in the Time of Covid-19

Deep learning-enabled system for rapid pneumothorax screening on chest CT

Artificial Intelligence and Machine Learning in Radiology: Opportunities, Challenges, Pitfalls, and Criteria for Success

Defining community acquired pneumonia severity on presentation to hospital: an international derivation and validation study

Validation of a modified Early Warning Score in medical admissions

Early triage of critically ill COVID-19 patients using deep learning

Development and validation of the ISARIC 4C

Deterioration model for adults hospitalised with COVID-19: a prospective cohort study. The Lancet Respiratory Medicine

Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal

The myth of generalisability in clinical research and machine learning in health care. The Lancet Digital Health

Machine learning for COVID-19-asking the right questions. The Lancet Digital Health

Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement

MissForest-non-parametric missing value imputation for mixedtype data

Deep & Cross Network for Ad Click Predictions

Random forests. Machine learning

A continental system for forecasting bird migration

Permutation importance: a corrected feature importance measure

Development and Validation of a Deep Learning Model for than LFO, HFO/NIV, or MV. The error bars show the 95% testing dataset (green bars

High flow face mask, Bag-valve Mask, Non-rebreather mask, T-Piece, Venturi mask, Partial rebreather mask, Bi-PAP, CPAP, Transtracheal catheter Mechanical ventilation (MV): Ventilator Additional patient outcome prediction by CO-RISK and comparison with other clinical scores Supplemental Figure 2: Boxplots for CO-RISK (top panel), CURB-65 (middle panel), and MEWS (bottom panel) scores in differentiating four different types of patient outcomes

), as well as patients on mechanical ventilation or high flow oxygen versus low flow oxygen or room air

Supplemental Figure 3: ROC curves of ICU/Death prediction using CO-RISK and MEWS score (left), and 30-days death prediction using CO-RISK and CURB-65 scores

95% CI), where MEWS achieved an AUC of 0.72 (0.69 -0.75 95% CI). Similarly, as CURB-65 score was designed and validated for predicting mortality in pneumonia and lung infection 3 , we specifically compared the performance between CO-RISK and CURB-65 scores in predicting 30-day morality

CO-RISK achieved an AUC of 0.86 (0.84 -0.89 95% CI), where CURB-65 achieved (admission/discharge, ICU/floor, or 30-day morality). Even so

Initial MEWS score to predict ICU admission or transfer of hospitalized patients with COVID-19: A retrospective study

The value of Modified Early Warning Score (MEWS) in surgical in-patients: a prospective observational study

Performance of pneumonia severity index and CURB-65 in predicting 30-day mortality in patients with COVID-19