key: cord-0977494-qovbwixf
authors: Heldt, F. S.; Vizcaychipi, M. P.; Peacock, S.; Cinelli, M.; McLachlan, L.; Andreotti, F.; Jovanovic, S.; Durichen, R.; Lipunova, N.; Fletcher, R. A.; Hancock, A.; McCarthy, A.; Pointon, R. A.; Brown, A.; Eaton, J.; Liddi, R.; Mackillop, L.; Tarassenko, L.; Khan, R. T.
title: Early risk assessment for COVID-19 patients from emergency department data using machine learning
date: 2020-05-22
journal: nan
DOI: 10.1101/2020.05.19.20086488
sha: 632ab07b560b4720ec65dc4552bc210c3b9c01db
doc_id: 977494
cord_uid: qovbwixf

Background Since its emergence in late 2019, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused a pandemic, with more than 4.8 million reported cases and 310 000 deaths worldwide. While epidemiological and clinical characteristics of COVID-19 have been reported, risk factors underlying the transition from mild to severe disease among patients remain poorly understood. Methods In this retrospective study, we analysed data of 820 confirmed COVID-19 positive patients admitted to a two-site NHS Trust hospital in London, England, between January 1st and April 23rd, 2020, with a majority of cases occurring in March and April. We extracted anonymised demographic data, physiological clinical variables and laboratory results from electronic healthcare records (EHR) and applied multivariate logistic regression, random forest and extreme gradient boosted trees. To evaluate the potential for early risk assessment, we used data available during patients' initial presentation at the emergency department (ED) to predict deterioration to one of three clinical endpoints in the remainder of the hospital stay: A) admission to intensive care, B) need for mechanical ventilation and C) mortality. Based on the trained models, we extracted the most informative clinical features in determining these patient trajectories. Results Considering our inclusion criteria, we have identified 126 of 820 (15%) patients that required intensive care, 62 of 808 (8%) patients needing mechanical ventilation, and 170 of 630 (27%) cases of in-hospital mortality. Our models learned successfully from early clinical data and predicted clinical endpoints with high accuracy, the best model achieving AUC-ROC scores of 0.75 to 0.83 (F1 scores of 0.41 to 0.56). Younger patient age was associated with an increased risk of receiving intensive care and ventilation, but lower risk of mortality. Clinical indicators of a patient's oxygen supply and selected laboratory results were most predictive of COVID-19 patient trajectories. Conclusion Among COVID-19 patients machine learning can aid in the early identification of those with a poor prognosis, using EHR data collected during a patient's first presentation at ED. Patient age and measures of oxygenation status during ED stay are primary indicators of poor patient outcomes.

Introduction 65 COVID-19, caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is 66 a novel infectious disease that leads to severe acute respiratory distress in humans. In March 67 2020, the World Health Organisation declared the outbreak a pandemic and, by May 19 th , it 68 had caused more than 4 800 000 confirmed cases and 310 000 deaths worldwide [1] . 69 Disease severity for COVID-19 appears to vary dramatically between patients, including 70 asymptomatic infection, mild upper respiratory tract illness and severe viral pneumonia with 71 acute respiratory distress, respiratory failure and thromboembolic events that can lead to 72 death [2] [3] [4] . Initial reports suggest that 6%-10% of infected patients are likely to become 73 critically ill, most of whom will require mechanical ventilation and intensive care [3, 5] . 74

Currently, few prognostic markers exist to forecast whether a COVID-19 patient may 75 deteriorate to a critical condition and require intensive care. In general, patients can be 76 grouped into three phenotypes, being at risk of thromboembolic disease, respiratory 77 deterioration and cytokine storm [6] . Early clinical reports find that age, sex and underlying 78 comorbidities, such as hypertension, cardiovascular disease and diabetes, can adversely 79 affect patient outcomes [7, 8] . However, few studies have leveraged machine learning to 80 systematically explore risk factors for poor prognosis. 81

Increasingly, hospitals collate large amounts of patient data as electronic healthcare 82 records (EHRs). Combined with state-of-the-art machine learning algorithms, these data can 83

help to predict patient outcomes with greater accuracy than traditional methods [9,10]. 84

However, EHR data for COVID-19 remains scarce in the public domain, prompting many 85 authors to focus on statistical analyses instead [11] [12] [13] [14] . Where machine learning has been 86 applied to COVID-19, results have been promising, but most studies suffer from a lack of 87 statistical power owing to small sample size [15] [16] [17] [18] . Jiang et al. applied predictive analytics 88 to data from two hospitals in Wenzhou, China, which included 53 hospitalised COVID-19 89 patients, to predict risk factors for acute respiratory distress syndrome (ARDS) [15] . Exploring 90 the risk factors for in-hospital deaths, Zhou A key factor that determines the success of risk prediction models is the quality and richness 97 of the available data. Studies to date have used a combination of demographics, 98 comorbidities, symptoms, and laboratory tests [15] [16] [17] 19 ]. These data typically comprise the 99 patients' entire historical record, as well as observations collected during the current hospital 100 stay [16, [18] [19] [20] . While the inclusion of a patient's full EHR history improves predictive 101 performance, such approaches may be limited in their clinical applicability to early risk-102 assessment; at the point of presentation in hospital, the entire EHR of a patient is rarely 103 available. 104

In this work, we retrospectively apply machine learning to data of 820 confirmed COVID-19 105 patients from two tertiary referral urban hospitals in London to predict patients' risk of 106 deterioration to one of three clinical endpoints: A) admission to an adult intensive care unit 107 (AICU), B) need for mechanical ventilation, and C) in-hospital mortality. We restrict our 108 analysis to EHR data available during a patient's first presentation in the emergency 109 department (ED) as this more accurately resembles the hospital reality of early-risk 110 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. considered. Patients who did not have information relating to an admission to any hospital 145 department in 2020 were excluded. Furthermore, the following exclusion criteria were applied 146 to each of the considered endpoints: for cohort A) patients without a documented ward 147 location were excluded; for cohort B) patients without information on oxygen supply were 148 excluded; for cohort C) patients without hospital discharge information were excluded. 149

Finally, since our models were trained on data available during a patient's stay in the ED, we 150 removed patients who did not have a documented ED visit. 151 152

Each cohort was divided into target and control groups (see Table 2 ). For AICU admission, 153 target patients comprise those that were admitted to an AICU at any time during their 154 hospital stay, while control patients are those that remained in any other ward for their entire 155

admission. Target patients in the ventilation cohort were defined as requiring invasive 156 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Three machine-learning algorithms were benchmarked to predict patient outcomes from EHR 195 data: logistic regression, random forest and Extreme Gradient Boosted Trees (XGBoost). 196

Logistic regression, which predicts the probability of a clinical endpoint as a linear function of 197 the feature space, was used as a baseline algorithm. The model was regularised with elastic 198 net using equal weighting given to L1 and L2 penalties in order to account for the high 199 dimensionality of the data set relative to the number of observations. A random forest [21], 200

i.e., an ensemble of decision trees where each tree is trained on a slightly different subset of 201 data, was trained using 100 trees and splits were evaluated using Gini impurity. Classes were 202

inversely weighted to account for the class imbalance present in the data set. An XGBoost 203 algorithm [22] was trained with its hyperparameters set to 100 trees, max tree-depth of 6, 204

step-shrinkage of 0.3, no subsampling and L2 regularisation, to minimize log-loss. This tree-205 based algorithm trains decision trees sequentially, with each new tree being trained on the 206 residuals of previous trees. 207 208 Performance evaluation 209

All models were evaluated using a stratified 3-fold cross-validation strategy. Results are 210

reported as mean and standard deviation across these folds. Predictive performance was 211 measured in terms of area under curve (AUC) of the receiver operating characteristic (ROC) 212

as well as F1 score at each model's ideal classification threshold as derived from the ROC 213 curve. Given the presence of class-imbalance, precision-recall curves were also computed to 214 assess expected real-world performance relative to random classifiers. 215 216

In order to extract the clinical features most relevant to predictions, permutation feature 217 importance (PFI) was calculated for each model post-hoc [21,23]. Each feature was 218

individually randomised. The model's AUC-ROC on the validation sets was then compared to 219 the AUC-ROC before the feature had been randomised. PFI provides an estimate of the 220 extent to which a model relies on a feature for its predictive performance and generalisability. 221

The changes in performance were normalised by the sum of absolute changes over all 222

features. Averages and standard deviations over the validation sets have been reported. 223 224

Accumulated local effects (ALE) were computed to determine the directionality of a feature's 225 effect on model predictions [24] . Specifically, the feature space was divided into ten 226 percentile bins and each feature's effect was calculated as the difference in predictions 227

between the upper and lower bounds of each bin, leaving all other features unchanged. 228

Binning features in this way can reduce the influence of correlated features often encountered 229 when trying to isolate the effect of a single feature. 230 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 253 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. First, we studied patients transitioning to critical care and requiring admission to an AICU. All 256 three models reach good prediction performance on this endpoint, as measured by area 257 under the curve (AUC) of the receiver operating characteristic (ROC) and precision-recall 258 curves, significantly outperforming random classifiers (Fig. 2) . The best performing model, 259

XGBoost, reaches an AUC-ROC of 0.83 and an F1 score of 0.51. Both tree-based methods 260 perform better than logistic regression (Table 4) . This is to be expected since logistic 261 regression cannot model interactions between features unless such interactions are explicitly 262 encoded into the training data set through feature engineering. All models show a moderate 263 amount of variability across cross-validation folds (notice standard deviations in Fig. 2 and  264 Table 4), which can compromise subsequent analyses. This instability originates from the 265 limited number of patients and high class imbalance between target and control patients (see 266 Table 2 ). Specifically, in each of the three cross-validation folds the models are only trained 267

and validated on two thirds and one third of the data set, respectively, leaving few target 268 patients for these tasks. Figure 3A presents the 15 most important features for the logistic regression 280 with elastic net regularisation. Note that clinical variables that can be recorded multiple times 281 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Moreover, the fraction of inspired oxygen (FiO2) contributes to predictions, albeit without 285 being significant. The random forest (Fig. 3B) and XGBoost (Fig. 3C ) models assign a higher 286 importance to patient age, with respiratory rate following thereafter. Intriguingly, ALE analyses 287 reveal that lower patient age increases the likelihood of AICU admission in all three 288 models (Figs. 3D-F) . This agrees well with a bias towards younger patients when comparing 289 AICU-admitted patients with control patients (Fig. S3A) . However, clinical indicators of 290 disease severity, such as C-reactive protein and ferritin levels, show no clear trend across 291 age groups (Fig. S4) . We also find that the fraction of inspired oxygen (Fig. 3D ) and 292 respiratory rate (Figs. 3E and F) exhibit a positive effect on AICU admission probability. 293

In summary, machine learning algorithms can predict those patients most likely to require 294 AICU admission in COVID-19 patients from EHR data available during the initial ED stay with 295 high precision. Patient age and indicators of oxygenation status are strong indicators of 296 patient outcome, with advanced age decreasing the probability of AICU admission. 297 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 

For mechanical ventilation prediction, we categorised patients into those that needed a 309

ventilator (e.g., patients receiving SIMV, BIPAP or APRV ventilation) and control patients that 310 either were able to breathe normally or required minimal assistance (e.g., those patients 311 receiving oxygen via nasal cannulae or face masks). Prediction performance on this endpoint 312 is comparable to prediction of AICU admission (Fig. 4) . Specifically, XGBoost performs best, 313

reaching an AUC of 0.83, while logistic regression and random forest reach 0.79 and 0.81, 314

respectively (Table 4 ). This result is expected since most patients receive mechanical 315 ventilation in AICU, meaning the ventilation cohort is a subset of the critical care cohort (56 of 316 62 target patients in Cohort B are target patients in Cohort A) . Notably, all models show a 317 decrease in stability in predicting this clinical endpoint. This is most likely due to a higher 318 class-imbalance and lower number of patients receiving ventilation. 319 320 321 Feature importance analysis for the logistic regression shows a large effect of the fraction of 327 inspired oxygen and patient age (Fig. 5A ). This mirrors the results for AICU admission. We 328 also observe a significant influence of haemoglobin levels on model predictions. Both tree-329 based methods rank age highly (Figs. 5B and C). In addition, blood lactate levels and oxygen 330 saturation are used by the random forest (Fig. 5B) , while XGBoost relies on the fraction of 331 inspired oxygen and levels of thyroid stimulating hormone (Fig. 5C ), although few values are 332 significant. In general, all models rely on a broader set of features for the ventilation endpoint. 333 ALE analysis shows younger patients had an increased probability of receiving 334 ventilation (Fig. 5D-F) , which agrees with an inherent bias towards younger age when 335 comparing ventilated with non-ventilated patients (Fig.S4B) . By contrast, a higher fraction of 336 inspired oxygen and higher blood lactate level were associated with a poor prognosis. 337

Taken together, models show good performance when predicting ventilation, albeit with a 338 decreased model stability (higher standard deviation). Patient age and oxygenation status are 339 most predictive of poor outcome, with additional contributions from blood test values, such 340 as lactate and haemoglobin levels. 341 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 22, 2020. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The performance of all three models shows a marked decrease when predicting mortality 352 (Fig. 6) . The logistic regression and XGBoost reach AUCs of 0.66 and 0.74, respectively, only 353 outperformed by random forest reaching an AUC of 0.75. However, model stability is 354 improved with standard deviations across cross-validation folds reaching their lowest levels 355 over all three clinical endpoints (Table 4) protein levels adding a small but significant contribution (Fig. 7A) . Similarly, tree-based 365 methods rely heavily on age for their predictions, with smaller contributions of respiratory rate 366

and Troponin T levels (Figs. 7B and C). More generally, prediction of mortality relies more 367 strongly on blood tests as opposed to indicators of oxygen supply observed in other cohorts. 368 ALE analysis shows that advanced age is predictive of higher mortality (Fig. 7D-F ). This 369 agrees with a bias towards older age in patients that die in hospital (Fig. S4C ). Higher C-370 reactive protein, respiratory rate and Troponin T levels increase the risk of mortality in our 371 models (Figs. 7D-F). 372

In summary, models show an increased stability but lower overall performance when 373 predicting mortality. Feature importance scores reveal a high and significant contribution of 374 patient age with advanced age contributing to poor patient outcomes. 375 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Once such data is available, more complex models, such as deep neural networks, may 429 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 22, 2020. . https://doi.org/10.1101/2020.05.19.20086488 doi: medRxiv preprint achieve higher prediction performance. A key aspect which should be considered in such 430 works is the prediction horizon, which impacts on how useful a model could be. 431

In conclusion, our models represent a first step towards the prediction of COVID-19 patient 432 pathways in hospital at the point of admission in the emergency department. Sensyne Health plc on 25th July 2018. 460 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 

European Centre for Disease 462 Prevention and Control

Characteristics of and Important Lessons From the Coronavirus 465 Disease 2019 (COVID-19) Outbreak in China: Summary of a Report of 72 314 Cases 466 From the Chinese Center for Disease Control and Prevention

Clinical course and outcomes of critically ill 469 patients with SARS-CoV-2 pneumonia in Wuhan, China: a single-centered, 470 retrospective, observational study

Incidence of 473 thrombotic complications in critically ill ICU patients with COVID-19

How will country-based 476 mitigation measures influence the course of the COVID-19 epidemic? The Lancet

Early detection 479 of severe COVID-19 disease patterns define near real-time personalised care, 480 bioseverity in males, and decelerating mortality rates

Novel Coronavirus Pneumonia Emergency Response Epidemiology Team. The 483 epidemiological characteristics of an outbreak of 2019 novel coronavirus diseases 484 (COVID-19) in China. Chin Cent Dis Control Prev

Clinical characteristics of 113 487 deceased patients with coronavirus disease 2019: retrospective study

Opportunities and challenges in 490 developing risk prediction models with electronic health records data: a systematic 491 review

493 Prediction models for diagnosis and prognosis of covid-19 infection: systematic review 494 and critical appraisal

Clinical Characteristics of 138 496 Hospitalized Patients With

Clinical course and outcomes of critically 499 ill patients with SARS-CoV-2 pneumonia in Wuhan, China: a single-centered, 500 retrospective, observational study

Characteristics and 503 outcomes of 21 critically ill patients with COVID-19 in Washington State

Risk Factors Associated with Clinical 506 Outcomes in 323 COVID-19 Patients in Wuhan, China

No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted

Towards an artificial 509 intelligence framework for data-driven prediction of coronavirus clinical severity

A Tool to Early Predict Severe 2019-515 Novel Coronavirus Pneumonia (COVID-19): A Multicenter Study using the Risk 516 Nomogram in Wuhan and Guangdong, China

Development and 519 external validation of a prognostic multivariable model on admission for hospitalized 520 patients with COVID-19

Predicting Mortality Risk in Patients with COVID-19 Using 522 Artificial Intelligence to Help Medical Decision-Making

Prediction of criticality in 525 patients with severe Covid-19 infection using three clinical features: a machine learning-526 based prognostic model with clinical data in Wuhan

Random Forests

Xgboost: A scalable tree boosting system

All Models are Wrong, but Many are Useful: Learning a 534 Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously

Visualizing the Effects of Predictor Variables in Black Box Supervised 537 Learning Models

Features 540 of 16,749 hospitalised UK patients with COVID-19 using the ISARIC WHO Clinical 541 Characterisation Protocol