key: cord-0899587-0agldesf authors: Sarkar, Jit; Chakrabarti, Partha title: A Machine Learning Model Reveals Older Age and Delayed Hospitalization as Predictors of Mortality in Patients with COVID-19 date: 2020-03-30 journal: nan DOI: 10.1101/2020.03.25.20043331 sha: f85b87d6ac0cb37ce1c1ffe67c055d1b55a44a8b doc_id: 899587 cord_uid: 0agldesf Objective: The recent pandemic of novel coronavirus disease 2019 (COVID-19) is increasingly causing severe acute respiratory syndrome (SARS) and significant mortality. We aim here to identify the risk factors associated with mortality of coronavirus infected persons using a supervised machine learning approach. Research Design and Methods: Clinical data of 1085 cases of COVID-19 from 13th January to 28th February, 2020 was obtained from Kaggle, an online community of Data scientists. 430 cases were selected for the final analysis. Random Forest classification algorithm was implemented on the dataset to identify the important predictors and their effects on mortality. Results: The Area under the ROC curve obtained during model validation on the test dataset was 0.97. Age was the most important variable in predicting mortality followed by the time gap between symptom onset and hospitalization. Conclusions: Patients aged beyond 62 years are at higher risk of fatality whereas hospitalization within 2 days of the onset of symptoms could reduce mortality in COVID-19 patients. The recent pandemic of coronavirus disease 2019 , caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused unprecedented morbidity and mortality in almost all the continents (1) . Despite implementations of extensive control measures, spread of the disease and eventual fatality could not be effectively halted till date. The major cause of death in COVID-19 is due to virus-induced pneumonia leading to respiratory failure (2) . Epidemiological evidence suggests that older age and the associated co-morbidities such as cardiovascular disease and diabetes put patients at higher risk of mortality (3) . Thus identification of novel risk factors predictive for patients' outcome including mortality is needed. Here using the publicly available clinical data from Kaggle, we have employed a machine learning tool to identify the risk factors that could potentially contribute to the mortality of COVID-19 patients from 22 countries in 4 continents. We show that older age and delayed hospitalisation of symptomatic patients are the two major risk factors for mortality in COVID-19 patients. The dataset was downloaded from Kaggle (https://www.kaggle.com/sudalairajkumar/novelcorona-virus-2019-dataset#COVID19_line_list_data.csv) on 23 rd March, 2020. It contained a total of 1085 reported cases of COVID-19 from 13 th January to 28 th February, 2020. Missing values were removed for all the variables to obtain a dataset of 433 individuals. 3 cases were filtered out from the dataset as the date of hospital visit preceded the date of symptom onset for them. Among the 430 cases selected finally from 22 countries in Asia, Australia, Europe and All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. North America, there were cases of 37 deaths and 78 recoveries. The descriptive statistics of the deaths and confirmed recovered cases have been depicted in Table 1 . Random Forest classification algorithm (4) was implemented over a dataset with 37 deaths and 78 recoveries using the randomForest package in R. The dataset was randomly split into training and test dataset containing 70% and 30% of the total samples respectively. To evaluate the model performance, the Area under the ROC curve was calculated on the test dataset. A variable importance plot was generated using the importance of the predictors over the outcome. The importance of the variables has been reported according to both the mean decrease of Gini and the mean decrease of Accuracy. The partial dependency plots were finally generated using the pdp package in R to determine the marginal effect of the Age and Time to Hospitalization over the fate of COVID-19 infection. The descriptive summary of the data has been represented by mean and standard deviation (SD). The numerical variables have been compared between groups by independentsamples two-sided Student's t-test. The categorical variables have been tested using Chi-square test. All the statistical analyses were performed in RStudio (Version 1. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 30, 2020. . https://doi.org/10.1101/2020.03.25.20043331 doi: medRxiv preprint (AUROC) curve during validation on the test dataset. The AUROC on the test dataset was found to be 0.97. Age was the most important variable in the model for predicting the fate which was interestingly followed by the time gap between the onset of symptoms and hospitalization. The importance of the variables in terms of Mean Decrease in Accuracy and Mean Decrease in Gini are graphically shown in Figure 2 (A, B) . In order to inspect the marginal effect of the predictors over the mortality of patients with COVID-19, we generated the partial dependency plots for the odds of Death among COVID-19 patients with Age and Days from the onset of symptoms to hospitalisation. As shown in Figure 2 (C, D), accentuation in odds of death was found with age beyond 62 years as well as beyond a time gap of 2 days between the onset of symptoms to hospitalisation. Taken together, our analysis identifies older age (62 years) and delayed hospitalisation as the two most important predictors of mortality among patients with COVID-19. Mortality of critically ill patients of COVID-19 is high and co-morbidities including hypertension, diabetes and coronary artery disease are often present in hospitalised patients. Though 48% of the non-survivors had a co-morbid disease, in multivariate analyses, independent associations of in-hospital death were found to be present with older age, high Sequential Organ Failure Assessment (SOFA) score and elevated d-dimer levels (6) . Another study has also identified older patients as a high risk group for mortality (7) . In agreement with previously published studies, our analysis also identified Age to be the most important risk factor for mortality among COVID-19 patients. However, the role of delayed hospitalisation following the development of symptoms as another significant risk factor for mortality among COVID-19 patients (after Age) is being reported for the first time. The inadequacy of healthcare resources has already been reported to associate with increased mortality among COVID-19 patients (8) . All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 30, 2020. Data represented by means ± SD. p-value < 0.05 considered statistically significant. World Health Organization. Coronavirus disease (COVID-19) pandemic A pneumonia outbreak associated with a new coronavirus of probable bat origin Prevalence of comorbidities in the novel Wuhan coronavirus (COVID-19) infection: a systematic review and meta-analysis Random Forests RStudio: Integrated Development for Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study Clinical course and outcomes of critically ill patients with SARS-CoV-2 pneumonia in Wuhan, China: a single-centered, retrospective, observational study Potential association between COVID-19 mortality and health-care resource availability