key: cord-0925186-tkv7bcyz authors: Ehwerhemuepha, Louis; Danioko, Sidy; Verma, Shiva; Marano, Rachel; Feaster, William; Taraman, Sharief; Moreno, Tatiana; Zheng, Jianwei; Yaghmaei, Ehsan; Chang, Anthony title: A super learner ensemble of 14 statistical learning models for predicting COVID-19 severity among patients with cardiovascular conditions date: 2021-03-17 journal: Intell Based Med DOI: 10.1016/j.ibmed.2021.100030 sha: c4828ac5a106d377757ed4e7f207b072a07bca79 doc_id: 925186 cord_uid: tkv7bcyz BACKGROUND: Cardiovascular and other circulatory system diseases have been implicated in the severity of COVID-19 in adults. This study provides a super learner ensemble of models for predicting COVID-19 severity among these patients. METHOD: The Cerner Real-World Database was used for this study. Data on adult patients (18 years or older) with cardiovascular and related circulatory diseases between 2017 and 2019 were retrieved and a total of 13 these conditions were identified. Among these patients, 33,042 admitted with positive diagnoses for COVID-19 between March 2020 and June 2020 (from 59 hospitals) were identified and selected for this study. A total of 14 statistical and machine learning models were developed and combined into a single more powerful super learning model for predicting COVID-19 severity on admission to the hospital. RESULT: LASSO regression, a full extreme gradient boosting model with tree depth of 2, and a full logistic regression model were the most predictive with cross-validated AUROCs of 0.7964, 0.7961, and 0.7958 respectively. The resulting super learner ensemble model had a cross validated AUROC of 0.8006 (range: 0.7814, 0.8163). The unbiased AUROC of the super learner model on an independent test set was 0.8057 (95% CI: 0.7954, 0.8159). CONCLUSION: Highly predictive models can be built to predict COVID-19 severity of patients with cardiovascular and other circulatory conditions. Super learning ensembles will improve individual and classical ensemble models significantly. The novel coronavirus disease, COVID-19, which was first reported in December 2019 in Wuhan, China, is caused by severe acute respiratory syndrome coronavirus 2, SARS-CoV-2. The virus has spread to 191 out of 195 countries with more than 63 million global cases and 1.47 million global deaths as of November 30, 2020. 1, 2 The World Health Organization declared COVID-19 a global pandemic on March 11th, 2020 as the number of countries affected rose sharply from 59 on February 28th, 2020 to 122 on March 13th, 2020. 1, 2 Underlying cardiovascular and circulatory diseases have been implicated in the severity of COVID-19 in adults [3] [4] [5] [6] [7] [8] [9] [10] [11] since March 2020. The association between cardiovascular diseases (CVD) and COVID-19 severity is bidirectional. On the one hand, pre-existing CVD such as coronary heart disease and hypertension are known to be linked with higher COVID-19 morbidity and mortality. On the other hand, COVID-19 can induce CVD such as myocardial injury, arrhythmia, acute coronary syndrome, and venous thromboembolism among others. [7] [8] [9] [10] [11] In other words, while pre-existing CVD can lead to worse COVID-19 outcomes, COVID-19 can induce new CVD and potentially worsen existing disease. [7] [8] [9] [10] [11] Recent studies have addressed cardiovascular risk factors of COVID-19 implicating cardiovascular complications with greater COVID-19 disease burden. 12, 13 This underscores the importance of studying the relationship between CVD and related circulatory conditions with respect to COVID-19 severity. Specific focus on CVD patients is therefore required given the elevated mortality rate among these patients with COVID-19. Corresponding severity prediction model for CVD patients on admission to the hospital will help with proactive care and reduce morbidity and mortality. The application of statistical learning and artificial intelligence algorithms may provide frontline clinicians ability to provide early and targeted therapies that may help reduce morbidity. [14] [15] [16] [17] [18] [19] J o u r n a l P r e -p r o o f Furthermore, the ability to recognize, on admission, patients who will progress to severe COVID-19 would be helpful in logistics and planning in face of scarce clinical resources and has the potential to be life-saving. Consequently, the application of predictive models may help mitigate some uncertainty associated with COVID-19 disease progression. In this study, we developed 14 statistical learning models and combined them into a super learning model that is an ensemble of ensembles and other statistical/machine learning models. The goal is to assess the extent by which these models may help predict severe COVID-19 in CVD patients who are already known to be at high risk. Table 1 . Demographics and health insurance payer data were retrieved for qualifying patients. The vital signs (such as body temperature, heart rate, respiratory rate, systolic blood pressure, and diastolic blood pressure) of patients on admission to the hospital with COVID-19 were captured and categorized into normal, high, and low for the age of the patient. The oxygen saturation level was also captured and categorized into the following categories: 100-95%, 94-90%, and <90%. A nuisance categorical level was created for patients with vital signs that were not measured on admission or with missing vital sign data in the database. Alternative approaches would include the use of statistical or machine learning imputation methods. Two of the most severe forms of decompensation are need for mechanical ventilation and inhospital death. In this study, patients who were on mechanical ventilators or who had in-hospital death were classified as patients who progressed to severe COVID-19. All other patients were classified as having mild COVID-19. As a result, the outcome variable of this study is binary: severe COVID-19 (need for mechanical ventilators or in-hospital death) and mild COVID-19 (any other outcome with live discharge from the hospital). This binary outcome was chosen to simplify this multicenter study and to ensure that we are targeting the most severe outcomes for COVID-19. A total of 14 statistical learning models (referred to as base learners from hereon) were selected for this study that encompassed LASSO regression, generalized logistic regression model (with and without forward variable selection), linear discriminant analysis (with and without LASSO variable selection), multivariate adaptive regression splines, random forest (with and without LASSO variable selection), and three extreme gradient boosting models (all with and without LASSO variable selection). 21 28 We include a mathematical derivation of super learning in the appendix of this paper and provide a simplified graphical representation in Figure 1 here. The data for this study consist of variables capturing demographics, health insurance information, first vital signs on admission, 13 pre-existing CVD and related circulatory conditions, and pre-existing comorbid conditions. This data was split into training (70%) and test J o u r n a l P r e -p r o o f (30%) sets. The training set was used to train all base learners and the super learner model using 10-fold cross-validation. The test set was used to provide unbiased estimates of the super learner ensemble model performance metrics. All analyses for this study were carried out in the Statistical Computing Programming Language R and the SuperLearner package. 29, 30 J o u r n a l P r e -p r o o f The data used for this study consists of COVID-19 hospitalizations from 59 hospitals/health systems. There was a total of 33,042 qualifying hospitalizations of which 5,685 had mechanical ventilators or resulted in an in-hospital death. This results in a severe COVID-19 rate of 17 The cross-validated model performance on the training data are shown in Table 2 in order of decreasing performance. The super learner model had a cross-validated average AUROC of 0.8006 which, as expected, is higher than those of the constituent base learners. An interactive web-based dashboard to track COVID-19 in real time tracker COVID-19 and the cardiovascular system COVID-19 and the cardiovascular system: implications for risk assessment, diagnosis, and treatment options COVID-19 and the Cardiovascular System Covid-19 and the cardiovascular system: a comprehensive review COVID-19 and cardiovascular disease COVID-19 and cardiovascular disease: from basic mechanisms to clinical perspectives Cardiovascular disease and COVID-19 The novel coronavirus disease (COVID-19) threat for patients with cardiovascular disease and cancer Coronavirus disease 2019 (covid-19) and cardiovascular disease: a viewpoint on the potential influence of angiotensin-converting enzyme inhibitors/angiotensin receptor blockers on onset and severity of severe acute respiratory syndrome coronavirus 2 infec Common cardiovascular risk factors and in-hospital mortality in 3,894 patients with COVID-19: survival analysis and machine learning-based findings from the multicentre Italian CORIST Study Impact of cardiovascular risk profile on COVID-19 outcome. A meta-analysis Applications of machine learning and artificial intelligence for Covid-19 (SARS-CoV-2) pandemic: A review Artificial intelligence and machine learning to fight COVID-19 New machine learning J o u r n a l P r e -p r o o f method for image-based diagnosis of COVID-19 Artificial intelligence distinguishes COVID-19 from community acquired pneumonia on chest CT Predictors of pediatric readmissions among patients with neurological conditions A Statistical Learning Model for Unplanned 7-day Readmission in Pediatrics HealtheDataLab -a cloud computing solution for data science and advanced analytics in healthcare with application to predicting multi-center pediatric readmissions The Elements of Statistical Learning Logistic Regression Multivariate adaptive regression splines An introduction to multivariate adaptive regression splines Extreme Gradient Boosting Oracle inequalities for multi-fold cross validation Super Learner 1. Patients with cardiovascular diseases are at high risk of severe Machine learning and artificial intelligence may help in identification of the highest risk patients 3. Individual base learners (statistical learning models) have good performance in predict the severity of COVID-19 in patients with pre-existing cardiovascular diseases Super learning, a mathematically proven approach to ensemble learning, will further improve the performance of base learners A super learner model for patients with pre-existing cardiovascular conditions resulted in improved model performance that may have meaningful clinical impact J o u r n a l P r e -p r o o f ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:J o u r n a l P r e -p r o o f