key: cord-0753124-i7tnwd4n authors: Vagliano, I.; Brinkman, S.; Abu-Hanna, A.; Arbous, M.S; Dongelmans, D.A.; Elbers, P.W.G.; de Lange, D.W.; van der Schaar, M.; de Keizer, N.F.; Schut, M.C. title: Can we reliably automate clinical prognostic modelling? A retrospective cohort study for ICU triage prediction of in-hospital mortality of COVID-19 patients in the Netherlands date: 2022-01-22 journal: Int J Med Inform DOI: 10.1016/j.ijmedinf.2022.104688 sha: 32d17d06540281827e8a013e51f98e953e7f114f doc_id: 753124 cord_uid: i7tnwd4n BACKGROUND: Building Machine Learning (ML) models in healthcare may suffer from time-consuming and potentially biased pre-selection of predictors by hand that can result in limited or trivial selection of suitable models. We aimed to assess the predictive performance of automating the process of building ML models (AutoML) in-hospital mortality prediction modelling of triage COVID-19 patients at ICU admission versus expert-based predictor pre-selection followed by logistic regression. METHODS: We conducted an observational study of all COVID-19 patients admitted to Dutch ICUs between February and July 2020. We included 2,690 COVID-19 patients from 70 ICUs participating in the Dutch National Intensive Care Evaluation (NICE) registry. The main outcome measure was in-hospital mortality. We asessed model performance (at admission and after 24 hours, respectively) of AutoML compared to the more traditional approach of predictor pre-selection and logistic regression. Findings: Predictive performance of the autoML models with variables available at admission shows fair discrimination (average AUROC = 0·75-0·76 (sdev = 0·03), PPV = 0·70-0·76 (sdev = 0·1) at cut-off = 0·3 (the observed mortality rate), and good calibration. This performance is on par with a logistic regression model with selection of patient variables by three experts (average AUROC = 0·78 (sdev = 0·03) and PPV = 0·79 (sdev = 0·2)). Extending the models with variables that are available at 24 hours after admission resulted in models with higher predictive performance (average AUROC = 0·77-0·79 (sdev = 0·03) and PPV = 0·79-0·80 (sdev = 0·10-0·17)). CONCLUSIONS: AutoML delivers prediction models with fair discriminatory performance, and good calibration and accuracy, which is as good as regression models with expert-based predictor pre-selection. In the context of the restricted availability of data in an ICU quality registry, extending the models with variables that are available at 24 hours after admission showed small (but significantly) performance increase. What was already known on the topic:  Classical prediction models (i.e., regression models with manual predictor selection) yield good performance for clinical diagnosis and prognosis, but the modeling process is potentially biased and limited.  Automated prognostic modelling (AutoML) facilitates automatic model and variable selection and hyperparameter tuning, and can lessen the burden of carrying out manual design tasks for prediction modeling.  The largest proportion of prediction models for diagnosis and prognosis of COVID-19 were developed in the classical way (regression with manual predictor selection). What this study added to our knowledge:  Automated modeling can deliver clinical prediction models that perform on par with more classical models (regression models with manual predictor selection).  Automated modelling can assist decision-making on ICU admittance and treatment, and can support efficient use of ICU capacity.  Admitting of COVID-19 patients to the ICU to see how they develop in the first 24 hours may not be effective. data in an ICU quality registry, extending the models with variables that are available at 24 hours after admission showed small (but significantly) performance increase. The prevalent approach to clinical prediction modeling often involves the manual selection of potentially relevant variables by experts, followed by regression analysis. Recent advancements in Machine Learning (ML) render this classical approach restrictive (uses only one model type), inefficient (labor-intensive manual selection) and potentially biased (predictor pre-selection). Automated Machine Learning (AutoML) is the automation of the ML design process which includes, among others, automatic model and variable selection and hyperparameter tuning. 1 The promise of AutoML is to remove or lessen the burden of manual ML design tasks. In this study, we assess the predictive performance of AutoML for clinical prognosis modeling by comparing classical modeling (manual variable selection followed by regression) and AutoML modeling approaches. In particular, we assess the performance of AutoPrognosis 2 for the prediction of in-hospital mortality of COVID-19 patients that were admitted to the ICU. AutoPrognosis is an AutoML tool developed for clinical prognostic modeling that learns 20 ML models (e.g., regression, neural networks, and linear discriminant analysis) simultaneously. The case study is particularly relevant for challenging the classical model approach, because (1) the largest proportion of prediction models for diagnosis and prognosis of COVID-19 were developed in the classical way (dd. July 2021: 89 out of 238 models used regression); 3 and (2) efficient automated approaches might be part of a rapid response strategy in a crisis situation. The classical approach to develop prediction models based on expert-based predictor preselection followed by logistic regression can be time and labor intensive and may be biased. In case of new and yet unknown diseases, such predictor selection is not even possible. New and highly infectious diseases with high chances of leading to pandemic outbreaks, like COVID-19, require a rapid response in order to obtain and disseminate new information about the disease. It is unclear whether automated clinical prognostic modelling approaches based on different machine learning algorithms, which are more rapid and less labor-intensive, are able to reliably predict in-hospital mortality for COVID-19 patients. 2 The aim of this study is twofold. First, to assess the performance of prognostic models to predict in-hospital mortality of COVID-19 patients admitted to Dutch ICUs using automated clinical prognostic modelling versus using the more traditional approach with expert-based predictor preselection followed by logistic regression. Second, to assess the performance of these models based on data available at ICU admission versus data available after 24 hours of ICU admission. This study used prospectively collected data on all patients admitted between February 15 th and July 1 st 2020 with confirmed COVID-19 to a Dutch ICU extracted from the Dutch National Intensive Care Evaluation (NICE) registry. This NICE dataset contains, amongst other items, demographic data, minimum and maximum values of physiological data in the first 24 hours of ICU admission, diagnoses (reason for admission as well as comorbidities), ICU as well as in-hospital mortality data and length of stay. 4 This data collection takes place in a standardized manner according to strict definitions and stringent data quality checks to ensure high data quality. 5 Patients were considered to have COVID-19 when the RT-PCR of their respiratory secretions was positive for SARS-CoV-2 or when their CT-scan was consistent with COVID-19 (i.e. a CO-RADS score of ≥4 in combination with the absence of an alternative diagnosis). 6 All analyses were performed on two variants of the NICE dataset: (1) when including only variables available at ICU admission (0h) and (2) when including all variables available after the first 24 hours of ICU admission (24h). The primary outcome of this study was in-hospital mortality. During the peak of COVID-19 there was a shortage of ICU beds in some hospitals and many patients were transferred to other ICUs. For transferred patients we could follow their transfers through the Netherlands (because all Dutch ICUs participate in the used registry) and used the survival status of the last hospital the patient was admitted to during one and the same COVID-19 episode. We applied AutoPrognosis to build prognostic models for prediction of in-hospital mortality using an automated machine learning (AutoML) process. 2 Supplementary Section 1 provides a brief technical overview of how AutoPrognosis works. Comparative design -In our study, we compared three different approaches (see Table 1 ) to develop a prognostic model to predict the in-hospital mortality of confirmed COVID-19 patients. Additionally, as a reference, we applied a recalibrated version of the Acute Physiology and Chronic Health Evaluation IV (APACHE IV) regression model, 7 which is one of the most common prognostic model used in intensive care, on our COVID-19 patient population. Such a reference enabled us to verify if developing an ad-hoc model makes sense at all (independently from the used approach). Statistical Analysis -All the analyses were performed using Python v3.6 and R version 3.5.1 x64 with publicly available software packages 1 . For the reporting of this study, we followed the TRIPOD statement (https://www.tripod-statement.org) and the IJMEDI checklist for assessment of medical AI (https://zenodo.org/record/4835800). 8 The completed IJMEDI checklist is included in the supplementary materials. Table 2 includes an overview of the processing operations that were performed. For the expert-selection approach, three intensivists (DD, DdL, SA) independently preselected predictors from a list of available variables in the NICE registry. Discrepancies were resolved by discussion and based on consensus. The APACHE III acute physiology score 9 and the overall Glasgow Coma Scale (GCS) 10 score were included, and the raw predictors that these scores take into account were excluded (we tried adding the raw predictors but this did not improve results). A further selection on the predictors was done with a backward stepwise AIC selection model. We measured (1) discrimination: Area Under the Receiver Operating Curve (AUROC), Area Under the Precision-Recall Curve (AUPRC), sensitivity, Positive predictive value (PPV), Negative predictive value (NPV), Brier score (i.e., the mean squared error of the prediction); (2) calibration: calibration curves; and (3) interpretation: model coefficients. AUROC and AUPRC were provided by AutoPrognosis; we computed separately the other required measurements. For PPV, NPV and sensitivity, the decision threshold was set to 0.3, which is the average mortality rate in this patient population, corresponding to outcome prevalence. 11 For some models built by AutoPrognosis, e.g., neural networks, interpretation was not readily available (but involves more elaborative techniques like SHAP 12 or LIME 13 ), this was not measured. The model performance was evaluated as the average performance over a five-fold cross validation (this is the default validation in AutoPrognosis). For all three approaches, the folds were kept identical to enable fair comparison. The original APACHE IV model as a baseline was first-level recalibrated with the same five folds to achieve a better fit with our specific population, and was then also evaluated with the same five folds. Following Moreno and Apolone, 14 Performance measures for discrimination and calibration were assessed by averaging the mean predicted values and the fraction of positives of the best models per fold. To determine the best model per fold, we perform a model comparison within AutoPrognosis. The best model is the one which achieved the highest average AUROC over the five folds. We used the 5x2 cross validation (CV) F-test statistical test for determining the best model. 15 16 For interpretation, we provided feature importance results for the best performing model within each approach. The interpretation results were judged on clinical relevance by intensivists (DdL, SA, DD). In total 2,706 confirmed COVID-19 patients of 70 ICUs were included, of which 2,690 (99·4%) could be followed up until hospital discharge; 796 patients (29·6%) died during their hospital stay. We observe that survivors were significantly younger (60·8 vs 68·6 years), were more often woman (30·5 vs 22·1%), were less often admitted from the emergency room (23·2 vs 30·9%), and were less often on mechanical ventilation at ICU admission (45·4 vs 55·5%). Discrimination -Tables 4a (models with data at admission; referred to as 0h models onwards) and 4b (models with data after 24 hours; referred to as 24h models onwards) show the AUROC, AUPRC, PPV, NPV, and Brier scores of the three approaches. The obtained 0h and 24h models have fair discriminatory performance (AUROC = 0·75-0·78). For both the 0h and 24h models, there is a significant difference in discriminatory performance in terms of AUROC, AUPRC and Brier score between the fully-and semiautomated approaches (AUROC 0h: p< 0·05, AUROC 24h p< 0·01, AUPRC and Brier score both 0h and 24h: p < 0·01, for 5x2 CV F-test). Additionally, for the 24h models the results of the APACHE IV model are significantly different to all other models for all measures but NPV (p < 0·01 for 5x2 CV F-test). The best 0h and 24h models obtained by the fully-automated approach were linear discriminant analysis (LDA) models. The best 0h models of the semi-automated approach was LDA; the best 24h was a logistic regression (logR) model. The PPV in the context of triage is most important as one does not want to falsely identify non-survivors and abstain them from ICU care. The 0h model PPVs range between 0·70 (fully-automated) and 0·79 (expertselection); there is no significant difference in PPV between the three approaches (p > 0·05 for 5x2 CV F-test). Table 2 includes the 0hmodel description for the fully-automated approach. Supplementary Table 3 includes the 0h-model description for the semi-automated approach. For both the LDA models, the major harmful risk factor for mortality was the patient's age and the major protective risk factor for mortality was the date at which the patient was admitted to the ICU (later date lower mortality risk in the variable selections in the semi-automated and expert-selection approaches. For the 24h models, the semi-automated approach selected more variables (34) than the experts did (30), but the overlap of variables (13) is the same as in the 0h models. In this study, we assessed the predictive performance of automated clinical prognostic modelling (AutoML) for in-hospital mortality of ICU-admitted confirmed COVID-19 patients by comparing two automated modelling approaches using (fully-automated and semi-automated) AutoML and one expert-selection approach where intensivists selected potentially relevant variables and a logistic regression analysis was performed. In addition, we compared predictive performance of models that had access to only variables available at admission (0h) with models that had access to variables available at 24 hours after ICU admission (24h). Overall, predictive performance in terms of discrimination (AUROC) was fair (0·7-0·8). For the 0h models, there was no significant difference for discrimination (AUROC) between the automated and manual approaches. The semi-automated constructed LDA model (best model of the semi-automated approach) did significantly outperform the fully automated constructed LDA model (best model of the fully-automated approach), but the difference was too small to be clinically relevant. There was no significant difference in PPV between the three approaches. The 24h models performed similarly in terms of discrimination (AUROC), PPV, and calibration. The selected best model for the semi-automated approach was different for 0h and 24h (0h: LDA, 24h: logR), for the fully-automated approach the best 0h and 24 models were the same (both LDA). The 24h models were found to perform significantly better than the 0h models (improved AUROC of 0·02), but since it is only a small improvement, it may not be clinically relevant. The studies that are most closely related to our work focus on the development and assessment of prognostic models of mortality among COVID-19 infected patients 17 18 and the identification of prognostic factors for severity and mortality in patients infected with COVID-19. [19] [20] [21] [22] [23] As for development of prognostic models, reported predictive performance varies from fair (AUROC 0·7-0·8) to very good (AUROC > 0·9), other performance measures than AUROC are rarely assessed (e.g., calibration), the studies show an high risk of bias and concern sample sizes up to a maximum of 577 ( Table 1 in Wynants et al. 3 ). As for finding strong prognostic factors, similar to other studies we found age, sex and patient history (comorbidities) to be predictors of mortality among COVID-19 patients. Additionally other indicative predictors were found in other studies such as body temperature, disease signs and symptoms (such as shortness of breath and headache), blood pressure, features derived from CT images, oxygen saturation on room air, hypoxia, diverse laboratory test abnormalities, biomarkers of end-organ dysfunction. 17, 18, 20, 21, 23 Most of these other predictors were not included in our dataset (mainly because the used registry data did not include detailed individual patient information). For some of the included comorbidities, we have no explanation why these were not selected as predictors in our models, other than that it is a result from dependences and correlations that are specific for our set of predictors. Our best performing models included CPR, gastro intestinal bleedings and neoplasm, which were not mentioned before in other studies. This may be because these data items are not systematically recorded in other datasets, or that the combination of COVID-19 with another important reason for ICU admission cannot be identified in other studies. A bad prognosis of ICU patients with cancer and after CPR, even independent of COVID-19, is expected and known. 24 25 Strengths The sample size of our study is large (i.e., contains many confirmed COVID-19 patients), and the dataset is comprehensive (i.e., contains many features per patient). As for the analysis, our evaluation is rigorous in that we use multiple performance measures. In general, our approach enables the rapid development of prediction models in case of the COVID-19 epidemic crisis since the registry data that we use are readily available and we use an autormated machine learning approach. Regarding the model development, we enabled the logistic regression model to perform better to some degree (e.g., with/without variable selection, inclusion of either aggregate (APACHE, GCS) scores or the raw predictors) but this was not done exhaustively. Boosting logR performance is still possible, for example by allowing it to use the best form of predictors (i.e., transformation with for example restricted cubic splines 26 ). We found further model tweaking to be out of scope, because we primarily compare (automated versus traditional) approaches and not models. As for data, the used NICE registration data does not include all laboratory or other individual patient variables, but a specific selection and sometimes an aggregation of routinely collected data. As other studies do include more and different individual patient information such as time series of laboratory values and features derived from CT images that may explain their higher predictive performance. Our study shows the value of automated modelling. After further development and extensive validation, these models are of great importance to assist medical staff in making decisions on ICU admittance and treatment, thereby supporting the use of ICU capacity as efficiently as possible. Since we do not find clinically relevant differences between models using data at admission time compared to after 24 hours, this may affect the triage process itself as well: when considering predicted mortality under high pressure on ICU capacity, it may not be effective to admit patients only to see how they develop in the first 24 hours. However, in case limited ICU capacity is not the main pressure for triage one might say that 24h is not long enough to accurately estimate individuals' survival chances. The models achieve fair (AUROC 0·7-0·8) but not good (AUROC > 0·8) predictive performance. The addition of more individual patient information such as other and more detailed laboratory values (instead of min/max values that we included) and findings of CT images obtained from the electronic patient record may increase the performance since other COVID-19 models including those predictors show better performance than we do, and this is thus worthwhile to investigate. This study shows that automated clinical prognostic modelling (AutoML) delivers prediction models with fair predictive performance in terms of discrimination, calibration, and accuracy. The model performance is as good as models that were developed using the more time-consuming regression analysis with expert-based predictor preselection. Models including data from the first 24 hours of ICU admission did significantly outperform models based on admission data, but the clinical relevance is small. These results pave the way to serve as a baseline for rapid automated model development in times of pandemics or other enduring crises that affect ICU capacity and hence increase the need for patient triage. Tables Table 1. Model approaches Approach Description Fully-automated We performed an AutoPrognosis analysis on all available patient variables and these variables were not processed, i.e., selected or transformed. Semi-automated We performed an AutoPrognosis analysis on patient variables that were selected by means of stepwise regression and subsequently transformed (capped and normalized) -see the Section Table 1 for details. Expert-selection We performed a more traditional logistic regression analysis on patient variables that were selected based on experts' opinions (i.e. intensivists) and by means of stepwise regression. Missing values for numerical variables were imputed by using fast k-nearest neighbour (kNN) 27 and mode imputation for categorical variables. Multiple imputation by chained equations (MICE) 28 yielded similar results. In addition to the original patient variables as collected and described above, we included a derived variable for the body mass index (BMI) based on weight divided by squared length. Variables were selected with a backward stepwise AIC (Akaike information criterion) selection model 29 before application of AutoPrognosis. Extreme values were removed by capping numerical variables (below 1th percentile and above 99th percentile). Rescaling All variables were rescaled to the range [0,1] by min-max normalization: where x is the original value and x' is the normalized value. Supplementary material Table 6a . PPV, NPV and sensitivity of the automated, semi-automated, and expertselection approaches using data available on admission (0h) at different cutoff values. We outline the average results for the five-fold cross validation with the standard deviation in between brackets and considering the best model per fold. Table 6b . PPV, NPV and sensitivity of the automated, semi-automated, and expertselection approaches and the APACHE IV baseline using data available from the first 24 hours after admission (24h) at different cutoff values. We outline the average results for the five-fold cross validation with the standard deviation in between brackets and considering the best model per fold. Highlights  Machine learning with automated predictor and model selection (AutoML) delivers comparable predictive performance as logistic regression models built with predictors that were pre-selected by hand.  AutoML uses a wide range of ML models (20 in our case) that are trained and tested, providing an exhaustive analysis of the predictive signal in the data, which is more articulated than when only a single model is used.  Automated modelling can assist decision-making on ICU admittance and treatment of COVID-19 patients as well as provide for efficient use of ICU capacity. Automated machine learning: State-of-the-art and open challenges AutoPrognosis: Automated Clinical Prognostic Modeling via Bayesian Optimization with Structured Kernel Learning Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal Data Resource Profile: the Dutch National Intensive Care Evaluation (NICE) Registry of Admissions to Adult Intensive Care Units Defining and improving data quality in medical registries: a literature review, case study, and generic framework CO-RADS: A Categorical CT Assessment Scheme for Patients Suspected of Having COVID-19-Definition and Evaluation Acute Physiology and Chronic Health Evaluation (APACHE) IV: Hospital mortality assessment for today's critically ill patients The need to separate the wheat from the chaff in medical informatics: Introducing a comprehensive checklist for the (self)-assessment of medical AI studies Evaluation of acute physiology and chronic health evaluation III predictions of hospital mortality in an independent database Assessment of coma and impaired consciousness. A practical scale A comparison of the performance of threshold criteria for binary classification in terms of predicted prevalence and kappa A unified approach to interpreting model predictions Explaining the predictions of any classifier Impact of different customization strategies in the performance of a general severity score Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms Combined 5 x 2 cv F test for comparing supervised classification learning algorithms Systematic evaluation and external validation of 22 prognostic models among hospitalised adults with COVID-19: An observational cohort study Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal Epidemiological, comorbidity factors with severity and prognosis of COVID-19: a systematic review and meta-analysis Predictors of COVID-19 severity: A literature review Prognostic factors for severity and mortality in patients infected with COVID-19: A systematic review Clinical, laboratory and imaging predictors for critical illness and mortality in patients with COVID-19: protocol for a systematic review and meta-analysis Epidemiology and clinical features of COVID-19: A review of current literature Outcomes of cancer patients after unplanned admission to general intensive care units A nationwide overview of 1-year mortality in cardiac arrest patients admitted to intensive care units in the Netherlands between Flexible regression models with cubic splines Missing value estimation methods for DNA microarrays Multivariate Imputation by Chained Equations in R Fully-automated Semi-automated Expert-selection Fully-automated Semi-automated Expert-selection We thank all participating ICUs for making this study possible. Supplementary material Table 5 . Selected predictors used by the three approaches. (For the semi-automated and expert-selection approaches, the variables mentioned here were those that were included after backward selection. The variables that were available to the backward selection for the semi-automated approach were the same that were available to the fully-automated approach.) Supplementary material The core component of AutoPrognosis is an algorithm for configuring so-called prognostic model pipelines using Bayesian optimization. [7] A pipeline refers to the phased machine learning process, including, among others, activities like data extraction, model training and selection, and algorithm tuning. The Bayesian optimization algorithm models the pipelines as a function, where the input is a set of selected algorithms and their hyperparameter settings, and the output is the achieved algorithmic predictive performance. AutoPrognosis offers 20 different machine learning algorithms for building prognostic model pipelines, e.g., Random Forests, [8] boosted trees, [9] and Linear Discriminant Analysis.10 AutoPrognosis returns the pipeline with the best predictive performance (Area Under the Receiver Operator Curve, AUROC) on the training data. The IJMEDI checklist for assessment of medical AI is included in the file "CHECKLIST_IJMEDI.pdf". The file is available in an Open Science Foundation (osf.io) repository (link to be included).  The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript. The following authors have affiliations with organizations with direct or indirect financial interest in the subject matter discussed in the manuscript.