key: cord-0500947-0g8ptpys authors: Girardi, Ivan; Vagenas, Panagiotis; Arcos-D'iaz, Dario; Bessai, Lydia; Busser, Alexander; Furlan, Ludovico; Furlan, Raffaello; Gatti, Mauro; Giovannini, Andrea; Hoeven, Ellen; Marchiori, Chiara title: On the explainability of hospitalization prediction on a large COVID-19 patient dataset date: 2021-10-28 journal: nan DOI: nan sha: 4944243d3c489c4291d5c26466ed885c72388d9d doc_id: 500947 cord_uid: 0g8ptpys We develop various AI models to predict hospitalization on a large (over 110$k$) cohort of COVID-19 positive-tested US patients, sourced from March 2020 to February 2021. Models range from Random Forest to Neural Network (NN) and Time Convolutional NN, where combination of the data modalities (tabular and time dependent) are performed at different stages (early vs. model fusion). Despite high data unbalance, the models reach average precision 0.96-0.98 (0.75-0.85), recall 0.96-0.98 (0.74-0.85), and $F_1$-score 0.97-0.98 (0.79-0.83) on the non-hospitalized (or hospitalized) class. Performances do not significantly drop even when selected lists of features are removed to study model adaptability to different scenarios. However, a systematic study of the SHAP feature importance values for the developed models in the different scenarios shows a large variability across models and use cases. This calls for even more complete studies on several explainability methods before their adoption in high-stakes scenarios. As recently published 1 , during the COVID-19 pandemic, several AI-based models have been developed, targeting a large variety of application areas, such as identification of disease clusters, monitoring of cases, prediction of the future outbreaks, mortality risk, diagnosis of COVID-19, disease management, or vaccine development 2 . A thorough review found that almost all prediction models were trained on small or low-quality datasets, poorly reported and at a very high risk of bias or overfitting 3 . Due to the sense of urgency and the lax regulatory landscape, some of these models were even used without clinical validation, raising legitimate concerns about patient safety and model clinical validity 4 . A way to build more robust AI prediction models is to adhere to the PROBAST framework, which provides a structured approach to assess potential risk of bias and applicability to the intended population and setting 5 . Transparent reporting can be achieved by following the TRIPOD checklist 6 . While the validation of the model "as-is" on external datasets is perceived as a proof of the generalization power of the model by some, this may be very difficult due to intrinsic differences in available populations, collected data, protocols and guidelines. Instead, a definitive important step forward would be to validate the models in real clinical settings by SMEs while in parallel improving their interpretability and correcting their bias with the help of explainability methods. In this paper, we develop hospitalization prediction models on a large cohort of COVID-19 case data collected in the US from March 2020 to February 2021 and comprising more than 110k positive-tested patients (113941 upon data extraction, 110996 after further preprocessing). Knowing which patients are at risk for hospitalization for COVID-19-related symptoms or complications can help physicians decide not only how to best manage a patient's care from the time of testing (e.g. remote monitoring, more frequent encounters), but also how to allocate resources. All the models reach relatively high performance despite the high class unbalance and the noise, typical of large real world datasets. We find however that despite the similar performance, different models lead to very different explanation results. While some hospitalization prediction models have been published 7-10 , they are generally built on smaller cohorts and do not systematically explore explainability methods across different AI models and feature subsets. In this section we describe in details the various components of our approach: (i) Cohort definition; (ii) Data extraction: sourcing and preparation of data with partition of patient history in time intervals to enable increased flexibility during feature extraction; (iii) Feature extraction: accurate preparation and selection of features with special attention to data leakage prevention, missing data imputation, and proper feature encoding for capturing the temporal dimension of the data; (iv) Model development, where we experiment with time dependent and independent models (early and model fusion of temporal and Boolean modalities), (v) Internal testing and adaptability of the model for the use in several settings, by reducing the number of features used to train the model or varying the time interval between predictor assessment and outcome determination, and (vi) Model explainability. Our analysis was performed on the IBM Explorys Electronic Health Record (EHR) Database 11 , which contains realworld clinical, operational, and financial data, spanning various healthcare aspects, from ambulatory to inpatient to specialty care. Explorys data are sourced from integrated delivery networks (IDN), clinically integrated network providers (CIN), and care collaborative groups (CC) and are constantly updated with incoming records. Data are curated, standardized, and normalized by using medical international coding standards, classification systems, ontologies, and lab unit measures such as ICD-10, SNOMED, LOINC, and RxNorm. Furthermore, data are searchable through a de-identified database. Explorys data have been used by the medical community for several studies, such as recently for building a predictive model for chronic kidney disease 12 . We used data collected from March 2020 to February 2021. To draw evidence of COVID-19 infection we used diagnoses and observations. We defined two evidence classes for COVID-19 diagnoses, confirmed and suspected, and examined the diagnosis fields ICD-10 code, the SNOMED concept, and a short diagnosis description snippet available in our source EHR data. As confirmed we defined the diagnoses explicitly registered with the ICD-10 code "U07.1", which was introduced by the WHO in April 2020 for use "when COVID-19 has been confirmed by laboratory testing irrespective of severity of clinical signs or symptoms". Diagnoses with the ICD-10 code "U07.2" (for use "when COVID-19 is diagnosed clinically or epidemiological but laboratory testing is inconclusive or not available") were classified as suspected. Where the above COVID-19-specific codes were absent, we searched for a set of specific COVID-19-associated text patterns in the diagnosis description as well as in SNOMED concept name (SNOMED search was performed in a hierarchy-aware manner, as described later. Pattern matches were classified as suspected, unless detected by a set of exclusion patterns we defined, in which case the diagnoses were altogether disregarded from further analysis. To extract observation-based COVID-19 evidence, we used LOINC "94500-6", which is the recommended code when (i) the gene or region being tested is not specified and a qualitative result is being reported (e.g., Detected/Not detected); or (ii) a single qualitative overall result based on a combination of individual test results is being reported, and also by far the most frequently occurring COVID-19 diagnostic test LOINC code in the data. Observations registered with LOINC code "94500-6" and positive result were defined as positive-test * . As cohort of interest we defined the one consisting of patients who had both a COVID-19 diagnosis, whether suspected or confirmed, and a positive-test observation within a time frame from 4 weeks prior to their earliest COVID-19 diagnosis up to 4 weeks following it. This cohort finally comprised 113941 patients. A statistic summary can be seen on Table 1 . Restricting the cohort only to patients with confirmed COVID-19 diagnosis and positive-test observation in the same time frame as above did not lead to significantly different results. We excluded patients who had a COVID-19 diagnosis but no positive-test observation, in order to reduce noise and focus on cases documented more completely in our EHR data. The absence of an explicit positive-test, is indeed ambiguous and can be explained either by the patient having been diagnosed by other means, e.g. chest CT, or as missing information, e.g. case not properly documented, test done in other institutions or unavailable (selection bias was not studied in detail, but the demographic distribution when including such "diagnosis-only" patients was similar to that of our defined cohort). A schema of the cohort selection is given in Figure 1 . For each patient belonging to the cohort, we extracted demographics (age, gender), medical conditions (chief complaints, symptoms, comorbidities), vital signs and laboratory measurements, based on literature findings and SME input. With the exception of the demographics, all pieces of information were extracted with their corresponding timestamp. To capture the temporal dimension of the data, but also to reduce complexity, each patient journey was partitioned into various time intervals using the patient's earliest COVID-19 diagnosis as reference point. Each variable was then extracted in an aggregated manner for each interval. The partitioning is depicted in Figure 1 and was designed to allow a greater focus from the diagnosis time onward. To maximize recall, the same approach based on SNOMED and text pattern matching used to extract the COVID-19 diagnosis was used to extract symptoms, comorbidities and other medical conditions. In the case of SNOMED search, we performed a hierarchical pattern matching including also matched ancestors and using exclusion patterns to reduce false positives. We assigned a Boolean values to each medical condition in each time interval. Measurements and test results were extracted from the observation data, using corresponding LOINC codes and were normalized into a single unit. For each observation we extracted one, in most cases continuous, numeric variable per time interval, representing the patient's most recently observed value in that interval, as well as the minimum, maximum, average. The patterns and codes used for data extraction were compiled in cooperation with SMEs. In order to determine if a patient had been hospitalized in a given time interval, we examined the patient interactions with the healthcare providers, as captured in the patient's encounter data. A patient was considered hospitalized in a time interval T, if they had an encounter with (i) encounter type inpatient, hospital emergency room visit or hospital encounter, (ii) admission date within T, AND (iii) duration above 24 hours. To exclude cases potentially hospitalized due to reasons other than COVID-19, patients with a hospitalization within the 4 weeks prior to the COVID-19 diagnoses were removed. After this cleanup step, patients with a hospitalization in the 4 weeks following the COVID-19 diagnosis were considered hospitalized due to COVID-19; we will refer to these patients simply as "hospitalized". Interestingly, 79% of the hospitalized patients were admitted within 4 days following the COVID-19 diagnosis. The prediction target of our models was based on these COVID-19 hospitalization labels. It must be noted that some patients may be hospitalized in other institutions not part of the Explorys network, producing noise in form of false negatives. A possible reason could be bed unavailability due to sudden spikes in demand. The temporal distribution of hospitalized and non-hospitalized cases is illustrated in Figure 2 . Possible bias in the data was removed by discarding all features related to ethnicity, race, medical insurance type, location and any other available information not related to the medical conditions of the patient. Used feature types were demographics (age, gender), clinical presentation (symptoms, vitals, other medical conditions, comorbidities) and lab measurements. These categorical and continuous variables were associated to specific time intervals as previously described. We removed patients in case age information was unavailable or if all other features were empty. Missing values were replaced using data imputation methods, and features were encoded (e.g., one hot encoding for the age groups), after randomly splitting the data as 70% train set and 30% test set. More precisely, missing values for Boolean features were imputed with false, while continuous features were imputed with median substitutions learned on the train set. After feature imputation, we performed data normalization using average and standard deviation computed on the train set. For comorbidities and chronic conditions, mentions over all the available time intervals were aggregated using disjunction ("OR") into a single Boolean feature (condition is present). For the other feature types, in order to avoid data leakage 13 , we selected only the time intervals from 14 days prior to the diagnosis up to the hospitalization (period of interest) with an upper limit of 28 days after COVID-19 diagnosis. Labs and vitals were kept with the corresponding time information, while all other Boolean medical conditions including symptoms were aggregated over the period of interest. We are aware that from a medical perspective the knowledge of whether a symptom was present for one or more consecutive days would be important to build the model, unfortunately our data did not consistently contain this type of information. Finally, for some conditions (i.e., stroke) we used the time intervals to distinguish between conditions belonging to the medical history of the patient ("past") or to the period of interest. After the preprocessing steps described above the cohort comprised a total number of patients [110996], non-hospitalized [97082, %87] and hospitalized [13914, %13]. Since most patients were hospitalized within 4 days from the COVID-19 diagnosis date and given the sparsity of the features over time, we explored two approaches that combine the two data modalities (time series and tabular) at different stages in the model. In the first approach, the features from the two modalities were concatenated (early fusion 14 ) in an ordered feature vector X = [X 1 , X 2 ] where X ∈ R n×k with n and k the total numbers of patients and features, respectively. Data modality X 1 ∈ R n×h contains tabular data such as gender, age groups, comorbidities and other Boolean medical conditions, for a total of h time-independent features. Data modality X 2 ∈ R n×m×t contains m features collected over t time intervals, such as lab values and vitals. Concatenation of the modalities was performed after reshaping the second modality to a two dimensional matrix. In the second approach, the modalities X 1 and X 2 are fed into two different architectures, and their hidden representations are subsequently merged in the model (model fusion 14 ). Upon the described preprocessing, the total number of features k = m × t + h is k = 1573 with h = 77, m = 88 and t = 17. Statistical analysis was performed to compare the hospitalized (H 1 ) and non-hospitalized (H 0 ) populations and compute the significance on the rejection of the null hypothesis that there is no relationship between predictor and outcome. Demographic information, symptoms, comorbidities and other medical conditions are presented (as %) in Table 2 , while median and IQR values for lab measurements and vitals are presented in Table 3 H 0 groups was performed with a χ 2 contingency analysis for the Boolean feature and Mann-Whitney U test for the numerical ones. In our analysis we used post-hoc correction with the Benjamini-Hochberg procedure. Two asterisks were assigned when p-values are < 0.001. When comparing the percentages of symptoms and comorbidities reported in Table 2 to existing statistics, it should be noted that these numbers were calculated on the preprocessed dataset upon removal of all features collected after hospitalization. Large variability is generally observed across publications, likely due to differences in the cohorts, the way data were collected and the time the patient met the caregiver in their disease progression journey (e.g. at triage vs hospitalization). In the laboratory-confirmed US cohort described in 15 , cough is reported in about 50% and fever in 43% of the 28.3% of the patients for which symptoms are actually known. Several of the conditions more commonly found in H 1 in Table 2 are in agreement with literature findings, either as conditions correlated to severe disease progression (e.g. pneumonia, hypoxia, hypoxemia) or comorbidities and risk factors (e.g. hypertension, diabetes, nicotine dependence, chronic kidney disease). Our analysis shows increasing hospitalization admissions associated with older age: eighty year old patients are 4 times higher in H 1 than H 0 , and 2 times higher for seventy and sixty year old patients. Regarding the lab values reported in Table 3 , it should be noted that these values are recorded before hospital admission Prediction of hospitalization up to 28 days after COVID-19 diagnosis was modelled as a binary classification (H 1 hospitalized, H 0 not hospitalized). For early fusion approach we trained the following models: Random Forest classifier (RF) 17 , Extra Trees classifier (ET) 17 , fully connected neural network (NN) 18 and Auto AI architectures (Auto Keras 19 ). For the model fusion approach we trained an architecture combining NN for the time-independent modality and temporal convolutional neural network (T-CNN) 18 for the time-dependent one. The two models were combined together with additional NN layers. The T-CNN is a convolutional neural network consisting of 1-D temporal convolutions. During training we performed 3-fold cross validation and selected the best performing (best F 1 -score) model on the validation set. Splitting of the dataset into train, validation, and test sets was performed several times varying the seed for the splitting. Precision, recall and F 1 -score were collected by computing average and standard deviation over all iterations. RF and ET models were trained with a randomized parameter optimization to search for the optimal hyperparameters such as number of trees, number of features to consider for the best split, maximum tree depth, minimum number of samples required to split an internal node, the minimum number of samples required to be at a leaf node either using the whole sample or bootstrapping. For the AutoKeras models we have varied the number of iterations and the number of epochs and let the library to find the best performing model. NN and T-CNN were optimized using mini batch Stochastic Gradient Descend with momentum, learning rate was varied with a step scheduler and dropout used to reduce overfitting. Comparing to the architecture trained by the Auto AI model with three dense layers followed by ReLu activation functions our implementations of NN (T-CNN) contained more (less) parameters. In all models we have defined a custom loss and weighted random sampling of the training data to take into account the high class unbalance. First, we built the models using all available features to have baselines (all features use case). Then, to test model adaptability to different use cases, we removed selected lists of features. To emulate general practitioner settings (GP), where diagnoses of acute conditions are not available and the results of some very specific tests not accessible, we removed the following: pneumonia, hypoxia, hypoxemia, myocarditis, all sepsis related conditions, acute respiratory distress syndrome, heart disease, pneumothorax, acute renal failure syndrome, thromboembolic disorder, acute hepatic failure, pulmonary embolism, renal failure syndrome, embolism, acute disease of cardiovascular system and lab tests such as D-dimer and hsTnT. This GP use case was designed under SMEs guidance. Finally, to test the model predictive power over time, we removed all available features in the day prior to the hospitalization for hospitalized patients (one day before use case). This drastically reduces the availability of lab values and vitals. Average precision, recall, F 1 -score are given in Table 4 for the all feature use case, the GP use case and the one day before use case, as first, second and third values in the series, respectively. NN and T-CNN architectures were frozen across all use cases. In all described use cases, all analysed models achieved relatively high classification performance despite the unbalanced dataset. There is a minimal difference amongst the time dependent T-CNN model performance and the NN and the Auto Keras models for all features and the GP use cases. Drop of F 1 -score when using a smaller number of features was limited to few percentages for all the models, at the exception of the T-CNN in one day before use case. The observed similar performances of the models may be due to the presence of several predictive features in the dataset, most likely, more than the dropped ones. To determine which features have the largest predictive power and assess model interpretability, we explored feature importance methods. Although several studies 8 used MDI importance 20 on datasets containing numerical features, we disregarded it, given its sensitivity to high-cardinality features. In addition, commonly used libraries compute MDI-based importance values on the training set and therefore the extracted important features are not useful to make predictions that generalize on the test set. Instead, we focused on the SHAP method 21 and report feature importance values computed on the test set. Regarding the NN and T-CNN models, SHAP values computed with the gradient based approximation are very similar to the ones computed using different not-gradient-based approximations. Therefore we did not regularize the gradients during training as suggested by 22 For the Boolean conditions, a high level agreement is observed between the information of Table 2 We developed several AI models (RF, ET, NN, T-CNN) to predict hospitalization on a large cohort of COVID-19 patients (more than 110k) up to 28 days after the diagnosis, with competitive classification results on the test set. Despite the large data unbalance, using available features prior to the hospitalization, the models reach average 0.96-0.98 (0.75-0.85) precision, 0.96-0.98 (0.74-0.85) recall, and 0.97-0.98 (0.79-0.83) F 1 -score across all split iterations for H 0 (H 1 ). Similar results were obtained in more restrictive scenarios with lower number of available features, even when features ranked as top by importance models were removed from the baseline model (GP use case). We then systematically explored the SHAP feature importance method for all analysed models in all the use cases and observed an overall agreement with the results of the statistical analysis on the Boolean medical conditions. However, the observed large variability of the SHAP importance score across models calls for a careful usage of the importance results, their evaluation across different models and methods. When deriving feature importance, methods that can handle feature correlation like SHAP should be used or analysis of feature correlation carefully conducted. Undesired dependencies as the ones shown in this study need to be carefully evaluated before adoption in high-stakes decisions [23] [24] [25] . In this retrospective study we tried to adhere as much as possible to the PROBAST framework, in the constrained setting of this use case: hospitalization prediction requires a large dataset of hospitalized and non-hospitalized patients and therefore data sources of different provenience are needed (eg., GP office visits, emergency, hospital encounters). This means that if a patient left or suddenly came into the Explorys network for whatever reasons (e.g., hospitalized in other hospitals or regions), outcomes and data may be missing. This may lead also to large variability in laboratory values and measurements. Our models were not tested on external datasets or in clinical settings due time constraints. Artificial intelligence for COVID-19: saviour or saboteur? The Lancet Digital Health The role of artificial intelligence in tackling COVID-19 Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal The myth of generalisability in clinical research and machine learning in health care. The Lancet Digital Health PROBAST: a tool to assess the risk of bias and applicability of prediction model studies Transparent Reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement Early prediction of level-of-care requirements in patients with COVID-19 Developing and validating COVID-19 adverse outcome risk prediction models from a bi-national European cohort of 5594 patients Development and validation of a model for individualized prediction of hospitalization risk in 4,536 patients with COVID-19 Presenting characteristics, comorbidities, and outcomes among 5700 patients hospitalized with COVID-19 in the New York City area Predicting the early risk of chronic kidney disease in patients with diabetes using real-world data Leakage in data mining: formulation, detection, and avoidance Multimodal deep learning Coronavirus disease 2019 case surveillance -United States Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study Random forests. Machine learning Deep learning Auto-Keras: an efficient neural architecture search system Understanding variable importances in forests of randomized trees A unified approach to interpreting model predictions Right for the right reasons: training differentiable models by constraining their explanations Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead Interpretation of neural networks is fragile Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission