key: cord-0908276-dys2df86 authors: Smith, Matthew; Alvarez, Francisco title: Identifying mortality factors from Machine Learning using Shapley values - a case of COVID19 date: 2021-03-11 journal: Expert Syst Appl DOI: 10.1016/j.eswa.2021.114832 sha: cf4726d6a2108da729508184667ad4a51fab4cf6 doc_id: 908276 cord_uid: dys2df86 In this paper we apply a series of Machine Learning models to a recently published unique dataset on the mortality of COVID19 patients. We use a dataset consisting of blood samples of 375 patients admitted to a hospital in the region of Wuhan, China. There are 201 patients who survived hospitalisation and 174 patients who died whilst in hospital. The focus of the paper is not only on seeing which Machine Learning model is able to obtain the absolute highest accuracy but more on the interpretation of what the Machine Learning models provides. We find that age, days in hospital, Lymphocyte and Neutrophils are important and robust predictors when predicting a patients mortality. Furthermore, the algorithms we use allows us to observe the marginal impact of each variable on a case-by-case patient level, which might help practicioneers to easily detect anomalous patterns. This paper analyses the global and local interpretation of the Machine Learning models on patients with COVID19. The interest in COVID-19 in the academic and data science community has been growing at an unprecedented rate since its outbreak, with new datasets being released on a continuous basis. 1 In this paper we use a unique dataset recently published in the supplementary material of Yan, H.-T. Zhang, Goncalves, et al. (2020) . They applied a Machine Learning algorithm, Extreme Gradient Boosting (XGBoost) on blood samples from 485 infected COVID19 patients. From their sample, we downloaded patient blood sample features for 375 patients, 201 patients who survived and 174 who perished from COVID19 between January and February 2020. As far as we are aware this dataset is the only dataset publicly available which contains patient characteristics on who survived and who died from COVID19 and due to the sensitivity of such patient level information, such datasets are hard to come by. In contrast to Yan, H.-T. Zhang, Goncalves, et al. (2020) , we take a more data science approach. We compare other Machine Learning models to XGBoost. We also present a way to analyse individual patientby-patient predictions quickly, which may be useful in high stress environments in the case another pandemic outbreak occurs in the future. Additionally this patient-by-patient analysis is potentially very relevant, as the marginal effect of a given feature might change from one patient to another depending on other feature values. Additionally, we aggregate the patient-by-patient analysis to deliver feature importance scores for the whole sample. For that, we use Shapley values, which is a concept recently taken from cooperative game theory and applied to machine learning. It measures the contribution of each feature value, abstracting away from the model specification. Finally, we apply what-if analysis from the Machine Learning model, which answers the question, how does the predicted probability of mortality change with a marginal increase (decrease) in the patients characteristics, such as, age or number of days spent in hospital when all other variables are held constant. There is an ever increasing literature in relation to COVID19 not just from medical sciences but from all angles of the scientific community. We keep this literature review specific to Machine Learning applications to the COVID19 pandemic however some other sciences have also analysed the COVID19 situation. Fernandes (2020) , Atkeson (2020) and Makridis and Hartley (2020) analysed the economic impact of COVID19, whereas analysed the psychological impact on children during the COVID19 lock-down. To date clinical studies have found that the majority of COVID19 patients have suffered from lung infection and therefore many academics have sought X-ray imagery for early automatic detection systems. Apostolopoulos and Mpesiana (2020) , Narin et al. (2020) , Zhang et al. (2020) , apply different Neural Networks on lung X-ray images in order to classify patients with and without COVID19. Wang and Wong (2020) apply deep convolutional networks on chest X-Ray images to detect patients with COVID19. They released their dataset as an open source benchmark dataset which contains 13,975 chest X-Ray images. Majeed et al. (2020) apply 12 convolutional neural networks on X-Ray images. They use two COVID19 X-Ray image datasets along with a large image dataset of non-COVID19 viral infections, bacterial infections and normal X-Rays. Shi et al. (2020) offers a comprehensive literature review of Artificial Intelligence methods applied to imagery data in relation to COVID19. Randhawa et al. (2020) applied a decision tree approach to analyse over 5000 unique viral genomic sequences including 29 COVID19 virus sequences. Arentz et al. (2020) discuss a number of patient characteristics of 21 critically ill patients with COVID19 in Washington State. The patients they analysed has a mean age of 70 years (min 43, max 92) with 52% being male. The characteristics of these critically ill patients related to this study were a mean absolute lymphocyte count of 889/µL, mean platelet count 10 3 /µL of 215 and a mean white blood cell count of 515/µL. Wynants et al. (2020) apply a review and critical appraisal of 27 studies and 31 prediction models from the academic community. They found that the most important reported predictors for patients with COVID19 were age, sex, tomography scan features, C reactive proteins, lactic dehydrogenase and lymphocyte count. They state that all studies were at risk of high bias due to non-representative selection of control patients and high risk of model over-fitting. Salman et al. (2020) achieved a 100% sensitivity, 100% specificity, 100% accuracy, 100% Positive Prediction and 100% Negative Prediction when applying deep learning models on the detection of COVID19 from 260 X-Rays images. Yan, H.-T. Zhang, Yang Xiao, et al. (2020) analysed patients with COVID19 and found that fever was the most common initial symptom, followed by a cough, fatigue and shortness of breath. They used over 300 variables and found that lactic dehydrogenase, lymphocyte and high-sensitivity C-reactive protein were key clinical features. Chen et al. (2020) analysed the clinical characteristics of COVID19 in pregnancy, they found that out of 9 patients, 7 presented a fever, 4 a cough, 3 muscle pain and 2 a sore throat. There is a fast-growing literature proposing Machine Learning models to predict COVID19 mortality. An illustrative -though ever-expanding-list of works are the following: Chansik et al (2020), Assaf et al. (2020) , Bertsimas et al. (2020) , Chowdhury et al. (2020 ), Di et al. (2020 , Ikemura et al. (2020) , Laguna-Goya et al. (2020) , Lalmuanawma et al. (2020) , Malki et al. (2020) , Metsky et al. (2020) , Osi et al. (2020) , Peng et al. (2020) , Randhawa et al. (2020) and Singh et al. (2020) . In our analysis and like many of the papers listed previously, we will compare different Machine Learning models in terms of their predictive capacity. In contrast to most of these papers, we go a step further in trying to understand the models predictions by observing figures for patient-level case studies. The use of Shapley values, which is absent in all of the previous papers, will be essential for that. Our motivation is purely practical: a practitioner, a non-expert in Machine Learning, who aims to understand the prediction that the application (machine learning model) is generating for a given incoming patient at the triage room in a hospital. The data used in this study can be found in the supplementary material from Yan, H.-T. Zhang, Goncalves, et al. (2020) . 2 The original dataset was collected between the 10th January to the 18th February 2020, pregnant, breast feeding women and patients under 18 years of age, along with patients with more than 80% incomplete data were omitted from their dataset. In total there were 375 patients in the dataset, 201 patients who survived and 174 patients who died from COVID19. Figure 1 reports the number of confirmed cases for the region Hubei, China. The shaded region indicates the time period for which we have the data which contains the most confirmed cases. The summary statistics, reported in table on the right, show that there are distinct differences between patients who survived and passed away from COVID19. On average older patients were most likely to pass away as a result of COVID19, additionally the longer you stayed in hospital the higher the chances of survival. The blood sample data also show significant differences between the two classes. Whereas there seems to be a heavy skew of males who passed away from COVID19 in the dataset. The original dataset contained a significant number of missing values. Panel (A) in Figure 10 in the Appendix reports the percentage of missing values for each patient, by patient outcome, whereas Panel (B) in Figure 10 reports the number of missing values for each variable, by patient outcome. For a number of patient cases the number of missing values are high -above 60% whereas the number of cases by variable is also high ≈ 100% for many variables. We therefore filter out these variables and use a cut-down version of the data. We set a cut-off percentage threshold of 50% -that is, all variables with more than 50% of NA values were removed, given by the vertical line in Figure 10 . Figure 11 plots an alluvial plot showing the distribution of patients by gender, mapped into the number of weeks that patient spent in hospital, then mapped into an age category, finally, mapped into that patients outcome. It is clear that a larger proportion of the gender 0 category who spent less than a week in hospital and was over 60 years of age died of COVID-19 related illnesses. The gender category 1 fared significantly better when following a similar path. Figure 12 in the Appendix plots the characteristics of age and age bins on the outcome variable. Panel B shows the outcome by age bins. The triangles on the left side show the outcome of mortality whereas the right side shows the outcome of survival. The size of the triangle dictates the number of patients in that outcome. For instance, we can see that for age bins (30, 40] there is a larger triangle on the right side than its corresponding colour on the left side (which is 180 degrees opposite). Therefore the patients in the age bin (30, 40] had a high success rate of survival. Moreover, contrast that with the (80, 90] age bin and we see an opposite trend -a higher triangle on the left side of the plot than the right side of the plot, indicating more people in this age bin perished. Panel (A) shows the violin plots for the age variable by gender and outcome. We can see that there is a distinct bump in the kernel density plot for males around the ages of 30 for the patients who died which is not seen in the sample of the patients who survived. We next report the comparisons between different Machine Learning models and show the interpretability from the classification tree model. Moreover, we show four patient level case studies along with variable importance plots demonstrating which variables the models found most important. Additionally, we report model interpretation from a subset of co-operative game theory, SHapley Additive exPlanations (SHAP) scores from one of the models. Finally we report ceteris paribus & what-if analysis of a patients survival probability. We discuss each of the above in more detail in each of the corresponding subsections. We split the sample of 375 observations up into a training and testing dataset, in which 75% corresponds to the training data and 25% corresponds to the testing data. The above table reports the confusion matrix statistics for a number of Machine Learning models such as Naive Bayes, Logistic Regression, Random Forest, adaBoost, Classification Tree, LightGBM and XGBoost. 3 Each of the models show very similar performance metrics, with the ensemble learning models performing slightly better over the more simpler models. Figure 3 plots an example of a decision tree from the Classification Tree model. Roughly, a decision tree, or simply a tree, represents a piece-wise mapping from a set of features, such as Neutrophils or age, into a response variable, which in our application is probability of mortality. Machine Learning algorithms, such as XGBoost, select the tree (or collection of trees) that minimizes some loss function. 4 Naturally, to select a tree conveys to select both the order of the features as we move down the tree and the threshold values at each split. In the figure 3, as we go downward, the first split at the first node, is made on Neutrophils which shows the predicted probabilities of being in each class along with the percentage of the observations in this split. We can see that patients who have Neutrophils levels x < 79 and age x < 63 fall into node4 which contains 44% of the total observations and has predicted probabilities of 0.93 of survival and 0.07 of mortality. Therefore, patients who fall into this terminal node are predicted to survive. Contrast that with a more complex non-linear node at node21 where patients have the following characteristics Neutrophils of x < 79, age of ≥ 63, Eosinophils of x < 0.1 and Days in hospital of x < 7 fall into node21 which has a predicted probability of 0.17 of survival and 0.83 of mortality, 9% of the sample fell into this node. To finalise, people who followed a similar path down the decision tree but stayed in hospital for more than 7 days fell into node20 where they had a predicted probability of survival of 0.67 and 0.33 probability of mortality, 10% of the sample fell into this terminal node and thus the model found that the length of time spent in hospital has a significant impact on the probability of survival. A single decision tree as depicted in Figure 3 is highly interpretable but not very good at prediction as is evidenced by the worst performing model in the column Classification Tree. In order to overcome this issue of performance, an ensemble of decision trees can be used to make a prediction. The combination of decision trees improves greatly the prediction, though interpretability becomes a priori more complex. In this section, we show how more advanced decision tree models can be interpreted through case studies. What sets the XGBoost model (along with other tree models) apart, from traditional black-box Machine Learning models is that it is possible to see how each variable contributes to the overall prediction for each observation or patient in the model. There are four possible cases, each representing a different position in the confusion matrix -or each representing one of the statistics of a True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FP). We briefly discuss the results for two of the cases, leaving the other two in Figure ? ? in the appendix. Figure 4 shows the breakdown of how a positive case (deceased) was correctly predicted. Given a particular variable, shown in the x-axis, a log-odds score is calculated (displayed inside each box), the sum of the log-odds scores are summed up in a cumulative manor and a final log-odds score is given (displayed in the final black box) and then a logistic function is applied to the final log-odds result in order to obtain a predicted probability (shown on the y-axis). The horizontal line demonstrates a y * = 0.5 probability cut-off threshold. Patients above this line are classified as deceased and patients below this line are classified as survived. Notice, that the final log-odds prediction score is 1.19, which is assigned a predicted probability of mortality (1 + exp(−1.19)) −1 = 0.77. False Negative (FN). Panel (B) in Figure 4 shows a patient who was incorrectly predicted to have survived. The model incorrectly predicted that the patient would have survied with a final log-odds score of -1.26 and a subsequent survival probability of (1 + exp(− − 1.26)) −1 = 0.22, sitting below the cut-off threshold y * = 0.5. From the case studies presented previously in Figure 4 we can see that certain patient characteristics are often given the largest (in absolute) values log-odds scores regardless of whether the patient survived or died. Figure 5 reports the variable importance scores from both the XGBoost and LightGBM model. We can see that the most important variables are consistent across both models, with age, daysInHospital, Lymphocyte and Neutrophils being ranked in the top four in both. From Figure 4 we can see that different individual patient characteristics are associated with different (positive & negative) prediction scores. From Figure 5 we can also see that certain variables contribute more to the model than other variables. Moreover, Figure 5 does not tell us whether, for example, different ages contribute more or less to the probability of mortality, just that age is important at a global level. In order to overcome this issue we turn to a subset of coalition game theory and analyse Shapley values. Shapley values, which is a classical concept in cooperative game theory, see Shapley (1953) has been recently applied to understanding a Machine Learning models predictions, see Lundberg and Lee (2017) and Lundberg et al. (2018) . Shapley values offer a global interpretation where we can measure how patient characteristics contributepositively or negatively to the prediction of mortality. A similar measure is shown previously in Figure 5 , however unlike the feature importance plot shown there we are now able to see the positive or negative relationship between each variable and patient mortality prediction. That is, given Figure 6 on the left we can see that age has the greatest variability in Shapley values. Low values of age correspond to younger patients and more importantly are assigned negative Shapley values and thus it tends to reduce the prediction of mortality. Contrast that with high values of age which corresponds to older patients and more importantly are assigned positive Shapley values and thus it has a higher marginal impact to the prediction probability of mortality. Conversely, the variable daysInHospital has the opposite impact. The higher the number of days the patient remains in hospital is associated with a negative marginal impact on the prediction of mortality whereas, the lower the number of days the patient remained in hospital is associated with a positive marginal impact on the prediction of mortality. Other variables follow similar and very distinct patterns. Figure 15 in the Appendix plots the mean Shapley values for each variable for the highest average Shap scores, which is a somewhat similar to Figure 5 . We note that the top four variables are consistent across models and across evaluation criteria. Shapley Values also give a local interpretation and each patient obtains a total Shapley value (a summation of each of the variables Shapley value). This allows us to explain why a patient receives its prediction and the corresponding contribution of each feature. Figure 13 in the Appendix shows the breakdown of the four most important variables for all patients in the dataset, ranked by each patients total Shapley value (lowest to highest by each outcome). Figure 7 shows four randomly sampled case studies, two from the deceased side and two from the survived side of Figure 13 (where the background is coloured by red = deceased & blue = survived) along with that patients feature characteristic for the four most important variables in the model age, daysInHospital, Lymphocyte and Neutrophils. 5 That is, we get to see the patients characteristics along with the corresponding Shapley value assigned to that feature. Note, that these plots differ significantly to those presented in Figure 4 since the Shapley value plots are derived from the training data whereas the XGBoost case studies are obtained from the test data. Moreover, the Shapley value case studies can be thought of as why the model learned a mapping of features to a prediction whereas the XGBoost case studies can be thought of as why the model made a mapping of features to a prediction. Figure 13 is essentially the patient observations presented in Figure 7 but stacked more compactly side-by-side (and without the patients feature attribution characteristic. Finally, Figure 9 plots the models what-if analysis for a single patient. We can see that when holding all other variables fixed how the models prediction probability changes with changes in the x-axis or changes in the patients feature characteristic. That is, given that this patient had an age of 66, when holding all other variables fixed an increase in that persons age increases the predicted probability of mortality. Moreover the patient also spent 7 days in hospital and thus if the patient spent more than 10 days in hospital the what-if analysis suggests that the patient would have a marginally lower predicted probability of mortality -holding all other variables constant. Similar analysis can be carried out for all patients and for all variables. This paper analyses a number of patient characteristics by applying a series of Machine Learning models in order to predict mortality of patients admitted to hospital with COVID19. There were 375 patients in the dataset with 201 patients who survived and 174 patients who died from COVID19. Ensemble tree based models obtained the highest prediction scores over more simplistic -yet easier to understand -classical models. We focus our analysis on the interpretability of Machine Learning models. Firstly, by introducing patient case studies for each quadrant in the confusion matrix which helps understand why a model made a correct prediction or not. We also show that there is consistency in both across models and across evaluation criteria on what the four most important variables are. Moreover, we find that the variables age, daysInHospital, Lymphocyte and Neutrophils are the most important variables when making a prediction. We discuss how variations in patient characteristics have a positive and negative effect on the models prediction through the use of SHapley Additive exPlanations (Shapley values) from cooperative game theory. Moreover, we use patient-level Shapley values to understand how the model assigns Shapley scores to each patient based on each patients characteristics for four case studies. We also study the interaction between patient characteristics and its corresponding Shapley values. Finally we briefly discuss ceteris paribus analysis in order to understand how the models predictions change with what-if scenarios. Tree based models could be useful in analysing patients during peak epidemic outbreaks when hospitals may be overloaded and quick analysis is in order, especially given the non-linear nature of patient characteristics when admitted to hospitals. The robustness of our findings are bound by the diversity of our dataset. We take data from Yan, H.-T. Zhang, Goncalves, et al. (2020) , which leverage's a database of blood samples. It would be interesting to apply the Machine Learning algorithms used in this paper to a wider population of patients. Another relevant dimension worth exploring is to enlarge the range of potentially relevant features, this study primarily focused on blood cell data but, including other features such as, aspartate aminotransferase (AST) and alanine aminotransferase (ALT) could potentially raise more interesting analysis of patient characterises and morbidity from COVID19. To summarize, our paper shows a promising direction on how relatively standard classification trees in Machine Learning combined with Shapley values help to identify mortality factors for COVID19, however, more robust conclusions require richer datasets. From a more operational angle, a growing branch of literature proposes the use of a number Machine Learning models, say, at a triage phase in hospitals. On this regard, our differential factor, as mentioned, is to propose patient case studies and patient-level Shapley values, that can be easily interpreted -learnt-by practitioners in the field, even those who are not so familiar with the terminology used in Machine Learning, which facilitates the real implementation. Figure 4 shows a patient that was incorrectly predicted to be deceased. The model incorrectly predicted that the patient would be deceased with a final log-odds score of 0.53 and a subsequent deceased probability of (1 + exp(−0.53)) −1 = 0.63, sitting just above the cut-off threshold y * = 0.5. True Negative (TN). Panel (B) in Figure 4 shows a patient who was correctly predicted to have survived with a final log-odds score of -1.37 and a subsequent probability of(1 + exp(− − 1.37)) −1 = 0.2. Machine learning prediction for mortality of patients diagnosed with COVID-19: a nationwide Korean cohort study Covid-19: Automatic detection from x-ray images utilizing transfer learning with convolutional neural networks Characteristics and outcomes of 21 critically ill patients with covid-19 in washington state Utilization of machine-learning models to accurately predict the risk for critical COVID-19. Internal and emergency medicine 15 What will be the economic impact of covid-19 in the us? Rough estimates of disease scenarios Carlos Luis and others, 2020. COVID-19 Mortality Risk Assessment: An International Multi-Center Study Clinical characteristics and intrauterine vertical transmission potential of covid-19 infection in nine pregnant women: A retrospective review of medical records Xgboost: A scalable tree boosting system An early warning tool for predicting mortality risk of COVID-19 patients using machine learning Common cardiovascular risk factors and in-hospital mortality in 3,894 patients with COVID-19: survival analysis and machine learning-based findings from the multicentre Italian CORIST Study. Nutrition, Metabolism and Cardiovascular Diseases Economic effects of coronavirus outbreak (covid-19) on the world economy. Available at SSRN 3557504 Using Automated-Machine Learning to Predict COVID-19 Patient Survival: Identify Influential Biomarkers. Nutrition, medRxiv IL-6-based mortality risk model for hospitalized patients with COVID-19 Applications of machine learning and artificial intelligence for Covid-19 (SARS-CoV-2) pandemic: A review Consistent individualized feature attribution for tree ensembles A unified approach to interpreting model predictions Covid-19 detection using cnn transfer learning from x-ray images The cost of covid-19: A rough estimate of the 2020 us gdp impact Association between weather data and COVID-19 pandemic predicting mortality rate: Machine learning approaches CRISPR-based COVID-19 surveillance using a genomically-comprehensive machine learning approach Automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks A Classification Approach for Predicting COVID-19 Patient Survival Outcome with Machine Learning Techniques An empirical overview of nonlinearity and overfitting in machine learning using COVID-19 data Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study Covid-19 detection using artificial intelligence A value for n-person games Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19 Study of ARIMA and least square support vector machine (LS-SVM) models for the prediction of SARS-CoV-2 confirmed cases in the most affected countries Mitigate the effects of home confinement on children during the covid-19 outbreak COVID-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest radiography images others, 2020. Prediction models for diagnosis and prognosis of covid-19 infection: Systematic review and critical appraisal An interpretable mortality prediction model for covid-19 patients others, 2020. Prediction of criticality in patients with severe covid-19 infection using three clinical features: A machine learning-based prognostic model with clinical data in wuhan Covid-19 screening on chest x-ray images using deep learning based anomaly detection