key: cord-0648665-8vy1a3uw authors: Aboutalebi, Hossein; Pavlova, Maya; Shafiee, Mohammad Javad; Florea, Adrian; Hryniowski, Andrew; Wong, Alexander title: COVID-Net Biochem: An Explainability-driven Framework to Building Machine Learning Models for Predicting Survival and Kidney Injury of COVID-19 Patients from Clinical and Biochemistry Data date: 2022-04-24 journal: nan DOI: nan sha: 98ccfbfa3e8bad127c100ccc0b7c794414ed3cff doc_id: 648665 cord_uid: 8vy1a3uw Ever since the declaration of COVID-19 as a pandemic by the World Health Organization in 2020, the world has continued to struggle in controlling and containing the spread of the COVID-19 pandemic caused by the SARS-CoV-2 virus. This has been especially challenging with the rise of the Omicron variant and its subvariants and recombinants, which has led to a significant increase in patients seeking treatment and has put a tremendous burden on hospitals and healthcare systems. A major challenge faced during the pandemic has been the prediction of survival and the risk for additional injuries in individual patients, which requires significant clinical expertise and additional resources to avoid further complications. In this study we propose COVID-Net Biochem, an explainability-driven framework for building machine learning models to predict patient survival and the chance of developing kidney injury during hospitalization from clinical and biochemistry data in a transparent and systematic manner. In the first"clinician-guided initial design"phase, we prepared a benchmark dataset of carefully selected clinical and biochemistry data based on clinician assessment, which were curated from a patient cohort of 1366 patients at Stony Brook University. A collection of different machine learning models with a diversity of gradient based boosting tree architectures and deep transformer architectures was designed and trained specifically for survival and kidney injury prediction based on the carefully selected clinical and biochemical markers. The world continues to struggle in containing and controlling the COVID-19 pandemic by applying different measures such as lockdown or remote-work to contain the virus spread at the cost of straining their economy. Although the situation has been improved significantly since the declaration of the global pandemic by the World Health Organization (Pak et al., 2020) , the number of infected people and death toll continues to rise regularly when a new COVID variant emerges (Thakur and Kanta Ratho, 2021) . The number of infected patients has a direct impact on the nation's healthcare systems and causes a swift drain of hospital's human and material resources. As an example, the shortage of hospital staff can lead to misdiagnosis and improper treatment of COVID-19 patients. As stated by the recent studies (Dadson et al., 2020; Sullivan et al., 2022) , COVID-19 infection can cause serious health complications like Acute kidney injury (AKI) which can be fatal in some patients. Understanding these complications and acting preemptively during the treatment can significantly increase the survival chance of a patient (See et al., 2021) suffering from COVID-19. Nevertheless, lack of resources makes taking swift actions even more difficult. In this paper, a new machine learning model framework is proposed to predict a patient's survival chance and the chance of developing kidney injury during hospitalization from clinical and biochemistry data within a transparent and systematic manner. The proposed COVID-Net Biochem method is an explainability-driven framework for building machine learning model for the aforementioned tasks that can be extended to other healthcare domains. The explainability insight derived from the model decision-making process provides a framework to be able to audit model decisions. This capability can be used in tandem to gain new powerful insights of potential clinical and biochemical markers that are relevant to the prediction outcome. As such, the proposed method can assist physicians to make the diagnosis process more effective and efficient by providing supplementary outcome predictions based on a large collection of clinical and biochemical markers as well as highlighting key markers relevant to the task. The resulting output from the framework includes a diverse collection of machine learning models including different gradient based boosting tree architectures and deep transformer architectures designed to specifically predict survival chance and kidney injury as well as their dominant clinical and biochemical markers leveraged throughout the decision making process. In this work, we propose COVID-Net Biochem, an explainability-driven framework for building machine learning models for patient survival prediction and AKI (Acute Kidney Injury during hospitalization) prediction in a transparent and systematic manner. The proposed two-phase framework leverages both clinician assessment and deep insights extracted via a quantitative explainability strategy to not only gain a deeper understanding into the decision-making process of the machine learning models, and the impact of different clinical and biochemical markers on its decision-making process, but also enables the creation of high-performance, trustworthy, clinically-sound machine learning models by guiding architecture design and training policies based on these extracted clinical and explainability-driven insights in an iterative manner. A key generalizable insight we wish to surface in this work is the largely 'black box' nature of model design at the current state of machine learning in the context of healthcare, and the strategies for transparent design are not only critical but very beneficial for building reliable, clinically relevant models in a trustworthy manner for widespread adoption in healthcare. More specifically, while significant advances have been made in machine learning, particularly with the introduction of deep learning, much of the design methodologies leveraged in the field rely solely on a small set of performance metrics (e.g., accuracy, sensitivity, specificity, etc.) to evaluate and guide the design process of machine learning models. Such 'black box' design methodologies provide little insight into the decision-making process of the resulting machine learning models, and as such even the designers themselves have few means to guide their design decisions in a clear and transparency manner. This is particularly problematic given the mission-critical nature of clinical decision support in healthcare, and can lead to significant lack of trust and understanding by clinicians in machine learning-driven clinical decision support solutions. Furthermore, the lack of interpretability or understanding in the decision-making process during the design process and during clinical use creates significant accountability and governance issues, particularly if decisions and recommendations made by machine learning models result in negative patient impact in some cases. Motivated to tackle the challenges associated with 'black box' model design for clinical decision support, in this work we propose an explainability-driven development framework for machine learning models that can be extended to multiple healthcare domains such as COVID-19 survival and acute kidney injury prediction. The framework provides a two phase approach in which a diverse set of machine learning models are designed and trained on a curated dataset and then validated using both an automatic explainability technique to identify key features as well as manual clinician validation of the proposed highlighted features. The second phase consists of leveraging the explainability-driven insights to revise the data and design of the models to ensure high detection performance from relevant clinical features. The resulting output from the development process are high-performing, transparent detection models that not only provide both supplementary outcome predictions for clinicians but also quantitatively highlight important factors that could provide new insights beyond standard clinical practices. From the onset of this crisis, the push for improving effective screening methods has gained tremendous attention worldwide. The access to accurate and effective screening processes of patients is important to provide immediate treatment, and isolation precautions to con-tain the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus causing the COVID-19 pandemic. Several research efforts have been introduced for utilizing deep learning models for screening of COVID-19 patients. Studies have shown that by exploiting deep learning, COVID-19 cases can be diagnosed based on the CXR image with an acceptable accuracy (Wong et al., 2020; Warren et al., 2018; Toussie et al., 2020; Huang et al., 2020a; Guan W.J., 2020; Zhang et al., 2021) , followed by works utilized CT images to diagnose COVID-19 cases (Silva et al., 2020; Saood and Hatem, 2021) . In addition, several techniques have been proposed to grade the severity of COVID-19 patients based on the medical imaging modality (Shoaib et al., 2021; Tang et al., 2021; Qiblawey et al., 2021) . While using computer-aided diagnostics for screening and medical imaging of COVID-19 patients has been very popular, little work has been done in using machine learning models to assess the survival chance and prediction of development of acute kidney injury (AKI) among COVID-19 patients. Furthermore, most of the proposed algorithms so far lack interpretability which makes their real-world application and integration with clinicians questionable as their decision making process is not reliable. Interpretable assessment tools are particularly important as they can help physicians in the diagnosis process and alert the medical team whether a certain complication has a high risk or not with a quantitaive comprehension of the underlying decision-making process, increasing the preemptive measures that can be taken to reduce the associated risks. As a result, not only the cost of the treatment can be reduced substantially but the patients also have a higher chance of survival (Hirsch et al., 2020) . One of the more relevant study to the proposed method is the approach introduced by Gladding et al. (Gladding et al., 2021) which utilizes machine learning model to determine COVID-19 and other diseases from hematology data. Furthermore, Erdi et al. (Ç allı et al., 2021) proposed a novel deep learning architecture for detection of COVID-19 based on laboratory results. While these works focus on determining the COVID-19 positive cases, our work focuses on determining the survival chance and the chance of developing AKI during hospitalization based on biochemical data with the proposal of an end-to-end transparent model development framework that can be extended to other healthcare domains. In this section, we describe in detail the proposed systematic framework for building highperformance, clinically-sound machine learning models from relevant clinical and biochemical markers in a transparent and trustworthy manner. The proposed COVID-Net Biochem framework comprises of two main phases: Clinician-guided initial design phase: The first phase starts with the preparation of a benchmark dataset from carefully selected clinical and biochemical markers based on clinical assessment curated for a patient cohort. While a plethora of clinical and biochemical markers may be collected for the patient cohort, only a selected number of markers are relevant for a given predictive task while others may be not only irrelevant but misleading for the machine learning model when leveraged. Furthermore, certain clinical markers such as age and gender can lead to considerable bias in the resulting model. Therefore, in this phase, we remove clinically irrelevant markers through consultations with clinicians who have the domain knowledge of the task. Next, a collection of different machine learning models with a diversity of gradient based boosting tree architectures and deep transformer architectures is designed and trained on the constructed benchmark dataset. Explainability-driven design refinement phase: The second phase starts with the quantitative explainability validation of model performance and behaviour to gain a deeper understanding of the decision-making process, as well as gaining quantitative insights into the impact of clinical and biochemical markers on the decision-making process and the identification of the key markers influencing the decision-making process. In this paper, we leverage a quantitative explainability technique called GSInquire to conduct this evaluation. Next, we analyze and interpret the decision making process of the model through the identified relevant predictive markers and leverage the insights in an iterative design refinement manner to build progressively better and more clinically-relevant machine learning models. More specifically, if all of the clinical and biochemical markers identified by quantitative explainability are driving the decision-making process of a given model and these markers are verified to be clinically sound based on clinical assessment of the explainability results, the model is accepted as the final model; otherwise, we returns to the first phase where the irrelevant markers are discarded for that given model, and a new model architecture is trained and produced via hyperparameter optimization and again tested for phase 2. This iterative approach not only removes the influence of quantitatively and clinically irrelevant clinical and biochemical markers, but it also removes the markers that may dominate the decision-making process when they are insufficient for clinically sound decisions (e.g., the heart rate clinical marker may be clinically relevant but should not be solely leveraged for survival prediction from COVID-19 as a result of its general severity implication). This iterative process is continued until the model heavily leverages clinically sound clinical and biochemical markers to great effect and impact in its decision making process. Figure 3 provides an overview of the complete iterative design process in the proposed framework. In the following sections, we show how we have applied this framework in our study to develop reliable models which leverage only relevant markers for decision making. In this particular study, the clinician-guided initial design phase consists of constructing a new benchmark dataset of clinical and biochemistry data based on clinical feedback curated from a patient cohort of 1366 patients at Stony Brook University (Saltz et al., 2021) . The collection of models we designed and trained on the constructed benchmark dataset are based on the following architecture design patterns: i)TabNet (Arik and Pfister, 2019), ii) TabTransformer (Huang et al., 2020b) , iii) FTTransformer (Gorishniy et al., 2021) , iv) XGBoost (Chen and Guestrin, 2016) , v) LightGBM (Ke et al., 2017) , and vi) CatBoost (Prokhorenkova et al., 2018) . TabNet focuses on employing sequential attention to score features for decision making and make the model more interpretable compared to previously proposed deep learning models for tabular datasets (Arik and Pfister, 2019). Tab-Transformer and FTTransformer utilize a more recent transformer architecture designed to process Tabular datasets. In practice, transformer models have shown higher performance on most well-known datasets (Huang et al., 2020b; Gorishniy et al., 2021; Vaswani et al., 2017) . The gradient boosting algorithms rely on creating and learning an ensemble of weak prediction models (decision trees) by minimizing an arbitrary differentiable loss function. In the explainability-driven design refinement phase for this particular study, we conduct quantitative explainability validation of model performance and behaviour by leveraging GSInquire (Lin et al., 2019b), a state-of-the-art explainability technique that has been shown to produce explanations that are significantly more reflective of the decision-making process when compared to other well-known explainability techniques in the literature. GSInquire enables the assignment of quantitative importance values to each clinical and biochemical marker representing its impact on the model prediction. Finally, clinical assessment of explainability-driven insights was conducted by a clinician with over eight years of experience. In this section we provide a comprehensive overview of the data preparation process in constructing a benchmark dataset for COVID-19 patient survival and AKI prediction in the clinician-guided initial design phase of the proposed framework, as well as the clinical and biochemical marker selection process based on explainability-driven insights in the explainability-driven design refinement phase. The proposed dataset is built by carefully selecting clinical and biochemical markers based on clinical assessment from a patient cohort curated by Stony Brook University (Saltz et al., 2021) . More specifically, the clinical and biochemical markers were collected from a patient cohort of 1336 COVID-19 positive patients, and consists of both categorical and numerical markers. The clinical and biochemical markers include patient diagnosis information, laboratory test results, intubation status, oral temperature, symptoms at admission, as well as a set of derived biochemical markers from blood work. Table 1 demonstrates the numeric clinical and biochemical markers from the patient cohort and their associated dynamic ranges. The categorical clinical markers consists of "gender", "last status" (discharged or deceased), "age", "is icu" (received icu or not), "was ventilated" (received ventilator or not), "AKI during hospitalization" (True or False), "Type of Theraputic received", "diarrha", "vomiting symptom", "nausea symptom", "cough symptom", "antibiotic received" (True or False), "other lung disease", "Urine protein symptom", "smoking status", and "abdominal pain symptom". Target value: In this study, the "last status" is used as a target value for the task of predicting the survival chance given the patient's symptoms and status. In addition, the "AKI during hospitalization" is identified as a target for the task of predicting the kidney injury development during hospitalization. Figure 2.a and figure 2 .b demonstrates the distribution of these two target values in the patient cohort is highly unbalanced. Missing Value and Input Transformation: For replacement and modification, we found that using different type of input transformation does not substantially change the final result in our models. In this regard, we examined MinMax scaler, uniform transformer and normal distribution transformer (all available in sickit-learn preprocessing method (Pedregosa et al., 2011)). None of them provided any better results. On the other hand, the dataset had many missing values and to resolve the issue, for any marker that had more than 75% missing values in the dataset, the corresponding marker was removed in our study. For replacing missing value for our models, we found that both transformer models and gradient boosting tree models are resilient against missing value and replacing the missing value with a constant gives a competitive result. Particularly, we followed the same strategy introduced in TabTransformer ( Huang et al., 2020b) where the missing value is treated as an additional category. To create our benchmark dataset in the clinically guided initial design phase, we consulted with a clinician with over 8 years of experience and identified clinical markers that are clinically irrelevant and may result in biases being learnt by the machine learning models. More specifically, we also excluded demographic markers that would induce bias in our training process. Given the highly imbalanced patient cohort, demographic markers such as "age" or "gender" can cause significant bias in the decision-making process of the trained machine learning models. As seen in Figure 2 .c, it can be observed that gender distribution is highly skewed and can lead to spurious correlations in the resulting machine learning model if used for training purposes. Finally, other confounding factors such as "heart rate", "invasive ventilation days" were also removed after consulting with the clinician as their impact on survival and AKI prediction were not directly clinically relevant. In the explainability-driven design refinement phase, we leverage quantitative explainability to analyse the decision-making processes of the individual trained models within the collection of initial model designs, and identified the most quantitatively important clinical and biochemical markers for each of the models using the GSInquire quantitative explainability (a) "last status" distribution (b) "Acute Kidney Injury (AKI)" distribution (c) "gender" distribution . After identifying the most quantitatively important markers to the decision-making processes of individual models, we presented these explainability results to the clinician to not only gain valuable clinical insights into the clinical soundness of the machine learning models but also to identify the non-relevant markers among these that the models rely on so that they will be excluded in the next round of model design refinement and training. As an example, after conducting explainability-driven assessment on the machine learning models with LightGBM and CatBoost architectures, we observed that the clinical marker "Length of Stay" had the highest quantitative impact on the decision-making process of said models for the AKI prediction (see Figure 7 ). After clinical consultation on this explainability-driven insight, we found out this clinical marker has little clinical value in determining the likelihood of AKI. As a result, in the next phase of model design and training, the "Length of Stay" marker was excluded. This process continued until only the relevant markers for our prediction tasks were utilized by the individual models. It is very important to note that explainability-driven assessment was conducted on each model independently, and as such the end result is that each model is uniquely tailored around what clinical and biochemical markers benefits them most from the set of possible markers. The final result is shown in figure 5 which shows the models are not dependant on irrelevant markers. More explanation will be provided in the Explainability section on these figures. Finally, to better show the correlation between clinical and biochemical markers, Figure 1 shows the correlation of top ten markers for AKI (acute kidney injury during hospitalization) and last status target marker. As seen, for the target marker "last status", AKI has the highest correlation. On the other hand, for the target marker AKI, "Urine Protein", "Therapeutic Heparin","Fibrin D Dimer", "Creatinine" and "Glomerular Filtration" have the highest correlation values. It is worth to note our trained models in the experiment section are actually utilizing these markers to do decision making as discussed in the Explainability section. In this section, we describe the experimental results and training procedure for the different machine learning models created using the proposed framework for the purpose of predicting COVID-19 patient survival and predicting the development of AKI (Acute Kidney Injury) in COVID-19 patients during hospitalization. As mentioned earlier, we designed six different machine learning models for the aforementioned prediction tasks using the following architecture design patterns: TabTransformer, TabNet, FTTransformer, XGBoost, LightGBM, and CatBoost. Our training procedure is guided by not only accuracy, precision, and recall but also by identified explainability results. In the next section we provide explainability on the models decision process. We set the last status which had a binary value of deceased or discharged as our target in this task. For the training, as briefly discussed in the previous section, we constantly monitored the decision making process of the model using GSInquire to make sure the model is choosing the relevant set of features to make the prediction. For the training, we used 20% of the dataset as the test set and another 5 % as validation set. For TabTransformer, TabNet, and FTTransformer we did a grid search to find the best hyperparameter. In this regard, we set the batch size to 256, the learning rate was set to 0.00015 and we run the models for 150 epochs with early stopping on the validation set. We used Adam optimizer for all tasks. The training procedure was done in parallel with getting explainibilty results for the model. In this regard, we discarded features "heart rate", "length of stay", "invasive ventilation days" as models tend to heavily rely on these less relevant factors for decision making. For gradient boosting models XGBoost, CatBoost, and LightGBM, we used the default setting except for the learning rate. The learning rate of 0.35 gave us the highest accuracy for CatBoost. For XGBoost and LightGBM, we set the learning rate to 0.3 and 0.1 respectively. The results for the models are depicted in Table 2 . Also Table 3 shows the confusion matrix for CatBoost and TabTransformer. As it can be seen XGBoost had the best performance achieving accuracy of 98.1 % on the test set. Among deep learning models, TabTransformer had the best performance with accuracy of 95.9%. Also, both TabTransformer and XGBoost had above 96% results for recall and precision. We set the Acute kidney injury during hospitalization which had a binary value of True or False as our target in this task. The training procedure for this task was very similar to the survival task with hyperparameters almost the same. Except here we also removed the last status marker from our input to the models as it is a non-relevant clinical marker. The results for the models are depicted in Table 4 . Also Table 5 shows the confusion matrix for LightGBM and TabTransformer. As it can be seen LightGBM had the best performance achieving accuracy of 96.7 % on the test set. Also, among deep learning models, TabTransformer had the best performance with accuracy of 91.9%. The benchmark dataset created in this study and the link to the code is available here As explained earlier, the trained models from phase one of the development framework were then audited via explainability-driven performance validation to gain insights into their decision-making process that will inform the design modifications in phase two of the process. We leveraged GSInquire (Lin et al., 2019a) scores for each marker based on their influence on the outcome prediction through an inquisitor I within a generator-inquisitor pair {G, I}. These actionable insights were then further validated by a clinician to ensure the clinical relevance and later employed by the framework to make design revisions to the models accordingly. Figures 4 and 5 show the 10 most impactful clinical and biochemical markers relevant to COVID-19 survival and AKI prediction for the highest performing models of TabTransformer, LightGBM, and CatBoost, respectively. Figure 6 provides a summary of the high impact markers across all models by averaging their impact scores and reporting the top 10 highest positive predictive parameters. For COVID-19 patient survival prediction, the marker indicating whether a patient has experienced acute kidney injury during hospitilization provides the highest impact to model predictions which is aligned with our clinician suggestion. In this regard, we observed in figure 1 , that there is a direct correlation between survival and acute kidney injury. Also, it is interesting to see that in figure 4 while two gradient boosting tree has the same high impact marker (AKI), the tabtransformer is looking at two other markers B and Fibrin D Dimer which are also relevant for survival prediction. What we can see here is that as we change the type of learning model from gradient boosting tree to deep neural network, the model considers a different set of relevant clinical and biochemical markers. This also happens in 5 for acute kidney injury prediction. While we can clearly see that both gradient boosting trees are considering the Theraputic Heparin and Creatinine to determine the chance of developing AKI, the TabTransformer is considering a different set of relevant markers such as Ferritin for decision making. Finally, it is worth mentioning that our clinician found the figure 6 very interesting which represents the main markers used by all model on average. In this regard, most of the biomedical and clinical markers including Creatinine, Therapeutic Heparin used in this figure are considered as most relevant markers to determine survival rate of the patient and chance of AKI. In this work we presented an explainability-driven framework for building machine learning models which is able to build transparent models that only leverage clinically relevant markers for prediction. As a proof of concept, we applied this framework for predicting survival and kidney injury during hospitalization of COVID-19 patients such that only clinically relevant clinical and biochemical markers are leveraged in the decision-making process of the models and to ensure that the decisions made are clinically sound. Experimental results show that the constructed machine learning models were able to not only achieve high predictive performance, but did so in a way that considered clinically-sound clinical and biochemical markers in their decision-making processes. In this regard, we provided a comprehensive examination of the constructed machine learning models' accuracy, recall, precision, F1 score, confusion matrix on the benchmark dataset. Furthermore, we interpreted the decision-making process of the models using quantitative explainability via GSInquire. Finally, we showed that the model uses acute kidney injury as the main factor to determine the survival chance of the COVID-19 patient, and leverages Creatinine biochemical markers as the main factor to determine the chance of developing kidney injury which is consistent with clinical interpretation. While our findings shows a path to build better explainable models for healthcare problems, more experiments needs to be carried out to attest the results we have obtained in this work. Tabnet: Attentive interpretable tabular learning. arxiv Deep learning with robustness to missing data: A novel approach to the detection of covid-19 Xgboost: A scalable tree boosting system Underlying kidney diseases and complications for covid-19: a review. Frontiers in medicine A machine learning program to identify covid-19 and other diseases from hematology data Revisiting deep learning models for tabular data Clinical characteristics of coronavirus disease 2019 in China Acute kidney injury in patients hospitalized with covid-19 Clinical features of patients infected with 2019 novel coronavirus in Wuhan Tabular data modeling using contextual embeddings Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems Do explanations reflect decisions? a machine-centric strategy to quantify the performance of explainability algorithms Do explanations reflect decisions? a machine-centric strategy to quantify the performance of explainability algorithms Economic consequences of the covid-19 outbreak: the need for epidemic preparedness Scikit-learn: Machine learning in python Catboost: unbiased boosting with categorical features Somaya Al Maadeed, Farayi Musharavati, et al. Detection and severity classification of covid-19 in ct images using deep learning Stony brook university covid-19 positive cases. The Cancer Imaging Archive Covid-19 lung ct image segmentation using deep learning methods: U-net versus segnet Risk factors for development of acute kidney injury in covid-19 patients: A retrospective observational cohort study Covid-19 severity: Studying the clinical and demographic risk factors for adverse outcomes Covid-19 detection in ct images with deep learning: A voting-based scheme and cross-datasets analysis Acute kidney injury in patients hospitalized with covid-19 from the isaric who ccp-uk study: a prospective, multicentre cohort study Severity assessment of covid-19 using ct image features and laboratory indices Omicron (b. 1.1. 529): A new sars-cov-2 variant of concern mounting worldwide fear Severity scoring of lung oedema on the chest radiograph is associated with clinical outcomes in ARDS Attention is all you need. Advances in neural information processing systems Severity scoring of lung oedema on the chest radiograph is associated with clinical outcomes in ARDS Frequency and distribution of chest radiographic findings in COVID-19 positive patients Diagnosis of coronavirus disease 2019 pneumonia by using chest radiography: Value of artificial intelligence Deep learning for covid-19 detection based on ct images