key: cord-1019993-da88dc1a authors: Alves, Marcos Antonio; Zanon de Castro, Giulia; Soares Oliveira, Bruno Alberto; Ferreira, Leonardo Augusto; Ramírez, Jaime Arturo; Silva, Rodrigo; Guimarães, Frederico Gadelha title: Explaining Machine Learning based Diagnosis of COVID-19 from Routine Blood Tests with Decision Trees and Criteria Graphs date: 2021-03-16 journal: Comput Biol Med DOI: 10.1016/j.compbiomed.2021.104335 sha: e9b0ab70f33b32cadd117121c97672fda91f87ff doc_id: 1019993 cord_uid: da88dc1a The sudden outbreak of coronavirus disease 2019 (COVID-19) revealed the need for fast and reliable automatic tools to help health teams. This paper aims to present understandable solutions based on Machine Learning (ML) techniques to deal with COVID-19 screening in routine blood tests. We tested different ML classifiers in a public dataset from the Hospital Albert Einstein, São Paulo, Brazil. After cleaning and pre-processing the data has 608 patients, of which 84 are positive for COVID-19 confirmed by RT-PCR. To understand the model decisions, we introduce (i) a local Decision Tree Explainer (DTX) for local explanation and (ii) a Criteria Graph to aggregate these explanations and portrait a global picture of the results. Random Forest (RF) classifier achieved the best results (accuracy 0.88, F1–score 0.76, sensitivity 0.66, specificity 0.91, and AUROC 0.86). By using DTX and Criteria Graph for cases confirmed by the RF, it was possible to find some patterns among the individuals able to aid the clinicians to understand the interconnection among the blood parameters either globally or on a case-by-case basis. The results are in accordance with the literature and the proposed methodology may be embedded in an electronic health record system. COVID-19, the disease associated with the SARS-CoV-2 virus, was declared a pandemic by the World Health Organization (WHO) on March 11th 2020 [1] . This pandemic has impacted all aspects of life, politics, education, economy, social, environment and climate and set off a warning about how governments, civil society and health systems can deal with an unknown disease. Although many scientific advances have been made and an intense vaccination program is being carried out in several countries, the severe situation is not effectively controlled yet. An accurate and reliable diagnosis is crucial in providing timely medical aid to suspected or infected individuals and helps the government agencies to prevent its spread and save people's lives. The standard test for COVID-19 is the Reverse Transcriptase Polymerase Chain Reaction, known as RT-PCR, reviewed in [2] . However, it has limitations in terms of resources and specimen collection [3] , it is time-consuming [3, 4, 5, 6] , it has high specificity and low sensitivity 1 [3, 7, 8] , high misclassification in the early symptomatic phase [6] and, also, it is unavailable in many countries and societies making the real extent of the spread still unknown [8, 9] . In addition to the RT-PCR, AI-based approaches may be used to assist in the screening of patients suspected of being contaminated by SARS-CoV-2, supporting the medical decision. In the field of Machine Learning (ML), a branch of Artificial Intelligence (AI) that studies methods that allow computers to learn tasks by examples, many researches studied the diagnosis of COVID-19 either through the analysis of medical images or routine blood tests, as in [6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] . Routine blood tests play an important role in the diagnosis of COVID-19 and other respiratory diseases. Parameters such as white blood cells (WBC), C-reactive protein (CRP), neutrophils (NEU), lymphocytes (LYM), monocytes (MONO), eosinophils (EOS), basophils (BAY), aspartate and alanine aminotransferase (AST and ALT, respectively), lactate dehydrogenase (LDH) and others have shown high correlations in patients diagnosed with COVID-19 [6, 8, 9, 10, 14, 18, 19, 20, 21, 22, 23] . These hematological features have been used for identifying patterns through ML approaches to verify whether the patient is infected or not. Meng et al. [4] used different indicators of whole blood count, coagulation test, and biochemical examination to build a Multivariate Logistic Regression (MLR) that was embedded in a COVID-19 diagnosis aid system. Kukar et al. [11] provided a model called "Smart Blood Analytics (SBA)" based on routine blood tests for patients with various bacterial and viral infections and COVID-19 patients. Wu et al. [13] extracted 11 blood indices through Random Forest (RF) algorithm to build an online assistant discrimination tool. Batista et al. [9] used Artificial Neural Networks (ANN), RF, Gradient Boosting Tree (GBT), Logistic Regression (LR) and Support Vector Machines (SVM) to predict the risk of positive COVID-19 using as predictors only results from emergency care admission exams. Brinati et al. [8] developed two classification models using hematological values from Italian patients. RF and Three-Way RF (TWRF) models showed the best results. A Decision Tree was used for explanation. Barbosa et al. [14, 15] proposed the Heg.IA as a support system for the diagnosis of COVID-19. RF is used as the classifier. Although these models bring promising results in COVID-19 diagnosis, their transparency and trust can be questionable. A model can be defined as explainable if a human can understand its decisions [24] . Any fully automated method without the possibility for human verification would be potentially dangerous in a practical setting, in particular, in the medical field. Explainable ML, or Explainable AI (xAI), typically refers to post hoc analysis and techniques used to understand a pre-trained model or its predictions. The ability of a system to explain its decisions is a central paradigm in symbolic or logic-based machine learning [25] . A model-agnostic explainer [25] can interpret a black-box model prediction without assumptions on the underlying black-box model. They are usually employed after the training step (post-hoc explainability), see for instance LIME [26] and SHAP [27] , providing an understandable output by showing graphically the results and highlighting the features that most contributed to the black-box model decision. In this work, we search for an accurate ML model for COVID-19 screening based on hematological data and propose the use of a decision tree explainer to improve the interpretability of the best model. We argue that a decision tree more closely resembles the decision-making process of a human healthcare worker and because of that it may be more useful in a real-world environment. We also introduce a criteria graph to aggregate explanations allowing for a generalization of the decision process and a deeper understanding of the interaction of factors leading to a diagnosis. The main contributions and findings are listed below: • A literature review of ML methods applied to COVID-19 screening in routine blood tests; • Reasonable results from different ML techniques (including an ensemble) to 3 J o u r n a l P r e -p r o o f support the diagnosis of COVID-19 using usual blood exams; • A decision tree-based methodology for the explanation of the model which can be given to the health teams; • A methodology for aggregating the individual explanations in a graph that shows the relative importance of each attribute and their interactions; • Further evidence that simple blood tests might help identifying false positive/negative RT-PCR tests. The remainder of the paper is organized as follows: Section 2 reviews the application of AI for diagnosing COVID-19. Section 3 discusses the Decision-Tree based Explainer (DTX) used for local interpretation. Section 4 presents the proposed Criteria Graph that can be used for global model interpretation. Section 5 explains the ML process, such as models and dataset used, evaluation process and explainability. Section 6 presents the results and discussion. Section 7 provides future directions and conclusions. Since the announcement of the pandemic, the scientific community has been working hard to investigate SARS-CoV-2 dynamics. As a result, the volume of papers about COVID-19 has increased exponentially [5] . Reviews were carried out to organize, summarize, and merge the amount of information available in such a short time. For instance, Mohamadou, Halidou and Kapen [28] revised 61 studies dealing with mathematical modelling, AI and datasets related to COVID-19. They reported that most models are either based on Susceptible-Exposed-Infected-Removed (SEIR) as in [29] or SIR model. Toledo et al. [17] provided a historic review of the virus, its epidemiology and pathophysiology, emphasizing the laboratory diagnosis, particularly in hematological changes found during the disease. Wynants et al. [30] provided a systematic review and critic appraisal of current models for COVID-19 for the prognosis of patients and for identifying people at increased risk of becoming infected or being admitted to hospital with the disease. Kermali et al. [10] revised 34 papers discussing biomarkers and their clinical implications. Zheng et al. [31] provided a meta-analysis of the risk factors of critical/mortal cases and non-critical COVID-19 patients, with 13 studies including 3027 patients, in which critical patient conditions and parameters were highlighted. Regarding AI and ML-based works, Yan et al. [21] applied an Extreme Gradient Boosting Machine (XGBoost) algorithm to predict risk mortality, in which a 4 J o u r n a l P r e -p r o o f single-tree was used to build an explanation for the model. Tian et al. [32] investigated the predictors of mortality in hospitalized patients in a total of 14 studies documenting the outcome of 4659 patients. Comorbidities such as hypertension, coronary heart disease, and diabetes were associated with a significantly higher risk of death amongst infected patients. Clinical manifestation laboratory examinations that could imply the progression of COVID-19 were presented. Shi et al. [33] analyzed AI techniques in imaging data acquisition, segmentation, and diagnosis. These images, either X-ray or CT images, can improve the work efficiency of the specialists by an accurate delineation of infections. Also in the AI context, Bullock et al. [5] revised datasets, tools, resources to confront many aspects of the COVID-19 crisis at different scales including molecular, clinical, and societal applications. In the clinical aspect, medical images, outcomes prediction and noninvasive measurements were discussed. Although these works have made valuable contributions to dealing with the pandemic, the decision made by the automatic learning model on the samples is still unclear. In the revised literature, important hematological features were highlighted such as CRP [21] , LDH [8, 21, 23, 34] , AST, ALT, NEU [8] , LYM and WBC [8, 9, 11, 34] , EOS [8, 9] and others, see also [4, 6, 13, 14, 16, 20] . These features are detailed in Table 1 with a short description of each hematological parameter, the reference value for males and female and the percentage of missing rates presented in the dataset used. In the literature, they were commonly estimated either through statistics as in [6, 16, 20] or a ML model or metric, such as RF in [8] , Least Absolute Shrinkage and Selection Operator (LASSO) in [4] , Multi-tree XGBoost in [21] or an evolutionary strategy as in [14] . The state-of-the-art algorithms have been the most used, such as the Support Vector Machine (SVM) in [9, 16] , XGBoost in [11, 21] and RF in [8, 13] . For the sake of simplicity, in Table 2 we summarize the works that used ML techniques to classify patients suspected of being infected with SARS-CoV-2 using hematological parameters. There is a short description of the papers, methods used (the best one is in bold), features analyzed, and the results for each performance metric. A series of recently published papers have reported the epidemiological and clinical characteristics of patients with COVID-19 disease, however there is no standard for data collection. Many public datasets available have different features and a large number of missing values, making it difficult to aggregate this data into a single ML model. Although many papers have presented ML-based support approaches to deal with COVID-19 screening in routine blood tests, only Brinati et al. [8] and Yan et al. [21] have raised the necessity of some sort of transparency in the model's decisions. ALT Alanine transaminase An enzyme that is normally present in liver and heart cells and it is released into blood when the liver or heart is damaged [6, 20] J o u r n a l P r e -p r o o f former presents a Decision Tree as an interpretable model but in doing so accuracy is getting sacrificed. In the latter, the authors used the XGBoost algorithm to obtain the relative importance of the features and built a Single-Tree XGBoost on the three most important (LDH, LYM and high-sensitivity CRP). Again, this is an approach that trades accuracy by interpretability. In this paper, we evaluate different ML methods, including ensembles, for COVID-19 diagnosis from routine blood tests. Besides, our methods include cleaning and pre-processing steps, imbalance class treatment, the creation of ensemble models, and an interpretability module. The proposed methodology can be generalized to other contexts as a pipeline for the ML workflow. Local interpretability is provided by using a Decision Tree-based explainer (DTX) (Section 3) and global interpretability is obtained with the criteria graph (Section 4) proposed herein. The DTX presents an explanation for the high-accuracy black-box model. Therefore, the quality of the predictions does not have to be sacrificed. On the other hand, this means that the explanations are individual. Thus, to get an insight into the models global behaviour, the Criteria Graph compresses the information of all the explanations and presents it in a single image. The post hoc explanation approaches aim to explain the predictions of a particular pre-trained ML model. These explanations can be of two types: • Instance explanation: aims to explain predictions of the black-box model for individual instances. It provides local scope for interpretability. • Model explanation: it is usually the result of aggregating instance explanations over many training or testing instances. This approach provides global level interpretability, generalizing local explanations. The aggregation of many instances enables the identification of the impact of features in the classification and knowledge extraction from the ML model. The interpreter applied in this work is known as Decision Tree-based Explainer (DTX). DTX can be defined as a model-agnostic, post hoc, perturbation-based, feature selector explainer. This approach generates a readable tree structure that provides classification rules, which reflect the local behaviour of the complex ML model around the instance to be explained. The explainer can understand the black-box 8 J o u r n a l P r e -p r o o f model according to: is the black-box model prediction, η is a noise set created around the instance to be explained, |η| is the number of samples around the instance to be explained, g(.) measures the distance between the blackbox prediction and DTX prediction, for instance, in classification problems we can use accuracy. The set η is created with artificial samples generated around the instance that we want to explain. This set is used to fit the explainer and to measure the accuracy of the explainer concerning the black-box model. Equation (1) implies local fidelity of the explainer to the predictions provided by the black-box model. The correctness of the prediction is orthogonal to the correctness of the explanation, but enforcing local fidelity to better models (in terms of higher accuracy) might enable better explanations. Figure 1 illustrates how the DTX presents an understandable visual output. The left side shows the noise set η around the sample (x) that is going to be explained. It also shows the decision boundaries defined by the explainer. The right side shows the tree structure generated by DTX for a local explanation. Also, DTX works as a feature selector, since the features presented in the tree are the most important for the method around the neighbourhood of x. In the example in Figure 1 , the explanation provided for why sample x is classified as class 1 (positive class), is given by the path in the tree that lead to this outcome: x 2 ≥ 0.074 and x 1 ≤ −0.04. From the previous section, one can see that the decision tree explainer returns a rule of the type: if criterion 1 and ... and criterion n then class = X where a criterion is defined as attribute value and is one of ≤, ≥, < or > operators. This kind of rule is easy to understand and provides valuable information to the health worker. Nevertheless, each patient will have its own local explanation and it might be useful to understand relationships between criteria over the whole population. To provide this information, in this work, we also propose a global interpretability method named Criteria Graph, which works as follows: Given a set of rules, R = {R 1 , R 2 , ..., R m }, where each rule, R i , is the explanation for the i th patient's diagnosis, and m is the number of patients. First, for each attribute, we discretize the values of each criterion. Being the mean value of that attribute, µ, and the standard deviation, σ, if a value is in the interval [µ − 0.5σ, µ + 0.5σ] it gets the label medium. If value < µ − 0.5σ it gets the label low and if value > µ + 0.5σ it gets the value high. After discretization, each criterion becomes a node in the graph. The size of the node is proportional to the number of patients for which that criterion was used in the diagnosis. If two criteria appear in the same rule, a link is created between them and the width of the link is proportional to the number of patients for which the two criteria are used in the diagnosis. Figure 2 shows the result of this procedure applied to the set below. Notice that the color of each node provides an extra visual cue related to the value of the criterion. Red for low, Blue for high and Yellow for medium. In this paper, we focus on COVID-19 binary classification using a public dataset detailed in subsection 5.1. The ML procedures for generating classifiers with evolving explanations consist, basically, of two main steps: (i) evaluation of different artificial learning models, and (ii) comparison among SHAP, LIME and DTX for local interpretation of the output and criteria graph for global interpretation. Figure 3 provides an overview of the entire process. SARS-CoV-2 by RT-PCR and additional laboratory tests during the visit. The dataset is publicly available in [43] for collaborative research and it is often updated. The raw version we used contained 5644 samples and 111 features, standard normalized (z-score), related to the medical tests, such as blood, urine and others. To select the most representative parameters in the dataset we first define a threshold of 95% for removing features with several missing values greater than it. Non-blood features were also discarded, such as urine tests and other contagious infectious diseases. These diseases include respiratory infections, such as influenza A and B; parainfluenza 1, 2, 3 and 4; enterovirus infections and others. We remove these features since the dependence of the diagnosis on a variety of other infectious diseases for COVID-19 prediction is not a practical situation in the emergency context. Furthermore, a false negative result of one of these diseases would generate a spread of the error. However, the diagnostic results for the others infectious diseases could be used to train a multiple output classifier, which may assist the health professional in the process of diagnosing simultaneous diseases. But this is not the focus of this work. The set of final features were detailed in Table 1 . After the cleaning process, we found a total of 608 observations, being 84 positive and 524 negative COVID-19 confirmed cases through RT-PCR being, thus, an imbalanced data problem. The distribution for each class is approximately 1:6 ratio. Since many null values remained, it was necessary an imputation technique to deal with. The "Iterative Imputer" technique from Scikit-learn package [44] showed the best performance in experimental tests compared with mean or median. In this paper, we use as a baseline the state-of-the-art of Logistic Regression [45] , XGBoost [46] and Random Forest [47] , since these algorithms have shown good results in problems with imbalanced data, as in [8, 13, 11, 21] . We also tested the SVM and MLP methods. We train and evaluate these models through a nested cross-validation procedure [48] . As illustrated in Figure 4 , first, in each iteration, the dataset is stratified between two subsets: training + validation and test set. In the inner loop, training + validation are divided into k folds and the model being trained in k − 1 partitions. The other fold, which does not participate in the training, is used for model validation and for selecting the best set of hyperparameters through the Grid Search algorithm. At the end of an iteration, the model is evaluated in the test set. In the outer loop, this process is repeated in other different training + validation and test set folds, mutually exclusive. The nested cross-validation method, in this way, allows a more reliable evaluation of the model generalization. For the evaluation of the models, we chose the known f1-score [49] to measure the best set of hyperparameters. Since 524 patient observations had no detection of the SARS-CoV-2 (86% of the dataset), the evaluation of accuracy does not provide a representative measure. F1-score, in its turn, provides a measure of the discrimination capacity of the models. We train each algorithm using the SVM-Synthetic Minority Over-sampling Technique (SVM-SMOTE) [50] . Through this technique, minority class data are synthetically over-sampled, presenting for the training subset the same proportion of instances for the positive and the negative class. Resampling by this technique is performed by creating a synthetic sample between the k neighbors closest to the instance, as shown in Figure 5 . For this task, we select a number of k = 5 neighbors. Create synthetic instance between k = 3 Figure 5 : Example of synthetic sample generated by SMOTE Through the nested cross-validation method, we generate five final models for each algorithm, which correspond to the number of external partitions. Thus, we choose the best of the five models generated for each method and retrain it in 10 iterations using the selected hyperparameters to measure their ability to generalize. For each iteration, we split the data in 80% for training and the rest for the test set. Considering the imbalanced data, we applied the SMOTE again, but only for the training data, for each of the interactions, synthetically super-sampling the minority class data. To compose the ensemble, we combine the best nested cross-validation models of RF, LR, XGBoost, SVM, and MLP. The label was predicted based on the majority voting decision. For weighting the votes, the model that obtained the best performance received a weight equal 2 and the worst one a weight equal 0. After generating the ensemble, we evaluated the combined models in each test subset of the 10 iterations, using the following evaluation metrics: accuracy, f1-score, sensitivity and specificity. In the end, the average and standard deviation values are calculated for each of the metrics, obtaining the result that represents the model's generalization. 14 J o u r n a l P r e -p r o o f We propose a methodology to provide a local explanation of the black box model using a single decision tree. In this step, we performed the following experiments: 1. Select a test instance for local explainability; 2. Generate new samples around the instance (noise set η); 3. Using the RF, classify the noise set and also the test instance; 4. The classification results are assigned as labels for these new samples; 5. With these labels and data, a DT is trained; 6. Then, the DT is used to provide a local explanation of the black-box model by taking the path in the tree that leads to the classification. For global explanation, the local explanations obtained with DTX are aggregated over many instances to build the Criteria Graph (see section 4). Table 3 shows the results for the classification of COVID-19 using the metrics accuracy, f1-score, sensitivity, specificity and area under the ROC curve (AUROC). We also summarize the classification results in the normalized confusion matrix per class (positive or negative) for each algorithm in Table 4 . Fig. 6 shows the average of the ROC curve obtained for each one of the algorithms evaluated. This curve is computed by varying the decision threshold, obtaining true positive and false positive rates for each of them. The closer the area is to 1, the greater the discrimination capacity of the model in the diagnostic test. Using the f1-score for comparison, the best models obtained were the RF, with maximum tree depth equal to 8 and 45 estimators, and the heterogeneous ensemble. In both models we obtain an f1-score of 76%. Thus, prioritizing simplicity, we chose the RF model to apply our proposed Criteria Graph for the global explanations and 15 Fig. 7a shows the importance of the blood features for the model decision using the global SHAP values, which reflects the positive or negative contributions of each feature to the model output. A positive SHAP value represents a positive contribution to the target variable, while a negative SHAP value represents a negative contribution. These importances are classified in a descending way, suggesting that the main features that contributed to the target variable are the WBC, PLT and the EOS. In addition to this information, the coloring of the points on the chart is related to the normalized values of the blood parameters of the patient, such as the number of WBC. The closer to blue, the lower the value of the characteristic and the closer to pink, the higher its value. Thus, a low value of the number of WBC, as well as the number of PLT, seen in blue, tends to positively impact the positive COVID-19 output. To corroborate this result, Fig. 8 shows the kernel density estimate for each of these two variables, for visualizing the distribution of observations of SARS-CoV-2 exam result across the dataset. For WBC and PLT values there is a central tendency around normalized values lowest of these characteristics. This is consistent with the literature, that suggest that the platelet count may reflect the pathological changes of patients with COVID-19 [51] . This tendency is also observed for EOS and the eosinopenia, characterized low EOS levels, appear to be related to disease severity [52] . In the case of CRP, higher values of this marker tend to positively impact the positive COVID-19 output. presented in Fig. 7a as a function of the corresponding SHAP value (Fig. 9) , which represents the marginal effect that these features have on the predicted result of the model. Values of the normalized number of WBC, PLT and EOS above the highlighted lines, tend to contribute to increasing the probability of the positive class. Table 5 shows the rules for the decision tree-based explainer for 12 positive COVID-19 patients which reflect the models behaviour. Since the explanations are local and built with high fidelity to the high accuracy model, differently from [8] and [21] one does not have to compromise accuracy. Also, the decisions trees allow us to represent non-linear behaviour which is an advantage against LIME. It can be seen that the model uses different criteria to "diagnose" each patient. This indicates that the COVID-19 affects a number of parameters in the blood and that the variation of these parameters is individual dependent. J o u r n a l P r e -p r o o f is a lot of overlap between the two rankings. Although WBC does not figure in the top five attributes, it has two nodes in the graph. That means that the WBC was important for the inference but its threshold value was not very clear. Thus, it seems to make sense that as a whole the attribute loses strength. The graph also shows a strong relationship between the criteria P LT ≤ M edium, M P V > low and EOS ≤ low pointing to a route towards a more reliable diagnosis procedure. Increasing the number of patients used to produce the graph may increase the strength of the identified patterns. Nevertheless, the Criteria Graph provides information that other explanation methods lack and that this information may be extremely useful for the application expert. For instance, neither SHAP [27] nor LIME [26] present information about features interactions. In Figures 7b and 7c it can be seen that LIME presents information about the thresholds used in the classification. However, as it happens with the DTX, the information is only local (individual dependent). The criteria graph addresses this drawback by aggregating the results of all the explanations. SHAP can inform the user about possible feature thresholds with the marginal effects plot as shown in Figure 9 . Such approach can be cumbersome if the number of features is high. In this context, the criteria graph is able to more clearly show the robustness of the thresholds by compressing the information about each feature in few nodes which are all displayed in a single plot. Thus, the amount artefacts presented to the user is reduced which tends to reduce the analysis time. As aforementioned stated, the RF and heterogeneous ensemble models achieved the best results. Looking for the simplest model (often called parsimony), we follow with the RF as the preferred one plus the Criteria Graph for global explanations and DTX for local ones. Utilizing a web application, the healthcare professional may be able to input the patient's blood test results (similar, for instance, with that available in [13] ). The system may be able, for instance, (i) to provide for the decision-maker both the results (infected or not), (ii) shows the rules to facilitate her/his valuable interpretation regarding local and global explanations, (iii) to be pre-configured to streamline the medical work and provide faster and more reliable diagnostics and (iv) offer intelligent prescription, which can be filled automatically in the correct standards of the medical prescription. The implementation must be focused on reusing the code, since once new strains of the virus are appearing, adaptations in the code/system may be required to make it useful in the future. There are many advantages of using electronic medical records, such as security and availability of patient information, standardization/integration of data, and automation of procedures, to name a few. We know that SARS-CoV-2 is highly transmissible and rapid tests are already in place to diagnose the disease. Therefore, we emphasize that the proposed solution has the objective of supporting the decision making of clinicians, providing more information for helping them. Moreover, a considerable differential of the proposed methodology is the presentation of explanations of the model, making such information comprehensible to the health professional, being able to assist her/him in the final result of the diagnosis. Recent research suggests that some parameters assessed in routine blood tests are indicative of COVID-19. It is well known that machine learning techniques excel in finding correlations in all sorts of data. Thus, it seems natural to try these techniques for the problem of COVID-19 screening through routine blood test data. However, there is significant barrier to the application of such methods in the real world due 22 to their lack of transparency, meaning that human specialists may find it difficult to trust the ML decisions. In this context, in this work, we search for an accurate machine learning model for COVID-19 screening based on hematological data and propose two methods to improve the interpretability of the ML decisions, a Decision Tree Explainer and a Criteria Graph. The decision Tree Explainer is used to provide an individual explanation for each classified sample in terms of If ... then rules. The Criteria Graph is used to aggregate the set of rules produced by the decision tree to provide a global picture of the criteria that guided the model decisions and show the interactions among these criteria. From the tested ML techniques, the best results were obtained with a RF which is an opaque model. It presented an accuracy of 0.88 ± 0.02, F1-score of 0.76 ± 0.03, Sensitivity of 0.66 ± 0.10, and Specificity 0.91 ± 0.02. The Decision Tree was then used to produce explanations for the classification of twelve confirmed COVID-19 cases and finally, the Criteria Graph was used to aggregate the explanations and portrait a global picture of the model results. The obtained Criteria Graph was in accordance with the well know techniques for interpretability SHAP and LIME indicating its adequacy and the adequacy of the Decision Tree Explainers. In addition, it could be seen that the Criteria Graph presents valuable information, such as the interaction among different criteria and the robustness of a criteria with respect to its threshold value, which is not provided by other techniques. Given the urgency of the pandemic and the need to generate immediate results, much of the research has been published in repositories such as arXiv or medRxiv. Some methodologies discussed in the literature review are not clear enough to be reproducible or the model decision is not comprehensible. Lastly, we made comparisons between our proposed work and others from the literature that have not been peer-reviewed and published yet in the scientific literature. However, their data confirm our finding that ML models using routine blood parameters are useful in the diagnosis of COVID-19. We employed hematological data from the Hospital Israelita Albert Einstein in São Paulo, Brazil, which is available as public data. However, this data is arguably not large and it is normalized (using z-normalization). Since we do not have access to the values used to normalize the data, the original values of the features are not accessible. Applying the proposed methods with larger data is an important step in our future work. Still, the solution we offer brings good results, it is reproducible and the model 23 J o u r n a l P r e -p r o o f explainable. Additionally, we intend to integrate it with other fronts, such as chest X-rays and CT scans. In this way, ML models may serve as a way to support the diagnosis of the disease, regardless of the stage of contagion, and can help in the validation of RT-PCR. The authors declare that they have no known competing financial interest or personal relationships that could have appeared to influence the work reported in this paper. World Health Organization, Coronavirus disease (covid-19) pandemic Technical aspects of quantitative competitive pcr Correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: a report of 1014 cases Development and utilization of an intelligent application for aiding covid-19 diagnosis Mapping the landscape of artificial intelligence applications against covid-19 Routine blood tests as a potential diagnostic tool for covid-19 Essentials for radiologists on covid-19: an update-radiology scientific expert panel Detection of covid-19 infection from routine blood exams with machine learning: a feasibility study The role of biomarkers in diagnosis of covid-19-a systematic review Covid-19 diagnosis by routine blood tests using machine learning Elevated exhaustion levels and reduced functional diversity of t cells in peripheral blood may predict severe progression in covid-19 patients Rapid and accurate identification of covid-19 infection through machine learning based on clinical available blood test results, medRxiv Heg. ia: An intelligent system to support diagnosis of covid-19 based on blood tests, medRxiv Covid-19 rapid test by combining a random forest based web system and blood tests Severity detection for the coronavirus disease 2019 (covid-19) patients using a machine learning model based on the blood and urine tests Covid-19: Review and hematologic impact Prediction of covid-19 from hemogram results and age using machine learning Prediction of covid-19 from hemogram results and age using machine learning Laboratory parameters in detection of covid-19 patients with positive rt-pcr: a diagnostic accuracy study Prediction of criticality in patients with severe covid-19 infection using three clinical features: a machine learning-based prognostic model with clinical data in wuhan Logistic regression analysis to predict mortality risk in covid-19 patients from routine hematologic parameters Lactate dehydrogenase levels predict coronavirus disease 2019 (covid-19) severity and mortality: A pooled analysis Applying genetic programming to improve interpretability in machine learning models Interpretable Machine Learning Why Should I Trust You?": Explaining the Predictions of Any Classifier A unified approach to interpreting model predictions A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of covid-19 Covid-abs: An agent-based model of covid-19 epidemic to simulate health and economic effects of social distancing interventions Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal Risk factors of critical & mortal covid-19 cases: A systematic literature review and meta-analysis Predictors of mortality in hospitalized covid-19 patients: A systematic review and meta-analysis Review of artificial intelligence techniques in imaging data acquisition, segmentation 27 diagnosis for covid-19 Utilization of machine-learning models to accurately predict the risk for critical covid-19, Internal and emergency medicine Blood groups and red cell antigens Full blood count (fbc) reference ranges A simple laboratory parameter facilitates early identification of covid-19 patients, medRxiv Biochemistry, lactate dehydrogenase (ldh), in: Stat-Pearls Armitage, Williams manual of hematology Normal range of mean platelet volume in healthy subjects: Insight from a large epidemiologic study Mosby's Diagnostic and Laboratory Test Reference-E-Book Clinical Methods: The History, Physical, and Laboratory Examinations Diagnosis of covid-19 and its clinical spectrum -ai and data science supporting clinical decisions (from 28th mar to 3st apr Scikit-learn: Machine learning in python, the Applied logistic regression Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining Random decision forests On over-fitting in model selection and subsequent selection bias in performance evaluation Humaine association conference on affective computing and intelligent interaction Smote: synthetic minority over-sampling technique Early decrease in blood platelet count is associated with poor prognosis in covid-19 patients-indications for predictive, preventive, and personalized medical approach Characteristics and prognostic factors of disease severity in patients with covid-19: The beijing experience J o u r n a l P r e -p r o o f Highlights of the paper:• A literature review of ML methods applied to COVID-19 screening in routine blood tests• Results from different ML techniques -including an ensemble -to support the diagnosis of COVID-19 using usual blood exams• A decision tree-based methodology for the explanation of the model which can be given to the health teams The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.