key: cord-0429190-udw1tat8
authors: Liz, Helena; S'anchez-Montan'es, Manuel; Tagarro, Alfredo; Dom'inguez-Rodr'iguez, Sara; Dagan, Ron; Camacho, David
title: Ensembles of Convolutional Neural Networks models for pediatric pneumonia diagnosis
date: 2020-09-29
journal: nan
DOI: 10.1016/j.future.2021.04.007
sha: e3cff9897cdbdeafb62459e96e7255c41ec4b80a
doc_id: 429190
cord_uid: udw1tat8

Pneumonia is a lung infection that causes 15% of childhood mortality, over 800,000 children under five every year, all over the world. This pathology is mainly caused by viruses or bacteria. X-rays imaging analysis is one of the most used methods for pneumonia diagnosis. These clinical images can be analyzed using machine learning methods such as convolutional neural networks (CNN), which learn to extract critical features for the classification. However, the usability of these systems is limited in medicine due to the lack of interpretability, because of these models cannot be used to generate an understandable explanation (from a human-based perspective), about how they have reached those results. Another problem that difficults the impact of this technology is the limited amount of labeled data in many medicine domains. The main contributions of this work are two fold: the first one is the design of a new explainable artificial intelligence (XAI) technique based on combining the individual heatmaps obtained from each model in the ensemble. This allows to overcome the explainability and interpretability problems of the CNN"black boxes", highlighting those areas of the image which are more relevant to generate the classification. The second one is the development of new ensemble deep learning models to classify chest X-rays that allow highly competitive results using small datasets for training. We tested our ensemble model using a small dataset of pediatric X-rays (950 samples) with low quality and anatomical variability (which represents one of the biggest challenges). We also tested other strategies such as single CNNs trained from scratch and transfer learning using CheXNet. Our results show that our ensemble model outperforms these strategies obtaining highly competitive results. Finally, we confirmed the robustness of our approach using another pneumonia diagnosis dataset [1].

ensemble is designed to improve the performance of the particular elements that builds up the new algorithm. This technique combine the individual predictions to produce a consensus prediction. Its advantages over individual models are the performance, because the combination of multiples models can improve their individual power and the final model can approximate better the optimal solution [32] ; and robustness, because the ensemble models reduce the variance of prediction errors made but the contributing models by adding bias, avoiding overfitting of the final model [33] . For this reason this technique can be critical in small datasets like ours, where the information is particular limited in both size and quality.

However, CNNs, like many other Deep Learning and machine learning methods, are considered as "black-box" algorithms, where both the input and output can be easily analysed and interpreted by the users, but where the inference process carried out by the algorithm is opaque hinders end-users' confidence in the results obtained, and therefore makes decision-making negatively affected. It makes this essential process ("how" and "why" the algorithm has obtained this outcome) uninterpretable for the human being [34] . This may limit its application in fields such as medicine, where the practitioners need to know how the algorithm has inferred the output for each specific patient [35] (e.g. why the algorithm is assigning a 90 % probability for alveolar pneumonia?). This limitation can be overcome using automatic explanatory systems, called explainable AI (XAI) [36] , which allows us to visualize which areas of the image (features) have been used to obtain the outcome generated by the algorithm, or at least alleviating, the aforementioned problem. These XAI-based systems will generate new images highlighting the areas of highest interest that the system uses to obtain the result (e.g. in our case to predict a particular kind of disease) [37] .

The combination of Deep Learning models with medical knowledge allows the development of new clinical decision support systems (CDSS). These automatic systems can help in medical diagnosis reducing some typical clinical problems such as subjectivity in the interpretation of medical tests or human errors (fatigue, distraction, etc.) [38] .

This combination of machine learning methods with human-based knowledge can improve the performance of the diagnosis process, as it was stated in [39] . In that project, leaded by Dr. Andrew Beck, it was demonstrated that the combination of pathologists and Deep Learning models provide a significant reduction of the error rate for breast cancer diagnosis. In the initial results pathologists obtained a 3.5% of error during classification of the pathology, whereas the Deep Learning algorithm obtained a slightly better result of 2.9% error. However, when both humans and AI model were combined this error decreased to an impressive 0.5 % (so, the 99.5% of cases were correctly classified) [39] .

The main contribution of this work can be briefly summarized as the design and development of a novel Machine Learning system based on ensembles for pneumonia diagnosis in childhood. Our system estimates the probability that the X-ray has a consolidation or other infiltrates, which would be a helpful tool for the unclear cases in case of disagreement among professionals, or in case of work overload. The result generated from the system should be user-friendly (from a health professional perspective), to achieve that, the system creates a graphical visualization using an explainable AI technique named heatmap, which highlights the areas of the image which are more relevant for the diagnosis according to the AI system [40] .

An interesting point of this work is to understand how the neural network infers the pneumonia type (alveolar versus non alveolar) from the chest X-ray. This could help to expedite the treatment of patients who require medication. On the other hand, this could also help to avoid giving antibiotics to patients who do not need them. This is crucial since an incorrect use of antibiotics (that is, using them in patients who do not have a bacterial infection), or an excessive use of broad-spectrum antibiotics, can cause antibiotic resistance. This can become a global problem making it more difficult to treat patients, the only solution being the development of new, more powerful antibiotics [13] . Therefore, in order to reduce the overuse of antibiotics in viral pneumonia, a correct diagnosis of bacterial pneumonia is crucial.

This article has been structured as follows: Section 2 provides a short description of some relevant works in the area of AI-based detection of lung diseases; Section 3 describes the methodology followed to design and train our CNN models; Section 4 shows the experimental results; Section 5 presents the conclusions and finally, future work, with some future lines of work.

AI techniques have been intensively applied in medicine, especially in diagnosis processes, forecasting and prediction of diseases evolution, global health impact, etc. We can find a very large number of examples, such as the diagnosis of brain abnormality with magnetic Magnetic resonance imaging (MRI) [41] , cardiac disorders using clinical data [42] , breast lesions using radiographs (X-rays) [43] or COVID-19 with chest X-ray [44] . Within the field of pneumonia there is also a multitude of systems with different kinds of data, such as: clinical data [45] , ultrasounds [46] , computed tomography [47] or X-rays [1, 48] , among many others. However, there are also numerous studies focused on other aspects, such as predicting the evolution of patients [49] or classification of patients according to severity using chest X-rays [50, 51] . Due to the main contributions of this work, only some relevant works in the area of Deep Learning, and its application to the pathology considered (pneumonia), will be briefly described.

Most of the previous works in this field use a pediatric dataset published by Kermany et al. [1] , consisting of a total of 5856 radiographs divided in three classes (2780 bacterial, 1493 viral and 1583 normal) of patients between 1 and 5 years of age. Other works in the field [38] use other datasets and a classification hierarchy. First, radiographs are classified as "pneumonia" or "normal". Then, radiographs labeled as pneumonia are classified as "viral" or "bacterial". Kermany et al. obtained an AUC of 0.85 for the first classification (pneumonia versus normal) and 0.81 for the second (viral versus bacterial). We can also find the same classification system in [1] , which will be explained later.

The first step that must be carried out is the pre-processing of the images. There are different algorithms, depending on the basic features, and pre-processing needs, to apply the target algorithms, of each dataset. The most commons are Histogram equalization, that enhances pathological signs [52] [53] ; Lung segmentation, which removes irrelevant data on X-rays and recovers useful information (because consolidation signs only appear inside lungs, therefore the rest of the image is irrelevant for the models). This one is applied in multiplicity of models [54, 38, 55] .

Currently, one of the most successful AI methods for automatic image classification is Convolutional Neural Networks (CNN) [56] . Two main problems in CNN training for pneumonia diagnosis (and in most of the medical diagnosis problems) are low dataset availability and small dataset size. To solve these problems several works use Transfer Learning, a technique that takes advantage of the knowledge obtained from models used in other (similar) areas and trained with bigger datasets. The idea is that part of the CNN (the first layers) is "inherited" from a model previously trained in other dataset, while the rest of the CNN is trained with the penumonia dataset. This technique was used in the work by Kermany et al. [1] , where part of the CNN was trained with ImageNet dataset and the rest of the CNN was trained with medical images including pediatric pneumonia. This resulted in an AUC of 0.96 for the classification task "pneumonia" versus "normal", and an AUC of 0.94 for the classification task "viral pneumonia" versus "bacterial pneumonia". This classification also shows an interesting classification distinction, unlike the rest of the networks, between three different categories (normal, bacterial and viral), achieving an AUC of 0.918. Other approaches, such as CheXNet [57] , use a CNN called "DenseNet" which was also trained using ImageNet dataset and re-trained with a dataset of 14 different lung diseases (including pneumonia). This model has an AUC value for pneumonia of 0.768, however, the AUC of cardiomegaly and emphysema has higher values than the pneumonia AUC (0.925 and 0.937 respectively). Other technique to solve this problem is Data augmentation: a data-space solution to the problem of limited data, a quiet common problem in this field. It encompasses a suite of techniques that enhance the size and quality of training datasets such that better models can be built using them [58] , increasing artificially, synthetically, the amount of training data using information only in it, so it helps to avoid overfitting. There are several techniques, such as geometric transformation (flippling, cropping, rotation, translation), color space transformation or noise injection [59] .

Finally, CNN ensembles can improve the performance of a particular model based on an unique architecture, this technique is based on the idea of combining predictions from multiple statistical models (e.g. CNN) to form one final prediction (i.e. a new model made by a combination of selected models). There exist different ways to generate those ensembles [60] : averaging, ensemble averaging creates a group of models, each with low bias and high variance, then combines them into a new model with (hopefully) low bias and low variance. Some relevant examples can be seen in Shin et al. [61] , which classifies two different datasets: the first one, Thoracoabdominal Lymph Node and Interstitial Lung Disease, comparing the performance of different architectures from the state of the art (CifarNet, Google-Net, AlexNet, etc.), and defining a new architecture that is made by the average value from the considered SoTA methods, or in the work by Christodoulidis et al. [62] , which designs a system that classify between seven different Interstitial lung diseases using CT, and applying transfer learning technique; majority voting, in this approach the new model is built taking into account the outcomes from the different models considered, there are two approaches to the majority vote prediction for classification; hard voting and soft voting. Hard voting involves summing the predictions for each class label and predicting the class label with the most votes. Soft voting involves summing the predicted probabilities (or probability-like scores) for each class label and predicting the class label with the largest probability. Some examples of majority voting ensembles can be found in Yan et al. [63] , which designs a CAD system for lung nodule malignancy risk classification from CT, using different CNNs with the objective of learning different levels of image spatial context, and improving detection performance; and finally, another popular ensemble method is the weighted averaging, which is based on a weighted combination of the former method, where a particular weight will be given to each model before the final model is generated. Some approches that follows this method as is shown in Bermejo-Peláez et al. [64] , which designs a model to classify computed tomography between 8 classes within Interstitial Lung Disease, another example of this approach can be found in Sirazitdinov et al. [29] , which generates an object detection model for pneumonia detection and location from chest X-rays.

As was previously mentioned, CNNs are considered black-box algorithms, which increases the difficulty of applying them in areas such as Medicine. The reason is that even these algorithms can be very precise classifying radiographs or making predictions, the end-users (medical staff in our case) will need to interpret and understand how and why the algorithm has reached the conclusion (the outcome provided to the end-user). For this reason, it is important to develop eXplainable AI (XAI) systems that allow end-users to understand how the system works (i.e. how and why classifies the radiography in our case).

We can distinguish between two main techniques, the first being object detection systems. These systems are based on CNN models that locate different objects in images. An example of this type of model is CoupleNet, which classifies and locates signs of pneumonia on a chest radiography and produces a visualization of the original image with bounding boxes in lung areas that show signs of disease [65] . These algorithms are used in different works to detect and locate pneumonia signs [29] combining two CNNs for the detection and location of pneumonia signs on lungs.

The second kind of methods uses additional techniques to visualize how the model classifies, such as heatmap generation. This method is used in a wide variety of problems including pneumonia diagnosis. For example in [38] , where different lung areas can be seen with different colour intensities according their relevance to the prediction made by the model. Other similar work is presented in Zech et al. [66] .

A substantial difference between the two previous methods is that in the first one it is necessary to build a training dataset where the areas with signs of the disease have been marked by the experts. Whereas, in the second type of methods, it is only necessary to label each radiography of the training set with the classification given by the experts (consolidation / non consolidation in our case). After training, medical staff need to review the heatmaps generated by the model to validate the clinical sense.

This section describes the methodology carried out to design and develop our model for pneumonia diagnosis.

The system has been designed following the three basic stages shown in Figure 1 .

All the processing was implemented in Python using well-known libraries such as numpy (mathematical computing) [67] , matplotlib (visualization) [68] , Keras (Deep Learning, [69] ), and Keras-Vis (heatmaps calculation) [70].

The first step is the data preprocessing stage where the X-rays dataset is divided into training and test subsets. In this stage, individual X-rays images are also normalized. Data augmentation of training images was performed for a robust model construction [71] . The second stage is the generation of a model that classifies each X-rays into two classes (consolidation / non-consolidation). The accuracy of this model is validated using a k-fold cross-validation technique (where k, the number of folds, has been fixed to 5). Finally, the last step is the application of the explainable AI technique that we have selected to increase the interpretability of our system, heatmap creation [72] . To evaluate the quality of our system we follow two different strategies: 1) generating the heatmap using only a model, and 2) generating the heatmap using an ensemble of models with the same architecture, but trained with different data folds. The second strategy will allow us to compute an uncertainty level (given by the standard deviation) associated with each pixel, that allow us to analyze the robustness of the heatmap.

In this work we use two different datasets. The first one is an X-ray pediatric-pneumonia (XrPP) dataset provided by Ben-Gurion University (Israel), and the second one is a public pediatric dataset of chest X-rays [1] , which will be used to analyze the generalization capability of our model. X-rays have been labeled by experts using one of the following two mutually exclusive classes:

• Consolidation, denoting a chest x-ray image with signs of consolidation (alveolar pneumonia).

• Non-consolidation, denoting a chest x-ray image with other infiltrates signs that correspond with non-alveolar pneumonia.

Individual X-rays images are divided into training and test, normalized and augmented.

Build a CNN to classify the X-rays into consolidation or nonconsolidation

Explainable AI technique selected to increase the interpretability of our system are applied to generate heatmaps Final report 92% consolidation / non-consolidation The first dataset is formed by 950 labeled chest X-rays of children (between one month and 16 years), 403 cases of consolidation (42.42%) and 547 cases of non-consolidation (57.58%). These chest X-rays are posteroanterior (PA) radiographs showing the posterior view of the chest. Each case was classified by a panel of experts consisting of two senior pediatricians and a radiologist from Hospital 12 de Octubre Research Institute (Madrid, Spain). The experts also had access to lateral radiographs of the patients to increase the precision of the labeling of each case. During this classification, 50 samples were withdrawn due to lack of consensus from the expert panel on the diagnosis 1 . The authors have obtained the necessary permission from the ethical boards of Ben-Gurion University and Hospital 12 de Octubre Research Institute to work with these data.

Note that the distribution of classes in the dataset is relatively balanced. This is important, especially in small datasets like this, since an unbalanced distribution of classes can severely affect the model's performance (e.g. example accuracy, true positive rate: TPR, false positive rate: FPR) [73] .

Another interesting feature of the dataset is the size of the images. The average size of the images is similar in both classes (approximately 200,000 pixels), a very low value for X-ray images, which implies that image resolution (and therefore quality) is limited. Therefore, another contribution of our work is the study of the reliability of CNNs when trained with small and low-resolution datasets.

The second dataset studied in our work is composed by 5,856 X-rays of children between one and five years of age. There are 2,780 cases of bacterial (47.5 %), 1,493 cases of viral (25.5 %) and 1,583 cases of normal (27.0%) [1] , so the class distribution is not balanced. As described in that paper, all chest radiographs were screened for quality control, removing low quality or unreadable scans. Then, diagnoses for the images were graded by two expert physicians. Since the sizes of the images are between 1,000,000 and 2,000,000 pixels, both the images resolution (quality) and the number of images are clearly superior to the first dataset.

The X-ray images were provided by medical centers in jpg format. This format codes the colour at each pixel using three values, the "RGB" components. In our dataset these values are redundant since RGB components are identical in greyscale images. Therefore we keep only the first one. The original images do not have the same size, so we normalize their shapes to 150x150 pixels. On the other hand, the pixel values of each image are normalized by dividing them by the average pixel value of the image.

As we mentioned before, CNN usually needs a very large number of training images to avoid overfitting, however, our available dataset is particularly small. To overcome this problem we used a popular technique named Data Augmentation, which allows us to increase the size of our dataset [71] . It generates batches of images with real-time data augmentation. During each epoch, a different set of variations of the original training images is generated using different types of transformations [69] . In this work shearing (0.2), zoom (0.05), rotation (0.2), horizontal shift (0.1), vertical shift (0.1) and horizontal flip transformations have been used with a batch size of 32.

In order to robustly evaluate the different architectures for our models, we generated a pool of different training / validation / test partitions of the dataset. For each of those partitions, we first divided the dataset randomly into construction (70%) and test (30%) subsets using stratified partitioning. Then the construction subset was randomly divided into training (80%) and validation (20%) subsets using stratified partitioning. Therefore each partition is a division of the original dataset into training (56%), validation (14%) and test (30%) subsets. These are mutually exclusive, so each image in each partition is only included in one subset (training, validation or test). Training subset will be used to learn the CNN's weights, whereas validation subset will be used for monitoring CNN's metrics throughout model's learning and avoid overfitting. Finally, test subset have been used for estimating the model's generalization capabilities (performance when new radiographs are considered).

We considered different architectures for our CNN model. The number of convolutional layers in them was in a 3-4 range (see Table 1 ), because a low number of convolutional layers may not be able to extract all the information from the images and a high number can lead to overfitting. Each convolutional layer has 32 kernels and a ReLU activation function. The output of the last convolutional layer is flattened and then perturbed by a Dropout with a rate of 70%. Then this information is processed by a dense layer ("FC layer") with a number of neurons depending on the architecture (Table 1) and ReLU activation function. Finally, the classification layer of our CNN consists in a dense layer of two neurons with Softmax activation. Therefore, we consider a total of six architectures ( 

In order to increase the performance and robustness of our system, ensembles are considered. Each ensemble is composed by five different CNNs, each constructed using a different partition with different training/validation subsets but same test subset. This ensures models diversity, which is crucial for the ensemble performance [32] .

The partitions were created as follows: first we randomly divided the dataset into construction (70%) and test (30%) subsets. Then we generated five different training / validation random partitions of the construction subset (80% / 20%). For each of these five partitions a CNN was trained from scratch. The predictions for test subset of the ensemble formed by these five CNNs were then computed and the performance metrics evaluated. Ensemble's prediction for the probability of consolidation (and non consolidation) was computed as the average probability prediction across the five CNNs in the ensemble.

Summarizing, in order to compute in a robust and consistent way the metrics for the ensemble, the entire process described in this subsection was repeated using five different construction / test divisions. On the other hand, the total number of CNN models built for each division was 5. Therefore, a total of 25 models were constructed.

In order to compare our system, CheXNet [57] has been selected as our reference model. CheXNet is a CNN model trained to classify chest X-rays over 14 different lung diseases, which is a very similar problem to ours (one of these lung diseases is precisely pneumonia). For this reason and the high performance with a similar problem was selected as the best model to compare the performance of ours. In order to make an adequate comparison between both models, the number of neurons in CheXNet's last layer was changed from 14 (number of classes in the original paper) to two (number of classes in our work -consolidation / non-consolidation). Once the final layer is modified, the rest of layers in CheXNet's are freezing except the last two, which we will re-trained using the target dataset. Taking a previously trained model in another similar domain and retraining it in the current dataset is a popular strategy in Deep Learning called transfer learning.

To analyze the performance of the different models, we used different metrics obtained from the ROC (Receiver Operating Characteristic Curve), a graphical representation that illustrates how the diagnostic capacity of a binary classifier system changes as its discrimination threshold is varied. More specifically, this curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold values [74] .

A standard metric derived from the ROC is the Area Under the Curve (AUC), which measures the overall quality and accuracy of the classifier. Finally, we also measured the TPR, which in our problem corresponds to the fraction of positive (consolidation) cases that are correctly detected by the model. Accordingly to that, the fraction of patients that need antibiotics immediately that are correctly detected by the model.

As it was previously mentioned, in order to generate an interpretable output for medical staff, we decided to generate a set of heatmaps. A heatmap is a matrix with the same size as the input image, where the value of each pixel is proportional to the importance of that pixel in the classification performed by the model has. Heatmaps were generated using the "Keras Vis" package [70] . These are shown overlaid with the original X-ray to make them more interpretable by medical staff. To overlay the images, the original X-ray and the obtained heatmap have a transparency degree of 50%. This allows us to see under the heatmap.

Typically, binary classification problems use a single output neuron. However, we used two in order to obtain a separate heatmap and generate the desired visualization for each of the two classes. Therefore our classifier layer has two output neurons, each one estimating the probability of the corresponding class (consolidation / non consolidation). The first step in heatmap generation is changing "the activation function of the last layer of the network (the "output layer") from Softmax to Linear [70]. These two heatmaps were generated to be able to compare both classifications, allowing the end-user to comparing the areas that the model considers relevant when determining if the patient presents consolidation, or non-consolidation.

As explained above, ensembles are formed by five models. We generated ensemble heatmaps by averaging the individual heatmaps generated by those models. We also computed for each pixel the uncertainty of the heatmap by calculating the standard deviation of the individual heatmaps.

As described in 3.4, we considered six architectures for the CNN. These architectures were compared using CheX-Net, which was retrained in our dataset using transfer learning (see 3.6). Table 2 shows the performance metrics of the different models. The statistics of AUC and TPR were calculated for all architectures using five different training / validation / test partitions.

Firstly, we can observe that our reference model, CheXNet, obtains a similar AUC value when is compared against our CNN architectures trained from scratch (see Figure 2 ). However, the TPR obtained by CheXNet is 0.43 ( Figure 3 ). The reason might rely on the fact that CheXNet has been pre-trained using adult X-rays, and the differences between the original diseases are greater than the differences between alveolar and non-alveolar pneumonia. Furthermore, we can observe that the best architecture is Arch 1 (Architecture 1), which achieves the highest AUC and TPR (see Table 2 ). Since the differences in AUC are statistically significant (p-value = 0.04), Arch 1 was selected as the best architecture. In order to analyze Arch 1 in depth, we generated five different divisions of the dataset into construction / test subsets, and for each construction subset we generated five different training / validation splits. For each training / validation / test partition a model was trained from scratch using Arch1, obtaining a total of 25 models. This allowed us to analyze the robustness and statistics of the architecture across different partitions (see Table 3 ). We can observe in Table 3 and Figure 4 that the performance of the architecture across different construction / test partitions is very similar, and the variance of the metrics is low. Therefore, we can conclude that this architecture is robust over our dataset. Table 3 shows the performance of the individual Arch1 models. We will now analyze the performance of an ensemble consisting of five individual models. Taking in consideration that for each construction/test partition we generated 5 different training/validation partitions and trained a different CNN from scratch for each of them. Now we will analyze the performance of an ensemble formed by those five individual models. As it is shown in Table 4 , a clear improvement in performance, both in AUC and TPR values, is now obtained. These values are higher than in the individual models with a difference of 9% / 7% for AUC/TPR respectively. Therefore, it can be concluded that the ensemble is more robust against overfitting and provides better generalization capacity than the individual CNNs. Table 4 : AUC and TPR values for Arch 1 ensembles.

The last step in our system is the heatmaps generation. Figure 6 shows the prediction and heatmaps of an individual CNN for a non-consolidation test sample. The nonconsolidation probability estimated by the model is 99.9%. If we analyse and compare the heatmaps, we can see that neuron 0 heatmap (non-consolidation) is clearly brighter than neuron 1 heatmap (consolidation), and in neuron 1 heatmaps there are different areas marked that shouldn't be, corresponding to the clavicle and diaphragm. The conclusion is that the CNN has not found any signs of consolidation, but the previously referred areas of neuron 1 heatmap should not be lighted up. In Figure 8 we can see in neuron 1 heatmap that there is a consolidation sign in the upper left lung, but the consolidation probability estimated by the model is only 26.3%. The conclusions is that the CNN model incorrectly classified the X-ray as a non-consolidation sample but the heatmap is able to find the consolidation evidences. Figure 7 shows that there are signs of consolidation in the left lung. We can observe how both the probability estimated by the CNN model, 98.7%, and the heatmap corresponding to the consolidation class show that the system has correctly classified the radiography. However, we can see how the heatmap only marks the left side of the consolidations area, which does not cover all the consolidation signs. We can conclude that the CNN model has correctly classified the sample but it has not found all interesting area of the image.

The ensembles are composed by five individual CNNs, therefore, for each radiography we have five different heatmaps for each class (consolidation / non-consolidation). This allows us to compute the average heatmap and the uncertainty at each pixel (standard deviation at each pixel) for each class.

In Figures 9-11 we can see from left to right, and from top to bottom, the original X-ray, the average heatmap for the "non-consolidation" class, the average heatmap for the "consolidation" class, and the standard deviation heatmaps.

This visual representation provides to medical staff with relevant information. Firstly, the probabilities of each class predicted by the model are shown in the title of each average heatmap. Secondly, the average heatmaps show the areas of the X-rays that are most informative according to the ensemble. Finally, the standard deviation heatmaps provide information about the areas of greatest disagreement among individual CNNs (that is, the areas with the greatest uncertainty). This suggests that medical staff should pay more attention to areas with higher values in the average heatmaps and in the standard deviation heatmaps. In Figure 9 we can see that the average heatmap for neuron 0 (non-consolidation class) is much brighter than the average heatmap for neuron 1 (consolidation). If we compare this visualization with the one obtained by an individual CNN (Figure 6 ), we can see that it analyzes left lung in greater depth than the stand alone model, and focuses on the most relevant features to perform the X-ray classification. It also has a low standard deviation in this area and, unlike the visualization obtained by the individual CNN, it is hardly fixed on the heart area. Therefore, it can be concluded that the ensemble result is better than the one obtained by the individual CNN.

In Figure 10 we can see how the average heatmaps of both neurons have lighted up, whereas only neuron 1 heatmap should light up because signs of consolidation appears in the upper area of the left lung. We can see how the standard deviation heatmap of neuron 0 presents higher values than neuron 1, which means that results of neuron 1 (consolidation class) are more robust than those of neuron 0 (non-consolidation class). If we look at the consolidation class average heatmap, the marked area corresponds to the pathology signs and it also marks a larger area than the same heatmap of the individual CNN (Figure 7) . Furthermore, if we look at the estimated probability of the models (81%) and we compare it with the one obtained with the individual CNN (26.7%), we can see that ensembles give greater robustness to the system and avoid possible misclassifications obtained with the stand alone models. obtained with the stand alone models. Figure 11 shows that the X-ray has consolidation signs in the left lung. The average heatmap marks practically all area affected by the pathology, unlike the visualization of an individual CNN that only marks a small part of the affected area. We can also observe how the standard deviation heatmap for neuron 1 indicates the adjacent areas to the one of interest. The probability of this class (consolidation) is 96.6%, lower than the obtained by individual CNN (98.7%), this is because the result is obtained from the average of the five models and not all of them have to present the optimal results, however by obtaining the average of all the models we increase the robustness of the model.

The average and standard deviation heatmaps (Figures 9-11) provide very interesting and relevant information related to the ensemble output and how the system made a particular decision. However, the main problem related to this approach is the computational and time-consuming requirements. In our computer system, it took about 400 seconds to calculate the heatmaps for the five models in the ensemble.

Finally, we want to investigate the robustness of our approach now using the dataset of Kermany et al. [1] . Analogous to that paper, two classification problems were considered, normal versus pneumonia and bacterial versus viral pneumonia. We used the same training/test sets partition as in that work, and randomly subdivided training set into training (70%) + validation (30%) partitions. We trained 5 CNN models, each with a different training / validation partition. We considered Arch1 architecture for the individual models and followed a similar approach to previous experiments when this new dataset is considered. Table 5 shows the results related to normal versus pneumonia classification problem. We observe that individual Arch1 models achieve AUCs similar to that reported in [1] . On the other hand, the TPR of the individual models is smaller (0.91 versus 0.932). However, the ensemble formed by our individual models achieves an AUC of 0.976 and a TPR of 1.0. Therefore, the ensemble shows better robustness and metrics than our individual models and the model presented in [1] based on DenseNet and transfer learning.

Note that the metrics are better than our previous dataset (see section 4.1), which could be due Kermany et al. dataset is larger than ours (5,856 X-rays versus 950) and the original images are higher quality (1-2 million pixels versus 200K), so the training set contains much more information for constructing the models. Table 6 indicates the results for the bacterial versus viral pneumonia classification problem. Again, it can be observed that individual Arch1 models achieve AUCs values similar to that reported in [1] . On the other hand, the ensemble formed by the individual models achieves an AUC value of 0.964 [1] .

These results are remarkable as we obtained better AUCs using simple CNN ensembles than very deep CNN transfer learning techniques such as those shown in [1] , where transfer learning with a DenseNet architecture with 121 convolutional layers was used. We can see that the application of ensembles in the Kermany et al. dataset also presents an improvement in the result for the bacterial and viral pneumonia classification obtained in AUC, with a difference of 2%. However, it can be observed that the improvement obtained is lower than that obtained in our dataset. The difference between the datasets is probably due to their quality. The dataset provided by Kermany et al. presents greater size and quality, so the models can achieve better values using only individual models and they can adequately generalize, while our dataset, as it is more limited, the individual models are not able to obtain the best possible results.

A large number of Convolutional Neural Network architectures have been recently proposed to help and support medical staff in the pneumonia diagnosis task. In this work, a new Machine Learning system based on ensembles, which combines XAI techniques and CNN models, has been designed for the childhood pneumonia diagnosis. When a simple model, over the target dataset, is used an AUC of 0.81 and a TPR of 0.67 values are obtained. However, applying ensemble techniques the performance of the model is improved to an AUC of 0.92 and a TPR of 0.73 values for this ensemble model. These results are in the line of the current state of the art results, although in some cases (as was described in Section 2.1) are lower. However, and opposite to other previous approaches from the state of the art, our CNN models classify between the presence or the absence of consolidation. The objective of this work was to analyze and study the applicability of XAI techniques in this domain, and applying our approach in a particular dataset (950 X-rays). Three main problems have been overcome, the small size of samples in our dataset (less than 1,000), the low quality of images, and the anatomical variability existing within the age range of our dataset (between one month and 16 years old) [75] , [76] .

Related to the visualizations provided (heatmaps) as the main XAI technique used, they present adequate results and provide much more information to medical staff than other methods. However, as it was described before, the quality of the model will depends on the quality of data, for this reason in some cases, as shown in Figure 8 , the model doesn't work correctly highlighting areas outside the lungs. Therefore, the model could be improved by using other techniques, such as segmentation methods to focus only on the lungs, and increasing the quality of the dataset. Finally we compare the results of models trained with our dataset and the dataset published by Kermany et al. and the results show that we can obtain similar values without very deep convolutional neural networks or transfer learning techniques.

Firstly, we would like to explore different preprocessing techniques, such as lung segmentation to force the model to focus only on the lungs, or different data augmentation techniques to improve the quality of the dataset. These techniques would improve the performance of our model. Secondly, we will study the performance of this technique with other datasets: for example with different thoracic diseases, such as COVID-19, Covid Data Save Lives (from HM hospitals) [77] , that contains different kinds of clinical images and clinical data, BIMCV-COVID19 [78] , which it is composed of different kinds of medical imaging and testing COVID-19; or multi-label datasets, for example such as PadChest [79] , composed by different clinical images of lung and heart diseases or COVID-19 Image Data Collection [80] , that contains X-rays by five lung diseases. Also we would to compare our system with other architectures, such as: COVID-Net [81] , the model used by kermany et al. [1] , Multichannel Convolutional Neural Network [82] among others, to better understand and study the robustness of the approach presented.

Other complementary XAI techniques based on visualizations will be considered with the aim to facilitate the work of medical staff. Finally, and to test the generality of the method proposed in this work, other domains, such as those related to industrial domains [83] , where the combinations of CNN ensemble models and XAI methods, could increase the capabilities (analysis, prediction, explainaibility) of current used methods, will be explored and tested in the near future.

Identifying medical diagnoses and treatable diseases by image-based deep learning

Community-acquired pneumonia

Unicef data: Monitoring the situation of children and women

Role of viral and bacterial pathogens in causing pneumonia among western australian children: a case-control study protocol

Associations between pathogens in the upper respiratory tract of young children: interplay between viruses and bacteria

Community-acquired pneumonia in children: the challenges of microbiological diagnosis

Discovering drugs to treat coronavirus disease 2019 (covid-19)

Reliability of radiographic findings and the relation to etiologic agents in community-acquired pneumonia

Viral pneumonia. The Lancet

The need to look at antibiotic resistance from a health systems perspective

Combination of clinical symptoms and blood biomarkers can improve discrimination between bacterial or viral community-acquired pneumonia in children

Association of c-reactive protein with bacterial and respiratory syncytial virus-associated pneumonia among children aged< 5 years in the perch study

World Health Organization et al. Standardization of interpretation of chest radiographs for the diagnosis of pneumonia in children

Standardized interpretation of paediatric chest radiographs for the diagnosis of pneumonia in epidemiological studies

The definition and classification of pneumonia

Optimising convolutional neural networks using a hybrid statistically-driven coral reef optimisation algorithm

Deep learning. nature

Deep residual learning for image recognition

3d convolutional neural networks for human action recognition

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups

Evolving deep neural networks architectures for android malware classification

Evodeep: a new evolutionary approach for automatic deep neural networks parametrisation

Jumping nlp curves: A review of natural language processing research

Recent trends in deep learning based natural language processing

Bram Van Ginneken, and Jeroen Van Der Laak. Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis

Ramil Kuleev, and Bulat Ibragimov. Deep neural network ensemble for pneumonia localization from a large-scale chest x-ray database

A convolutional neural network cascade for face detection

Very deep convolutional networks for large-scale image recognition

Ensemble methods in machine learning

Hydra: An ensemble of convolutional neural networks for geospatial land classification

A survey of methods for explaining black box models

What do we need to build explainable ai systems for the medical domain

Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai

Interpretable explanations of black boxes by meaningful perturbation

Computer-aided diagnosis for world health organization-defined chest radiograph primary-endpoint pneumonia in children

Deep learning drops error rate for breast cancer diagnoses by 85%

Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models

Application of deep transfer learning for automated brain abnormality classification using mr images

Usefulness of machine learning-based detection and classification of cardiac arrhythmias with 12-lead electrocardiograms

Evaluation of deep learning detection and classification towards computer-aided diagnosis of breast lesions in digital x-ray mammograms. Computer methods and programs in biomedicine

Deep learning approaches for covid-19 detection based on chest x-ray images

Establishing classifiers with clinical laboratory indicators to distinguish covid-19 from communityacquired pneumonia: Retrospective cohort study

Classification of pediatric pneumonia prediction approaches

Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct

Comparison and validation of deep learning models for the diagnosis of pneumonia

Initial chest radiographs and artificial intelligence (ai) predict clinical outcomes in covid-19 patients: analysis of 697 italian patients

Covidgr dataset and covid-sdnet methodology for predicting covid-19 based on chest x-ray images

Deep transfer learning artificial intelligence accurately stages covid-19 lung disease severity on portable chest radiographs

Detection of pneumonia clouds in chest x-ray using image processing approach

Automatic detection of major lung diseases using chest radiographs and classification by feed-forward artificial neural network

Attention-guided convolutional neural network for detecting pneumonia on chest x-rays

Automatic lung segmentation on chest x-rays using self-attention deep neural network

An efficient deep learning approach to pneumonia classification in healthcare

Radiologist-level pneumonia detection on chest x-rays with deep learning

The effectiveness of image augmentation in deep learning networks for detecting covid-19: A geometric transformation perspective

A survey on image data augmentation for deep learning

Ensemble methods: Foundations and algorithms [book review

Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning

Multisource transfer learning with convolutional neural networks for lung pattern analysis

Classification of lung nodule malignancy risk on computed tomography images using convolutional neural network: A comparison between 2d and 3d strategies

Classification of interstitial lung abnormality patterns with an ensemble of deep convolutional neural networks

The DeepRadiology Team. Pneumonia detection in chest radiographs

Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study

A guide to NumPy

Matplotlib: A 2d graphics environment

Improving deep learning using generic data augmentation

Peeking inside the black-box: A survey on explainable artificial intelligence (xai)

The role of balanced training and testing data sets for binary classifiers in bioinformatics

An introduction to roc analysis

Advances in imaging chest tuberculosis: blurring of differences between children and adults

Paediatric anatomy

Covid data save lives -hm hospitales

Bimcv covid-19+: a large annotated dataset of rx and ct images from covid-19 patients

Padchest: A large chest x-ray image dataset with multi-label annotated reports

Covid-19 image data collection: Prospective predictions are the future

Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images

A novel method to identify pneumonia through analyzing chest radiographs employing a multichannel convolutional neural network

Conformance checking for time-series-aware processes

This work has been supported by Spanish Ministry of Science and Education under TIN2017-85727-C4-3-P (DeepBio) grant, by Agencia Estatal de Investigación AEI/FEDER Spain, Project PGC2018-095895-B-I00, by CHIST-ERA 2017 BDSI PACMEL Project (PCI2019-103623), and by Comunidad Autó-noma de Madrid under S2018/ TCS-4566 (CYNAMON), S2017/BMD-3688 grants. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research.