key: cord-0455430-rsvzbh06 authors: Karim, Md. Rezaul; Dohmen, Till; Rebholz-Schuhmann, Dietrich; Decker, Stefan; Cochez, Michael; Beyan, Oya title: DeepCOVIDExplainer: Explainable COVID-19 Predictions Based on Chest X-ray Images date: 2020-04-09 journal: nan DOI: nan sha: 79b3361a41bc17eb7bab0128327b302e914d986c doc_id: 455430 cord_uid: rsvzbh06 Amid the coronavirus disease(COVID-19) pandemic, humanity experiences a rapid increase in infection numbers across the world. Challenge hospitals are faced with, in the fight against the virus, is the effective screening of incoming patients. One methodology is the assessment of chest radiography(CXR) images, which usually requires expert radiologists' knowledge. In this paper, we propose an explainable deep neural networks(DNN)-based method for automatic detection of COVID-19 symptoms from CXR images, which we call 'DeepCOVIDExplainer'. We used 16,995 CXR images across 13,808 patients, covering normal, pneumonia, and COVID-19 cases. CXR images are first comprehensively preprocessed, before being augmented and classified with a neural ensemble method, followed by highlighting class-discriminating regions using gradient-guided class activation maps(Grad-CAM++) and layer-wise relevance propagation(LRP). Further, we provide human-interpretable explanations of the predictions. Evaluation results based on hold-out data show that our approach can identify COVID-19 confidently with a positive predictive value(PPV) of 89.61% and recall of 83%, improving over recent comparable approaches. We hope that our findings will be a useful contribution to the fight against COVID-19 and, in more general, towards an increasing acceptance and adoption of AI-assisted applications in the clinical practice. The ongoing coronavirus pandemic has had an devastating impact on the health and well-being of the global population already [10, 32] . As of April 10, 2020, more than 1.6 million infections of COVID-19 and 97,000 fatalities due to the disease were reported 2 . Recent studies show that COVID-19, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [5], often, but by no means exclusively, affects elderly persons with pre-existing medical conditions [2, 8, 12, 26, 34] 3 . While hospitals are struggling with scaling up capacities to meet the rising number of patients, it is important to make use of the screening methods at hand to identify COVID-19 cases and discriminate them from other conditions [32] . The definitive test for COVID-19 is the reverse transcriptasepolymerase chain reaction (RT-PCR) test [5] , which has to be performed in specialized laboratories and is a labour-intensive process. COVID-19 patients, however, show several unique clinical and paraclinical features, e.g., presenting abnormalities in medical chest imaging with commonly bilateral involvement. The features were shown to be observable on chest X-ray (CXR) and CT images [12] , but are only moderately characteristic to the human eye [26] and not easy to distinguish from pneumonia features. AI-based techniques have been utilized in numerous scenarios, including automated diagnoses and treatment in clinical settings [18] . Deep neural networks (DNNs) have recently been employed for the diagnosis of COVID-19 from medical images, leading to promising results [12, 26, 32] . However, many current approaches are "black box" methods without providing insights into the decisive image features. Let's imagine a situation where resources are scarce, e.g., a hospital runs out of confirmatory tests or necessary radiologists are occupied, where an AI-assisted tools could potentially help less specialized general practitioners to triage patients, by highlighting critical chest regions can lead to the automated diagnosis decision [32] . A fully automated method without the possibility for human verification would, however, at current state-of-the-art, be unconscionable and potentially dangerous in a practical setting. As a first step towards an AI-based clinical assistance tool for COVID-19 diagnosis, we propose 'DeepCOVIDExplainer', a novel diagnosis approach based on neural ensemble methods. The pipeline of 'Deep-COVIDExplainer' starts with histogram equalization enhancement, filtering, and unsharp masking of the original CXR input images, followed by the training of DenseNet's, ResNet's, and VGGNet's in a transfer learning (TL) setting, creating respective model snapshots. Those are incorporated into an ensemble, using Softmax class posterior averaging (SCPA) and prediction maximization (PM) for the best performing models. Finally, class-discriminating attention maps are generated using gradient-guided class activation maps (Grad-CAM++) and layerwise relevance propagation (LRP) to provide explanations of the predictions and to identify the critical regions on patients chest. We hope that 'DeepCOVIDExplainer' will be a useful contribution towards the development and adoption of AI-assisted diagnosis applications in general, and for COVID-19 in particular. To allow for reproduction of results and for derivative works, we will make the source code, documentation and links to used data publicly available. The rest of the paper is structured as follows: Section 2 outlines related works and points out potential limitations. Section 3 describes our proposed approach, before demonstrating experiment results in section 4. Section 5 summarizes the work and provides some outlook before concluding the paper. Bullock et al. [4] provides a comprehensive overview of recent application areas of AI against COVID-19, mentioning medical imaging for diagnoses first, which emphasizes the prevalence of the topic. Although PCR tests offer many advantages over CXR and CT [2] , shipping the sample of patients is necessary, whereas X-ray or CT machines are readily available even in rather remote areas. In a recent study by K. Lee et al. [34] , CXR and CT images from nine COVID-19 infected patients were analyzed by two radiologists to assess the correspondence of abnormal findings on X-rays with those on CT images. The proportion of patients with abnormal initial radiographic findings was 78.3% to 82.4% for SARS and 83.6% for MERS, while being only 33% for COVID-19 cases [34] . Chest CT images, in contrast, showed double lung involvement in eight out of nine patients. In other words, X-ray may not be the best imaging method for detecting COVID-19, judging by the small cohort of nine patients [34] . Another study by Yicheng Fang et al. [8] , however, supports those findings and argues in favour of the effectiveness of CT over X-ray. CT should hence cautiously be considered as the primary imaging source for COVID-19 detection in epidemic areas [2] . Nevertheless, the limited patient cohort size leaves room for statistical variability and, in contrast to those findings, a few other studies have reported rather promising results for the diagnosis based on CXR imaging [9, 25, 32] . Narin et al. [25] evaluated different convolutional neural networks (CNN) architectures for the diagnosis of COVID-19 and achieved an accuracy of 98% using a pre-trained ResNet50 model. However, the classification problem is overly simplified by only discriminating between healthy and COVID-19 patients, disregarding the problem of discriminating regular pneumonia conditions from COVID-19 conditions. Wang et al. [32] proposed COVID-Net to detect distinctive abnormalities in CXR images of COVID-19 patients among samples of patients with non-COVID-19 viral infections, bacterial infections, and healthy patients. On a test sample, containing 10 positive COVID-19 cases among approx. 600 other cases, COVID-Net achieve a PPV of 88.9% and sensitivity of 80%. The small sample size does not enable generalizable statements about the reliability of the method, yet. The highlighted regions using 'GSInquire' are also not well-localized to critical areas. Overall, training on imbalance data, lack of thorough image preprocessing, and poor decision visualization have hindered this approach. Biraja G., et al. [9] employed uncertainty estimation and interpretability based on Bayesian approach to CXR-based COVID-19 diagnosis, which shows interesting results. The results may, however, be impaired by a lack of appropriate image preprocessing and the resulting attention maps show rather imprecise areas of interest. To overcome these shortcomings of state-of-the-art approaches, our approach first enriches existing datasets with more COVID-19 samples, followed by a comprehensive preprocessing pipeline for CXR images and data augmentation. The COVID-19 diagnosis of 'DeepCOVIDExplainer' is based on snapshot neural ensemble method with a focus on fairness, algorithmic transparency, and explainability, with the following assumptions: • By maximum (or average) voting from a panel of independent radiologists (i.e., ensemble), we get the final prediction fair and trustworthy than a single radiologist. • By localizing class-discriminating regions with Grad-CAM++ and LRP, we not only can mitigate the opaqueness of the black-box model by providing more human-interpretable explanations of the predictions [19] but also identify the critical regions on patients chest. In this section, we discuss our approach in detail, covering network construction and training, followed by the neural ensemble and decision visualizations. Depending on the device type, radiographs almost always have dark edges on the left and right side of the image. Hence, we would argue that preprocessing is necessary to make sure the model not only learns to check if the edges contain black pixels or not but also to improve its generalization. We perform contrast enhancement, edge enhancement, and noise elimination on entire CXR images by employing histogram equalization (HGE), Perona-Malik filter (PMF), and unsharp masking edge enhancement. Since images with distinctly darker or brighter regions impact the classification [27] , we perform the global contrast enhancement of CXR images using HGE. By merging gray-levels with low frequencies into one, stretching high frequent intensities over high range of gray levels, HGE achieves close to equally distributed intensities [1] , where the probability density function p(X k ) of an image X is defined as [1] : where k is the grey-level ID of an input image X varying from 0 to L, n k is the frequency of a grey level X k appearing in X , and N is the total number of samples from an input image. A plot of n k vs. X k is specified as the histogram of X , while the equalization transform function f (X k ) is tightly related to cumulative density function c(X k ) [1] : Output of HGE, Y = Y (i, j) is finally synthesized as follows [1] : Image filters 'edge enhances' and 'sharpen' are adopted with the convolution matrices as kernel д(.). PMF is used to preserve the edges and detailed structures along with noise reduction as long as the fitting diffusion coefficient c(.) and gradient threshold K are separate [16] . As a non-linear anisotropic diffusion model, PMF smoothens noisy images θ (x, y) w.r.t. partial derivative as [16] : DeepCOVIDExplainer: Explainable COVID-19 Predictions Based on Chest X-ray Images , , where u(x, y, 0) is the original image, θ (x, y), u(x, y, t) is a filtered image after t iteration diffusion; div and ∇ are divergence and gradient operators w.r.t spatial variables x and y, where the diffusion coefficient c(.) is computed as [28] : To determine whether the local gradient magnitude is strong enough for the edge preservation, diffusion coefficient function c(.) is computed as follows [28] : where c 3 is the Tukey's biweight function. Since the boundary between noise and edge is minimal, c 3 is applied as the fitting diffusion coefficient [16] . Further, we attempt to remove textual artefacts from CXR images, e.g., a large number of images annotate right and left sides of chest with a white 'R' and 'L' characters. To do so, we threshold the images first to remove very bright pixels and the missing regions were in-painted. In all other scenarios, image standardization and normalization are performed. For image standardization, the mean pixel value is subtracted from each pixel and divided by the standard deviation of all pixel values. The mean and standard deviation is calculated on the whole datasets and adopted for training, validation and test sets. For image normalization, pixel values are rescaled to a [0,1] by using a pixel-wise multiplication factor of 1/255, giving a set of grey-scale images. Further, CXR images are resized 224 × 224 × 3 before starting the training. We trained VGG, ResNet, and DenseNet architectures and create several snapshots during a single training run with cyclic cosine annealing (CAC) (see fig. 2 ) [22] , followed by combining their predictions to an ensemble prediction [13, 17] . We pick VGG-16 and VGG-19 due to their general suitability for image classification. Based on the dense evaluation concept [30] , VGG variants convert the last three fully connected layers (FCLs) to 2D convolution operations to reduce the number of hyperparameters. We keep last 2 layers fixed to adopt a 1×1 kernel, leaving the final one equipped with a Softmax activation. However, owing to the computational complexity of VGG-16 due to consecutive FCLs, the revised VGG-19 is trained with a reduced number of hidden nodes in first 2 FCLs. Next, we pick ResNet-18 [33] ) and ResNet-34 [11] ) architectures. Apart from common building blocks, two bottlenecks are present in the form of channel reduction in ResNets. A series of convolution operators without pooling is placed in between, forming a stack. The first conv layer of each stack in ResNets (except for the first stack) are down-sampled at stride 2, which provokes the channel difference between identity and residual mappings. ResNets are lightweight stack-based CNNs, with their simplicity arising from small filter sizes (i.e., 3×3) [30] . A series of convolution operators without pooling is then placed in between and recognized as a stack, as shown in fig. 1 . However, w.r.t regularisation, a 7×7 conv filter is decomposed into a stack of three 3×3 filters with non-linearity injected in between [30] . Lastly, DenseNet-161 and DenseNet-201 architectures are picked. While ResNets merge feature-maps through summation, DenseNets concatenate additional inputs from preceding layers, which not only strengthens feature propagation and moderates information loss but also increases feature reusing capability by cutting down numbers of parameters [14] . To avoid possible overfitting, L 2 weight regularization, dropout, and data augmentation (by rotating the training CXR images by up to 15 • ) were employed. We did not initialize networks weights with any pretrained (e.g., ImageNet) models. The reason is that ImageNet contains photos of general objects, which would activate the internal representation of network's hidden layers with geometrical forms, colorful patterns, or irrelevant shapes that are usually not present CXR images. We set the number of epochs (NE), maximum learning rate (LR), number of cycles, and current epoch number, where initial LR and NE are two hyperparameters. CAC starts with a large LR and rapidly decreases to a minimum value before it dramatically increases to the following LR for that epoch [13, 17] . During each model training, CAC changes the LR aggressively but systematically over epochs to produce different network weights [13] : where α(t) is the LR at epoch t, α 0 is the maximum LR, T is the total epoch, C is the number of cycles and mod is the modulo operation. After training a network for C cycles, best weights at the bottom of each cycle are saved as a model snapshot (m), giving M model snapshots, where m ≤ M. Especially when a single practitioner makes a COVID-19 diagnosis, the chance of a false diagnosis is given. In case of doubt, a radiologist should, therefore, ask for a second or third option of other experts. Analog to this principle, we employ the principle of model ensembles, which combine the 'expertise' of different predictions algorithms into a consolidated prediction and hereby reducing the generalization error [13] . Research has shown that a neural ensemble method by combining several deep architectures is more effective than structures solely based on a single model [13, 17] . Inspired from [13, 31] , we apply both ASCP and the PM of bestperforming models from the list of snapshot models, ensemble their predictions, and propagate them through the Softmax layer, where the class probability of the ground truth j for a given image x is inferred as follows [31] : where m is the last snapshot model, M is the number of models, K is the number classes, andP m (y = j |x) is the probability distribution. To improve the COVID-19 detection transparency, class-discriminating regions on the subjects chest are generated by employing Grad-CAM [29] , Grad-CAM++ [6] , and LRP [15] . The idea is to explain where the model provides more attention for the classification. CAM computes the number of weights of each feature map (FM) based on the final conv layer to calculate the contribution to prediction y c at location (i, j), where the goal is to obtain L c i j that satisfies y c = i, j L c i j . The last FM A i jk and the prediction y c are represented in a linear relationship in which linear layers consist of global average pooling (GAP) and FCLs: i) GAP outputs F k = i, j A i jk , ii) FCL that holds weight w c k , generates the following output [21] : where L c i j = k w c k A i jk [21] . Due to the vanishing of nonlinearity of classifiers, CAM is an unsuitable method. Hence, we employ Grad-CAM to globally average the gradients of FM as weights instead of pooling. While heat maps (HM) are plotted, class-specific weights are collected from the final conv layer through globally averaged gradients (GAG) of FM instead of pooling [6] : DeepCOVIDExplainer: Explainable COVID-19 Predictions Based on Chest X-ray Images , , where Z is the number of pixels in an FM, c is the gradient of the class, and A k i j is the value of k t h FM. Having gathered relative weights, the coarse saliency map (SM), L c is computed as the weighted sum of α c k * A k i j of the ReLU activation. It introduces linear combination to the FM as only the features with a positive influence on the respective class are of interest [6] and the negative pixels that belong to other categories in the image are discarded [29] : Grad-CAM++ (see fig. 3 ) replaces the GAG with a weighted average of the pixel-wise gradients as the weights of pixels contribute to the final prediction w.r.t the following iterators over the same activation map A k , (i, j) and (a, b). Even though CXR images rarely contain multiple targets, revealing particular image parts that contributed to the prediction, rather than the entire chest area is still helpful. CAM variants backpropagate the gradients all the way up to the inputs, are essentially propagated only till the final conv layer. Besides, CAM methods are limited to specific architectures, where an average-pooling layer connects conv layers with an FCL. LRP is another robust technique of propagating relevance scores (RSs) and, in contrast to CAM, redistributes proportionally to the activation of previous layers. LRP assumes that the class likelihood can be traced backwards through a network to the individual layer-wise nodes of the input [15] . From a network of L layers, 1, 2, ..., N nodes in layer l, 1, 2, .., M nodes in layer l + 1, the RS, R (l ) n at node n in layer l is recursively defined [15] : The node-level RS for negative values is calculated with ReLU activation as [15] : The output layer RS is finally calculated before being back-propagated as follows [15] : First, an image x is classified in a forward pass, where LRP identifies important pixels. The backward pass is a conservative relevance (i.e., R (L) t ) redistribution procedure with back-propagation using deep Taylor decomposition [24] , to generate a relevance map R lr p , for which the nodes contributing most to the higher-layer, also receive most relevance. Finally, heat maps for all the test samples are generated based on the trained models, indicating the relevance for the classification decision for each. In this section, we discuss the evaluation results both quantitative and qualitatively. Experiments were carried out on a machine having an Intel(R) Xeon(R) E5-2640, 256 GB of RAM, and Ubuntu 16.04 OS. All the To tackle class imbalance, we apply class weighting to penalize a model when it misclassifies a positive sample. Although accuracy is an intuitive evaluation criterion for many bio-imaging problem, e.g., osteoarthritis severity prediction [3] , those evaluation criteria are most suitable for balanced class scenarios. Given the imbalanced class scenario, with widely varying class distributions between different classes, we report precision, recall, F1, and positive predictive value (PPV) that are produced through random search and 5-fold cross-validation tests, i.e., for each hyperparameter group of the certain network structure, five repeated experiments are conducted. We consider 3 different versions of COVIDx datasets. The COVIDx v1.0 dataset had a total of 5,941 CXR images from 2,839 patients based on the COVID-19 image dataset curated by Joseph P., et al. [7] and Kaggle CXR Pneumonia dataset 6 by Paul Mooney. It is used in some early works, e.g., [32] . However, Kaggle CXR images are of children. Therefore, to avoid possible prediction bias (e.g., the model might be prone to predict based on mere chest size), we enriched 'COVIDx v2.0' (with CXR images of adult subjects from the RSNA Pneumonia Detection Challenge 7 , original and augmented versions of COVID-19 examples 8 ), which we called 'COVIDx v3.0' is used in our approach. Additional 59 CXR images are collected from: i) Italian Radiological Case 9 , and ii) Radiopaedia.org (provided by Dr. Fabio Macori) 10 . 'COVIDx v3.0' images are categorized as normal, pneumonia, and COVID19 viral. Table 1 and table 2 show the distributions of class, images, and patients. Overall results are summarized in table 3. As seen, VGG-19 and DenseNet-161 performed best on both balanced and imbalanced datasets, while VGG-16 turns out to be the lowest performer. In direct comparison, the diagnosis of VGG-19 yields much better results than VGG-16, which might be explainable by the fact that a classifier with more formations requires more fitting of FMs, which again depends on conv layers. The architecture modification of VGG-19 by setting 2 conv layers and the filter size of 16, visibly enhances the performance. ResNet-18 performed better, although it's larger counterpart ResNet-34 shows quite unexpected low performance. Evidently, due to structured residual blocks, the accumulation of layers could not promote FMs extracted from the CXR images. Both DenseNets architectures show consistent performance owing to clearer image composition. DenseNet-161 outperforms not only DenseNet-201 but also all the other models. In particular, DenseNet-161 achieves precision, recall, and F1 scores of 0.94, 0.95, and 0.945, respectively, on balanced CXR images. On imbalanced image sets, both DenseNet-161 and ResNet-18 perform consistently. Although VGG-19 and ResNet-18 show competitive results on the balanced dataset, the misclassification rate for normal and pneumonia samples are slightly elevated than DenseNet-161, which poses a risk for clinical diagnosis. In contrast, DenseNet-161 is found to be resilient against the imbalanced class scenario. Hence, models like DenseNet-161, which can handle moderately imbalanced class scenarios, seem better suited for the clinical setting, where COVID-19 cases are rare compared to pneumonia or normal cases. The ROC curve of the DenseNet-161 model in fig. 4 shows consistent AUC scores across folds, indicating stable predictions, signifying that the predictions are much better than random. Nevertheless, bad snapshot models can contaminate the overall predictive powers of the ensemble model. Hence, we employ WeightWatcher [23] in two levels: i) level 1: we choose the top-5 snapshots to generate a full model, ii) level 2: we choose the top-3 models for the final ensemble model. In level 2, WeightWatcher is used to compare top models (by excluding VGG-16, ResNet-34, and DenseNet-201) and choose the ones with the lowest log norm and highest weighted alpha (refer to section 4 in supplementary for details), where a low (weighted/average) log-norm signifies better generalization of network weights [23] . Figure 5 shows choosing the better model between VGG-16 and VGG-19 with WeightWatcher in terms of weighted alpha and log norm. We perform the ensemble on following top-3 models: VGG-19, ResNet-18, and DenseNet-161. To ensure a variation of network architectures within the ensemble, VGG-19 is also included. As presented in table 4, the ensemble based on the SCPA method outperforms the ensemble based on PM methods moderately. The reason is that the PM approach appears to be easily influenced by outliers with high scores. To a great extent, the mean probabilities for each class affect the direction of outliers. For the SCPA-based ensemble, the combination of VGG-19 + DenseNet-161 outperforms other ensemble combination. The confusion matrix of the best ensemble's performance on balanced data is shown in fig. 6 . The results show that a majority Since we primarily want to limit the number of missed COVID-19 instances, the achieved recall of 83% is still an acceptable metric, which means that a certain fraction of all patients who test positive, will actually not have the disease. To determine how many of all infected persons would be diagnosed positive by the method, we calculate the positive predictive value (PPV). Out of our test set with 77 COVID-19 patient samples, only six were misclassified as pneumonia and two as normal, which results in a PPV of 89.61% for COVID-19 cases, slightly outperforming a comparable approach [32] . In our case, results are backed up by a larger test set, which contributes to the reliability of our evaluation results. It is to note that the PPV was reported for a low prevalence of COVID-19 in the cohorts. In a setting with high COVID-19 prevalence, the likelihood for false-positives is expected to shrink further in favour of correct COVID-19 predictions. Precise decisive feature localization is vital not only for the explanation but also for rapid confirmation of the reliability of outcomes, especially for potentially false-positive cases [6] . Attention map highlighting of critical regions on the chest advocate transparency and trustworthiness to clinicians and help them leverage their screening skills to make faster and yet more accurate diagnoses [32] . In general, the more accurate a model is, the more consistent the visualizations of Grad-CAM and Grad-CAM++ will be. Key features can then easily be identified based on where the activation maps are overlapping. The critical regions of some CXR images of COVID-19 cases are demonstrated in fig. 7, fig. 8, and fig. 9 , where class-discriminating areas within the lungs are localized. As seen, HMs generated by Grad-CAM and Grad-CAM++ are fairly consistent and alike, but those with Grad-CAM++ are more accurately localized. The reason is that instead of certain parts, Grad-CAM++ highlights conjoined features more precisely. On the other hand, although LRP highlights regions much more precisely, it fails to provide attention to critical regions. It turned out that Grad-CAM++ generates the most reliable HM's when compared to Grad-CAM and LRP. To provide more human-interpretable explanations, let's consider the following examples (based on ResNet-18): • Example 1: the CXR image is classified to contain a confirmed COVID-19 case with a probability of 58%, the true class is COVID-19, as shown in fig. 7 . • Example 2: the CXR image is classified to contain a confirmed COVID-19 case with a probability of 58%, the true class is COVID-19, as shown in fig. 8 . • Example 3: the CXR image is classified to contain COVID-19 case with a classification score of 10.5, true class is COVID-19, as shown in fig. 9 . Based on the above analyses, 'DeepCOVIDExplainer' disseminates the following recommendations: firstly, even if a specific approach does not perform well, an ensemble of several models still can outperform individual models. Secondly, since accurate diagnosis is a mandate, models trained on imbalanced training data can provide distorted or wrong predictions during inference time, due to possible overfitting during the training. In this case, even a high accuracy score can be achieved without predicting minor classes, hence might be uninformative. Thirdly, taking COVID-19 diagnosis context into account, the risk resulting from a pneumonia diagnosis is much lower than for a COVID-19 diagnosis. Hence, it is more reasonable to make a decision based on the maximum score among all single model predictions. Fourthly, due to the nature of neural networks, decision visualizations cannot be provided based on ensemble models, even For the decision visualization, therefore, it is recommended to pick the single best model as a basis and to employ Grad-CAM++ for providing the most reliable localization. In this paper, we proposed 'DeepCOVIDExplainer' to leverage explainable COVID-19 prediction based on CXR images. Evaluation results show that our approach can identify COVID-19 with a PPV of 89.61% and recall of 83%, outperforming a recent approach. Further, as Curtis Langlotz 11 stated "AI won't replace radiologists, but radiologists who use AI will replace radiologists who don't". In the same line, we would argue that 'DeepCOVIDExplainer' is not to replace radiologists but to be evaluated in a clinical setting and is by no means a suitable replacement for a human radiologists. We would even argue that human judgement is indispensable when the life of patients is at stake. However, we hope our findings will be a useful contribution to the fight against COVID-19 and towards an increasing acceptance and adoption of AI-assisted applications in the clinical practice. Lastly, we want to outline potential areas of enhancement: Firstly, since only a limited amount of CXR images for COVID-19 infection cases were at hand, it would be unfair to claim that we can rule out overfitting for our models. More unseen data from similar distributions is necessary for further evaluation to avoid possible out-of-distribution issues. Secondly, due to external conditions, we were yet not been able to verify the diagnoses and localization accuracies with the radiologists. Thirdly, accurate predictions do not only depend on single imaging modalities, but could also build up on additional modalities like CT and other decisive factors such as, e.g., patients demographic and symptomatic assessment report [31] . Nevertheless, we would argue that explaining predictions with plots and charts are useful for exploration and discovery [20] . Explaining them to patients may be tedious and require more humaninterpretable decision rules in natural language. In future, we intend to overcome these limitations by: i) alleviating more data (e.g., patient CT, phenotype, and history) and training a multimodal convolutional autoencoder, and ii) incorporating domain knowledge with neuro-symbolic reasoning to generate decision rules to make the diagnosis fairer. This work was supported by the German Ministry for Research and Education (BMBF) as part of the SMITH consortium (grant no. 01ZZ1803K). This work was conducted jointly by RWTH Aachen University and Fraunhofer FIT as part of the PHT and GoFAIR implementation network, which aims to develop a proof of concept information system to address current data reusability challenges of occurring in the context of so-called data integration centers that are being established as part of ongoing German Medical Informatics BMBF projects. Acronyms and their full forms used in this paper are as follows: Modified histogram based contrast enhancement using homomorphic filtering for medical images Correlation of chest CT and RT-PCR testing in coronavirus disease 2019 (COVID-19) in China: a report of 1014 cases Simple Definition and Calculation of Accuracy Mapping the Landscape of Artificial Intelligence Applications against COVID-19 Grad-CAM++: Generalized gradient-based visual explanations for convolutional networks COVID-19 image data collection. arXiv Sensitivity of chest CT for COVID-19: comparison to RT-PCR Estimating Uncertainty and Interpretability in Deep Learning for Coronavirus (COVID-19) Detection Rapid AI development cycle for the coronavirus pandemic: Initial results for automated detection and patient monitoring using deep learning CT image analysis Deep residual learning for image recognition Clinical features of patients infected with novel coronavirus in Wuhan Snapshot Ensembles: Train 1, get m for free Densely connected convolutional networks Explaining Convolutional Neural Networks using Softmax Gradient Layer-wise Relevance Propagation Image denoising using variations of Perona-Malik model with different edge stopping functions A Snapshot Neural Ensemble Method for Cancer-type Prediction Based on Copy Number Variations Deep learning-based clustering approaches for bioinformatics OncoNetExplainer: Explainable Predictions of Cancer Types Based on Gene Expression Data Prognostically Relevant Subtypes and Survival Prediction for Breast Cancer Based on Multimodal Genomics Data Extending Class Activation Mapping Using Gaussian Receptive Field SGDR: Stochastic Gradient Descent with Warm Restarts Traditional and heavy-tailed self regularization in neural network models Explaining nonlinear classification decisions with deep taylor decomposition Automatic Detection of Coronavirus Disease using X-ray Images and Deep Convolutional Neural Networks Imaging profile of COVID-19 infection: radiologic findings and literature review A combined effect of local and global method for contrast image enhancement Scale-space and edge detection using anisotropic diffusion Grad-CAM: Visual explanations from deep networks via gradient-based localization Very deep convolutional networks for large-scale image recognition Automatic Knee Osteoarthritis Diagnosis from Plain Radiographs: A Deep Learning-based Approach COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest X-ray Images Aggregated residual transformations for deep neural networks X-ray may be missing COVID cases found with CT