key: cord-0225277-20ft84pi authors: Altaf, Fouzia; Islam, Syed M.S.; Akhtar, Naveed title: Resetting the baseline: CT-based COVID-19 diagnosis with Deep Transfer Learning is not as accurate as widely thought date: 2021-08-12 journal: nan DOI: nan sha: f4459cca1b0172f4eb6e5567203543ce0846429a doc_id: 225277 cord_uid: 20ft84pi Deep learning is gaining instant popularity in computer aided diagnosis of COVID-19. Due to the high sensitivity of Computed Tomography (CT) to this disease, CT-based COVID-19 detection with visual models is currently at the forefront of medical imaging research. Outcomes published in this direction are frequently claiming highly accurate detection under deep transfer learning. This is leading medical technologists to believe that deep transfer learning is the mainstream solution for the problem. However, our critical analysis of the literature reveals an alarming performance disparity between different published results. Hence, we conduct a systematic thorough investigation to analyze the effectiveness of deep transfer learning for COVID-19 detection with CT images. Exploring 14 state-of-the-art visual models with over 200 model training sessions, we conclusively establish that the published literature is frequently overestimating transfer learning performance for the problem, even in the prestigious scientific sources. The roots of overestimation trace back to inappropriate data curation. We also provide case studies that consider more realistic scenarios, and establish transparent baselines for the problem. We hope that our reproducible investigation will help in curbing hype-driven claims for the critical problem of COVID-19 diagnosis, and pave the way for a more transparent performance evaluation of techniques for CT-based COVID-19 detection. Likely originated in Wuhan, China in December 2019 [1] , a novel coronavirus disease, later dubbed COVID-19 [2] was declared a pandemic in March 2020 [3] . Since then, it has disrupted nearly all aspects of daily life across the globe. The onus is now on the scientific community to rapidly develop solutions to curb this disease. This has spawned numerous interdisciplinary efforts in multiple scientific fields. Whereas cross-domain research is essential to meet the challenges of COVID-19, it can also sprout misleading findings that lack in accounting for detailed knowledge of the contributing domains. This concern is especially valid for COVID-19 research that is driven by the highest level of urgency which can cause superficial investigations. This paper exposes a scenario marred by this issue in medical image analysis literature for COVID-19 detection using CT images. It then addresses the concern by providing an extensive transparent investigation. The Reverse Transcription Polymerase Chain Reaction (RT-PCR) is currently considered the gold standard for diagnosing COVID-19. Nevertheless, it can still be complemented for improved detection with imaging techniques, e.g. radiography, tomography. The later can also be a viable diagnostic tool in the absence of RT-PCR. It is claimed that Computed Tomography (CT) has even higher sensitivity to COVID-19 than RT-PCR [4] - [6] . This has led to a significant interest of medical imaging community in exploring CT images for COVID-19 detection [7] - [12] . Medical imaging researchers currently rely strongly on the developments in the fields of machine learning and computer vision [13] , where deep learning [14] is the key technology that is providing numerous breakthroughs. This has invited medical researchers to develop deep learning solutions for CT-based COVID-19 detection. Unfortunately, deep learning requires a large amount of annotated (training) data to learn useful computational models to accurately detect COVID-19 using CT images. Such a large volume of data, which is annotated by medical experts, is currently not available for this task. This makes 'deep transfer learning' as the preferred solution for the problem [13] . Put simply, transfer learning allows one to transfer a deep learning model trained for one domain (e.g. natural images) to another domain (e.g. CT images). Due to the easy availability of models trained on large-scale annotated natural images, it is a common practice to transfer natural image models to the domain of CT images for our concerned problem. This broad strategy is currently highly popular in the related literature, however, we also find it marked with confusing results. In the current literature, at one end we find contributions that claim highly precise detection of COVID-19 with transfer learning using very limited amount of annotated training data [9] , [15] , [16] . These contributions make their claims based on vanilla transfer learning. Contrastingly, there are also works that enhance transfer learning for better performance with rather sophisticated procedures. Surprisingly, most of such methods do not claim very high accuracies despite their enhancements [12] , [17] , [11] . On the other extreme, we also find contributions that train models from scratch [18] , [19] , [20] . When large amount of training data is available, this strategy is expected to give the best performance. However, we often find methods in this category reporting comparable [18] , [20] , [19] or lower [21] , [22] performance than the vanilla transfer learning methods. These method often use orders of magnitude larger data as compared to transfer learning methods, with few exceptions, e.g. [19] , that claim high performance with small datasets. We capture a summary of this contradiction in the published literature with representative examples in Table. I. Contrasting results in the literature call for an in-depth investigation that can provide a transparent baseline for the critical task of COVID-19 detection with CT imaging -which is being widely preferred for its high sensitivity to COVID-19 [4] , [5] , [6] . Such an investigation for deep transfer learning is vital for two main reasons. First, it will halt premature deployment of this technology in practice, which is currently encouraged by frequent overestimation of its capabilities for the problem at hand. Second, a transparent perspective on deep transfer learning abilities for CT-based COVID-19 detection will allow the scientific community to better address weaknesses of this strategy, which are currently not apparent due to the high performance claims. In this paper, we address this issue with a reproducible comprehensive study that resets the baseline of deep transfer learning for CT-based COVID-19 detection. To that end, we extensively analyse performance of 14 state-of-the-art models, pre-trained on ImageNet [24] -a large-scale database of natural images, transferred to CT image domain with COVID-CT-Dataset (CCD) [25] . Our choice of CCD is based on the recent comprehensive study published in Nature Scientific Reports [9] . Our analysis is also mainly anchored by the experimental setup of [9] . However, we find that their evaluation does not strictly follow the practices of machine learning community. Doing so results in a considerable reduction in the claimed transfer learning performance estimates 1 . Our investigation identifies cursory data curation as the major cause of a drastic overestimation of performance for the problem in the broader literature. We establish a transparent baseline for CCD, and also analyse two practicable scenarios to enable a more informed expectation of deep transfer learning based COVID-19 detection with CT images in practice. The contributions of this paper are summarized as follows: • We identify the overestimation of transfer learning abilities in the literature for CT-based COVID-19 detection. • We perform a transparent investigation to reset the baseline of COVID-19 detection using transfer learning on CCD [25] . We demonstrate the performance overestima-tion by comparing results with a study published on the same topic in Nature Scientific Reports [9] . • We establish inappropriate data curation as the source of overestimated results. • We also provide transfer learning results for COVID- 19 detection under pragmatic scenarios. II. BACKGROUND AND RELATED WORK To comprehend the results and claims in the literature on CT-based COVID-19 detection with transfer learning, it is imperative to understand the interaction between the involved scientific sub-fields and the metrics used in evaluation. Before discussing the related works, we first provide a short discussion on these topics. Background: Deep learning [14] is the key technology around which imaging based COVID-19 research currently revolves. In itself, its a representation learning technique, which is a core topic in the field of machine learning [26] . However, computer vision researchers have also found deep learning with Convolutional Neural Networks (CNNs) to be extremely effective for their problems [27] . From computer vision literature, CNNs paved their way to medical image analysis [13] . Medical image analysis is already a domain where experts in medical science and image processing interact. Currently, the claims of highly accurate computer aided diagnosis of COVID-19 are based on deep learning techniques, mainly found in the medical imaging literature. As clear from the discussion above, deep learning is a tool explored in multiple domains. Nearly, all of those domains claim it to be highly accurate [14] . When this tool is used by nonexperts of machine learning (the originating field of modern deep learning) in scenarios dictated by urgency, hype-drive overestimation of its capabilities are highly likely. Our analysis in the subsequent sections reveals that this phenomenon has also affected CT-based COVID-19 diagnosis research. In the text to follow and Table I , we refer to different evaluation metrics to discuss performance. For the convenience of readers, we provide formal definitions of these metrics upfront, and follow them throughout the paper. For the considered binary classification problem, let us denote true positive predictions by TP, true negatives by TN, false positives by FP and false negatives by FN. The Specificity (Spec.), Sensitivity (Sens.) and F1-Score are then computed as: where PPV = TP/(TP+FP) and TPR = TP/(TP+FN). Based on these metrics, we compute Accuracy (Acc.) as [30] proposed a deep learning framework ai-corona to assist radiologist in diagnosing COVID-19 in CT images. In their experiments, they used more than five thousand CT scans. Mei et al. [31] claim that their model outperformed radiologists in diagnosing COVID-19 positive patients in CT scans. Their approach has three stages including CNN followed by SVM and multi-layer perceptrons. They also combine other clinical information in their predictions. More closely related to our study are the works that fully focus on transfer learning for CT-based COVID-19 diagnosis [7] , [8] , [9] , [10] , [15] , [16] . The main goal of such contributions is to often empirically establish the best-performing state-ofthe-art visual models for transfer learning on CT images for COVID-19. As identified in Table I , so far, these works have claimed a variety of popular visual models, e.g. Xeception, ResNets, DenseNet, VGG-19, as the top-performing models. We also frequently witness very high performance reported by these methods while using very limited training data. Of particular relevance to our work is the study presented in [9] . We closely follow [9] in our evaluations, however we provide very different results that demystify the claims of high performance of transfer learning in the context of CT-based COVID-19 diagnosis. Besides the above-mentioned literature, other works are also appearing in this direction. We refer interested readers to [32] for a recent survey. To provide a concrete evidence of overestimation of transfer learning abilities in medical imaging literature for our problem, we focus on [9] as the representative existing study. Published in Nature Scientific Reports, this work provides a baseline for transfer learning based COVID-19 detection using CT images of CCD [25] . The authors made their code public 2 . By splitting CCD samples into five different 80%-20% training-validation sets per experiment, they performed 5 experiments per model and provided the validation results. On the unseen validation samples, they reported up to 96.20% mean prediction accuracy of the transferred models. We copy the results of Pham [9] from the original paper in Table II . We defer further discussion on the training and validation data sets of [9] to Section IV, where we also explore other aspects of the data. Here, it is sufficient to note that the results are reported using less than 600 annotated CT images in each training session of a given model. High performance reports in [9] with such a limited data seems an excellent prospect for transfer learning because a ∼ 100% prediction accuracy appears possible here by using a (relatively) small number of additional training images. However, surprisingly, when we reproduced the results of [9] with the standard fivefold cross-validation protocol, there is a drastic reduction in the claimed performance. The comparison of our results with [9] is given in Table II . We note that, except for the used data samples, we follow [9] in our experiments down to every single detail. That is, as per [9] , we first convert the original images to RGB and resize them to input dimensions of the used CNN. We use stochastic gradient descent with momentum for model optimization, for which the momentum value is set to 0.9. The gradient threshold method with 2 -norm is used in the training. We use a batch size of 10, and train the models for 6 epochs to complete the transfer, with a weight decay of 0.0001. The learning rate is set to a fixed value of 0.0003, and the training and validation samples are shuffled before Diff. Pham [9] Diff. Pham [9] Diff. Pham [9] Diff. every epoch. In all our experiments reported in this paper, we keep this hyper-parameter setting fixed. We explicitly note the exceptions whenever they occur. In Table II , the only difference between our experiments and [9] is in the used data splits. Instead of a random 80%-20% split of trainingvalidation data of [9] , we perform a more systematic split. In our data splitting, we first sort the samples in CCD (for both COVID positive and negative sets) by their names used in the dataset. Then, we remove the first 20% samples from the data and consider those as the validation set, where the remainder is the training set. This gives us the training and validation sets for the 1 st fold. For the 2 nd fold, we put back the 20% samples taken out for the 1 st fold, and take out the next 20% samples for the validation set, the remainder is the training set. We repeat this to construct the training-validation sets for all the five folds. This procedure reflects the typical understanding of the 'five-fold' evaluation protocol in machine learning literature. Normally, one would expect the mean performance under the standard five-fold protocol and random five splits to roughly coincide. This is because both protocols evaluate performance on unseen data of the same proportion. However, as apparent in Table II , this is not the case here. An overestimation of 25.83% accuracy, 36.77% sensitivity, 30.12% specificity and 0.3 F1-score is identifiable in the table. It is emphasized that these values are simple differences of the mean of metric scores under the two protocols. Converting them to percentages of the original results [9] will lead to even higher numbers. Also notice that, not only the corresponding performance of the same models differs significantly for the two evaluations, the difference in the best performances across all models for any evaluation metric is remarkably high. Even the minimum differences for the same models (highlighted gray) are not minor -despite being anomalies in most cases. The results in Table II provide a clear proof of performance overestimation. This leads to a natural question that 'what really caused such a huge disparity in performance?'. We do not hold the evaluation protocol of [9] responsible for this issue. Provided that images in a dataset are considered representative random samples of the same image distribution, the average results of random data splitting and sequential data splitting (as in five-fold protocol) can not be expected to differ too much. Hence, the evaluation protocol of [9] is reasonable in general. However, here the issue is with how the COVID-CT-Dataset (CCD) [25] is curated. The dataset consists of 349 CT images of COVID-19 positive cases and 397 CT images of negative cases. The samples in the dataset are assembled from different COVID-19 papers published in medRxive, bioRxive, MedPix, LUNA and PubMed Central (PMC) etc. The curation process often extracts samples directly from the digital copies of the publications themselves. We refer to [25] for the exact details on the extraction process. Here, we are mainly interested in the nuisance patterns emerging in the dataset due to such a process of sample acquisition. Although, it is claimed that a senior radiologist from Tongji Hospital, Wuhan, China has confirmed the practicality of this dataset [25] , our opinion from machine learning viewpoint differs. For the case of learning computational models, this dataset can cause misleading results, unless carefully (pre-)processed. To corroborate our claim, we show representative examples of samples in the dataset in Fig. 1 . To generate the figure, we picked random samples from a testing (i.e. validation) set that we created under random data splitting of [9] . These images are bounded in red boxes in Fig. 1 . We then performed a manual selection of green-boxed images from the remaining training set. These images are picked by a human participant by looking at the corresponding red-boxed images. This participant was given the full training set without labels, and asked to pick the closest match(es) by visual inspection. The participant was unaware of even the fact that these are CTimages, let alone having any expertise of COVID-19 diagnosis. Interestingly, the participant picked the corresponding greenbox images in each cluster (identified by different background colours in Fig. 1 ) from the correct classes. It is easy to see that the similarities between red-boxed images and their greenboxed counterparts are hardly COVID-19 relevant. The dominant COVID-19 irrelevant patterns implicated by our brief experiment in the preceding paragraph causes the performance overestimation in [9] . In CCD [25] , it is often the case that multiple samples of a class are acquired in a set from the same source. The source and the acquisition process would change for different subsets of samples. Here, we use the term 'source' in a broad manner, i.e. including CT-scanner in the process. Thus, very similar looking images often appear in clusters in CCD, where the similarity is irrelevant to the symptoms of COVID-19. For the two classes, different kinds of clusters appear in the dataset -observe Fig. 1 for representative examples. Within clusters, the dominant COVID-19 irrelevant similarities are often so obvious that a simple visual inspection by a non-expert is sufficient to identify the right cluster, and hence perform accurate classification. Interestingly, in the related literature, we generally do not see data normalization or sophisticated pre-processing [9] to avoid these irrelevant peculiarities of images. We recommend such pre-processing for this problem, however, evaluating effective strategies for that is not in the scope of this paper. We leave that for the future work. In the preceding sections, we exposed the overestimated baseline results and its causes for CCD dataset [25] . To fully address the problem, it is imperative to provide a transparent evaluation for resetting the baseline of transfer learning for COVID-19 detection using CCD CT-images. To that end, we perform three sets of experiments that are discussed below. A. Five-fold evaluation with data augmentation In [9] , we also encounter a counter-intuitive claim that data augmentation considerably degrades the performance of transfer learning for the problem at hand. For deep learning, data augmentation is known to regularize models for better generalization [14] . In the case of limited data, as in CCD, the best practices in deep learning clearly recommend data augmentation [13] . On the other hand, [9] reports significant accuracy reduction across all the models in Table II with data augmentation, claiming up to 15.8% degradation. This can mislead the research community to avoid data augmentation for CT-based COVID-19 diagnosis with deep learning. Our investigation reveals that this result is also an undesired byproduct of inappropriate evaluation. We already provided the baseline results for CCD under the standard five-fold protocol in Table II . In Table III , we report the results of our experiments with data augmentation. For each evaluation metric, we also provide the percentage gain over the corresponding results reported in Table II . The results in Table III indicate that data augmentation is indeed helpful in general. We highlight the positive gains in green and performance reduction in red in the table for the convenience of readers. The best results are also boldfaced. To transparently refute the claims of [9] , we follow the exact data augmentation strategy as followed by [9] . That is, we use random reflection in top-bottom direction, such that the images are reflected vertically with 0.5 probability. Horizontal and vertical translations are applied in the range [-30, 30 ] pixels, where the distance is selected randomly from continuous uniform distribution. Random scaling is performed in the range [0.9, 1.1]. Interestingly, after data augmentation, DenseNet-201 is no longer the best performing model for any metric, whereas this network is originally claimed to perform the best in [9] . Also, we find that under the correct five-fold evaluation protocol, data augmentation mitigates the over-fitting problem which is encountered early in the training without data augmentation. Notice that we used 6 training epochs for the results in Table II . Besides following [9] , this is because more epochs showed clear signs of over-fitting. With data augmentation, we are able to easily extend our training to 10 epochs without over-fitting. This is the only hyper-parameter we change in Table III (besides data augmentation) in comparison to Table II . We provide further discussion on this topic in the supplementary material of the paper. An acute reader may also notice that whereas we do encounter performance reduction in a few scores in Table III , the standard deviations of most of the scores is lower than those of the 'Actual' scores in Table II . This also indicates appropriate model regularization. Our results in Table III provide the new baseline for five-fold cross-validation on CCD using transfer learning. They also establish data augmentation to be useful for the problem -which is in contrast to [9] . To further elaborate on the effectiveness of transfer learning for CT-based COVID-19 detection, in this case study we provide results for a near-ideal practical scenario. That is, we assume that for any test sample, our model has already seen a visually similar training sample. The visual similarity is based on non-COVID dominant features, as identified in Section IV. This scenario closely presents the situation where all training data is locally acquired at a medical facility (with the same set of apparatus). Moreover, it automatically accounts for the provision that some samples of the same patients may be included in the training set. We create the test set for this scenario by asking a non-medical expert to identify clusters of similar images by eye-balling CCD samples. Then, we randomly pick one image out of each cluster to use as the test sample. In all, we separate 10% samples of both COVID-19 positive and negative subsets in CCD for testing. We provide the full list of the separated images in the supplementary material of the paper for reproducibility. Following the hyper-parameter settings and data augmentation used in Section V-A, the summary of results achieved for this case study are reported in Table IV (left). We conduct three experiments for this study, across which the test set remains the same, as allowed the samples available in CCD. Due to the common test, we witness small standard deviations in the table -resulting from data augmentation and random selection of batches. Interestingly, ResNet-101 also performs the best in this case in terms of the overall accuracy. Notice that the considered scenario does have some resemblance to the 'random' splitting of [9] . Hence, we also witness high metric scores for this case study. In our opinion, the results in Table IV (left) provide an optimistic estimate of the upper bound on transfer learning performance in a practical scenario where the model is trained locally by a medical facility using its own limited training data. This case study represent a scenario where there is an equal probability that the model training 'has' and 'has not' seen images similar to the test samples. We emulate this scenario by creating a test set that has 50% images chosen following the procedure of case study I. Model training will have seen such images. For the remaining 50% images, we directly select the complete clusters of images that visually appeared unique in the dataset. Again, the similarity criterion is based on medical non-expert perception. Upon reflection, it may be apparent that this case is easier than the five-fold validation in Table III . Since we created test sets in those experiments by mutually exclusive subsets of sorted images, most of the image in every subset were unique clusters 3 . Hence, we can expect better performance of models for case study II as compared to the five-fold experiments. This is exactly what we achieve in Table IV(right). VI. DISCUSSION AND CONCLUSION Our investigation has exposed multiple interesting facts about CT-based COVID-19 diagnosis in the context of deep transfer learning. We provide a conclusive summary of these fact in the light of our results below. 1) The literature in CT-based COVID-19 diagnosis with deep transfer learning suffers widely from performance overestimation. Taking a work published in Nature Scientific Reports [9] as a representative example, we expose a drastic overestimation of more than 25% accuracy. Even larger overestimated performance margins persist for other metric scores. We also found that the problem of overestimation is more common in the 'transfer learning' based diagnosis techniques as compared to the literature developing more sophisticated methods using deep learning. Nevertheless, instances of overestimation can also be found among those methods [19] . 2) We establish that the major source of overestimation is the inappropriate data curation. It is tempting to blame evaluation protocols for the issue, however aptness of evaluation strongly depends on data. Intriguingly, a recent review published in Nature Machine Intelligence [32] identifies "bias" in small data and its "poor integration" for image-based COVID-19 diagnosis with machine learning. Our investigation provides the breakthrough in demonstrating how these issues have led to gross performance overestimation of deep transfer learning for CT-based COVID-19 detection. We establish data augmentation to be useful for the problem. Data augmentation is an important tool for deep learning with limited data. Loosing this option is highly undesirable for COVID-19 problems that lack large-scale annotated data. Pham [9] showed significant performance reduction due to data augmentation. Our investigation finds their report to be a by-product of performance overestimation with un-augmented data. Our experiments identify reasonable regularization of the models with data augmentation, which should generally be expected. Data augmentation reduces the erratic performance of the models on training data across different networks, see Fig. 2 . Very high training data accuracy with minimal variation, but considerably low validation accuracy indicates over-fitting. Data augmentation successfully avoids that for CT-based COVID-19 detection with transfer learning. Fig. 2 . Training accuracies for five-fold evaluation of all models without and with data augmentation on CCD [25] . Without augmenting the training data, the models generally achieve higher mean accuracies, often with very small variance. However, their accuracies on validation sets remain low, see 'Actual' results in Table II . With data augmentation, training accuracies across the models is less erratic. Despite slightly lower mean training accuracies, performance of the models generally improves for the unseen validation data, see Table III . This confirms the expected benefits of data augmentation for CT-based COVID-19 detection. Overall, our comprehensive investigation provides vital information in demystifying transfer learning abilities for COVID-19 detection. Based on our findings, we urge the research community to make every effort to avoid hype-driven outcomes. Despite the urgency of the matter, researchers must ensure thorough investigation before communicating the results. Superficial investigations for the sake of demonstrating rapid developments are counter-productive for COVID-19 research. We hope that our work provokes medical imaging community in general and COVID-19 researchers in particular to evaluate the modern technological tools more critically for their problems. Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in wuhan, china WHO director-general's remarks at the media briefing on WHO director-general's opening remarks at the media briefing on covid-19 -11 Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal Diagnosis of the coronavirus disease (covid-19): rrtpcr or ct? Sensitivity of chest ct for covid-19: comparison to rt-pcr Application of deep learning technique to manage covid-19 in routine clinical practice using ct images: Results of 10 convolutional neural networks Covid-19 pneumonia diagnosis using a simple 2d deep learning framework with a single chest ct image: model development and validation A comprehensive study on classification of covid-19 on computed tomography with pretrained convolutional neural networks Deep transfer learning based classification model for covid-19 disease Rapid ai development cycle for the coronavirus (covid-19) pandemic: Initial results for automated detection & patient monitoring using deep learning ct image analysis Boosting deep transfer learning for covid-19 classification Going deep in medical image analysis: concepts, methods, challenges, and future directions Deep learning Deep transfer learning-based automated detection of covid-19 from lung ct scan slices Diagnosis of covid-19 using ct scan images and deep learning techniques A deep transfer learning model with classical data augmentation and cgan to detect covid-19 from chest ct radiography digital images A fully automated deep learning-based network for detecting covid-19 from a new and large lung ct scan dataset Deep learning for detecting corona virus disease 2019 (covid-19) on high-resolution computed tomography: a pilot study Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct End-to-end automatic differentiation of the coronavirus disease 2019 (covid-19) from viral pneumonia based on chest ct Deep learning-based multi-view fusion model for screening 2019 novel coronavirus pneumonia: a multicentre study Deep learning enables accurate diagnosis of novel coronavirus (covid-19) with ct images Imagenet: A large-scale hierarchical image database Covid-ct-dataset: a ct scan dataset about covid-19 Representation learning: A review and new perspectives Imagenet classification with deep convolutional neural networks Deep learning-based detection for covid-19 from chest ct using weak label Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography ai-corona: Radiologist-assistant deep learning framework for covid-19 diagnosis in chest ct scans Artificial intelligence-enabled rapid diagnosis of patients with covid-19 Common pitfalls and recommendations for using machine learning to detect and prognosticate for covid-19 using chest radiographs and ct scans ACKNOWLEDGMENT This work was supported by Australian Government Research Training Program Scholarship. Dr. Naveed Akhtar is the recipient of an Office of National Intelligence Postdoctoral Grant funded by the Australian Government. [25] . CASE STUDY I ALLOWS VISUALLY SIMILAR IMAGES IN TRAINING AND TEST SET. CASE STUDY II ALLOWS 50% TEST IMAGES FOR WHICH TRAINING DATA MAY NOT CONTAIN VISUALLY SIMILAR SAMPLES. THE 'SIMILARITY' IS FROM THE PERSPECTIVE OF SALIENT IMAGE