key: cord-1014651-h0a2q8g5 authors: Santosh, KC; Ghosh, Sourodip title: Covid-19 Imaging Tools: How Big Data is Big? date: 2021-06-03 journal: J Med Syst DOI: 10.1007/s10916-021-01747-2 sha: cee6c5e22c08369367267acabd41ec85e7eeca3d doc_id: 1014651 cord_uid: h0a2q8g5 In this paper, considering year 2020 and Covid-19, we analyze medical imaging tools and their performance scores in accordance with the dataset size and their complexity. For this, we mainly consider AI-driven tools that employ two different types of image data, namely chest Computed Tomography (CT) and X-ray. We elaborate on their strengths and weaknesses by taking the following important factors into account: i) dataset size; ii) model fitting criteria (over-fitting and under-fitting); iii) transfer learning in the deep learning era; and iv) data augmentation. Medical imaging tools do not explicitly analyze model fitting. Also, using transfer learning, with fewer data, one could possibly build Covid-19 deep learning model but they are limited to education and training. We observe that, in both image modalities, neither the dataset size nor does data augmentation work well for Covid-19 screening purposes because a large dataset does not guarantee all possible Covid-19 manifestations and data augmentation does not create new Covid-19 cases. The novel coronavirus (nCoV) -originally known as SARS-nCoV-2 -has become one of the most vulnerable viruses, threatening human lives for the last hundred years [1] . Due to the exponential rising in the number of cases, the World Health Organization (WHO) declares Covid-19 as a pandemic in March 2020 [2] . The primary symptoms of Covid-19 are headaches, muscle pain, cough, common cold, occasional fevers, and in several vulnerable cases, breathing problems [3, 4] . Such a disease can also be asymptomatic. Therefore, detecting its presence by clinical prognosis becomes cumbersome. It is currently confirmed with a Reverse Transcript Polymerase Chain Reaction (RT-PCR) This test, which we considered the gold standard [5] . However, it is expensive and time consuming as it requires adequate testing centers and clinical experts. Medical experts and clinicians have tirelessly contributed towards the early results of screening trials of this virus. The speedy acquiring of test results offers two main advantages: i) the subject can be moved to a diagnosis care center sooner, preventing further spread; and ii) the recovery chances improve with a faster diagnostic time. Artificial Intelligence (AI) has promoted countless contributions in the field of medical imaging. Healthcare tools have advanced the quality of screening procedures in the Covid-19 era [6] [7] [8] . Machine Learning (ML) and Deep Learning (DL) based tools for Covid-19 prognosis and diagnosis have utilized statistical approaches to extract normal/abnormal patterns in chest Computed Tomography (CT) and/or X-rays [9] . This is done to predict the possibility of a Covid-19 affected lung region that reduces the prognosis time and determines the need for an RT-PCR test. Computer-Aided Diagnosis (CADx) tools created from DL tools using CT and X-ray images, custom Neural Networks (NNs), and with and without transfer learning models have been proposed [10] [11] [12] [13] [14] . Training and validating Covid-19 screening-based CADx tools typically involve acquisition of image data (positive and negative classes) and feature-based pattern analysis using imaging tools [15] . Deploying up-to-date ML and/or DL models is to prevent possible risks on human lives [16, 17] . We consider both chest image data: CT and X-ray images, and elaborate on the performance of imaging tools in accordance with the data size. We are aware of thousands of research articles published in the year 2020 [18] . We, however, are considering medical imaging tools that employ chest CT and X-ray image data, other than pre-prints from such as ArXiv, medRxiv, and TechRxiv. The remainder of the paper is organized as follows. In "Medical imaging tools: Chest CT scans and X-rays", we review Covid-19 screening models using chest CT images (ref. Chest CT imaging) and X-ray images (ref. Chest X-ray imaging). We then discuss on how big data is big in "How big data is big?" by considering both image modalities into account. "Conclusion" concludes the paper. As mentioned earlier, for Covid-19, we elaborate on the use of chest CT imaging methods based on the performance by taking dataset size into account. In what follows, we consider 16 different research articles that have contributed to detect Covid-19 positive cases in 2020 (see Table 1 ). Farid et al. [19] devised a Convolutional Neural Network (CNN) based approach to classify Covid-19 and SARS images (51 each class). Using 10-fold cross validation, they reported an accuracy of 94.11%. Singh et al. [20] developed a CNN using a multi-objective differential evolution (MODE) technique. Using 150 CT images (75 each class) and hold-out validation (90 : 10), an accuracy of 93.25% was reported. Hasan et al. [21] used handcrafted features from Q-deformed entropy to distinguish between lung scans, Pneumonia, and Covid-19 CT slices. A long shortterm memory (LSTM) architecture enabled them to achieve 99.68% accuracy on 321 subjects. A notable study was conducted by Mukherjee et al. [22] [25] analyzed 495 CT subjects that were collected from three different hospitals in China. They used a DL-based multi-view fusion model and classified Covid-19 and pneumonia with an accuracy of 0.76 and AUC of 0.819 in the testing set, comprised of 50 subjects. Pathak et al. [26] conducted an experiment with Covid-19 CT images using a deep transfer learning method by taking a baseline ResNet50 pre-trained architecture into account. Using 10-fold cross validation approach on a balanced dataset of size 826, they achieved an accuracy of 93.01%. Amyar et al. [27] optimized segmentation and classification performances by training/validating 1,369 images, with 449 Covid-19 CT images. They achieved a dice coefficient score of 0.88 and an AUC of 97%. Li et al. [28] used CT data collected across 6 different hospitals. Using ResNet50 architecture on dataset of size 3,322 subjects, they achieved an AUC score of 0.96. Ardakani et al. [29] utilized 1,020 CT Covid-19 affected CT images. They studied 10 different DNN architectures, and achieved the best accuracy of 99.51% (with AUC = 0.994 and sensitivity = 100%) from ResNet101 model. Ko et al. [30] used four DNNs, namely VGG16, ResNet50, InceptionV3, and Xception. With access to 3,993 CT images (Covid-19 (1, 194) , other pneumonia (1,357), and non-pneumonia (1,442)) across two hospitals and a public database, the ResNet50 achieved best accuracy of 99.87%. Alshazly et al. [31] experimented on two different CT datasets and used seven different DNNs. They used a k(= 5) fold cross-validation, and achieved accuracies of 99.4% and 92.9% in the two separate datasets, respectively. Ni et al. [32] implemented a deep learning model to train and validate with CT data acquired from 14,435 subjects. The method detects lesions, with segmentation and location with sensitivity and F1-score of 100% and 97% per-patient basis. Zhou et al. [33] ensembled (majority voting) AlexNet, GoogleNet, and ResNet18 architectures. With a transfer learning approach and a k(= 5) fold crossvalidation training procedure involving 7,500 CT images, equally distributed between lung tumor, Covid-19 positive, and normal class, they achieved an accuracy of 99.05%. Chen et al. [34] developed a Covid-19 CT screening tool validated on 46,096 images from Renmin Hospital of Wuhan University. Using a pre-trained imageNet dataset, they achieved 95.24% and 96% accuracies on an internal and external test datasets, respectively. Like CT imaging tools/techniques, we review 24 different works, as shown in Table 2 . Alqudah et al. [35] used CNN to extract features from 79 images in total, and reported an accuracy of 95.2%. Marques et al. [56] employed DNN algorithm, known as EfficientNet to detect Covid-19 positive cases. In their test on 1,508 images (Covid-19 cases = 504), they achieved an accuracy of 96.70% (multi-class). Das et al. [57] used different categories (TB, Covid-19 positive, pneumonia, and control) chest X-rays and divided them into six different datasets. They trained a truncated Inception-V4 architecture and tested it on these six datasets separately using a cross-validation approach. This allows them to achieve an average accuracy of 98.77% with a standard deviation of ±0.702. Needless to mention that the aforementioned research articles (see Tables 1 and 2 ) have used different feature extractors, decision-making processes and experimental set ups. More importantly, for Covid-19, their dataset sizes are varied over time, and so the sources are. For a fair analysis, let us not discuss on their methodologies and/or techniques, we rather focus on dataset size. We then elaborate on the strength of machine learning and deep learning algorithms by taking the following factors into account, such as fitting, transfer learning in the era of deep learning, and data augmentation. 1. Dataset: For easy understanding, we organize research articles, in both Tables 1 and 2, in accordance with the dataset size. In machine learning, we state that bigger the data, better the performance. It does not hold true as we are looking at collecting all possible Covid-19 manifestations, rather than just increasing number of images. We have not observed better results from bigger datasets. We are aware of the situation that collecting data for Covid-19 during the beginning of the year 2020 is not trivial. Authors, however, worked on a fairly large dataset of size 46,096 images (chest X-rays) in late 2020 as compared to a dataset of size 100 images or so (early 2020). It, again, does not really guarantee whether imaging tools are ready for mass-screening. If so, then how big data is big? Machine learning tools require to learn all possible manifestations that are related to particular diseases (Covid-19, in our case) not just the size of the dataset. Dataset size, however, opens the possibility of having new cases (i.e. manifestations), which is always not the case. Apart from model fitting issues, multiple works suggest using deep CNNs. However, comparing them with shallow CNN networks, we find out that it shows marginal differences in performance. The advantages (7, 592) of computer vision tools in this modern era have allowed researchers to leverage datasets of any size and focus on methods that guarantee better performance in validation and testing, both internal and external. Traditionally, in machine learning, under-fitting and over-fitting situations are explicitly discussed/analyzed. They, however, have not analyzed well in Covid-19 screening tools (see Tables 1 and 2 ). More often, authors were engaged in producing better performance scores by tuning (hyper)parameters. If it is the case, the possibility of having better results can be due to test set contains similar images as in the train set. Of all, a hold-out validation approach is one of the issues. Also, performance can be biased when imbalanced datasets are used. 3. Transfer learning: In deep learning era, the idea of transfer learning plays crucial role in computer vision field. It focuses on gaining knowledge while solving one problem and applying it to different but related problems. The primary idea is to initially train models from a larger dataset to understand basic details (e.g. visual cues, such as edges, nodes, shape). The trained models can then be used for target dataset so learning trivial features is possible. For Covid-19 imaging tools, we observe that a handful of authors used transfer learning. They, however, did not provide explainable features/models, rather than just better scores. This brings an open question: do their performance scores state that their imaging tools (with transfer learning) are robust enough to generalize? 4. Data augmentation: Availability of the data is a serious challenge/issue in deep learning, especially in healthcare. Even when there exists sufficient data that are collected in one domain, the trained model may not necessarily be generalize to another application (even in the exact same domain but different application. It requires domain adaptation, which is a sub-field of transfer learning that helps alleviate the domain shift in such cases. Covid-19 is no exception to this. Data augmentation is often used in data analysis to increase the available raw data by adding slightly modified copies of the source or, in some cases, the synthetic image generated from existing data. In general, it includes horizontal or vertical flips, rotation, noise injection, cropping, color modification, and random erasing. Although data augmentation has largely contributed in general object detection and recognition, it faces challenges when it needs clinical experts that are seeking for clinical implications. As in computer vision domain, even though the process seems trivial, augmented data may not carry clinical significance (e.g., Covid-19 +ve, lung cancer, pneumonia, or normal classes). In this paper, for Covid-19 screening, we have analyzed 40 research articles (16 CT + 24 X-ray) other than pre-prints and conference proceedings. In our analysis, we are limited to medical imaging tools whether their performance scores are based on the dataset size. In both image modalities: CT and X-ray images, we have observed that the performance was not improved in accordance with the dataset size. In addition, we have noticed the possibility of over-fitting in early 2020. On the other hand, we have not observed that a large dataset improved results since it did not guarantee whether we had all possible Covid-19 manifestations. Besides, we have observed that data augmentation worked well in improving results. We, however, did not find that whether the augmentation process can possibly create new Covid-19 manifestations. As reported in the computer vision domain, transfer learning could possibly build Covid-19 deep learning model ready with fewer data. It did not hold true for Covid-19 cases as most of them are limited to education and training. Therefore, for such a Covid-19 outbreak, we are required to deploy AI-driven Covid-19 screening tools that consider active learning with an aim to develop cross-population train/test models [15] . Active learning helps learn data over time so we are not required to wait for weeks, months, and years to build AI-driven tools. Ethical approval This article does not contain any studies with human participants performed by any of the authors. Authors declare no conflicts of interest. A new coronavirus associated with human respiratory disease in china Who declares covid-19 a pandemic Clinical features of patients infected with 2019 novel coronavirus in Clinical features of covid-19 Correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in China: A report of 1014 cases Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19 Covid-19 and artificial intelligence: Protecting healthcare workers and curbing the spread Artificial intelligence (ai) applications for covid-19 pandemic Role of biological data mining and machine learning techniques in detecting and diagnosing the novel coronavirus (covid-19): A systematic review Artificial intelligenceenabled rapid diagnosis of patients with covid-19 Serial quantitative chest ct assessment of covid-19: Deep-learning approach Deep learning covid-19 features on cxr using limited training data sets Inf-net: Automatic covid-19 lung infection segmentation from ct images Using x-ray images and deep learning for automated detection of coronavirus disease Ai-driven tools for coronavirus outbreak: Need of active learning and cross-population train/test models on multitudinal/multimodal data Covid-19 prediction models and unexploited data Revisited covid-19 mortality and recovery rates: Are we missing recovery time period? The published scientific literature on covid-19: An analysis of pubmed abstracts A novel approach of ct images feature analysis and prediction to screen for corona virus disease (covid-19) Classification of covid-19 patients from chest ct images using multi-objective differential evolution-based convolutional neural networks Classification of covid-19 coronavirus, pneumonia and healthy lungs in ct scans using qdeformed entropy and deep learning features Deep neural network to detect covid-19: One architecture for both ct scans and chest x-rays A deep learning system to screen novel coronavirus disease A deep transfer learning model with classical data augmentation and cgan to detect covid-19 from chest ct radiography digital images Deep learning-based multi-view fusion model for screening 2019 novel coronavirus pneumonia: A multicentre study Deep transfer learning based classification model for covid-19 disease Multi-task deep learning based ct imaging analysis for covid-19 pneumonia: Classification and segmentation Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct Application of deep learning technique to manage covid-19 in routine clinical practice using ct images: Results of 10 convolutional neural networks Covid-19 pneumonia diagnosis using a simple 2d deep learning framework with a single chest ct image: Model development and validation Explainable covid-19 detection using chest ct scans and deep learning A deep learning approach to characterize 2019 coronavirus disease (covid-19) pneumonia in chest ct images The ensemble deep learning model for novel covid-19 on ct images Deep learningbased model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography Covid-2019 detection using x-ray images and artificial intelligence hybrid systems Covidiagnosis-net: Deep bayessqueezenet based diagnostic of the coronavirus disease 2019 (covid-19) from x-ray images Within the lack of chest covid-19 x-ray dataset: A novel detection model based on gan and deep transfer learning Automated detection of covid-19 cases using deep neural networks with x-ray images Shallow convolutional neural network for covid-19 outbreak screening using chest x-rays A deep learning framework for coronavirus disease (covid-19) detection in x-ray images Deep learning system for covid-19 diagnosis aid using x-ray pulmonary images A modified deep convolutional neural network for detecting covid-19 and pneumonia from chest x-ray images based on the concatenation of xception and resnet50v2 Deep learning approaches for covid-19 detection based on chest x-ray images Deep learning covid-19 detection bias: Accuracy through artificial intelligence Application of deep learning for fast detection of covid-19 in x-rays using ncovnet A novel medical diagnosis model for covid-19 infection detection based on deep features and bayesian optimization Covid-19: Automatic detection from x-ray images utilizing transfer learning with convolutional neural networks Convolutional capsnet: A novel artificial neural network approach to detect covid-19 disease from x-ray images using capsule networks Explainable deep learning for pulmonary disease and coronavirus covid-19 detection from x-rays A deep learning approach to detect covid-19 coronavirus with x-ray images Coronet: A deep neural network for detection and diagnosis of covid-19 from chest x-ray images Attention-based vgg-16 model for covid-19 chest x-ray image classification New bag of deep visual words based features to classify chest x-ray images for covid-19 diagnosis Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images The investigation of multiresolution approaches for chest x-ray image based covid-19 detection Automated medical diagnosis of covid-19 through efficientnet convolutional neural network Truncated inception net: Covid-19 outbreak screening using chest x-rays Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.