key: cord-0216608-vm3rs9m4 authors: Altaf, Fouzia; Islam, Syed M.S.; Janjua, Naeem K.; Akhtar, Naveed title: Boosting Deep Transfer Learning for COVID-19 Classification date: 2021-02-16 journal: nan DOI: nan sha: dc312deef34a28f374c73dde455aea3de4bc93ef doc_id: 216608 cord_uid: vm3rs9m4 COVID-19 classification using chest Computed Tomography (CT) has been found pragmatically useful by several studies. Due to the lack of annotated samples, these studies recommend transfer learning and explore the choices of pre-trained models and data augmentation. However, it is still unknown if there are better strategies than vanilla transfer learning for more accurate COVID-19 classification with limited CT data. This paper provides an affirmative answer, devising a novel `model' augmentation technique that allows a considerable performance boost to transfer learning for the task. Our method systematically reduces the distributional shift between the source and target domains and considers augmenting deep learning with complementary representation learning techniques. We establish the efficacy of our method with publicly available datasets and models, along with identifying contrasting observations in the previous studies. COVID-19 classification with images is receiving increasing attention [1] , with Computed Tomography (CT) as the leading modality to leverage the super-human predictive abilities of deep learning [2] for this critical task [3] , [4] . CT scans are widely used for assessing the severity and progression of lung infections [5] . This makes reliable computer aided predictions with CT scans highly relevant to eventually curb COVID-19. Consequently, there have been multiple studies to explore the practices to maximize deep learning performance for this task. Considering the current lack of clean annotated data, transfer learning with ImageNet [6] pre-trained models is the most widely adopted strategy in the current literature. Zhao et al. [7] provided a baseline for COVID-19 classification with public CT-scan images, employing transfer learning. Similarly, [8] uses transfer learning to report results for ten pre-trained models on a dataset of 106 COVID-19 and 86 non-COVID-19 patients. The results are provided using the images pre-processed for regions of interest identification. Building on the pre-trained ResNet50 [9] , Dadario et al. [10] This work was supported by Australian Government Research Training Program Scholarship. proposed COVNet to detect COVID-19 using 4,356 3D CT scans of 3,322 patients. More examples of employing natural image-based pre-trained deep visual models for COVID-19 detection with CT-scans can also be found, e.g. [11] , [12] . Except for a very few, e.g. [7] , the datasets used by the existing works are private. Moreover, the requirement of preprocessing for the region of interest extraction makes their techniques less attractive. Pham [5] provided a comprehensive study of transfer learning for 16 ImageNet models using a public dataset [7] . Besides reporting DenseNet201 [13] as a promising architecture for the task, Pham also reported that data augmentation often has a deteriorating effect on vanilla transfer learning for the problem. This finding further caps the training data size for the task, where the correctly annotated data is already limited. To circumvent the above issue, we investigate if it is possible to augment the classification 'model' (instead of training data) to boost COVID-19 classification performance under transfer learning. We provide an affirmative answer to this question with the help of a technique that leverages the fundamentals of machine learning for the performance gain. Our method focuses on systematically reducing the distributional shift between the pre-trained model of natural images [6] and COVID-19 CT-scan images. Moreover, we view deep learning from the lens of representation learning, and augment the overall prediction model with sparse [14] and dense collaborative representation learning [15] . We demonstrate that our technique is able to considerably boost the accuracy of COVID-19 classification with limited training data. Before introducing the proposed technique, we first highlight the bottleneck of transfer learning for the CT-scan-based COVID-19 classification, which has still kept researchers from achieving the desired level of accuracy with deep learning. For the discussion, let us denote a deep neural model as a function M(x ∼ X , Θ), where x is a sample of the distribution X and Θ is the set of model parameters, a.k.a. weights. Under the classification learning objective, the model aims at encoding the distribution X , which is possible by optimising Θ over a considerably large set of samples from X . If the sample size is small, M struggles in modeling X faithfully. Transfer learning is then employed, which aims at computing [6] . Besides being colored, the patterns in Ima-geNet samples are very different from CT-scans, identifying a large distributional shift between the domains. The distributional shift is expected to be much smaller between CT scans and X-rays due to the apparent similarities in images, besides both being gray-scale domains. (Right) We propose to first transfer an imageNet model M(.) to an intermediate model M(.) of Chest X-rays by adding extra layers that can process gray-scale images. We transfer M(.) to M(.) with a relatively large amount of data [16] . Then, we transfer M(.) to the target model M(.) with the available small COVID-19 CT scan data. We also augment the predictions with sparse [14] and dense collaborative representations [15] . is the pre-trained model learned from a large number of samples of Z and X is a small subset of the observed samples of X . Given a fixed X, the efficacy of the mapping Ψ is mainly governed by the distributional shift ||Z − X ||. The smaller is the shift, the more representative is M(.) of the distribution X , which is desired for better classification of M(.) in X 's domain. Unfortunately, the distributional shift between the colored natural images of ImageNet [6] and the grey-scale images of CT-scans is too large, see Fig. 1 (left), which compromises the mapping Ψ. Clearly, increasing the size of X could help because the larger distributional shift entails a larger || Θ − Θ||, which can be accounted for with a more comprehensive representation of X in X. However, [5] demonstrates that increasing X synthetically does not help for this task. Under our systematic treatment of the problem, we can remark that the data augmentation techniques used in [5] are not able to make X more representative of the distribution X . Provided that improving X is implausible, we aim at improving the mapping function itself. Namely, we let Ψ : || Θ − Θ|| and we can still compute a reasonable approximation of M(.) by transferring M(.) to it, because we can arrange for a larger number of samples of Y. Thus, we reduce the distribution shift between the source and target models with an intermediate model that has a smaller shift with the target model, whereas it also allows a better transfer of the source model due to the availability of more training data. We give details of the exact procedure in Sec. 3. Our second major inspiration comes from looking at deep visual models from the representation learning viewpoint. The model M(.) learns a representation of X to map its samples onto a discriminative feature space for classification. Incidentally, deep learning is not the only representation learning technique available for that purpose. Sparse [14] and dense collaborative representation [15] have also been used effectively for this task. In contrast to the highly non-linear representation learned by deep learning, these methods focus on linear spaces for data modeling. Hence, one can expect them to augment deep learning with their complementary representations. Our results in Sec. 4 verify this. We illustrate the proposed method in Fig. 1 (right) and describe it below following the provided schematics. Source model M(.): For the underlying transfer learning task, we follow the common practice of using natural images as the source domain [5] , [8] , [11] . The models are pretrained on 1 million labelled images of ImageNet [6] , mapping 224 × 224 × 3 color images to 1, 000 class labels. Intermediate model M(.): Considering that our target domain of CT-scans has 'large grey-scale images', we first introduce slight architectural modifications to M(.), while preserving its original weights. Concretely, we enforce a larger single channel input of size 448 × 448 to the model by adding an additional convolutional layer such that the output of this layer is a 224 × 224 × 3 tensor. For the modification, our strategy is to keep the hyper-parameters of kernel size and strides similar to the first convolutional layer of the original model, and use three filters to output a 3-channel feature map. We use the original activation functions and employ Batch-Normalisation when the original model used it. We aim at training the new layer and also fine-tuning the remaining model for an 'intermediate' domain to get the intermediate model M(.). We choose chest radiography images as our intermediate domain, that provides large-scale annotate data, Chest-Xray14 [16] for thoracic disease classification. Being grey-scale large medical images, this data domain is closer to the CT-scan images, see Fig. 1(left) . From [16] , we select a balanced subset of 775 images per class for 10 classes, and alter the output layer of M(.) to predict those classes. We tune the resulting network in a three-step scheme. First, we only learn the newly added input layer and the modified output layer for 5 epochs with a learning rate 0.001 using Adam optimizer. This step is intended for a reasonable initialization only. We further reduce the learning rate 10 times and fined-tuned these layers for 5 more epochs by augmenting the data with a random rotation in [-7,7] degrees, horizontal flip and cropping. For cropping, we select the central 850 × 850 region of 1024 × 1024 images. The network is fed with 448 × 448 × 1 input. In the end, we again reduce the learning rate by 10 and allow 5 more epochs to fine-tune the 'complete model' with the augmented data. Note that, data augmentation here is only used as a regularization mechanism for M(.) to avoid over-fitting to the intermediate domain. Target model M(.): To transfer M(.) to the target domain of CT-scan images, we use 448 × 448 × 1 inputs obtained by resizing the CT-scan grey-scale images. Besides the advantage that we transfer a model of grey-scale medical images to the CT-scan domain, notice that we are also able to use a larger input size (i.e. 448 × 448 vs 224 × 224). This is beneficial because larger images contain more information, providing more discriminative patterns. We obtain M(.) with a further fine-tuning of M(.) for 6 epochs with the grey-scale images from the target domain. We use 5e-4 as the learning rate for the whole model, except for the output layer for which the rate is 10× 5e-4 because that layer is added anew to account for the binary classification problem at hand. Sparse & dense collaborative representation: Sparse representation [14] encodes a sample, say s ∈ R m as a sparse linear combination of a dictionary D ∈ R m×n , such that Dα ≈ s and ||α|| 0 ≤ k, where ||.|| 0 denotes the pseudo-norm of the vector. The external constraint ||α|| 0 ≤ k does not allow α to have more than 'k' non-zero coefficients. Hence, the representation vector α is sparse. Removing the sparsity constraint, renders α dense. In order to make these representations collaborative, we must construct D such that its columns (i.e. the basis vectors) form discriminative subspaces for each class label involved in the problem. We treat the activation vector before the logits of our final model as a basis vector for D. Extracting these vectors for the training samples and arranging them in a class-wise manner in a matrix form constructs D in our approach. Using that, we compute the sparse representation vector of s using the well-established Orthogonal Matching Pursuit (OMP) technique [18] . Here, s is the activation vector of M(.) for a test sample. For the dense representation vectors, we let α = (D T D + λI) −1 D T s, where I is an identity matrix and λ is a scalar. From the linear algebra viewpoint, the computed α gives us a regularized least square projection of s onto the discriminative subspace formed by D. We fuse the two representation vectors by simply normalizing and adding. Label prediction: The computation of sparse and dense representation is done only at the prediction stage. The fused representation vector for a test sample is further combined with the prediction of the target model M(.), for which a simple strategy is adopted. That is, we first add all the coefficients of the fused representation vector that belong to the same class. It is possible to identify those because our dictionary is an arranged matrix. Then, we add the resulting vector to the softmax activations of M(.). The intuition is simple. That is, a representation vector for a sample of a given class normally likes to use the dictionary columns belonging to that class more actively. Thus, the corresponding coefficients of the vector gets higher values, which we can use to amplify the softmax scores of M(.). In the end, we choose the maximum augmented softmax score to decide the prediction label. We evaluate our technique on two public datasets for COVID-19 classification using CT-scans. The first dataset is, SARS-COV-2-CT (SC2C) database [17] . It contains a total of 2, 482 CT images, which includes 1, 252 images of COVID-19 positive cases of 60 patients and 1, 230 images of 60 COVID-19 negative patients. The data has been collected from different hospitals in Sao Paulo, Brazil. The second dataset is COVID-CT-Dataset (CCD) [7] . It consists of 349 CT images of COVID-19 infected patients and 397 CT images of non-infected patients. The image sizes in both datasets vary significantly. However, most of those images are much larger than the 224 × 224 grid size. We transfer the popular ImageNet models of Inception-v3, ResNet50, DenseNet201 and VGG16 to our target domain using the training details discussed in the previous section. Table 1 summarizes the results of our experiments on the two datasets. We include the results of vanilla 'Transfer learning' as the baseline, which is claimed highly accurate in [5] . Results for the 'Boosted' transfer learning are achieved by transferring our chest X-ray model, which was altered for the larger grey-scale inputs. We use 5 training epochs with 5e-4 learning rate for this transfer. We can see a consistent large performance gain with this improvement over vanilla transfer learning. For the 'Boosted + Data Aug.', we also include data augmentation with random scaling in the range [0.9, 1.1], random translation in the range [-5, 5] and reflection. It is worth noticing that data augmentation generally results in a slight performance gain, which is expected. However, this is different from the findings of [5] . We discuss this further in Sec. 5. Lastly, the 'Combined' results indicate that the proposed sparse and dense collaborative representation is also used to improve the 'Boosted+Data Aug.' results. Again, generally, an increasing trend in the performance is observed. We use 50 as the sparsity threshold for the OMP algorithm [18] , and λ = 2 to compute the dense representation vector. These values are selected empirically by cross-validation. Due to space restrictions, we only provide 'Combined' results for the CDC dataset, reporting similar trends for the remaining methods. Contrary to [5] , our results do not particularly favor Den-sNet201. Instead, shallower networks seem to have a slight advantage. We report results as averages of five draws from the dataset where random chunks of consecutive images were selected as the test data, which formed 10% of the overall datasets. We note that this strategy and data division is different from [5] . For the CDC dataset, we have consistently observed more than 5% increase over the vanilla transfer learning with our method across all models. We introduced a novel method to make transfer learning with deep models of natural images much more effective for CT-scan-based COVID-19 classification. However, despite a large accuracy gain across all models, the achieved results on public datasets can still not be categorized 'acceptable' for automated prediction of this infection. Our results indicate that larger annotated datasets are still required to achieve that target. Otherwise, human experts should not fully rely on the automated results. Interestingly, our findings do not align well with the existing claims of very high predictive performance of transfer learning on the same datasets, e.g. Pham's claim [5] of 96% accuracy with vanilla transfer learning of ImageNet models on [7] . We conjecture that such studies are over-estimating the performance of transfer learning for this task. The apparent high accuracies seem to be not due to accurate modeling of COVID-19 features, instead they result from encoding data idiosyncrasies to cause a form of over-fitting to the dataset. This argument is supported by two counter-intuitive observations about such studies. (a) Data-augmentation results in significant performance degradation instead of better generalisation. (b) Deeper models perform better than shallower ones despite the small training data size. In our separate experiments, we also observed a large performance degradation in transfer learning results of [5] , by slightly changing the training/testing data selection strategy. We refrain from draw conclusive statements about this observation here, and stress on more careful evaluation of transfer learning for this task by the research community. As compared to [5] , our analysis does not suffer from counter-intuitive observations. However, it also does not support the notion that highly effective transfer learning from the natural image models is possible with limited number of CTscans. A further investigation for unbiased and fair evaluation of transfer learning for this task is implicated by our study, which is planned for the future. However, our method does ascertain the possibility of a considerable performance boost for transfer learning for this task. The role of imaging in 2019 novel coronavirus pneumonia Deep learning How might ai and chest imaging help unravel covid-19's mysteries? Ct imaging changes of corona virus disease 2019 (covid-19): a multi-center study in southwest china A comprehensive study on classification of covid-19 on computed tomography with pretrained convolutional neural networks ImageNet: A Large-Scale Hierarchical Image Database Covid-ct-dataset: a ct scan dataset about covid-19 Application of deep learning technique to manage covid-19 in routine clinical practice using ct images: Results of 10 convolutional neural networks Deep residual learning for image recognition Regarding" artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct A deep learning algorithm using ct images to screen for corona virus disease A deep learning system to screen novel coronavirus disease 2019 pneumonia Densely connected convolutional networks Dictionary learning Efficient classification with sparsity augmented collaborative representation Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases Sars-cov-2 ct-scan dataset: A large dataset of real patients ct scans for sars-cov-2 identification Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition