key: cord-0113281-jycbw7nz
authors: Morani, Kenan; Unay, Devrim
title: Deep Learning Based Automated COVID-19 Classification from Computed Tomography Images
date: 2021-11-22
journal: nan
DOI: nan
sha: 1df8cbb62f27d751b0985b081ceb5f52fc722195
doc_id: 113281
cord_uid: jycbw7nz

The paper presents a Convolutional Neural Networks (CNN) model for image classification, aiming at increasing predictive performance for COVID-19 diagnosis while avoiding deeper and thus more complex alternatives. The proposed model includes four similar convolutional layers followed by a flattening and two dense layers. This work proposes a less complex solution based on simply classifying 2D CT-Scan slices of images using their pixels via a 2D CNN model. Despite the simplicity in architecture, the proposed model showed improved quantitative results exceeding state-of-the-art on the same dataset of images, in terms of the macro f1 score. In this case study, extracting features from images, segmenting parts of the images, or other more complex techniques, ultimately aiming at images classification, do not yield better results. With that, this paper introduces a simple yet powerful deep learning based solution for automated COVID-19 classification.

The COVID-19 virus, or the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is believed to have initially originated from the species of bats and transmitted to human beings in December 2019. The virus spread rapidly all around the world, affecting lots of people and claiming lives [1] . COVID-19-infected individuals have experienced fever on the onset, generalized fatigue, dry coughing, diarrhea among other possible symptoms [2] . Early detection and isolation are vitally important to successfully handle the COVID-19 pandemic. Studies have shown the importance of lung imaging for that cause [3] . With that, automated solutions were proposed for COVID-19 detection through medical images such as computed tomography (CT) scans using different algorithmic methods [4] .

The proposed methods would report classification performance scores in different matrices including accuracy, precision, recall, specificity, and F1 scores [5] . In case of an unequal number of observations in the classes (unbalanced data), accuracy of the solutions might not be enough to inform about the performance. If this is the case, then the model can be assessed in terms of its "Precision" and "Recall". If the former is high, then that means the model gives more relevant results than irrelevant ones. On the other hand, if the latter is high then that means the model gives most of the relevant results (whether irrelevant ones are also returned). Therefore, for unbalanced classification problems, the weighted average of the two scores or the macro F1 score can be used to evaluate the classification performance of a model in a more reliable manner [6] .

In this paper, the macro F1 score was used to compare the performances of different deep learning models validated on the same dataset. The design of those models is aimed at finding an automated solution for COVID-19 diagnosis via CT-scan images. The proposed classification solution in this paper is a deep learning model consisting of four similar 2D convolutional layers followed by a flattening layer and two dense layers. The deep learning model was then used to make diagnosis predictions at patient's level using different methods and thresholds via class probabilities and voting from the slices.

The main contributions of this work can be listed as follows:

• We present a less complex deep neural network with five layers solution to achieve COVID-19 diagnosis from CT images.

• We show that processing CT images with a Region of Interest (ROI) dedicated to the lung region improves diagnostic performance.

• We propose taking patient-level diagnosis from slicelevel processing via the proposed method.

• We evaluate the performance of the proposed solution on a recent, relatively large, challenging dataset.

Recently, deep Transfer learning and Customized deep learning-based decision support systems are proposed for COVID-19 diagnosis using either CT or X-ray modalities [7, 8, 9] . Some of these systems are developed based on pre-trained models with transfer learning [10, 11] , while a few others are introduced using customized networks trained from scratch [12, 13, 14] .

One approach proposed a novel COVID-19 lung CT infection segmentation network, named Inf-Net [13] . The work utilized implicit reverse attention and explicit edge-attention aiming at identification of infected regions in CT images. The work also introduced a semi-supervised solution, Semi-Inf-Net, aiming at alleviating shortage of high-quality labeled data. The proposed method was designed to be effective in case of low contrast regions between infections and normal tissues.

Another approach used deep learning based automated method validated using chest X-ray images collected from different sources. Different pre-trained CNN models were compared, and the impact of several hyperparameters was analyzed in this work. Finally, the best performing model was obtained. ResNet-34 model outperformed other competitive networks and thus development of effective deep CNN models (using residual connections) proved to give a more accurate diagnosis of COVID-19 infection [14] .

Using a CT scan series of images, named COV19-CT-DB database, [15] a baseline approach introduced a deep neural network, based on CNN-Recurrent Neural Network (RNN) architecture. The CNN part of the model extracts features from the images while the following RNN part takes the final diagnostic decision [16, 17, 18] .

Another study, which used the same database (COV19-CT-DB) for validation, introduced a different method [19] . In this study 2D deep CNN models were trained on individual slices of the database. Performances of the following pre-trained models were compared -VGG, ResNet, MobileNet, and DenseNet. Evaluation of the models was reported both at slice level (2D) as well as at patient/volumetric level (3D) using different thresholding values for voting at the patient level for the latter. The best results were achieved using the ResNet14 architecture (referred to as AutoML model) via 2D images.

In another prior work, a 3D CNN-based network with BERT was used to classify slices of CT scans [20] . The model used only part of the images from the COV19-CT-DB database. The training and validation set of images were passed through a lung segmentation process first to filter out images of closed lungs and to remove background. After the segmentation process, a resampling method was used to select a set of a fixed number of slices for training and validation. The 3D CNN-based model was followed by a second level MLP classifier to capture all the slices' information from 3D-volumetric images. The final model architecture achieved improved accuracy and macro F1 score on the validation set.

Another study introduced 2D and 3D deep learning models to predict COVID-19 cases [21] . The 2D model, named Deep Wilcoxon signed-rank test (DWCC), adopts non-parametric statistics for deep learning, making the predicted result more stable and explainable, finding a series of slices with the most significant symptoms in a CT scan. On the other hand, the 3D model was based on pixel-and slice-level context mining. The model was termed as CCAT (Convolutional CT scan Aware Transformer), to further explore the intrinsic features in temporal and spatial dimensions.

More work on the same dataset involved deploying a hybrid deep learning framework named CTNet which combines convolutional neural network and transformer together for the detection of COVID-19. The method deploys a CNN feature extractor module with SE attention to extract features from the CT scans, together with a transformer model to model the discriminative features of the 3D CT scans. The CTNet provides an effective and efficient method to perform COVID-19 diagnosis via 3D CT scans with data resampling strategy. The method's macro f1 Score exceeded the baseline method on the test partition of the COV19-CT-DB database [22] .

Additionally, on the COV19-CT-DB, two experimental methods that custom and combine Deep Neural Network to classify the series of 3D CT-scans chest images were deployed. The proposed methods included experimenting with 2 backbones: DenseNet 121 and ResNet 101. The experiments were separated into 2 tasks, one was for 2 backbones combination of ResNet and DenseNet and the other one was for DenseNet backbones combination. [23] The method's macro F1 score on the test partition of COV19-CT-DB also exceeded the baseline model score as can be seen on the leaderboard [24].

The proposed deep learning approaches in the literature summarized above achieved high macro F1 scores on the COV19-CT-DB database. Our work presented here further explores the database and introduces a less hand-engineered and more efficient deep learning based solution for COVID-19 diagnosis. Our model's performance is compared to state-ofthe-art on the same dataset.

COV19-CT-DB is the dataset used for validating the CNN model proposed in this paper as well as other state-of-the-art models compared to it. The CT images in the database were manually annotated by experts and distributed for academic research purposes via the "AI-enabled Medical Image Analysis Workshop and Covid-19 Diagnosis Competition" [25].

The database consists of about 5000 3D chest CT scans acquired from more than 1000 patients. The training set contains 1560 scans in total with 690 of the cases being COVID while the rest (870) belong to the Non-COVID class. The validation set contains, in total, 374, where 165 are COVID cases and 209 are Non-COVID cases. The CT scans in the database contain largely varying slice numbers, ranging from 50 to 700. Please note that the validation set in the COV19-CT-DB database could be referred to as the testing set in our work from this point onward.

The data is unbalanced in terms of the number of 2D slices for both COVID and Non-COVID classes. The images, which are input to the model, were all grayscale in JPEG (Joint Photographic Experts Group) format with 8-bit depth. The images were all resized to an original size of 512x512.

The numbers of 2D slices used in our work were 335672 in the training set and 75532 in the validation set. Fig. 1 shows distribution of the slices with respect to the classes in the training and test (validation) sets used in this study. The test set with unseen images would be called the "test partition".

and Class1 is Non-COVID)

The proposed model's architecture consists of four similar convolutional layers followed by a flatten layer, and two dense layers. The number of filters in the convolutional layers are 16, 32, 64, and 128, in order, all with a 3x3 filter size. Padding was also applied on the input images in all four convolutional layers, to match input and output image sizes (Padding="same"). The four layers had batch normalization and max polling (2,2), and ReLu (rectified linear unit) activation function with a binary output for the final diagnosis. Fig. 2 shows the proposed CNN model's architecture.

Following the four convolutional layers was a flattening layer, followed by a dense layer with the dimensionality of 256, batch normalization, ReLu activation function, and a dropout of 0.1. The model then ends with a dense layer using a sigmoid activation function.

The model was compiled using Adaptive momentum estimator (Adam) optimizer with all its default values on Keras [26] . Learning scheduler, learning rate decay and step decay options were not employed. Batch size of 128 is used. The learning rate decay on the original dataset was not used on the basis of avoiding a design of a complex model.

The motivation behind the model architecture is to adopt similar and simple four layers model with standard components; multiples number of filters, padding, max pooling and regularization. The four layers are followed with a standard two dense layer with a dropout.

The activation visualization results of classification on the database show room for improvement in terms of accuracy based on the classification mechanism. Following Grad-Cam visualization in Fig. 9 , one can theorize that masking the images with the lung area should improve the performance as the model can better learn to discriminate COVID from Non-COVID slices. To prove the theory and improve the performance, a fixed-sized rectangular the Region Of Interest (ROI) was conducted manually to localize the area of interest in the slices. The rectangular area was defined and used to crop lung areas of the CT slice, where both right and left lungs were included in the rectangular crop. Fig. 3 shows this ROI overlaid on the original slice from the training set.

After cropping, thresholding was applied to identify and remove uppermost and lowermost slices of the CT scans -nonrepresentative slices of the volume, aiming to achieve better performance at the patient level diagnosis. Identification of the non-representative slices was realized based on the number of bright pixels in a binarized slice. This procedure is explained below.

First, the cropped images were blurred by using a Gaussian filter to suppress noise and thus enhance large structures in the image. A Gaussian function with a standard deviation of one was convolved with the cropped image's pixel intensity values. The Gaussian function can be expressed in two dimensions as in Equation 1:

Where x and y are the distances from the origin in the horizontal and vertical axes, respectively, and is the standard deviation ( = 1).

Second, a histogram based binarization was applied to the resulting blurred images. By looking at the slice's histograms, an estimated threshold for histogram-based image binarization was chosen to be 0.45. This fixed threshold was chosen after applying scale normalization to the image's pixel intensities. Fig. 4 illustrates an exemplary histogram of one of the Gaussian blurred images in the database and the corresponding resulting binarized image. Finally, the binarized image's pixels were used to find a threshold to remove non-representative slices of the CT volume. To choose the threshold, four candidate CT scans were selected chosen from the training set (CT scans 5, 6, 7, and 8) and random slices from them were processed as explained above. To indicate the importance of the slices in the CT volume, numbers from one to three were used with three being the most representative/important slice and 1 being the least representative slice; An important slice means a representative slice or a slice that shows a large area of the lung. Similarly, slices of less impotence are slices that show little to no lung area. Fig. 5 shows the results of the four candidate CT scan volume slices. The chosen filtering threshold for the number of white pixels was 4500 out of a total of 68100 pixels (227x300) in each image. Consequently, if the resulting binarized image has more white pixels than the threshold, the slice corresponding to the binarized image will be kept in the CT scan, otherwise it will be removed. The threshold was chosen so as to keep at least one representative slice in every CT scan volume to be part of the final diagnosis. The slice processing methodology reduces the number of slices in the dataset by including only the representative slices. Accordingly, the number of training and test slices were reduced to 280462 (corresponding to 16% reduction) and 63559 (15.3% reduction), respectively. Please note that the original number of the slices are as shown in Fig. 1 above.

After slice processing, hyperparameters were readjusted using the same CNN model architecture.

To stabilize the increment in validation accuracy during training, a learning rate scheduler was used to prevent fluctuation. The learning rate scheduler included an exponential decay function, applied to the SGD (Stochastic Gradient The initial learning rate (initial LR) was set to 0.1 and a 0.96 decay rate was used. The value of steps divided by decay steps is an integer division, i.e. the decayed learning rate follows a staircase function.

The optimizer's steps were defined using floor divisions as in Equation 3:

Decay steps were set every 100000 steps [27].

On the other hand, class weights were used to modify any imbalance in the input image classes. The ratio of the class weight in our case study came to {1.197:1} COVID to Non-COVID ratio in the training set. The class weight was calculated using the formula in Equation 4:

Finally, image augmentation (mainly horizontal and vertical flipping) was applied on the processed. These image flipping techniques aimed at improving the accuracy via smoothing the effects of the content variations present in the slice [25] [28] .

At the patient level, different class probability thresholds were tried and compared using class prediction probability to find the method with the best performance. The class probability thresholds were based on the probability of prediction of class 1 (Non-COVID); If the output probability for class 1 is greater than the chosen threshold, then the slice would be predicted as Non-COVID. Otherwise, the slice would be predicted as COVID. In that, if number of COVID slices is equal to the number of Non-COVID slices in any one of the CT volume, then the decision is that the patient is a Non-COVID. This slice level decision can be expressed as follows:

if Class1 probability > class probability threshold:

Predict slice as Non-COVID else:

Predict slice as COVID After slice level predictions are obtained, a patient is diagnosed based on the presence/absence of COVID slices in his/her CT: if patient CT data contains more Non-COVID predicted slices than COVID predicted slices, the patient is diagnosed as Non-COVID else the patient is diagnosed as COVID (majority voting method).

The above approach of predicting patient cases is, although computationally solid, should be medically considered. Qualitatively considering the patient's illness is to be acknowledged by doctors and radiologists implementing our method. For examples, using our patient prediction approach, assuming that a patient has lung damage of 30% due to COVID. So, the network classifies around 30% of the predicted slices as COVID and the rest as Non-COVID, and the final result will be Non-COVID.

The proposed model was evaluated via the COV19-CT-DB database. The macro F1 score was calculated after averaging precision and recall matrices as in Equation 5:

Average precision and average recall can be taken from the test set over the number of training epochs.

Furthermore, in an attempt to report the confidence intervals of the results obtained, the Binomial proportion confidence intervals for macro F1 score are used. The confidence intervals were calculated from the following formulation [26] [29]:

The radius of the interval is defined as in Equation 6:

Where is the number of samples used.

In the above formulation, z is the number of standard deviations from the Gaussian distribution, which is taken as z=1.96 for a significance level of 95%.

Evaluation of our method will be conducted at slices level and at patient level. Slices level meaning taking slices of all CT images into consideration in any quantitative and qualitative results. Whereas, patient level results meaning considering the CT scan images rather than 2D slices individually, and thus the prediction is emphasizing 3D-CT prediction value or patient level rather than each 2D slice's predicted value.

Our model achieved average recall and precision rates of 0.95 and 0.93 on the test set, respectively. The batch size used for these scores was 128. The training was conducted over 70 epochs. The number of training epochs was chosen so that one is able to monitor the training results such as accuracies, precisions, recalls and other similar results over sufficient number of epochs. With these rates, the macro F1 score reached 0.94 for this binary classification, which was obtained using the CNN model on the original images in the database, i.e. without any slice processing, or hyperparameters tuning. Adam optimizer was used in these results. Using other optimizer options, such as Stochastic SGD, did not yield higher macro F1 score. Moreover, adding a learning scheduler or employing step decay, was also tried and found to add only fractional improvement in the results. On the other hand, to increase the accuracy of the model the effect of increasing batch size has been explored. Using a batch size of 64 sufficiently improved the performance as compared to smaller batch sizes. The macro F1 score increased from 0.903 with a batch size of 32 to 0.927 with a batch size of 64. Finally, using 128 as a batch size increased the resulting macro F1 to the number reported in the results section. Fig. 6 shows the evolution of recall and precision rates on the test set over the epochs. The results were taken on the whole slices of the dataset with a 128 batch size. The images were used in their original sizes.

As the batch size increased the computation time also increased. To Train the CNN model using batch size of 128 about two and half days were required for training. The model was trained using GNU/Linux operating system on 62GiB System memory with Intel(R) Xeon(R) W-2223 CPU @ 3.60GHz processor.

The interval of the reported macro F1 score with 95% significance level is calculated as in Equation 7: The results show a narrow deviation from our reported macro F1 score.

The proposed model was compared to other state-of-the-art models on the COV19-CT-DB database in terms of the macro F1 score and the confidence intervals. Our proposed CNN model, although has a simple architecture, achieved improved performance in terms of macro F1 score with similar or better confidence intervals compared to the other models. Table I shows a comparison of the proposed model and the state-of-theart models mentioned in the related work section. The comparison in the table is at patient level or at slices level for the methods presented. The table is results encourages using the CNN model at the patient level with 128 batch size. The accuracy of the model is considered next. Table I . Comparison of the proposed model with the state-of-the-art and the baseline model on the same dataset.

Macro F1 Confidence Intervals ResNet50-GRU (Baseline model) [15] 0.70 ± 0.0032

AutoML model (bestreported result) [19] 0.88 ± 0.0023 3D-CNN-Network with BERT [20] 0.92 ± 0.0018 CCAT and DWCC [21] 0.93 ± 0.0017

Our proposed Methodology 0.94 ± 0.0017

Our proposed model reached a test accuracy of about 80% on the original dataset. Fig. 7 shows the training and test accuracies. The model's architecture allows sharp learning during the initial epochs and a steadier trend throughout the rest of the epochs. Furthermore, hyperparameters tuning was not used to train the model and that could explain the fluctuation and spikes of the results on the test set. However, the model gave similar results on the validation partition and on the test partition (unseen images) in terms of accuracy and macro F1 score. The results are presented in the results section. That indicates that overfitting is not sufficiently present in our proposed model. On the other hand, to understand how our proposed model performs the classification, Guided Grad-cam class activation visualization was used at the last convolutional layer of the model -the layer followed by a (256) flatten layer [30] . Fig. 8 shows the Grad-cam visualization for a slice in the validation set. The color map used in the figure is the Viridis colormap [31] . The slice belongs to a COVID case and was correctly classified by the model. The outputs for the correct and incorrect classifications are adapted to the input image. They clearly show that the model pays attention to:

-the lung area, and -the posterior and anterior walls (with the anterior walls getting very strong attention values).

The image on the left is the color map. The image in the center is the image slice from the dataset. The image on the right is the overlapping of the colormap on the image slice.

We can observe a similar attention distribution on the COVID cases incorrectly classified and Non-COVID cases (correct and incorrectly classified slices) as well.

As for the slice level decision, the proposed model can sometimes incorrectly predict the uppermost and the lowermost slices as Non-COVID (specifically, 20 out of 24 extreme slices in the validation partition are misclassified). These extreme slices correspond to the anatomical regions where COVID involvement is not seen, and therefore can be considered the least representative slices for the diagnosis of the disease. Fig.  9 shows exemplary slices that are correctly classified by our proposed model, while Fig. 10 depicts exemplary slices that are incorrectly classified where the extreme slices can be observed. To increase the validation accuracy, the slices were processed as described in Section 3.3 and the parameters were tuned as described in Section 4.3. With that, the CNN model reached a test accuracy of 84%, improved from 80% test accuracy achieved without slice processing and hyperparameters tuning. The final model including slice processing and hyperparameters tuning was used for taking patient diagnosis.

In order to obtain patient level diagnosis from slice level decisions different class probability thresholds (for slices prediction) varying in the range of [0,1] were tried as explained in Section 3.5, and the corresponding macro F1 scores were compared. Majority voting was used at patient level (for CT prediction). As observed in Fig. 11 , the model achieves the highest macro F1 score with a class probability threshold of 0.40, followed by class probability threshold of 0.15. The testing accuracies when using the mentioned thresholds are 88.5% and 87.7%, respectively. With that, the results demonstrate that a class probability threshold of 0.40 achieves the best performance when used with majority voting at the patient level in terms of macro F1 score compared to the other class probability threshold values. The patient level macro F1 score achieved using the proposed method reaches 0.882 on the validation set. The resulting macro F1 score of the proposed model comfortably exceeds that of the baseline model on the validation set, which is 0.70, on the same database as reported in [15] . In general, the model misclassifies 13 Non-COVID cases out of 209, and 30 COVID ones out of 165. Class-specific macro F1 scores of the proposed method are 0.86 for the COVID class and 0.90 for the Non-COVID. Table II shows the confusion matrix of the proposed method at the patient level for the best threshold value. Despite the fact that misclassification of 13Non-COVID cases out of 209 is less problematic than misclassification of 30 COVID cases out of 165, the model aimed mainly at increasing the quantitative results. Further, the work here emphasizes on the more automation of the models and less predictions of the results by radiologists or others.

Further to validate the results, The method was test on the test partition of the COV19-CTDB database (unseen images). On unseen dataset of images, the method exceeded the baseline score and other works. Within the context of the MIA-COVID19 competition, the teams were provided with a test partition of images. The model achieved 0.82 macro F1 score, with 0.96 F1 score for Non-COIVD and 0.68 F1 score for COVID. This score is above the base, which is 0.67 macro F1 score.

Our proposed method not only exceeded the baseline macro F1 score (0.67) but also other alternatives provided to the competition on COV19-CT-DB's test partition [24] . 

ResNet50-GRU (Baseline model) [15] 0.67 A hybrid deep learning framework (CTNet) [22] 0.78

Custom Deep Neural Network [23] 0.78

Our proposed methodology 0.82 CCAT and DWCC [21] 0.88

This paper proposes s a verified CNN-based methodology for the classification of CT scans of COVID-19 cases. The proposed model architecture provides a less complex and efficient method for the classification task. The model achieved a state-of-the-art macro F1 score with similar or better confidence intervals for classification on the given dataset (COV19-CT-DB) at slice level. Further, slice processing along with parameters tuning allow the model to achieve a macro F1 score which exceeds the base line score on the database.

More complex techniques or methods do not reach as high macro F1 scores as the model trained in this paper at slices level. With that, the paper generally encourages researchers, programmers, and otherwise to consider a simpler and from scratch deep learning model with different modifications and shows that more complex deep learning models might not give better results.

On the other hand, using the rectangular region selection for slice processing improved the performance over the previous method, which was applied on the original dataset. This shows that limiting the region of interest with the lung volumes instead of processing the whole CT scan will be a promising approach. Therefore, segmenting lung parenchyma prior to classification could further improve diagnostic performance of the proposed method.

Inferring the ecological niche of bat viruses closely related to SARS-CoV-2 using phylogeographic analyses of Rhinolophus species

Longitudinal symptom dynamics of COVID-19 infection

The Importance of Diagnostic Testing during a Viral Pandemic: Early Lessons from Novel Coronavirus Disease (COVID-19)

Visual Transformer with Statistical Test for COVID-19 Classification

A review on deep learning techniques for the diagnosis of novel coronavirus (covid-19)

Review on machine and deep learning models for the detection and prediction of Coronavirus

Artificial intelligence-enabled rapid diagnosis of patients with COVID-19

Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal

Serial quantitative chest CT assessment of COVID-19: a deep learning approach

Application of deep learning for fast detection of COVID-19 in X-Rays using nCOVnet

Using X-ray images and deep learning for automated detection of coronavirus disease

Deep learning covid-19 features on cxr using limited training data sets

Inf-net: Automatic covid-19 lung infection segmentation from ct images

Application of deep learning techniques for detection of COVID-19 cases using chest X-ray images: A comprehensive study

MIA-COV19D: COVID-19 Detection through 3-D Chest CT Image Analysis

Deep transparent prediction through latent representation analysis

International Workshop on the Foundations of Trustworthy AI Integrating Learning, Optimization and Reasoning

Deep neural architectures for prediction in healthcare

COVID19 Diagnosis Using Automl from 3D CT Scans. 1, TechRxiv

A 3D CNN Network with BERT For Automatic COVID-19 Diagnosis From CT-Scan Images

Visual Transformer with Statistical Test for COVID-19 Classification

A hybrid deep learning framework for covid-19 detection via 3d chest ct images

Custom deep neural network for 3d covid chest ct-scan classification

Differential data augmentation techniques for medical imaging classification tasks

A study of colormaps in network visualization

The authors acknowledge the work of all the medical staff and others who manually annotated the images in the COV19-CT-DB database and shared them in a relatively big dataset.Funding: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.