key: cord-0785937-ehq645ya authors: Yang, Yuan; Zhang, Lin; Du, Mingyu; Bo, Jingyu; Liu, Haolei; Ren, Lei; Li, Xiaohe; Deen, M.Jamal title: A comparative analysis of eleven neural networks architectures for small datasets of lung images of COVID-19 patients toward improved clinical decisions date: 2021-09-24 journal: Comput Biol Med DOI: 10.1016/j.compbiomed.2021.104887 sha: b6fdcc5e19687f3bf96e61ef56d94b87e8f069f5 doc_id: 785937 cord_uid: ehq645ya The 2019 novel severe acute respiratory syndrome coronavirus 2-SARS-CoV2, commonly known as COVID-19, is a highly infectious disease that has endangered the health of many people around the world. COVID-19, which infects the lungs, is often diagnosed and managed using X-ray or computed tomography (CT) images. For such images, rapid and accurate classification and diagnosis can be performed using deep learning methods that are trained using existing neural network models. However, at present, there is no standardized method or uniform evaluation metric for image classification, which makes it difficult to compare the strengths and weaknesses of different neural network models. This paper used eleven well-known convolutional neural networks, including VGG-16, ResNet-18, ResNet-50, DenseNet-121, DenseNet-169, Inception-v3, Inception-v4, SqueezeNet, MobileNet, ShuffeNet, and EfficientNet-b0, to classify and distinguish COVID-19 and non-COVID-19 lung images. These eleven models were applied to different batch sizes and epoch cases, and their overall performance was compared and discussed. The results of this study can provide decision support in guiding research on processing and analyzing small medical datasets to understand which model choices can yield better outcomes in lung image classification, diagnosis, disease management and patient care. COVID-19, a highly infectious lung disease, has caused an extremely serious pandemic that has spread worldwide. Some papers forecast the long-term trajectories of COVID-19 cases using mathematical modeling approaches [1] and stochastic forecasting models [2] . Three important symptoms of COVID-19 are shortness of breath or difficulty breathing, fever, and drying cough [3] . However, in many younger persons, these symptoms might not be present,as a result,other means of detecting infected individuals should be used. Nasal or throat swabs from asymptomatic infected persons are collected, which is uncomfortable and invasive, and then pathological tests such as reverse transcriptionpolymerase chain reaction (RT-PCR) tests or rapid antigen tests (RAT) are performed on those samples. In addition, diagnosis based on X-ray and computed tomography (CT) chest images is commonly used to assess the severity of the disease and in disease management and patient care [4] . However, identifying COVID-19 from these medical images is time-consuming, challenging, and prone to human errors. As a result, researchers in computer science have developed many automated diagnostic models based on machine learning (ML) or deep learning (DL) to help radiologists improve the accuracy of diagnoses [5] and obtain content performance [6] . In artificial intelligence (AI) methodologies, DL networks are more popular than traditional ML methods. The reason is that, unlike ML techniques, all feature extraction stages, feature selection, and classification are automated in the DL model. DL generally requires a large amount of training data to enable its network to learn the data characteristics. However, currently, there are two major limitations to using DL on COVID-19 datasets. First, the CT datasets used cannot be shared with the public due to privacy concerns.As a consequence, the DL results cannot be reproduced, and the trained models cannot be used in other hospitals. In addition, the lack of an opensource, annotated COVID-19 CT dataset hinders the research and development of advanced AI methods that can test COVID-19 CT images more accurately. Second,to achieve a performance level that meets clinical standards, using a DL method requires a large number of CT scans to be collected during model training. This requirement is stringent and might not be met by many hospitals, especially since health care professionals are busy caring for COVID-19 patients and are unlikely to have the time to collect and annotate large numbers of COVID-19 CT scans. In this research, an important finding is that in most papers, it is difficult to quantitatively compare the strengths and weaknesses of the various DL models used on COVID-19 CT scans. This difficulty arises from the lack of standard datasets, networks, indicators, and experimental methods. Another important issue is how to identify a neural network model that can effectively classify small CT datasets. Therefore, eleven well-known convolutional neural networks(CNNs), VGG-16, ResNet-18, ResNet-50, DenseNet-121, DenseNet-169, Inception-v3, Inception-v4, SqueezeNet, MobileNet, ShuffeNet, and EfficientNet-b0, were used to investigate the merits of detecting lung problems in small datasets of COVID-19 patients. This paper notes that these neural network options are not mutually exclusive. In contrast, they can help to guide research or development efforts to understand which model choices can yield better results on small datasets. For the model evaluation and comparison, this research used uniform datasets, data augmentation,hyperparameter training, and consistent optimal weight during the training process. By conducting comparative experiments on the application of the eleven DL models on CT for COVID-19 diagnosis, disease classification, and their variabilities, this research makes the following contributions: • A comprehensive comparative analysis of five performance metrics, namely, accuracy(Acc), recall, precision(Pre), F1, and area under the curve (AUC), were performed on the eleven DL models. • For these eleven models, different batch sizes and epochs and the same five metrics were employed to assess their merits and limitations. • For the traditional neural network models used (ResNet-18, ResNet-50, Dense121, Dense169, Inception-v3, or Inception-v4), this research compared their performance differences under different parameter cases, including different batch sizes and epochs. • The comparative analysis of CNN models conducted in this research on the COVID-19 small datasets can help to guide decision-making and planning recommendations; and help to understand which model choices could yield better transfer learning. 2.1.1. VGG Since winning the ImageNet first runner-up position in 2014, the VGG model has been widely used for image classification. The VGG architecture consists of multiple convolutional layers activated by ReLU (rectified linear unit), and the kernel size of the VGG convolutional layers is chosen to be 3x3. VGG-11, VGG-16, and VGG-19 are three variants of the VGG model, which are not very different from each other in terms of the model structure. They consist of successive convolutional and pooling layers, followed by three fully connected layers [7] . They differ only in the number of convolutional layers (11, 16, or 19) , which is directly reflected in their names. In [8] , the researchers collected 777 CT images from 88 COVID-19 patients and trained and tested them using VGG. The model had an Acc of 84% with an F1 index of 84% and an AUC of 91%. In [9] , the 150 collected CT images were cut into smaller parts and labeled to form the dataset. The constructed dataset was then trained using the VGG16 network, and two sets of test results were obtained depending on the setting of the dataset, with the optimal set achieving 96.93% accuracy, 99.20% sensitivity, and 94.67% specificity. For this study, VGG-16 was selected. ResNet is a widely used and favored DL network for the identification of COVID-19 CT images. In ResNet and other DL networks, there is a tendency for the accuracy of the model prediction to decrease as the depth increases beyond a certain number, and thus the model depth must be carefully selected. In [10] , this problem was solved by passing features from the lower layers to the higher layers and adding an identity mapping between the higher and lower layers of the network. The main difference between ResNet-18 and ResNet-34 is the multiplier of the block usage, while the main difference between ResNet-34 and ResNet-50 is the internal structure of the block. For this study, ResNet-18 and ResNet-50 neural network models were employed. In [11] , an automated ResNet-based CT image analysis tool for detecting and distinguishing between COVID-19 patients and nonpatients was developed. The results showed an AUC of 99.6%, a sensitivity of 98.2%, and a specificity of 92.2%. In [12] , a total of 618 CT images were collected and used to train the improved model neural network based on ResNet, with a final accuracy of 86.7%. In [13] , the 3D Unet++-ResNet50 combined model was used to classify and identify patients with COVID-19. The final sensitivity and specificity were 97.4% and 92.2%, respectively, and the AUC was 99.1%. The core of the ResNet model is to train deeper CNNs by establishing shortcuts (skip connections) between the front and back layers, which helps to backpropagate the gradient during training. The DenseNet model is developed based on the same basic idea as ResNet, but it establishes dense connections between all of the previous and subsequent layers, which is reflected in its name [14] . These features allow DenseNet to achieve better performance than ResNet with fewer parameters and less computational cost [15] . DenseNet-121 and DenseNet-169 use the same structure of bottleneck (BN) layers.,i.e.,the BN-ReLU-Conv (1x1) -BN-ReLU-Conv (3x3). The main difference between DenseNet-121 and DenseNet-169 is the multiplier used by the dense block. DenseNet-121 and DenseNet-169 neural network models were used for this study. In [15] , DenseNet was combined with Nu-SVM (support vector machine) to detect COVID-19 pneumonia and achieved a final recall of 90.8%, a precision of 89.7%, and an accuracy of 95.0%. In [16] , DenseNet-121 was used as a control group to compare the results of pneumonia disease classification using Moco selfmonitored learning, and the final model achieved an accuracy of 85.5%. GoogLeNet, the 2014 ImageNet winner, mainly uses the structure of inception. The main feature of Inception is that it extracts information from different scales of the image through multiple convolutional kernels, finally concatenating them to obtian a better representation of the image [17] . Inception-v2 differs from Inception-v1 in two main ways. The first is the decomposition of the 5×5 convolution into two 3× 3 convolutions. The second is the decomposition of n× n convolutional kernel's size into two convolutions of 1 × n and n × 1. Inception-v3 primarily uses BatchNorm [18] . Inception-v4 introduces a dedicated reduction block, which is used to change the network width and height. Inception-v4 has a more unified and simplified architecture and more inception modules than Inception-v3 [18] . Inception-v3 was used in [19] and achieved a final recall of 80.08%, a precision of 80.07%, and an accuracy of 81.63%. In this study, Inception-v3 and Inception-v4 neural network models were used. The core of SqueezeNet is the proposed fire module, which consists of two parts, the squeeze part and the expand part. The squeeze part is a 1×1 convolutional kernel and a 1×1 convolutional layer. The expand part is 1×1 and 3×3 convolutional kernels and convolutional layers, respectively. In the expanded layer, the 1 × 1 and 3 × 3 feature maps are concatenated. A comparison on the ImageNet dataset shows that the accuracy of SqueezeNet and AlexNet is roughly equal [20] . In [21] , a lightweight CNN model based on SqueezeNet for the recognition of lung CT images, was proposed. The improved model achieved 83% accuracy, 85% sensitivity, 81% specificity, and an F1 value of 0.833 on the test dataset. The basic unit of MobileNet is the depthwise separable convolution, which can be broken down into two smaller operations: depthwise convolution and pointwise convolution. Depthwise convolution uses different convolution kernels for each input channel, i.e., one convolution kernel for each input channel;thus depthwise convolution is a depth-level operation. The pointwise convolution is just a normal convolution, but it uses 1x1 convolution kernels [22] . In [19] , the experiments used MobileNet-V1 and obtained a final recall of 88.53%, precision of 88.64%, and accuracy of 89.14%. With MobileNet-V2, the recall was 87.66%, precision was 82.84% and stthe accuracy was 85.52%. The design of ShuffleNet accounts for mobile devices with low computing power. The core of ShuffleNet is composed of two operations: pointwise group convolution and channel shuffling, which significantly reduce the computational load of the model while maintaining accuracy. The basic unit of ShuffleNet is demonstrated on the basis of a residual unit [23] . In [24] , ShuffleNet was used on X-ray images as raw data, and the final model achieved accuracy, sensitivity, FPR (false-positive rate) and F1 score of 65.26%, 65.26%, 17.36% and 58.79%, respectively. To make the neural network model balance the speed and accuracy, EfficientNet combines several dimensions of model scaling: network depth, network width, and image resolution. EfficientNet uses a compound scaling method to find the best combination of these three dimensions, which affect one another [25] . In [26] , the EfficientNet model achieved an accuracy of 0.7840 on the test set of 1248 CXR (lung X-ray) images of COVID-19 patients, patients with non-COVID-19-induced pneumonia, and healthy individuals from 2 publicly available datasets. An analysis of the datasets from the relevant literature revealed that except for a few studies that used publicly available datasets, most studies did not provide a detailed description of the chosen data sources.To make the comparison more explicit, this study investigated the datasets that were used in many existing studies. The results are shown in Table1. First, due to patient privacy concerns, hospitals cannot share CT images in their original format, which makes it difficult to reproduce many of the findings. Second, the medical images used in a significant portion of the work in many studies include other forms of imaging, such as X-rays. For many models not trained on a uniform dataset, the trained models can show excellent classification results in some cases, but they might not be robust. Therefore, to conduct a comprehensive comparative analysis of 11 neural networks, this study trained and tested them on a public dataset to ensure the reproducibility of the model training. When researching and analyzing datasets in the relatively small data regime, it also helps to understand which model to choose to obtain desirable results. Because some models have many variants, it is difficult to determine the exact network model structure used in some publications, and a typical example is ResNet. For example, in [12] , the classical ResNet-18 network structure for image feature extraction was used. The output of the convolutional layer was flattened to a 256-dimensional feature vector. Then, it was converted to a 16-dimensional feature vector using a fully connected network. In [19] , the ResNet50 model was used to extract features from images. Initially, the ResNet50 model was used to obtain a 1024 dimensional feature map. Then, the SVM was applied to the extracted feature map to classify the sample into two categories. Therefore, in many cases, what is used is a specific neural network model that is a custom variant of the neural network structure. The usual approach is to remove the last few layers of the original neural network and replace them with fully connected layers. In addition, some batch layer, dropout layers, and so on, can also be added. Data augmentation techniques can improve the size and quality of training datasets in such a way that they can be used to build better deep neural network models. In particular, for medical images, creating large medical datasets is very challenging due to the low numbers of patients with specific diseases and the privacy issues of patient data. Therefore, it is necessary to perform data augmentation on medical datasets. Conventional data augmentation methods include geometric transformations, flipping, color space, cropping and rotation. There are also ways to enhance data by developing models, for example, using the popular generative modeling framework to form a generative adversarial network (GAN). In Table2, several previously studied image augmentation methods are summarized. To assess the performance of each DL model, different metrics were applied in different studies to measure their misclassification of COVID-19 in the tested CT images. In Table3,the metrics used to evaluate the COVID-19 diagnostic models are summarized. The most commonly used metrics are accuracy and AUC. [24] ResNet/ShuffleNet DenseNet/MobileNet These 127 COVID-19 X-ray images were shared by a postdoctoral fellow at the University of Montreal [26] VGG-16 A total of 1248 CXR images were obtained from two public datasets, which included 215 images of COVID-19 patients [9] VGG-16/GoogleNet/ResNet-50 53 CT images of infected persons provided by the Italian Radiology Association [27] ResNet/DenseNet/Inception CXR images were obtained from two public datasets, which include 236 images of COVID-19 patients [28] Inception/ResNet50/MobileNet Images of 349 confirmed patients and 397 healthy people [11] ResNet The lung CT image data of 157 patients from Chinese hospitals and the United States [12] ResNet18 The 618 CT images uesd were collected from the First Affiliated Hospital of Zheijiang University including 219 images of COVID-19 patients [13] Inception/ResNet50/Attention ResNet50 The 1136 (723 COVID-19 positive) training samples were collected from five hospitals including Wuhan Leishenshan Hospital. [16] ResNet/DenseNet The dataset was provided by the Italian Society of Medical and Interventional Radiology [21] SqueezeNet The dataset was provided by the Italian Society of Medical and Interventional Radiology Table 2 : Different methods of data augmentation Paper Data augmentation [8] Each set of 3D CT images was equally divided into 15 slices. The slices with incomplete lung were removed. The lung region in each slice was automatically extracted. The images were then filled with a background composed of 10 translational and rotational lungs [24] NONE [26] The conventional data augmentation method included ± 15 • rotation, ± 15% x-axis shift, ± 15% yaxis shift, horizontal flipping, and 85%-115% scaling and shear transformation. The parameters of mixup was set to 0.1 [9] The original image is divided into 16 * 16 and 32 * 32 blocks to build two data sets [27] All the of images were initially preprocessed to have the same size. To make the image size uniform throughout the dataset, each of the images was interpolated using bicubic interpolation. [28] The image size was resized to 224 * 224 * 3 [29] GAN was used for data augmentation. First, the image was resized to 286 * 286, and then it was cropped to 256 * 256 by patchGAN. [11] U-net was used to remove the irrelevant areas. Image rotation, horizontal flipping and clipping were used to enhance the data. [12] A total of 3957 candidate cubes were generated from the 3D segmentation model. Subsequently, a total of 3957 candidate cubes were generated from the 3D segmentation model. [13] Image rotation, horizontal flipping and clipping were used to enhance the data. [16] Using random clipping with color distortion to augment data, the size is adjusted [21] Rotation (random angle between 0 and 90 degrees), scale (random value between 1.1 and 1.3) and Gaussian noise in the original image were used for data augmentation 5 J o u r n a l P r e -p r o o f Paper Performance Criteria [8] AUC\Recall\Precision\F1-score\Accuracy [24] Accuracy\Sensitivity\FPR\F1-score [26] Accuracy\Sensitivity [9] TP\TN\FP\FN\Accuracy \Sensitivity\Specificity\Precision \F1-score\Matthews Correlation Coefficient (MCC) [27] \F1-score\Recall\Precision\Specificity [28] AUC\Recall\Precision\F1-score\Accuracy [29] AUC\Recall\Precision\F1-score\Accuracy [11] AUC\Sensitivity\Specificity [12] Recall\Precision\F1-score [13] AUC\Sensitivity\Specificity [16] AUC\Recall\Precision\Accuracy [21] AUC\specificity\Precision\F1-score\Accuracy An important problem with training neural networks on small datasets is that the trained models do not perform well on the validation and test datasets. In order to solve the overfitting problem of these models, a variety of methods have been produced, the simplest of which is to add regularization terms to the weighting paradigm [30] . Another popular technique is dropout, which is achieved by probabilistically removing neurons from a given layer during training or by discarding certain connections [31] . Data augmentation is another way to reduce the overfitting of models. Currently, a widespread and well-accepted practice of image data augmentation is geometric and color augmentation [32] , such as reflecting the image, cropping and translating the image, changing the color palette of the image, color processing, and geometrical transformations (rotation, resizing, and so on.). Image augmentation algorithms [33] include geometric transformations, color space augmentations, kernel filters, random erasing, adversarial training, and meta-learning [33] . Among them, the basic methods of image processing data augmentation are geometric transformations, flipping, color space, cropping, rotation, and color space transformations. In [32] , the dataset from tiny-imagenet-200 was used in one experiment to select pictures of dogs and cats in a binary classification task. The result shows that without any data augmentation, the accuracy was 85.5% on the validation set. After using traditional data augmentation methods, the accuracy was improved to 89%, which indicates that traditional data augmentation has some limited effect on improving the accuracy. The image augmentation methods used in this study are all basic methods. The input images were standardized to have zero mean and unit standard deviation. Then, they were cropped to 224x224x3. For the UCSD-AI4H dataset and Italiancase dataset, the data augmentation methods and values used for each image are shown in Table6. Fig4 shows several examples after data augmentation,including changing the brightnes and contrasting As shown in Table7, this paper listed the statistics of the number of parameters of the eleven selected models. In addition to image preprocessing, hyperparameters are an essential part of neural network training. The hyperparameters of the final model used in this work are listed in Table8. (1) Training process a) A batch of data is obtained from the training dataset to train the model and to input the trained neural network model. a dictionary after each training generation. The current values of accuracy, AUC, and F1-score are compared with their corresponding optimal historical values. If a value is greater than its corresponding optimal historical value, the optimal historical value is updated. At the same time, the optimal weight file for this generation of training is saved. Therefore, after the final training, three optimal weight files are obtained, including the Accuracy weight file, the AUC weight file, and the F1 weight file. The confusion matrix illustrated in Table9 is determined. The confusion matrix has four expected outcomes, including true positive (TP), true negative (TN), false positive (FP), and false negative (FN). TP is the number of predicted positives (e.g., predicted as having a disease) and actual positives (e.g., actually having the disease). TN is the number of predicted negatives (e.g., predicted not having a disease) and actual negatives (e.g., not having the disease). FP is the number of predicted positives (e.g., predicted having a disease) but actual negatives (e.g., not having the disease). FN is the number of predicted negatives (e.g., predicted not having a disease) but actual positives (e.g., having the disease). For the judgment of the training results of this research, the following five metrics were selected, which are the supporting data for calculating the overall performance metrics. (TP+FN) . The higher the recall rate is, the more accuracy the target sample is predicted, and the less likely it is that a bad sample will be missed. Recall= TP TP + FN (3) The F1-score measures the accuracy of a test and is the harmonic mean of the precision and recall. In general, there is a contradiction between the precision and the recall, as a result, F1-score is introduced as a composite index to balance the effects of precision and recall and to evaluate classifier more a correctly. F1 − s core = 2 Precision × Recall Precision + Recall (4) Accuracy is the ratio of the number of correctly classified samples to the total number of samples. In our study, since it is a binary-classification problem and the number of positive and negative samples is not balanced, the pursuit of high accuracy alone might not reflect the classification effect objectively. Accuracy = TP + TN TP + FP + FN + TN (5) AUC (Area Under Curve) is defined as the area under the ROC (receiver operating characteristic) curve, and it is not greater than 1. The ROC curve and AUC are often used to evaluate a binary classifier's effectiveness. To make a comprehensive comparison of the performance of 11 neural networks on the COVID-19 dataset, this study analyzed and compared these models with different batch sizes and epochs. The batch size affects the direction of the gradient descent during backpropagation. The larger the batch size is, the more representative it is of the dataset's overall characteristics, and the faster it converges. However, in terms of computing power, it also requires more memory capacity and more time. In summary, this study chose 10 and 25 for the batch size, respectively. Epochs, another important hyperparameter, do not have a clear criterion in the training process of neural networks. When the periods are too small, the model cannot be adequately trained, which leads to poor performance. In addition, when the epochs are too large, an overfitting issue can arise. In this case, the model tends to perform very well on the training set. However, in fact, it does not learn the actual features of the image, and the classification performance on the test set is significantly reduced. AI4H Dataset Figure5 to Figure12 shows the results of five metrics (Precision, Recall, F1-score, Accuracy and AUC) for the comparison of the 11 models using different optimal weights on 2 COVID-19 datasets, UCSD-AI4H and Italiancase. In the case of the same parameter set,the results of using three different optimal weights were compared horizontally. In most cases, using three different optimal weights with the same epoch and batch size has little effect on the five metrics' results. However, in certain situations, it can have a large effect. For example, on the UCSD-AI4H dataset with epoch = 800 and batch size = 25, when EfficientNet-b0 uses the weights of the optimal accuracy, the accuracy is 76%, and when is uses the weights of the optimal AUC, the accuracy is 67%. Longitudinally, for the same dataset, the final results of the five metrics on the test set using different epochs and batch size parameters are different. However, from the overall comparison of the 11 models, the models' performance on the five metrics is the same. Dataset Overall, on the UCSD-AI4H dataset, EfficientNet-b0 achieved the best performance. On the Italiancase dataset, EfficientNet-b0, ResNet-18, ResNet-50, DenseNet-121, DenseNet-169, Inception-V3 and Inception-V4 all achieved good performance. To determine the performance of the 11 models in a comprehensive and accurate way, this research considered how to use these five metrics (precision, recall, F1score, accuracy, and AUC) in combination. However, several of these five metrics are related to each other. Among them, the Fl-score is a combined indicator of the accuracy and recall. It was also observed that some models performed well according to the recall but poorly according to the accuracy and precision, which indicates that the models actually performed poorly. Therefore, to evaluate the merits of the models in a more comprehensive way, the standard deviations (std) and the dispersion of the 4 indicators (precision, recall, accuracy, and AUC) were introduced. This research first added up (sum) the four indicators for each model, then obtained their std, and added a constant k = 0.02 to the obtained std (to make std + 0.02). The last two (a) Best-Acc (c) Best-AUC Figure 9 : The overall performance comparison of 11 neural networks on the Italiancase dataset, with epoch=800, batch-size=10. numbers were then divided to obtain the comprehensive evaluation indicator (sum/ (std+0.02)).The process is displayed as follows. Step 1: Delete F1-score. Step 2: Calculate Sum = Accuracy + Precision + Step 3: Calculate std = std( Accuracy + Precision + Recall + AUC) Step4: Calculate sum /(std +0.02). This value is the comprehensive indicator required. For the image classification task, in addition to the classification effect being the most important index, the number of model parameters was also used as an index to evaluate the merits of the model. Therefore, this research combined these two factors to list the efficiency-effects plot (Figure13), where the horizontal coordinate is the number of parameters of the model and the vertical coordinate is the overall performance index of the model. The closer the point representing the model is to the upper left corner of the efficiencyeffects graph, the better and more efficient the model is. The opposite is true for models near the lower right corner. It can be seen that the EfficientNet-b0 model had the best performance in terms of overall metrics and had smaller model training parameters. The ResNet-18, ResNet-50, DenseNet-121, DenseNet-169, Inception-V3 and Inception-V4 models had moderate performance in terms of overall metrics. The VGG and SqueezeNet models had the worst performance. This research employed the composite evaluation indicators mentioned above to compare the performance of each model. By performing the 11 models on the UCSD-AI4H dataset and Italiancase dataset, respectively, with a combination of 2 parameters (batch-size and epochs) and evaluating their performance based on the four sets of parameters, this research obtained the result for the UCSD-AI4H dataset (Figure14 (a) ), and the result for the Italiancase dataset (Figure14 (b) ). From the above observation, the 11 neural networks were grouped into four categories. • The first category is Vgg-16 as a baseline methodology. • The second category is ResNet-18, ResNet-50, DenseNet-169, DenseNet-121, Inception-v3, and Inception-v4. • The third category is the SqueezeNet, MobileNet, and ShuffleNet-v2 lightweight models. • The fourth class is EfficientNet-b0, which can scale the model on three parameters: depth, breadth, and input resolution. Based on the above categorization of the models and the results shown in Figure14, the following five conclusions were made. 1 The VGG-16 had the worst overall performance. 2 SqueezeNet had the worst performance among the SqueezeNet, MobileNet, and ShuffleNet-v2 lightweight models. 3 MobileNet and ShuffleNet both outperformed SqueezeNet. MobileNet even achieved performances comparable to those of the ResNet, DenseNet, and Inception series but had the advantage of one order of magnitude fewer parameters. 4 The ResNet, DenseNet, and Inception series had no significant advantages over MobileNet and Shuf-fleNet under certain circumstances. However, the former three classes of models required larger numbers of parameters. 5 The EfficientNet-b0 model performs well in a variety of metrics. On the two small datasets, the EfficientNet model performed better than the ResNet, DenseNet, and the Inception series of networks in terms of the accuracy, synthesis, and efficiency. Similar results were obtained when compared to those shown in [34] . Mo-bileNet achieved a performance comparable to ResNet, DenseNet, and the Inception series on the two small datasets. According to Figure15, the following two conclusions were obtained. First, a model with more layers might not have better performance, e.g., in the UCSD-AI4H dataset case, the overall performance of ResNet18 was better than ResNet50 in all four cases. In the Italiancase dataset, the overall performance of ResNet18 was better than ResNet50 in 2 out of 4 cases. Second, a larger number of model parameters does not necessarily produce better overall model performance. For example, for the Inception series, the number of model parameters of Inception-v4 was greater than that of Inception-v3. In the case of the UCSD-AI4H dataset, in all four cases, Inception-v3 performed better than Inception-v4 in terms of the overall performance. In the case of the Italiancase dataset, in two of the four cases, Inception-v3 performed better than Inception-v4. In [10] , the top-1 errors of ResNet18 and ResNet34 on the ImageNet dataset were compared. The fact that ResNet34 had a lower error rate than ResNet18 indicated that ResNet34 performed better than ResNet18. In [15] , on the ImageNet dataset, DenseNet 169 was less error-prone than DenseNet 121. In [35] on the ImageNet dataset, Inception-v3 was less error-prone than Inception-v4. Through the two small datasets used in this study, for the ResNet, DenseNet, and Inception models, it can be found that a larger number of layers of the model does not necessarily give better performance. To further investigate these two findings, this research evaluated the image quality of the ImageNet dataset, the UCSD-AI4H dataset, and the Italiancase dataset. Image quality assessment can generally be divided into two types: the subjective quality score given by managers and the objective quality score given by the image quality model. Subjective quality assessment methods would be more accurate. Nevertheless, because they are expensive, time-consuming, and unsuitable for large-scale data, algorithms should be investigated to predict the image quality. Current objective quality assessment methods are roughly divided into three categories, including full reference image quality assessment (FRIQA), reduced reference image quality assessment (RRIQA), and no reference image quality assessment (NRIQA) [36] . Among them, NRIQA is a so-called blind image quality assessment (BIQA). Compared to other image assessment methods, the NRIQA method does not require the original distortionfree reference image, which fits most application scenarios; therefore, the NRIQA method was employed in this study. Compared with the BIQA method, the FRIQA method has developed a complete theoretical system and assessment model. The most commonly used indicators in FRIQA are the mean square error (MSE) based on pixel statistics, peak signal-to-noise ratio (PSRN), and structural similarity (SSIM) based on structural information [37] . The generic BIQA algorithm learns to map from image features to the corresponding quality fractions or to split the image into different distortion categories before mapping. Since the first use of natural scene statistics (NSS) [38] for image quality assessment in 2005, many experiments have shown that there is a close relationship between NSS features and image quality. In 2012, Mittal et al. proposed another model for extracting NSS features in the spatial space: the Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) [39] . The data quality of the Imagenet dataset, the UCSD-AI4H dataset, and the Italiancase dataset was evaluated using four metrics: MSE, PSNR, SSIM, and BRISQUE, and the results are shown in Table 10 . From Table 10 , it can be seen that the PSNR and SSIM values are low. The image was selected as the reference image (ref) inside the LIVE dataset. The smaller the MSE result is, the smaller the gap between the detected image and the reference image. The BRISQUE result is a number between 0 and 100, and the smaller the number is, the better the quality. According to the MSE and BRISQUE metrics, the image data quality of the Imagenet dataset is better than that of the UCSD-AI4H and Italiancase datasets. Therefore, for the UCSD-AI4H dataset and the Italiancase dataset, more layers of the model and the more model parameters do not mean that the overall performance of the model is better. The likely reason is the poor quality of the dataset, which ultimately leads to the overfitting of the model. This result can be used to extend classification studies on small image datasets to other areas. There are still some limitations to this study. First, the image quality of the two datasets was not high, and it was difficult for the neural network model to learn the features of the local pneumonia foci. Second, there were no clinical features associated with neocoronary pneumonia to examine the correlations between the symptoms and the pneumonia lesion characteristics. The contrast experiment for data augmentation and no data augmentation is performed on the UCSD-AI4H dataset. The boxplots of precision, recall, f1, accuracy and AUC are plotted in Fig.16 . The experimental results show that after data augmentation, the accuracy of the test set and each metric without data augmentation are improved accordingly. In recent years, deep neural networks (DNNs) have made great achievements in natural language processing, computer vision, and other applications. Their performance is not only better than a number of existing machine learning methods but also outstanding when addressing actual tasks. With the intention of opening the black-boxes of DNNs, a number of scholars have paid attention to the interpretability of the model. Although many studies have explored this topic, there is currently no unified definition of interpretability. Moreover, the definitions and motivations of interpretability that they proposed are usually diverse or even significantly inconsistent with one another. It can be noted that several papers have distinguished between explainability and interpretability. In this Precision Recall F1 Accuracy AUC (a) Data augmentation (b) No data augmentation Figure 16 : Boxplots of precision, recall,F1,accuracy and AUC for the UCSD-AI4H dataset in 2 experiments with data augmentation and without data augmentation.. research, the minute variance between these two concepts was not considered. As defined above, this research considered the explanation to be the essence of interpretability; and used understandability, explainability, and interpretability interchangeably. Specifically, this research attempted to study the interpretability of DNNs, with the purpose of providing an explanation of their internal operations as well as input-output mappings. The main functions of using feature visualization to explore the working mechanism of a deep convolution neural network are as follows: 1. It is helpful to understand and analyze the working principle and decision-making process of the neural network to better select or design the network. For example, for classification networks, CAM places higher requirements on the network in addition to the classification accuracy. Specifically, it not only requires high prediction accuracy, but also requires the network to extract the required features. 2. It makes use of visual information to guide the network to achieve better learning results. Among the interpretation forms of neural networks, methods based on saliency are the most commonly used. These methods assign importance weights to each pixel of the input image to indicate the significance of each pixel to the predicted category of the image. The saliency map [40] can be considered to be a feature map, which demonstrates the influence of the pixels in the image on the result of image classification. The full name of CAM is Class Activation Mapping [41] , also known as the Category Heat Map; In general, it is represented by a grayscale image from 0 to 255 with the same size as the original picture, and the pixel value of each position on it ranges from 0 to 1. It can be understood as the contribution distribution to the prediction output. The higher the value is, the higher the response and the greater the contribution of the corresponding area of the original picture to the network. The visualization of CAM can be presented in the form of a superposition of the heat map and the original image. The darker the red is, the greater the value. It can be considered that when the network predicts the "COVID-19" category, the red highlighted area is its primary bias for judgment. The intuitive visualization is to draw the weight of the target layer. The weights visualization [42] of the first layer are presented in Appendix A. In general, the coverage areas of the heatmap and CAM are similar, as in Fig.17 . Therefore, the renderings presented by the CAMs of each neural network were discussed separately. • Saliency maps of VGG, Resnet, and Denset pay more attention to local features; MobileNet, Shuf-feNet, and SqueezeNet do not perform well in extracting key features; EffcientNet performs well not only in paying attention to global features, but also in distinguishing key features. • The Grad-CAMs of VGG-16 and SqueezeNet do not cover the entire object. In contrast, those of Resnet, Denset, Inception, and EffcientNet have more comprehensive coverage. This finding further illustrates that the performance of the Resnet, Denset, Inception, and EffcientNet models is better than that of the VGG-16 and SqueezeNet models. • Compared with Grad-CAM, the objects covered by Grad-CAM++ [43] are more comprehensive. The objects covered by Grad-CAM are only partial, while Grad-CAM++ covers almost all objects. In particular, the Grad-CAM++ of the Resnet, Denset, and EffcientNet models can basically cover all objects. This research studied the effect of 11 neural networks on learning on the COVID19-CT dataset, and evaluated the performance of the random initialization network. In addition, the differences in the final classification performance of the neural network models on the COVID19-CT dataset were compared. The results of this research can guide researchers and help them determine the most suitable model, and understand the conditions under which the models will produce better results. This paper contributes to a systematic comparison and evaluation of the performance of 11 traditional neural network models in a relatively small data regime. For the relatively small data regime, a neural network model that has deeper layers does not necessarily provide better overall performance. In general, choosing neural networks with residual connectivity (e.g. ResNet) and automatic search capability (e.g. EfficientNet) gives better results. It should be noted that neural network models impact the model performance when using different hyperparameters. However, in general, neural networks with residual connections (e.g., ResNet) and automatic search capabilities (e.g., EfficientNet) have better migration performance. Numerical analysis approach for models of Covid-19 and other epidemics An integrated deterministic-stochastic approach for forecasting the long-term trajectories of COVID-19 World Health Organization declares global emergency: A review of the 2019 novel coronavirus (COVID-19) Development of a Machine-Learning System to Classify Lung CT Scan Images into Normal/COVID-19 Class Computer Vision For COVID-19 Control: A Survey (2020) 1 Deep learning-based multi-view fusion model for screening 2019 novel coronavirus pneumonia: A multicentre study Very deep convolutional networks for large-scale image recognition Deep learning Enables Accurate Diagnosis of Novel Coronavirus (COVID-19) with CT images Coronavirus (COVID-19) Classification using Deep Features Fusion and Ranking Technique (2020) Deep Residual Learning for Image Recognition Rapid AI Development Cycle for the Coronavirus (COVID-19) Pandemic: Initial Results for Automated Detection & Patient Monitoring using Deep Learning CT Image Analysis Deep learning system to screen coronavirus disease 2019 pneumonia, Applied Intelligence (2020) 1 AI-assisted CT imaging analysis for COVID-19 screening: Building and deploying a medical AI Deeplysupervised nets Momentum Contrastive Learning for Few-Shot COVID-19 Diagnosis from Chest CT Images Going deeper with convolutions Rethinking the Inception Architecture for Computer Vision A Novel and Reliable Deep Learning Web-Based Tool to Detect COVID-19 Infection from SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size A Light CNN for detecting COVID-19 from CT scans of the chest MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications Practical guidelines for efficient cnn architecture design Detection of coronavirus disease (COVID-19) based on deep features and support vector machine Rethinking model scaling for convolutional neural networks Automatic classification between COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy on chest X-ray image: combination of data augmentation methods in a small dataset Exploration of Interpretability Techniques for Deep COVID-19 Classification using Chest X-ray Images A Novel and Reliable Deep Learning Web-Based Tool to Detect COVID-19 Infection from Radiologist-Level COVID-19 Detection Using CT Scans with Detail-Oriented Capsule Networks Regularization for Unsupervised Deep Neural Nets Compacting Neural Network Classifiers via Dropout Training The Effectiveness of Data Augmentation in Image Classification using Deep Learning A survey on Image Data Augmentation for Deep Learning Searching for mobileNetV3 Inception-v4, inception-ResNet and the impact of residual connections on learning Modern Image Quality Assessment, Synthesis Lectures on Image, Video, and Multimedia Processing No-reference quality assessment using natural scene statistics: JPEG2000 No-Reference Image Quality Assessment in the Spatial Domain Deep inside convolutional networks: Visualising image classification models and saliency maps Visual Explanations from Deep Networks via Gradient-Based Localization Visualizing and understanding convolutional networks Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks To deeply understand the behavior of 11 neural networks using weight visualization, visual explanations of the predictions of convolutional neural networks are provided. The weight visualization of the first layer is presented inA.18 to A.27. As output, the weights of the current layer were obtained to be grayscale images, and 16 of them were plotted. According to these images, certain pixels at the edges of the image are brighter than others.