key: cord-0725355-yxmy2c58
authors: Wang, Shui-Hua; Zhu, Ziquan; Zhang, Yu-Dong
title: PSCNN: PatchShuffle Convolutional Neural Network for COVID-19 Explainable Diagnosis
date: 2021-10-29
journal: Front Public Health
DOI: 10.3389/fpubh.2021.768278
sha: 8c81f0f1b154c3eb8bce6b0c92ad186e0a9e3011
doc_id: 725355
cord_uid: yxmy2c58

Objective: COVID-19 is a sort of infectious disease caused by a new strain of coronavirus. This study aims to develop a more accurate COVID-19 diagnosis system. Methods: First, the n-conv module (nCM) is introduced. Then we built a 12-layer convolutional neural network (12l-CNN) as the backbone network. Afterwards, PatchShuffle was introduced to integrate with 12l-CNN as a regularization term of the loss function. Our model was named PSCNN. Moreover, multiple-way data augmentation and Grad-CAM are employed to avoid overfitting and locating lung lesions. Results: The mean and standard variation values of the seven measures of our model were 95.28 ± 1.03 (sensitivity), 95.78 ± 0.87 (specificity), 95.76 ± 0.86 (precision), 95.53 ± 0.83 (accuracy), 95.52 ± 0.83 (F1 score), 91.7 ± 1.65 (MCC), and 95.52 ± 0.83 (FMI). Conclusion: Our PSCNN is better than 10 state-of-the-art models. Further, we validate the optimal hyperparameters in our model and demonstrate the effectiveness of PatchShuffle.

COVID-19 is a form of infectious disease triggered by a new strain of coronavirus. CO means corona, VI virus, and D disease. Till 19/Sep/2021, this disease has led to more than 228.58 million confirmed cases and more than 4.69 million death tolls, shown in Figure 1 .

Two popular methods are commonly used to diagnose COVID-19. The first is real-time reverse-transcriptase polymerase chain reaction (rRT-PCR) (1) , which harnesses nasopharyngeal swab samples to examine the presence of ribonucleic acid (RNA) bits of the COVID-19 virus. The second is the so-called chest imaging that directly checks the radiological evidence of COVID-19 patients.

The chest imaging technologies exhibit five advantages to traditional rRT-PCR technologies. (i) The swab will possibly be polluted (2) . (ii) Chest imaging examines the lesions of lungs, called ground-glass opacity (GGO), which is distinguishing evidence to differentiate COVID-19 from healthy fellows. (iii) Publication reported that chest computed tomography (CCT), one type of chest imaging technology, is able to spot 97% of COVID-19 contagions (3) . (iv) Chest imaging is able to deliver an instant outcome once the imaging procedure is done. (v) Some COVID-19 variants/mutations could muddle the rRT-PCR tests since the variants/mutations may evade primer-probe sets. (8) combined wavelet entropy (WE) and biogeography-based optimization (BBO) to detect COVID-19. El-kenawy et al. (9) proposed a feature selection and voting classifier (FSVC) algorithm to classify COVID-19 in CT images. Chen (10) combined gray-level cooccurrence matrix (GLCM) and support vector machine (SVM) to detect COVID-19. Khan (11) used Pseudo Zernike Moment (PZM) technique to extract features from CT images for COVID-19 diagnosis. Pi (12) combined GLCM and extreme learning machine (ELM) for COVID-19 diagnosis. Wang (13) applied the Jaya algorithm to detect PatchShuffle was proposed by Kang et al. (14) . It can be embedded in any classification-oriented convolutional neural network (CNN) model. Through producing images and feature maps via interior order-less patches, PatchShuffle (PS) makes rich local variations, decreases the danger of network overfitting, and can be regarded as a useful addition to diverse kinds of training regularization practices. Based on PS, this study proposes a novel PatchShuffle convolutional neural network (PSCNN). The contributions are shown in Figure 2 , which comprises the following points:

1. The "n-conv module (nCM)" is introduced. 2. A 12-layer convolutional neural network (12l-CNN) is created as the backbone network. 3. A PSCNN is proposed where PS serves as the regularization term of the loss function. 4. Multiple-way data augmentation (MDA) is employed to assist in evading overfitting. 5. Grad-CAM is utilized to disclose the explainable heat map that indicates the locations of lung lesions.

The dataset in this study is described in reference (15) where they provided two datasets. The first dataset is smaller. The second one comprises a larger dataset of 320 COVID and 320 healthy control (HC) images. We use the latter dataset since it is bigger, and the results on the bigger dataset will be more reliable than those on the smaller dataset.

First, the raw dataset set

is extracted from reference (15) , where |N| is the number of images in dataset N 0 . The size of each image is size n 0 k = 1024 × 1024 × 3. The raw images look grayscale, however, those images are deposited in the format of RGB at the store servers of hospitals.

Second, all those raw images n 0 k are grayscaled to new images n 1 k . The equation is: Algorithm 1 | Pseudocode of five-step preprocessing.

Step A Import the raw image set N 1 . See Equation (1).

Step B RGB to grayscale: N 1 → N 2 . See Equation (2).

Step C Run histogram stretching:

Step D Margin crop: N 3 → N 4 . See Equation (4).

Step E Downscaling:

where f red , f green , and f blue extract the red, green, and blue channels from the raw image. Third, histogram stretching (HS) (16) is harnessed to improve the contrast of all grayscaled images N 1 = n 1 k . For the k-th image n 1 k , suppose its upper bound and lower bound grayscale values are n U 1 (k) and n L 1 k . The new HS-enhanced image n 2 k can be computed as

where n range 1 (k) is the grayscale range of the image n 1 k , x, y the indexes of width and height dimension, respectively, and (W1, H1) the width and height of the image n 1 , respectively. The HS-enhanced image n 2 k occupies the full grayscale range as [r min , r max ], where r min and r max mean the minimum and maximum grayscale values, respectively, as shown on the righthand side of Figure 3 .

Fourth, the scripts at the right region and the check-up bed at the bottom region are cropped, the cropping values of which are set to g 1 , g 2 , g 3 , g 4 , which stand for the pixels to be cropped from four positions: top, left, bottom, and right, respectively. The output image n 3 k is written as

where (W3, H3) mean the weight and height of any image n 3 , respectively, and x ′ , y ′ two ranges with the format of a : b, which means from integer a to integer b. Fifth, downsampling is implemented to decrease the image size and eradicate unneeded information. Assume the final size is (W, H), and the last image set N = n k is defined as where f ds is the downscaling function defined as

In all, the pseudocode of this five-step preprocessing is itemized in Algorithm 1. The input is the raw image set N 1 , and the output is the preprocessed image set N within which each image has the size of [W, H]. Figure 4A shows one preprocessed image of COVID-19 and Figure 4B delineated the corresponding lesions, which are outlined by red curves. Figure 4C presents one sample of an HC subject.

n-conv Module Table 1 presents the abbreviations and their explanations. An "n-conv module" (nCM) is introduced, comprising n-repetitions of a conv layer and a batch normalization (17) layer tailed by a max pooling (MP) (18) layer. The activation functions are ignored here. Figure 5 displays the schematic of our nCM 

where n m is the maximum integer of n. We find n m = 3 can achieve the best performances. We also test results using n = 4, but the performances do not improve.

Convolutional neural network is a new type of neural network (19, 20) that is particularly for analyzing visual images. An αlayer convolutional neural network is proposed as the backbone network based on the nCM concept. Its structure is listed in Table 2 , where α is defined as the number of weighted layers (NWL)-either convolutional layer or fully connected layer (FCL) (21) . The total layers of the backbone network are calculated as α = 9 i=1 α i = 12 (see Table 2 ) via trialand-error method. Hence, our backbone network is a 12-layer convolutional neural network (12l-CNN). We did not choose the transfer learning method since we found the backbone network developed from scratch can realize better performances than traditional transfer learning models.

The HC in Table 2 represents the hyperparameter configuration. In the nCM stage, the expression is in the format of which represents n repetitions of c 1 kernels with sizes of c 2 × c 2 , followed by an MP with a stride of c 3 . See Figure 5 to recap the structure of nCM.

In the FCL stage, the expression of HC is in the format of

which represents the size of the weight matrix in d 1 × d 2 , and the size of the bias vector in d 1 × 1. Finally, the last column in Table 2 shows the size of the feature map (SFM). Figure 6 shows the diagram of SFMs of each layer/module of this proposed 12l-CNN backbone network.

Kang et al. (14) proposed a novel PatchShuffle (PS) technique. Both input images and feature maps (FMs) undertake the PS transformation within each minibatch, so the pixels with the corresponding patch are shuffled. Through producing counterfeit images or FMs via interior order-less patches, PS generates local changes, and thus reducing the likelihood of overfitting. Long story short, PS is a helpful complement to present training regularization techniques (14) .

where f B stands for Bernoulli distribution. We can conclude that v = 1 with probability ε, and v = 0 with probability 1 − ε.

The resultant matrix after PSX is expressed aŝ

where G PS is defined as the PS function.

In a closer look, supposing the size of each patch {x} is q × q, i.e., x ∈ R q×q , we can rephrase the matrix X as

where x ij means a non-overlapping patch at i-th row and j-th column. The PS transformation runs on all patches as

where the PatchShuffled patch G PS x i,j is written as

where e ij stands for the row permutation matrix, and e ij ′ for the column permutation matrix. In routine computation, a randomly shuffle process is harnessed to substitute the row and column permutation processes. Each patch x i,j undertakes one of the q 2 ! doable permutations. For example, if q = 2, there are 2 2 ! = 24 possible shuffle operations as listed in Table 3 .

We propose a PatchShuffle convolutional neural network (PSCNN). It adds the PS operations on both the input image layer and the FMs of all the convolutional layers of the proposed backbone network 12l-CNN. See the results of PS on a grayscale image (Figure 7) and a color image (Figure 8) with discrete values of q = 2, 3, . . . , 8.

The diagram of building PSCNN from 12l-CNN is shown in Figure 9 , where both input images and feature maps of nCM (See dash arrows in Figure 9 ) are randomly picked up to undertake the PS operation. To grab the best bias-variance trade-off, merely a trivial percentage (ε) of the images or FMs will undertake G PS process.

For ease of reading, we analyze the mathematical mechanism by only considering running PS on input images. Supposing Frontiers in Public Health | www.frontiersin.org means the loss function (LF), the training LF of the proposed PSCNN is written as where represents the ordinary LF, PSCNN the LF of PSCNN, X the raw images, y the label, W the weights, and G PS (X) the PatchShuffled images.

Considering two extreme situations of v = 0 ∨ 1, we can deduce

which means the LF of PSCNN PSCNN X, y, W degrades to ordinary LF if v = 0, while the LF of PSCNN equals to training all images, PatchShuffled if v = 1.

If we take the mathematical expectation of v, Equation (15) is transformed to

where ε 1−ε G PS (X) , y, W serves as a regularization term.

The multiple-way data augmentation (MDA) method is used to help create fake training images so as to make our AI model avoid overfitting (22) . Compared to traditional data augmentation (DA), MDA can provide more diverse images than DA. In Reference (22) , nine data augmentation (DA) methods are applied to the raw training image e (w) and its horizontally mirrored image (HMI) e ′ (w). The diagram of MDA is shown in Figure 10 .

Step A. R 1 different DA methods (23) are utilized to e (w). Let Y r , r = 1, . . . , R 1 be each DA operation Frontiers in Public Health | www.frontiersin.org (24), we make X 1 augmented sets from the raw image e (w) as:

Algorithm 2 | Pseudocode of our 18-way DA on w-th raw image.

Input a raw preprocessed w-th training image e (w).

Step A We attain Y r [e (w)] , r = 1, . . . , R 1 . See Equation (18) . Each enhanced set comprises R 2 new images. See Equation (19) .

Step B An HMI is produced as e ′ (w) = η 1 [e (w)]. See Equation (20) .

Step C we obtain Y r [e ′ (w)] , r = 1, · · · , R 1 . See Equation (21).

Step D e (w), e ′ (w), Y r [e (w)] , r = 1, . . . , R 1 , and Y r e ′ (w) , r = 1, · · · , R 1 are combined via η 2 . See

Equation (22). (23). Let R 2 stand for the size of produced new images of each DA operation:

Step B. HMI is produced by:

where η 1 means horizontal mirror function.

Step C. All R 1 different DA methods run on the HMI e ′ (w), and produce R 1 new sets as:

Step D. The raw image e (w), the HMI e ′ (w), all R 1 -way DA results Y r [e (w)] of the raw image, and all R 1 -way DA results Y r e ′ (w) of HMI are combined. The final dataset from e (w) is 

where η 2 stands for the combination function. Let augmentation factor be R 3 that stands for the number of images in M (w), which is deduced as

Algorithm 2 recapitulates the pseudocode of our 18-way DA, which sets R 1 = 9 to yield an 18-way DA.

where D a (v) stands for the v-th fold of the whole dataset at a-th run (26) .

At v-th (1 ≤ v ≤ V) trial, the v-th fold is pinched out as the test set, and the remained V − 1 folds are selected as the training set:

Note: the training set is augmented via the MDA method described in section Multiple-way Data Augmentation. The PSCNN model is trained on the augmented training set. The trained model is dubbed M (a, v) , and the corresponding confusion matrix is dubbed L (a, v). After all the V-fold trials, the confusion matrix of a-th run is summarized as

Based on which, K indicators I a, k , k = 1, 2, . . . , K are deduced, which will be explained in the next section. Based on A runs, the mean and standard deviation (MSD) of all K measures are calculated as the form of I m (k) ± I SD k , which is defined as: Figure 11 shows the schematic of V-fold cross validation. Moreover, the V-fold cross-validation runs A times. At each run, the data division is reset randomly. Algorithm 3 summarizes the pseudocode of A-run of V-fold cross-validation. Bold means the best. 

Accuracy (29) is defined as:

F1 score reflects both the precision and the sensitivity. It is the harmonic mean of the preceding two measures: precision and sensitivity (30) . F1 score is defined as 

The minimum value of FMI is 0, corresponding to the worst binary classification, where all samples are misclassified. The maximum value of FMI is 1, corresponding to the best binary classification, where all samples are classified correctly. The receiver operating characteristic (ROC) curve (32) and the area under the curve (AUC) are introduced to provide a graphical plot and a quantitative value of measuring the proposed PSCNN model, respectively. ROC and AUC are obtained through the following two procedures: (i) ROC plot is firstly generated by charting the TP rate against the FP rate at different threshold degrees (33) . (ii) AUC is then estimated by measuring the complete 2D area beneath the ROC curve from point (0, 0) to point (1, 1) (34).

At last, gradient-weighted class activation mapping (Grad-CAM) (35) is harnessed to deliver explanations on how our PSCNN model creates the decision. The output of nCM-5 in Figure 9 is chosen for Grad-CAM.

The parameters and their values are itemized in Table 5 The maximum value of n in each nCM is set to 3. The backbone network contains α = 12 weighted layers. We use R 1 = 9 DA for each raw training image and its HMI. Each DA generates R 2 = 30 images. The augmentation factor is R 3 = 542, V = 10-fold cross-validation is employed, and 10 runs are performed on our cross-validation. In total K = 7 indicators are utilized. The PS probability is set to 0.05, and the patch size is 2 × 2. Results of Multiple-Way Data Augmentation (MDA) Figure 12 shows the results of MDA if choosing Figure 4A as the raw training image e (w). The 9-way results of the raw image are displayed while the HMI and its MDA results are not displayed due to the page limit. From Figure 12 , it is clear that MDA proliferates the varying degree of the training set. 

We validate the optimal parameters of PS in this experiment. The validation settings are the same as the previous experiment, but we change the probability ε and patch size q. Note that here patch size q may be either a square or a rectangle. The results with sundry combinations of ε and q are disclosed in Table 7 , and the three-dimensional bar plot is illustrated in Figure 13 .

The optimal parameter set unearthed from the 10-fold crossvalidation is the combination of the probability of ε = 0.05 and the patch size of q = 2 × 2, which are consistent with reference (14) .

This ablation experiment studies the effectiveness of PS. Suppose we remove the PS module from our PSCNN model; the remaining is the backbone network 12l-CNN. The results of the backbone network are shown in Table 8 . After comparing Tables 6, 8, we can conclude that PS can effectively increase the performances of the diagnosis model. The error bar plot of this comparison is shown in Figure 14 .

Furthermore, the ROC curves of the two models and their corresponding AUC results are illustrated in Figure 15 . The AUC of the 12l-CNN model is 0.9503, and the AUC of the PSCNN model is 0.9610. The results also indicate that PS is effective in our PSCNN model.

This proposed PSCNN model is compared with ten state-of-theart models: CSSNet (4), SMO (5), COVNet (6), WSF (7), WEBBO (8), FSVC (9), SVM (10), PZM (11), GLCM-ELM (12) , and Jaya (13) . The implementation of all the state-of-the-art models is the same as in previous experiments.

The comparison results are itemized in Table 9 . The corresponding three-dimensional bar plot is displayed in Figure 16 , in which all the models are sorted in terms of MCC. We can observe our PSCNN model achieves better performances than the other 10 stateof-the-art COVID-19 diagnosis models in terms of all seven measures. The reason can be found from previous Figure 2 , where we combine stacked nCMs and FCLs to build the backbone network 12l-CNN, based on which we integrate MDA, PS, and Grad-CAM to form the final network PSCNN.

We take Figure 4A as an example. Remember that the nCM-5 feature map in PSCNN is employed to create heatmaps via the Grad-CAM technology. In the previous experiments, we run our PSCNN model 10 times, generating 10 different models with different heatmaps. Due to the page limit, only the first three heatmaps are offered in Figures 17B-D and the manual delineation is shown in Figure 17A .

Traditional artificial intelligence (AI) is concerned as a black box (BB) that impedes its pervasive practice, in other words, the BB characteristic of old-fashioned AI is awkward for the approval of the Food and Drug Administration (FDA). Nonetheless, with the help of explainability of Grad-CAM, the physicians, radiologists, and/or patients shall gain confidence in the proposed PSCNN model, as the heatmaps deliver understandable interpretations of how our PSCNN model differentiates COVID-19 from healthy subjects. Recently, a load of new explainable-AIbased diagnosis systems are now approved by FDA (36) , because the doctors are aware of the relationships between the diagnosis labeling and the underlying reasons via the explainable heatmaps.

Our team proposes the PSCNN model for developing a more accurate COVID-19 diagnosis system. After introducing the nCM module, we develop a 12l-CNN backbone network and a PSCNN to diagnose COVID-19. Moreover, multiple-way DA is employed to avoid overfitting, and Grad-CAM is utilized to locate the lung lesions. The MSD values of the seven measures of our model are: 95.28 ± 1.03 (sensitivity), 95.78 ± 0.87 (specificity), 95.76 ± 0.86 (precision), 95.53 ± 0.83 (accuracy), 95.52 ± 0.83 (F1 score), 91.07 ± 1.65 (MCC), and 95.52 ± 0.83 (FMI).

Reflecting on this proposed model, there are three weak sides. First, the seven measures indicate the model can still be improved. Second, the edge of the heatmap is blurry. Third, our dataset is relatively small.

In future studies, we shall aim to use other advanced DL techniques, such as graph convolutional networks, to check whether we can further the performance of our models. Besides, more precise explainable AI techniques will be studied to provide more accurate heatmaps. Optimization algorithms (37) can help optimize the structures of networks. Finally, we shall test our model on other public datasets.

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

S-HW: conceptualization, methodology, software, investigation, writing-original draft, writing-review and editing, visualization, supervision, project administration, and funding acquisition. ZZ: methodology, validation, formal analysis, data curation, writing-review and editing, and visualization. Y-DZ: conceptualization, software, validation, formal analysis, resources, writing-review and editing, supervision, project administration, and funding acquisition. All authors contributed to the article and approved the submitted version. 

Multiple-way data augmentation (MDA) results. (A) Image rotation. (B) Salt-and-pepper noise

Gaussian noise. (I) Speckle noise. Frontiers in Public Health | www.frontiersin.org This paper is partially supported by Medical Research Council Confidence in Concept Award, UK (MC_PC_17171)

Seroprevalence of immunoglobulin M and G antibodies against SARS-CoV-2 virus: a systematic review and meta-analysis study

Delayed laboratory response to covid-19 caused by molecular diagnostic contamination

Correlation of chest CT and RT-PCR testing for coronavirus disease 2019. (COVID-19) in China: a report of 1014 cases

Predicting COVID-19 pneumonia severity on chest X-ray with deep learning

COVID-19 detection using deep learning models to exploit Social Mimic Optimization and structured chest X-ray images using fuzzy color and stacking approaches

Using artificial intelligence to detect COVID-19 and community-acquired pneumonia based on pulmonary CT: evaluation of the diagnostic accuracy

A weakly-supervised framework for COVID-19 classification and lesion localization from chest CT

COVID-19 detection via wavelet entropy biogeographybased optimization

Novel feature selection and voting classifier algorithms for COVID-19 classification in CT images

Covid-19 classification based on gray-level co-occurrence matrix support vector machine

Pseudo zernike moment and deep stacked sparse autoencoder for COVID-19 diagnosis

Gray level co-occurrence matrix and extreme learning machine for Covid-19 diagnosis

Covid-19 detection by wavelet entropy and jaya

ANC: Attention network for COVID-19 explainable diagnosis based on convolutional block attention module

Enhancement algorithm for high visibility of underwater images

Graphics forgery recognition using deep convolutional neural network in video for trustworthiness

Predicting defibrillation success in out-of-hospital cardiac arrested patients: moving beyond feature design

Distributed recurrent neural networks for cooperative control of manipulators: a game-theoretic perspective

Kinematic control of redundant manipulators using neural networks

Improved algorithm for multiple sclerosis diagnosis in MRI using convolutional neural network

MIDCAN: a multiple input deep convolutional attention network for Covid-19 diagnosis based on chest CT and chest X-ray

A machine learning-based clinical tool for diagnosing myopathy using multicohort microarray expression profiles

Detection and diagnosis of brain tumors using deep learning convolutional neural networks

Predicting self-harm within six months after initial presentation to youth mental health services: a machine learning study

An error detection method for real-time EPID-based treatment delivery quality assurance

Functional correlations in the pursuit of performance assessment of classifiers

Chronological discrimination of silver coins based on inter-elemental ratios using laser ablation inductively coupled plasma mass spectrometry (LA-ICP-MS)

Smart NIR tomoscopy to predict oxidative stress in rabbits

A computer-aided diagnosis of brain tumors using a fine-tuned YOLO-based model with transfer learning

Land usage identification with fusion of thepade SBTC and sauvola thresholding features of aerial images using ensemble of machine learning algorithms

Validation of the khorana score for prediction of venous thromboembolism after robot-assisted radical cystectomy

Software defect prediction using K-PCA and various kernel-based extreme learning machine: an empirical study

A predictive model of intrinsic factors associated with long-stay nursing home care after hospitalization

Beach state recognition using argus imagery and convolutional neural networks

Expert-augmented computational drug repurposing identified baricitinib as a treatment for COVID-19

Beetle antennae search algorithm for optimization problems

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpubh. 2021.768278/full#supplementary-material