key: cord-0200487-ej7juowp authors: He, Xin; Wang, Shihao; Chu, Xiaowen; Shi, Shaohuai; Tang, Jiangping; Liu, Xin; Yan, Chenggang; Zhang, Jiyong; Ding, Guiguang title: Automated Model Design and Benchmarking of 3D Deep Learning Models for COVID-19 Detection with Chest CT Scans date: 2021-01-14 journal: nan DOI: nan sha: 6316b4bc0a8cf6564d0f5a2ea5de007179681777 doc_id: 200487 cord_uid: ej7juowp The COVID-19 pandemic has spread globally for several months. Because its transmissibility and high pathogenicity seriously threaten people's lives, it is crucial to accurately and quickly detect COVID-19 infection. Many recent studies have shown that deep learning (DL) based solutions can help detect COVID-19 based on chest CT scans. However, most existing work focuses on 2D datasets, which may result in low quality models as the real CT scans are 3D images. Besides, the reported results span a broad spectrum on different datasets with a relatively unfair comparison. In this paper, we first use three state-of-the-art 3D models (ResNet3D101, DenseNet3D121, and MC3_18) to establish the baseline performance on the three publicly available chest CT scan datasets. Then we propose a differentiable neural architecture search (DNAS) framework to automatically search for the 3D DL models for 3D chest CT scans classification with the Gumbel Softmax technique to improve the searching efficiency. We further exploit the Class Activation Mapping (CAM) technique on our models to provide the interpretability of the results. The experimental results show that our automatically searched models (CovidNet3D) outperform the baseline human-designed models on the three datasets with tens of times smaller model size and higher accuracy. Furthermore, the results also verify that CAM can be well applied in CovidNet3D for COVID-19 datasets to provide interpretability for medical diagnosis. The Corona Virus Disease 2019 , pandemic is an ongoing pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The SARS-CoV-2 virus can be easily spread among people via small droplets produced by coughing, sneezing, and talking. COVID-19 is not only easily contagious but also a severe threat to human lives. The COVID-19 infected patients usually present pneumonia-like symptoms, such as fever, dry cough and dyspnea, and gastrointestinal symptoms, followed by a severe acute respiratory infection. The usual incubation period of COVID-19 ranges from one to 14 days. Many COVID-19 patients do not even know that they have been infected without any symptoms, which would easily cause delayed treatments and lead to a sudden exacerbation of the condition. Therefore, a fast and accurate method of diagnosing COVID-19 infection is crucial. Currently, there are two commonly used methods for COVID-19 diagnosis. One is viral testing, which uses realtime reverse transcription-prognosis chain reaction (rRT-PCR) to detect viral RNA fragments. The other is making diagnoses based on characteristic imaging features on chest X-rays or computed tomography (CT) scan images. (Ai et al. 2020) conducted the effectiveness comparison between the two diagnosis methods and concluded that chest CT has a faster detection from the initial negative to positive than rRT-PCR. However, the manual process of analyzing and diagnosing based on CT images highly relies on professional knowledge and is time-consuming to analyze the features of the CT images. Therefore, many recent studies have tried to use deep learning (DL) methods to assist COVID-19 diagnosis with chest X-rays or CT scan images. However, the reported accuracy of the existing DL-based COVID-19 detection solutions spans a broad spectrum because they were evaluated on different datasets, making it difficult to achieve a fair comparison. Besides, most studies focus on 2D CT datasets (Singh et al. 2020; Ardakani et al. 2020; Alom et al. 2020) . However, the real CT scan is usually the 3D data. Thus it is necessary to use 3D models to classify 3D CT scan data. To this end, we use three state-of-the-art (SOTA) 3D DL models to establish the baseline performance on the three open-source 3D chest CT scan datasets: CC-CCII 1 (Zhang et al. 2020b) , MosMedData (Morozov et al. 2020) and COVID-CTset (Rahimzadeh, Attar, and Sakhaei 2020) . The details are shown in Table 2 . In addition, designing a high-quality model for the specific medical image dataset is a time-consuming task and requires much expertise, which hinders the development of DL technology in the medical field. Recently, neural architecture search (NAS) has become a prevalent topic, as it can efficiently discover high-quality DL models automatically. Many studies have used the NAS technique to image classification and object detection tasks (Pham et al. 2018; Liu, Simonyan, and Yang 2019; ; Tan et al. • We use three manually designed 3D models to establish the baseline performance on the three open-source COVID-19 chest CT scan datasets. • To the best of our knowledge, we are the first to apply the NAS technique to search for 3D DL models for COVID-19 chest CT scan datasets. Our DNAS framework can efficiently discover competitive neural architectures that outperform the baseline models on the three CT datasets. • We use the Class Activation Mapping (CAM) (Zhou et al. 2016) algorithm to add the interpretability of our DNASdesigned models, which can help doctors quickly locate the discriminative lesion areas on the CT scan images. In recent years, DL techniques have been proven to be effective in diagnosing diseases with X-ray and CT images (Litjens et al. 2017) . To enable DL techniques to be applied in helping the detection of COVID-19, an increasing number of publicly available COVID-19 datasets have been proposed, as shown in Table 1 . We separate the publicly available datasets into two different categories: the pre-pandemic datasets and the post-pandemic datasets which mainly differ in quality and quantity. In the pre-pandemic period, gathering datasets for COVID-19 is a tough job as there is no enough data for collection. Most datasets in this period were gathered from medical papers or uploaded by the public. IEEE8023 Coivd-chestxray-dataset (Cohen, Morrison, and Dao 2020) is a dataset of COVID-19 cases with chest X-ray and CT images collected from public sources. But its quality has no guarantee since the images are not verified by medical experts. Covid-ct-dataset (Zhao et al. 2020) is another CT dataset of COVID-19, mainly composed of CT images extracted from COVID-19 research papers, and its quality is low. The dataset only contains 2D information because each patient has only one to several CT images instead of a complete 3D scan volume. Post-pandemic Datasets During the pandemic, the number of confirmed cases of COVID-19 has been rising rapidly, which brings many high-quality COVID-19 chest CT scan datasets, such as CC-CCII (Zhang et al. 2020b ) and COVID-CTset (Rahimzadeh, Attar, and Sakhaei 2020). Some of them have annotations by doctors, e.g., COVID-19-CT-Seg-Dataset (Jun et al. 2020) and MosMedData (Morozov et al. 2020) . The three datasets we use in this work are all from this category. Much research is conducted on CT images, but the 3D information of CT images is under-explored, such as the work by Mobiny et al. 2020; Singh et al. 2020) . These work mainly propose 2D DL models for COVID-19 detection. (Ardakani et al. 2020) benchmarks ten 2D CNNs and compares their performance in classifying 2D CT images on their private dataset with 102 testing images. On the other hand, the studies in utilizing 3D CT images are relatively rare, mainly due to the lack of 3D COVID-19 CT scan datasets. Zheng et al. 2020) propose 3D CNNs with their private 3D CT datasets. There are also some other studies conducted on X-ray images. For example, (Narin, Kaya, and Pamuk 2020) proposes three 2D DL models for COVID-19 detection. (Zhang et al. 2020a ) introduces a deep anomaly detection model for fast and reliable screening. (Ghoshal and Tucker 2020) investigates the estimation of uncertainty and interpretability by Bayesian CNN on the X-ray images. (Alom et al. 2020) uses both X-ray images and CT images to do segmentation and detection. In recent years, NAS has created many SOTA results by automatically searching for neural architectures for many tasks (He, Zhao, and Chu 2021; Elsken, Metzen, and Hutter 2018) . Due to the success of NAS in natural image recognition (such as ImageNet (Deng et al. 2009 )), researchers also try to extend it to the medical datasets, such as Magnetic resonance imaging (MRI) segmentation (Kim et al. 2019) . (Faes et al. 2019) uses five public datasets, MESSIDOR, OCT images, HAM 10000, Paediatric images, and CXR images, to search for and train models by Google Cloud AutoML platform. Their experimental results demonstrate that AutoML can generate competitive classifiers compared to manually designed DL models. But to the best of our knowledge, there is no study applying the NAS technique to search for 3D DL models for COVID-19 chest CT scan datasets. To this end, we exploit the NAS technique to the three opensource COVID-19 chest CT scan datasets and successfully discover high-quality 3D models that achieve comparable performance with the human-designed SOTA 3D models. In this section, we first describe our search space for 3D CT scans classification models. Then, we introduce the differentiable neural architecture search (DNAS) method combined with the Gumbel Softmax technique (Jang, Gu, and Poole 2017; Dong and Yang 2019) . There are two critical points to be considered before designing the search space. One is that all datasets we use are composed of 3D CT scans; therefore, the searched model should be good at extracting the information from threedimensional spatial data. The other is that the model should be lightweight, as the time required to process 3D data is much longer than 2D image data. Yang 2019) is one of the most commonly used search space, it has several problems: 1) the final model is built by stacking the same cells, which precludes the layer diversity; 2) many searched cells are very complicated and fragmented and are therefore inefficient for inference. MobileNetV2 (Sandler et al. 2018 ) is a lightweight model manually designed for mobile and embedded devices for efficient inference. Several NAS studies (Tan et al. have successfully used the layer modules (Sandler et al. 2018) including inverted residuals and linear bottlenecks to search for neural architectures and achieved SOTA results on the 2D image datasets. Therefore, we use MobileNetV2 as a reference to design our 3D search space. As shown in Fig. 1 , we represent the search space by a supernet, which consists of the stem layer, a fixed number of cells, and a linear layer. The stem layer performs convolutional operations, and the last linear layer follows behind a 3D global average pooling operation (Zhou et al. 2016 ). Each cell is composed of several blocks. The structures of all blocks need to be searched. In different cells, the number of channels and the number of blocks are different and hand-picked empirically. By default, all blocks have a stride of 1. However, if a cell's input/output resolutions are different, then its first block has a stride of 2. The blocks within the same cell have the same number of input/output channels. Inspired by MobileNetV2 (Sandler et al. 2018) , each block is a MBConv-similar module (see Fig. 1 ). It consists of three sub-modules: 1) a point-wise (1×1×1) convolution; 2) a 3D depthwise convolution with K × K × K kernel size, where K is a searchable parameter; 3) another point-wise (1×1×1) convolution. All convolutional operations are followed by a 3D batch normalization and a ReLU6 activation function (Howard et al. 2017) , which is denoted by Conv3D-BN3D-ReLU6, and the last convolution has no ReLU6 activation. Another searchable parameter is the expansion ratio e, which controls the ratio between the size of the input bottleneck and the inner size. For example, 5 × 5 × 5 MBConv6 denotes that the kernel size of MBConv is 5 × 5 × 5, and the expansion ratio is 6. In our experiments, the search space is a fixed macroarchitecture supernet consisting of 6 cells, where each has 4 blocks, but the last cell only has 1 block. We empirically collect the following set of candidate operations: • Skip connection Therefore, it contains 8 21 ≈ 9.2 × 10 18 possible architectures. Finding an optimal architecture from such a huge search space is a stupendous task. We will introduce our search strategy in the following. According to (He, Zhao, and Chu 2021) , gradient descent (GD) based NAS is an efficient method, and many studies use it to find competitive models with much shorter time and less computational resources (Dong and Yang 2019; Wu et al. 2019) than other NAS methods. Hence, in this paper, we use the GD-based method and combine it with the Gumbel Softmax (Jang, Gu, and Poole 2017) technique to discover models for COVID-19 detection. Preliminary: DARTS DARTS (Liu, Simonyan, and Yang 2019) was one of the first studies to use GD-based method to search for neural architectures. Each cell is defined as a directed acyclic graph (DAG) of N nodes, where each node is a network layer, and each edge between node i and node j indicates a candidate operation (i.e., block structure) that is selected from the predefined operation space O. To make the search space continuous, DARTS (Liu, Simonyan, and Yang 2019) uses Softmax over all possible operations to relax the categorical choice of a particular operation, i.e., where o k indicates the k-th candidate operation performed on input x, α k i,j indicates the weight for the operation o k between a pair of nodes (i, j), and K is the number of predefined candidate operations. The training and the validation loss are denoted by L train and L val , respectively. Therefore, the task of searching for architectures is transformed into a bilevel optimization problem of neural architecture α and the weights ω α of the architecture: Differentiable Model Sampling by Gumbel Softmax In DARTS, as Fig. 2 (left) shows, the output of each node is the weighted average of the mixed operation during the whole search stage. It causes a linear increase in the requirements of computational resources with the number of candidate operations. To alleviate this problem, we follow the same idea as (Dong and Yang 2019) . Specifically, for each layer, only one operation is sampled and executed with the sampling probability distribution P α defined in Equation 1. For example, the probability of being sampled for the three operations in Fig. 2 (left) is 0.1, 0.2, and 0.7, respectively, but only one operation will be sampled at a time. Therefore, the sampling distribution P α of all layers is encoded into a onehot random distribution Z, e.g., P However, each operation is sampled from a discrete probability distribution Z, so we cannot back-propagate gradients through Z to α. To enable back-propagation, we use a reparameterization trick named Gumbel Softmax (Jang, Gu, and Poole 2017) , which can be formulated by where G k i,j = −log(−log(u k i,j )) is the k-th Gumbel sample, u k i,j is a uniform random variable, and τ is the softmax temperature. When τ → ∞, the possibility distribution of all operations between each pair of nodes approximates to the one-hot distribution. To be noticed, we perform argmax function on Equation 3 during the forward process but return the gradients according to the Equation 3 during the backward process. As mentioned above, the last linear layer follows behind a 3D global average pooling layer, which enables us to utilize Figure 3 : The pipeline of training 3D deep learning models. All CT scans need to be pre-processed by the slice sampling strategy to make sure that each scan contains the same number of slices. The input size of network is bs × 1 × d × h × w, where bs is batch size, d is the number of slices, h and w indicate the height and width, respectively. class activation mapping (CAM) algorithm to generate 3D activation maps for our model. CAM exploits the global average pooling layer to calculate get the activation map M c for class c, where each spacial element is given by where in a given image, f k (x, y, z) is the activation of unit k at the last convolutional layer before global average pooling layer at spatial location (x, y, z), w c k is the corresponding linear layer weight of class c for unit k. After getting the class activation map, we can simply upsample it to the size of the input scan images to visualize and identify the regions most relevant to the specific class. In this paper, we use three publicly available datasets: CC-CCII (Zhang et al. 2020b) , MosMedData (Morozov et al. 2020 ) and COVID-CTset (Rahimzadeh, Attar, and Sakhaei 2020). The three datasets are all chest CT volumes. However, since the data format varies from the three datasets, it is necessary to pre-process each dataset to make them follow a unified way of reading data. The original CC-CCII dataset contains a total number 617,775 slices of 6,752 CT scans from 4,154 patients, but it has five main problems (i.e., damaged data, non-unified data type, repeated and noisy slices, disordered slices, and nonsegmented slices) that would have high negative impacts on the model performance. To solve these problems, we manually remove the damaged, repeated and noisy data. Then we segment the lung part for the unsegmented slice image and convert the whole dataset to PNG format. After addressing the above problems, we build a clean CC-CCII dataset named Clean-CC-CCII, which consists of 340,190 slices of 3,993 scans from 2,698 patients. Scan images construction Each CT scan contains a different number of slices, but DL models require the same dimensional inputs. To this end, we propose two slice sampling algorithms: random sampling and symmetrical sam- pling. Specifically, the random sampling strategy is applied to the training set, which can be regarded as the data augmentation, while the symmetrical sampling strategy is performed on the test set to avoid introducing randomness into the testing results. The symmetrical sampling strategy refers to sampling from the middle to both sides at equal intervals. The relative order between slices remains the same before and after sampling. We use three manually-designed 3D neural architectures as the baseline methods: DenseNet3D121 (Diba et al. 2017), ResNet3D101 (Tran et al. 2017) , and MC3 18 (Tran et al. 2017) . As shown in Fig. 3 , after building the scan images by the sampling algorithm, we further apply transformations to scans, including resize, center-crop, and normalization. Besides, for the training set, we also perform a random flip operation in the horizontal or vertical direction. The other implementation details are as follows: we use the Adam (Kingma and Ba 2015) optimizer and the weight decay of 5e-4. We start the learning rate of 0.001 and anneal it down to 1e-5. All baseline models are trained for 200 epochs. To verify the efficiency of the method, we apply the DNAS method combined with the Gumbel Softmax technique to search for neural architectures on the three datasets. The pipeline of our method is shown in Fig. 4 , which contains two sequential stages: search stage and evaluation stage. Search stage In our experiments, the supernet consists of 6 cells with the number of blocks of [4, 4, 4, 4, 4, 1] . Besides, the blocks within the same cell have the same number of channels. Here, we test two settings: smallscale and large-scale, where the number of channels of blocks in the 6 cells is [24, 40, 80, 96, 192, 320] and [32, 64, 128, 256, 512, 1024] , respectively. We name the models searched under the two settings as CovidNet3D-S and CovidNet3D-L, respectively. The stem block is a Conv3D-BN3D-ReLU6 sequential module with the number of output channels fixed to 32. To improve searching efficiency, we set the input resolution to 64×64, and the number of slices in a scan to 16. We implement three independent search experiments for the three datasets. During the search stage, we split the training set into the training set D T and the validation set D V . In each step, we first use D V to update the architecture parameters α, and then use the training set to update the sampled architecture weights ω α . Besides, the architecture parameter α is optimized by the Adam (Kingma and Ba 2015) optimizer, and the architecture weights are optimized with the SGD optimizer with a momentum of 3e-4. The initial learning rate for both optimizers is 0.001. Each experiment is conducted on four Nvidia Tesla V100 GPUs (the 32GB PCIe version) and it can be finished in about 2 hours. After each epoch, we save the sampled architecture and its performance (e.g., accuracy). Therefore, we generate 100 neural architectures for each experiment after the search stage. Evaluation stage As Fig. 4 shows, the search stages records the performance of the sampled architectures. In the evaluation stage, we select top-10 architectures and training these architectures with the training set for several batches, then the best-performing architecture will be retrained for 200 epochs with the full training set, and then evaluated on the test set. We set different input resolutions for three datasets to evaluate the generalization of searched architectures. Besides, since the number of slices contained in CT scans of different datasets is different, we set the intermediate value for each dataset, shown in Table 3 . Each evaluation experiment uses the same settings as follows: we use the Adam (Kingma and Ba 2015) optimizer with an initial learning rate of 0.001. The cosine annealing scheduler (Loshchilov and Hutter 2016) is applied to adjust the learning rate. We use Cross-entropy as the loss function. Our experiment results are summarized in Table 3 . We compare our searched models with SOTA efficient models. We use several commonly used evaluation metrics to compare the model performance, as follows: Accuracy = T N + T P T N + T P + F N + F P To be noticed, the positive and negative cases are assigned to the COVID-19 class and the non-COVID-19 class, respectively. Specifically, T P and T N indicate the number of correctly classified COVID-19 and non-COVID-19 scans, respectively. F P and F N indicate the number of wrongly classified COVID-19 and non-COVID-19 scans, respectively. For the Clean-CC-CCII dataset, the non-COVID-19 class includes both normal and common pneumonia. The accuracy is the micro-averaging value for all test data to evaluate the overall performance of the model. Besides, we also take the model size as an evaluation metric to compare the model efficiency. Table 3 divides the results according to the datasets. We can see that our searched CovidNet3D models outperform all baseline models on the three datasets in terms of accuracy. Specifically, CovidNet3D-L models achieve the highest accuracy of the three datasets. Besides, all CovidNet3D-S models are with much smaller sizes than the baseline models, but they can also achieve similar or even better In summary, the results demonstrate that our DNAS method can discover well-performing models without inconsistency on network size, input size or scan depth (the number of slices). We can also see that the performance of both baseline models and our CovidNet3D on the MosMedData dataset is not as good as that on the other two datasets. There are two possible reasons. One is that the MosMedData datasets's original data format is NIfTI, but all our models do not converge when trained with NIfTI files; therefore we convert NIfTI to Portable Network Graphics (PNG) format, and this process would loss information of the input files. The other possible reason is that the MosMedData dataset is imbalanced (shown in Table 2 ), which increases the difficulty of model training. We also find that the random seed greatly influences on the training of the searched CovidNet3D model through experiments. In other words, the results obtained by using different seeds for the same model would differ significantly. Hence, how to improve the robustness of NAS-based models is worthy for further exploring. Although our model achieves promising result in detecting COVID-19 in CT images, classification result itself does not help clinical diagnosis without proving the inner mechanism which leads to the final decision makes sense. To inspect our CovidNet3D model's inner mechanism, we apply Class Activation Mapping (CAM) (Zhou et al. 2016) on it. CAM is an algorithm that can visualize the discriminative lesion regions that the model focuses on. In Fig. 5 , we apply CAM on each slice of a whole 3D CT scan volume from Clean-CC-CCII dataset. Regions appear red and brighter have a larger impact on the model's decision to classify it to COVID-19. From the perspective of the scan volume, we can Regions colored in red and brighter has more impact on model's decision to the class of COVID-19 while blue and darker region has less. see that some slices have more impacts on the model's decision than the others. In terms of a single slice, the areas that CovidNet3D focuses on has ground-glass opacity, which is proved a distinctive feature of CT images of COVID-19 Chest CT images . CAM enables the interpretability of our searched models (CovidNet3D), helping doctors quickly locate the discriminative lesion areas. In this work, we introduce the differentiable neural architecture (DNAS) framework combined with the Gumbel Softmax technique to search for 3D models on three open-source COVID-19 CT scan datasets. The results show that Covid-Net3D, a family of models discovered by DNAS can achieve comparable results to the baseline 3D models with smaller size, which demonstrates that NAS is a powerful tool for assisting in COVID-19 detection. In the future, we will apply NAS to the task of 3D medical image segmentation to locate the lesion areas in a more fine-grained manner. Correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: a report of 1014 cases Application of deep learning technique to manage COVID-19 in routine clinical practice using CT images: Results of 10 convolutional neural networks Performance of radiologists in differentiating covid-19 from non-covid-19 viral pneumonia at chest ct Covid-19 image data collection Imagenet: A large-scale hierarchical image database IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Automated deep learning design for medical image classification by health-care professionals with no coding experience: a feasibility study Sample-efficient deep learning for covid-19 diagnosis based on ct scans. medRxiv Categorical reparameterization with gumbel-softmax Scalable neural architecture search for 3d medical image segmentation Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct Progressive neural architecture search Sgdr: Stochastic gradient descent with warm restarts A fully automated deep learning-based network for detecting covid-19 from a new and large lung ct scan dataset. medRxiv Mobilenetv2: Inverted residuals and linear bottlenecks Classification of COVID-19 patients from chest CT images using multi-objective differential evolutionbased convolutional neural networks. European journal of clinical microbiology & infectious diseases : official publication of European Society of Clinical Microbiology Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search COVID-19 Screening on Chest X-ray Images Using Deep Learning based Anomaly Detection Clinically applicable AI system for accurate diagnosis, quantitative measurements, and prognosis of covid-19 pneumonia using computed tomography. Cell Deep learning-based detection for covid-19 from chest ct using weak label. medRxiv Neural architecture search with reinforcement learning Learning transferable architectures for scalable image recognition The research was supported by the grant RMGS2019 1 23 from Hong Kong Research Matching Grant Scheme . We would like to thank the anonymous reviewers for their valuable comments.