key: cord-1049317-p7f6yh4v authors: nan title: Automated COVID-19 Grading With Convolutional Neural Networks in Computed Tomography Scans: A Systematic Comparison date: 2021-10-08 journal: IEEE Trans Artif Intell DOI: 10.1109/tai.2021.3115093 sha: e1f94ac97ca70d4c6b5b2b9e553cdf1b01a24010 doc_id: 1049317 cord_uid: p7f6yh4v Amidst the ongoing pandemic, the assessment of computed tomography (CT) images for COVID-19 presence can exceed the workload capacity of radiologists. Several studies addressed this issue by automating COVID-19 classification and grading from CT images with convolutional neural networks (CNNs). Many of these studies reported initial results of algorithms that were assembled from commonly used components. However, the choice of the components of these algorithms was often pragmatic rather than systematic and systems were not compared to each other across papers in a fair manner. We systematically investigated the effectiveness of using 3-D CNNs instead of 2-D CNNs for seven commonly used architectures, including DenseNet, Inception, and ResNet variants. For the architecture that performed best, we furthermore investigated the effect of initializing the network with pretrained weights, providing automatically computed lesion maps as additional network input, and predicting a continuous instead of a categorical output. A 3-D DenseNet-201 with these components achieved an area under the receiver operating characteristic curve of 0.930 on our test set of 105 CT scans and an AUC of 0.919 on a publicly available set of 742 CT scans, a substantial improvement in comparison with a previously published 2-D CNN. This article provides insights into the performance benefits of various components for COVID-19 classification and grading systems. We have created a challenge on grand-challenge.org to allow for a fair comparison between the results of this and future research. I MAGING of COVID-19 with chest computed tomography (CT) has been found to be helpful for diagnosis of this disease in the current pandemic [1] . With the aim to reduce the workload of radiologists, various machine learning techniques have been proposed to automatically grade and classify the presence of COVID-19 in CT images [2] - [23] . Automatic COVID-19 classification methods have already been deployed in several medical centers [8] . By far the most common technique for automatic COVID-19 classification from CT images is the convolutional neural network (CNN) [24] , [25] , which is the current state-of-the-art for image classification [26] . The works that use this approach can be divided into those that use 2-D CNNs [2] , [6] , [7] , [11] , [13] , [15] , [18] - [20] , [22] and those that use 3-D CNNs [4] , [9] , [10] , [12] - [14] , [16] , [17] , [23] . While 3-D CNNs are directly capable of exploiting 3-D information present in CT volumes, 2-D CNNs can only indirectly use 3-D information by aggregating their output for individual slices of the image to produce an image level prediction. 3-D CNNs are typically more memory intensive than 2-D CNNs, but graphics processing units (GPUs) with sufficient memory to train 3-D models are becoming increasingly available. Moreover, radiologists are specifically instructed to take 3-D information into account by inspecting different orthogonal views for assessing the suspicion of COVID-19 in CT scans [27] . This indicates that 3-D information is essential for radiologists in assessing the patterns indicative for COVID-19. Additionally, the slice thickness of CT scans is increasingly becoming smaller [28] so that the scans contain more detailed 3-D information. Therefore, we hypothesize that 3-D CNNs are more suitable for COVID-19 classification from CT scans than 2-D CNNs. A major issue that inhibits the utilization of artificial intelligence in real-world applications, such as COVID-19 diagnosis from CT, is the excessive focus of research on novel architectures, while scientifically sound comparisons and proper evaluations on external datasets are lacking. Often, small additions and adaptations to model architectures for incremental improvements on specific datasets are proposed that do not generalize well to other datasets. This issue is increasingly being recognized and simple baselines have been proposed, which perform comparably to or better than overengineered solutions [29] , [30] . The goal of this article is therefore not to introduce novel architectural tweaks, but instead to perform a comparative study that evaluates existing approaches. To indicate the generalization capabilities of automatic COVID-19 classification systems, some methods have been validated on data from different centers than the data that were used for training [4] , [14] . Also, the same validation methods, such as receiver operating characteristic (ROC) curves and the area under the ROC curve (AUC), have been reported across different studies [2] , [4] , [6] - [10] , [12] - [15] , [18] - [20] , [22] , [23] . However, since each study used different datasets for training and for validation, the need for fair, direct comparisons of the performance of these algorithms remains unsatisfied. Recently, the "CT images and clinical features for COVID-19" (iCTCF) dataset was made publicly available [31] , enabling a fair comparison of COVID-19 classification methods. This article compares a variety of 2-D and 3-D CNN architectures for COVID-19 classification. We trained and evaluated the approaches on the same internal dataset. Moreover, in an ablation study, we investigated performance changes due to 1) using transfer learning for 2-D and 3-D COVID-19 classification models, 2) using prior information in the form of COVID-19 related lesion segmentations as additional input to the network, and 3) replacing the categorical output with a continuous output. We furthermore created a public challenge [32] for evaluating and comparing different COVID-19 classification algorithms. Algorithms can be submitted to the challenge as Docker containers and are evaluated on the iCTCF dataset that we used in this article. This allows their performance to be compared to the methods presented in this article, as well as to other COVID-19 grading and classification algorithms that are submitted to the challenge. 3-D CNNs were initially proposed for processing video data [25] , where the third dimension of the convolutional layers dealt with the temporal dimension. In later works, 3-D CNN architectures were derived from 2-D CNN architectures by expanding the 2-D filters into 3-D [33] . Methods based on these inflated 3-D CNNs, in particular the Inflated Inception-v1 (I3D) model, have recently been successfully employed for lung nodule detection and scan-level classification tasks from thorax CT scans [34] , [35] . The large majority of the architectures used for COVID-19 classification from CT scans in previous works [2] , [4] - [10] , [12] , [14] - [19] , [19] , [20] , [22] , [23] , [36] are heavily or completely based on the ResNet [37] , DenseNet [38] , or Inception [39] architecture families. Especially ResNet architectures have been used frequently [2] , [6] , [8] - [10] , [15] - [20] , [36] . Some works did not use a full ResNet architecture, but did incorporate residual blocks into their model [5] , [22] . Architectures from the DenseNet [4] , [19] , [23] and Inception [7] , [14] families have been used less frequently. Other architectures such as VGG-19 [40] , Inception-ResNet-v2 [41] , NASNet [42] , and EfficientNet [43] have also been used in research for COVID-19 classification from CT scans [36] , [44] - [47] . Due to the lack of standardized data for testing across different works, previous research does not identify which architecture produces the best performance for COVID-19 classification from CT. Fine-tuning is a widely used technique in research on deep learning in medical imaging [48] and COVID-19 classification specifically [49] . With fine-tuning, models are initialized with pretrained weights from models trained on a different task or dataset. They are commonly pretrained on the ImageNet [50] dataset that contains a large variety of 2-D natural images. Afterward, the models are trained for the task at hand. Pretraining speeds up training and can offer performance gains for large models [48] . It has been used in several 2-D CNN COVID-19 classification methods [2] , [6] , [7] , [18] , [20] . Pretrained weights have also been used for 3-D CNN-based methods. Wang et al. [4] pretrained their model for COVID-19 classification on a large number of CT scans from lung cancer patients. Inflated 3-D CNNs can conveniently be initialized by inflating 2-D weights. 2-D weights have been used to pretrain I3D models for video classification [33] and chest CT classification [34] tasks. Before presenting CT images to the CNN, they are often preprocessed by extracting the lung region using lung or lobe segmentation algorithms. These lung regions are then used either for cropping around and centering to the lungs [4] , [6] , [14] , [16] , [18] and/or by suppressing nonlung tissue [2] , [4] , [6] , [8] - [10] , [12] , [15] , [17] . Yang et al. [19] used a lung segmentation as an additional input channel and used lesion masks as extra information by training their model to perform lesion segmentation and COVID-19 classification simultaneously. Lessmann et al. [14] also added a lesion segmentation to the input of their model. Most studies on automated detection of COVID-19 employ a categorical classification output format that uses a softmax or sigmoid activation [49] . Previous works have trained models to discern between COVID-19 positive and negative patients [4] , [5] , [6] , [12] , [15] , [16] , [18] - [20] , [22] , [23] , COVID-19 positive patients and patients with other types of pneumonia [4] , [7] , [9] , and between all three [2] , [10] , [17] . In this work, we followed Lessmann et al. [14] and trained our models to produce CO-RADS [27] scores on chest CT scans of suspected COVID-19 patients. The CO-RADS score denotes the suspicion of COVID-19 on a scale from 1 to 5 and was developed to standardize reporting of CT scans of patients suspected with COVID-19 [27] . Scoring systems, like CO-RADS, have been advocated for better communication between radiologists and other healthcare providers [14] , [27] . [14] . CO-RADS scores were reported by a radiologist as part of routine interpretation of the scans. CO-RADS 1 was used for normal or noninfectious etiologies, having a very low level of suspicion. CO-RADS 2 was used if the CT-scan was typical for other infections than COVID-19, indicating a low level of COVID-19 suspicion. CO-RADS 3 implies equivocal findings and features compatible with COVID-19, but characteristics of other diseases are also found. CO-RADS 4 and 5 indicate a high and very high level of COVID-19 suspicion, respectively. We randomly split the dataset into a development set with 616 patients and an internal test set of 105 patients. The patients in the development set were split into 75% for training and 25% for validation using data stratification based on the CO-RADS scores. The distribution of CO-RADS scores over the different splits is displayed in Table I . All data splits were made such that all scans from a patient with multiple visits ended up in the same split. 2) External Test Data: For external evaluation, we used the publicly available CT images and clinical features for COVID-19 dataset (iCTCF) dataset [13] , [31] . Since we focused on comparing architectures for CT image processing for COVID-19 classification, we did not incorporate the clinical features from this dataset into the input for our models. In iCTCF, patients were categorized with a Chinese grading system that distinguishes the classes as control, mild, regular, severe, critically ill and suspected. Since there was no etiological evidence available for the presence of COVID-19 in suspected cases [13] , we did not use them for testing our models. The distribution of the other classes is displayed in Table II. The grading system uses etiological laboratory confirmation and other factors such as clinical features and CT imaging [13] . The control cases include both healthy patients and patients with community acquired pneumonia. Most of the iCTCF data has been made publicly available, but some CT scans were not available at the time of conducting this study. We validated our models with all available data from the first iCTCF cohort for which etiological evidence for the presence of COVID-19 was available [31] . We compared the performance of a variety of popular 2-D and 3-D CNN architectures for the task of COVID-19 classification from CT. More specifically, we compared vanilla 2-D and 3-D versions of DenseNet-121, DenseNet-169, DenseNet-201, Inception-v1, ResNet-18, ResNet-34, and ResNet-50. Section II describes previous works that have used many of these architectures. Since we used scan-level labels for training and testing these models, the 2-D architectures required the integration of a slice-wise reduction step, while the 3-D architecture did not. For the 2-D architectures, we therefore integrated the slice-wise reduction step presented by Li et al. [2] . First, the 2-D CNN extracts features of individual axial slices. A global max pooling step reduces these features to a 1-D vector, to which a fully connected layer is applied with an output size equal to the number of classes. We investigated whether additional model components had an effect on COVID-19 classification performance in an ablation study. Fig. 1 shows a summary of the processing pipeline that was used. Since performing the ablation study for all 2-D and 3-D architectures would require a large quantity of computational resources, the ablation study was instead performed with only the best performing architecture in terms of quadratic weighted kappa (QWK). To aid the model in localizing COVID-19 related parenchymal lesions, we provided a lesion segmentation map as additional input in a separate input channel. More specifically, the CT image was fed into the first input channel, the lesion segmentation into the second channel, and the third channel was presented with zeros. When training models without the additional lesion segmentation input, the CT image was fed into all three input channels. A 3-D nnU-Net [29] trained by Lessmann et al. [14] , which segments ground-glass opacities (GGOs) and consolidations, provided the lesion segmentations. GGOs and consolidations are biomarkers with major importance in diagnosing COVID-19 [27] . 2) Dimensionality: Since various components were added to the models in the ablation study, we trained both the 2-D and 3-D variants of the best performing architecture. This allows for an analysis of the performance difference solely due to the dimensionality of the model in our complete processing pipeline. c) We compared a continuous output to a categorical output in the ablation study. Section III-C4 describes the continuous output in detail. The dashed line indicates that the categorical output replaces the continuous output in one of the models in the ablation study and all models in the architecture search, but it is not incorporated in the main approach. We investigated the performance changes due to pretraining on a natural image classification task. The 2-D models were initialized with weights pretrained on ImageNet. The 3-D models were initialized with the same weights by inflating the pretrained 2-D convolution kernels to 3-D. The standard output format of CNNs used for categorical classification does not capture the ordinal nature of the CO-RADS scoring system. Furthermore, although the CO-RADS scoring system allows for a higher level of interpretability than a binary system, the fact that a CO-RADS suspicion score of three indicates that it is unclear whether COVID-19 is present makes it difficult to decide on the onset of the positive class for the predicted scores in ROC analyses. For these reasons, we considered the CO-RADS classification to be a regression task. Hence, the model had one output node that was forced to the range (0,1) using the sigmoid function. CO-RADS scores were mapped to target values in the range [0,1] with a uniform spacing between CO-RADS classes such that CO-RADS scores of 1 and 5 were assigned target values of 0 and 1, respectively. As the network had one output node, binary cross-entropy was used as loss function. With this method, unlike a standard categorical approach with a softmax layer and categorical cross-entropy loss, predictions that are further off from the target are penalized more heavily than predictions that are closer. To obtain a CO-RADS score during inference, the sigmoid output was multiplied by 4, rounded to the nearest integer and added to 1. De Vente et al. [51] explored this approach for prostate cancer grading and found that it outperformed other regression and categorical output methods. The CT scans were clipped between −1100 and 300 Hounsfield units, normalized between 0 and 1, and resampled to a voxel spacing of 1.5 mm 3 using linear interpolation. The scans were further preprocessed using a lung segmentation algorithm that was trained on data from patients with and without COVID-19 [52] . More specifically, any slices with a distance of 10 mm or more to the lung mask were discarded and the remaining slices were cropped to 240 × 240 pixels around the center of the mask. Following previous research with I3D models [33] - [35] , we trained our models with a fixed 3-D input size. To achieve this without adding extra slices that do not contain information regarding the presence of COVID-19, we uniformly sampled 128 axial slices along the z-axis. We trained all networks with a batch size of 2, the Adam optimizer with β 1 = 0.9, β 2 = 0.999, and a learning rate of 10 −4 . Data augmentation consisted of random zooming between −20% and +20%, rotation between −15% and +15%, shearing between −10% and +10% and elastic deformations in the axial plane, translation between −2 and +2 voxels in the z-direction, −20 and +20 voxels in both the x-and y-direction, and additive Gaussian noise with a mean of 0 and a standard deviation between between 0 and 0.01 (after intensity normalization between 0 and 1). To correct for the class imbalance, we monitored the performance on the validation data in the development set during training with balanced samples based on the distribution of CO-RADS classes in the training set. We used early stopping with a patience of 10 000 training batches and the QWK on the validation set for the stopping criterion. Gradient checkpointing [53] reduces GPU memory requirements for training deep neural networks without affecting performance. This technique was used when necessary to enable a batch size of 2 for the 2-D models. To rule out the possibility that performance differences between the 3-D and 2-D approach were due to other factors such as preprocessing or data augmentation, we kept all hyperparameters the same during training. Each model was trained on a single GPU, using NVIDIA GeForce GTX TITAN X, GeForce GTX 1080, GeForce GTX 1080 Ti, GeForce RTX 2080 Ti, TITAN Xp, and A100 SXM4 cards. The models were sensitive to the randomness of the training process introduced by initialization of weights without pretraining, sample selection, and data augmentation. In order to enable stable comparisons, we obtained ensembles by training 10 instances of the same model with different random seeds. The ensemble output was obtained by simply taking the mean of the individual model outputs. For categorical model ensembles, the output was the mean of the probability output vectors of the individual models. All results presented in Section IV were obtained from ensembles unless stated otherwise. We evaluated the CO-RADS scoring performance using the QWK score. This measure accounts for the ordinal nature of the CO-RADS score by weighting mismatches between true and predicted labels differently based on the magnitude of the error. Following previous works on COVID-19 classification and grading [2] , [4] , [6] - [10] , [12] - [14] , diagnostic performance was evaluated using the AUC and ROC curves. We calculated 95% confidence intervals (CIs) with nonparametric bootstrapping and 1000 iterations [54] . Statistical significance was computed with the same bootstrapping method [55] . The AUCs that our models achieved on the external test set are additionally listed on the grand challenge platform [32] to allow for a direct comparison between our and future COVID-19 grading and classification solutions. Inference duration was calculated on the same machine for each architecture, using a GeForce RTX 2080 Ti card. The reported durations were averaged over 50 forward passes of a batch with one sample. Table III shows the number of trainable parameters, single-model inference time for one sample and FLOP count for each architecture. All 2-D architectures were outperformed by their 3-D counterparts both in terms of QWK and AUC. The 3-D DenseNet-201 architecture performed best in terms of QWK, followed by the 3-D Inception-v1 architecture. In terms of AUC, the Densenet-169 obtained the best performance, again followed by the 3-D Inception-v1 architecture. In the architecture selection, on average, training of the individual 3-D models required approximately 26 700 iterations, while it required about 29 800 iterations for the 2-D models. Since the QWK takes into account the ordinal nature of the CO-RADS score, this metric was used to select the architecture to execute the ablation study with. In the rest of this section, we refer to the 3-D DenseNet-201 ensemble as the 3-D model and to the 2-D Densenet-201 ensemble as the 2-D model. On the internal dataset, both the AUC and the QWK scores were significantly higher for the full 3-D model (with transfer We also trained an ensemble with the COVNet pipeline from Li et al. [2] , which contains a ResNet-50 backbone that was pretrained on ImageNet. With COVNet, we obtained a lower performance on the internal test set than when we applied the 3-D model in our own pipeline. COVNet obtained a QWK of 0.567 (95% CI: 0.411-0.703, p = 0.004) and a lower AUC of 0.828 (95% CI: 0.741-0.906, p = 0.017) Our 2-D model also outperformed COVNet in terms of both the QWK (p = 0.074) and AUC (p = 0.179). Fig. 5 shows confusion matrices for the two dimensionalities. For 13 scans, the full 3-D approach had predictions that were more than one CO-RADS category off. For the full 2-D approach this was the case for 19 scans. Furthermore, the full 3-D approach and 2-D approach both had two cases that were further off than two categories. The results of an ablation study to investigate the effect of each of the additional components added to the 3-D CNN are shown in Fig. 3 . The 3-D model without ablations obtained an AUC of 0.930 (95% CI: 0.872-0.971) and a QWK of 0.785 (95% CI: 0.705-0.852). Removing any of the additions had a smaller effect on these performance metrics than changing the dimensionality of the architecture to 2-D. Removing pretraining reduced the QWK to 0.770 (95% CI: 0.682-0.789, p = 0.278), but increased the AUC to 0.932 (95% CI: 0.857-0.977, p = 0.428). When the lesion segmentation input was removed from the model, the QWK was increased to 0.812 (95% CI: 0.738-0.875, p = 0.091) and the AUC was reduced to 0.920 (95% CI: 0.859-0.969, p = 0.292). Replacing the regression approach with a categorical target had a negative effect on both metrics, reducing the QWK to 0.799 (95% CI: 0.680-0.863, p = 0.421) and the AUC to 0.919 (95% CI: 0.868-0.964, p = 0.324). Fig. 4 shows prediction examples from the ablation study models in black. The 3-D model required 31 550 iterations for training on average. The 2-D model, the network without pretraining, and the model without categorical output all required less iterations (25 650, 31 000, and 22 450, respectively). The model without lesion input required more iterations (32 750). For a single patch the lesion segmentation model inference time was 178.66 ms ± 14.56 ms, using 9.41 × 10 11 FLOPs. The CT scans in the test set contained 12.8 patches on average. The model had 29.69 × 10 6 parameters. Performance metrics for this model were reported by Lessmann et al. [14] . In this article, we identified and tested components of CNN based automated COVID-19 grading models. More specifically, we investigated how the performance of such models is affected by using different 2-D and 3-D CNN architectures, adopting pretrained weights, using automatically computed lesion maps as additional network input, and predicting a continuous output instead of a categorical output. We evaluated all models with the same datasets to allow for a fair comparison between models. Based on the architectures used in earlier automated COVID-19 classification research, we selected and compared the performance of the 2-D and 3-D variants of 7 CNN architectures for this task. We found that for all architecture types, the 2-D models were outperformed by their 3-D counterparts. The best performing model was a 3-D DenseNet-201. In the rest of this section, we refer to the 3-D DenseNet-201 as the 3-D model and to the 2-D Densenet-201 as the 2-D model. The full 3-D model (with transfer learning, lesion maps and continuous output) outperformed the full 2-D model in terms of AUC and QWK score on the internal test set for COVID-19 classification and CO-RADS grading. We compared our 2-D model with COVNet, an architecture previously used in a similar COVID-19 classification task in CT [2] , for which the authors reported an AUC of 0.96 for differentiating between COVID-19 positive and negative patients. The substantial difference between this result and our observations with COVNet illustrates the importance of using the same dataset when comparing different approaches. We also observed a better diagnostic performance for COVID-19 classification by the 3-D model on the external test set, although this performance increase was not statistically significant for a significance level of 0.05. AUC was 0.919 for the full 3-D Using a superset of the external set used in this article for evaluation an AUC of 0.919 was obtained, which is the same as the AUC of our 3-D model, even though our 3-D model was trained with weaker labels and on data from a different population. This further emphasizes the importance of using 3-D rather than 2-D models. The internal test set was comprised of data from the same population as the data the model was trained on, while the external test set was comprised of data from a different population. For the full 2-D model, a lower AUC was obtained on the internal test set than on the external test set. This difference might be due to population differences between the internal and external test set, or due to the different definitions of the positive class, which were presence of COVID-19 and high suspicion of COVID-19 for the internal and external test sets, respectively. On the external test set, the full 3-D model outperformed the full 2-D model by a smaller margin in terms of AUC than on the internal dataset. This difference could be partly due to the different definitions of the positive class. However, we also found that it partly arises from the larger overall slice thickness in the external test set. All scans in the internal test set had a slice thickness of 0.5 mm. In contrast, 207 scans (40 COVID-19 positive, 167 negative scans) in the external test set had a slice thickness larger than 1.5 mm, which was the input resolution in our training and testing pipeline. When evaluating only on these scans, we obtained an AUC of 0.885 (95% CI: 0.835-0.931) for the full 3-D model and an AUC of 0.891 (95% CI: 0.843-0.932) for the full 2-D model. The external test set contained 535 scans (167 COVID-19 positive, 368 negative) with a slice thickness smaller than or equal to 1.5 mm. On these scans, we obtained an AUC of 0.926 (95% CI: 0.902-0.947) for the full 3-D model and an AUC of 0.918 (95% CI: 0.892-0.941) for the full 2-D model. The performance of both models is lower for scans with a large slice thickness, but this effect is more apparent for the 3-D model. Taking into account the increasingly smaller slice thickness of CT scans [28] , this observation further supports our hypothesis that 3-D models are better suited for COVID-19 grading applications than 2-D models. A possible explanation for why adding the extra dimension to the convolutions improves the performance is that it allows the CNN to take into account the 3-D structure and full volume of individual lesions. This explanation is in line with the fact that radiologists typically use both the axial and coronal views to visualize the spread of COVID-19 related lesions across the lungs in CT scans, such as GGOs [27] . We could not directly compare the CO-RADS classification performance on the external set, since CO-RADS labels were not available. Moreover, the CO-RADS grading cannot be directly translated to the system used in the iCTCF dataset, since the former measures the probability of COVID-19 presence, while the latter quantifies the severity of the disease. The ablation study on the internal test set showed that the further additions to the network and training procedure did not have a significant effect on the performance. Regardless of performance increases, using a continuous output removes the disadvantage of having to decide on the onset of the positive class for the predicted CO-RADS scores. Adding lesion maps as input and using inflated ImageNet weights for pretraining might both be ineffective for 3-D automated CNN based COVID-19 grading methods. The full 2-D DenseNet-201 model obtained a better performance than the 2-D DenseNet-201 model without pretraining, additional lesion map input, and continuous output. This indicates that some of these additional components positively affected the performance of the 2-D model. However, even with all additional components, it was still outperformed by the vanilla 3-D DenseNet-201. We did not use clinical features available for the external dataset as input to the models trained in this work, since the main goal of this article was to demonstrate the effect on performance of different COVID-19 grading and classification algorithm components. We compared a variety of 2-D and 3-D CNN architectures for COVID-19 classification from CT scans and found that for all architectures considered, the 3-D variants outperformed their 2-D counterparts. We investigated how the performances of the best performing architecture and its 2-D counterpart were affected by including COVID-19 related lesion segmentations as additional input, using pretrained weights, and replacing the categorical output with a scalar continuous output. We intentionally did not develop novel nontrivial architectural tweaks for small performance improvements, as many of them have been shown to be unnecessary and to not generalize well to other datasets and tasks [29] , [30] . We leave systematic comparisons that explore other transfer learning schemes, make use of slice-level annotations, and use clinical features as model input for future work. Radiologists can be aided in assessing CT scans on the presence of COVID-19 by automatic COVID-19 grading systems. This article advances and speeds up the development of such systems in the following ways. First, our findings aid in advancing the performance of automated COVID-19 grading systems and provide insight into the performance benefits of several of their components. These insights primarily indicate that future research and clinical applications should move towards using 3-D CNNs for COVID-19 grading in CT scans. Second, the models and the automatic evaluation method used in this article have been made available on the online grand challenge platform [32] . This allows researchers to obtain and compare the performance of their COVID-19 grading and classification solutions to other solutions on the platform. Third, the output of all models used in this article adheres to the standardized CO-RADS reporting system to facilitate easier integration into clinical workflow. This publication is made possible in part by funding from the European Regional Development Fund (ERDF) East Netherlands. The role of imaging in 2019 novel coronavirus pneumonia (COVID-19) Artificial intelligence distinguishes COVID-19 from community acquired pneumonia on chest CT Coronavirus (COVID-19) classification using CT images by machine learning methods A fully automatic deep learning system for COVID-19 diagnostic and prognostic analysis Classification of COVID-19 patients from chest CT images using multi-objective differential evolution-based convolutional neural networks Deep learning enables accurate diagnosis of novel coronavirus (COVID-19) with CT images A deep learning algorithm using ct images to screen for corona virus disease (COVID-19)," medRxiv AI-assisted CT imaging analysis for COVID-19 screening: Building and deploying a medical AI system in four weeks Dual-sampling attention network for diagnosis of COVID-19 from community acquired pneumonia Prior-attention residual learning for more discriminative COVID-19 screening in CT images Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography: A prospective study Deep learning-based detection for COVID-19 from chest CT using weak label iCTCF: An integrative resource of chest computed tomography images and clinical features of patients with COVID-19 pneumonia Automated assessment of CO-RADS and chest CT severity scores in patients with suspected COVID-19 using artificial intelligence Artificial intelligence-enabled rapid diagnosis of patients with COVID-19 Efficient and effective training of COVID-19 classification networks with self-supervised dual-track learning to rank Deep learning in CT images: Automated pulmonary nodule detection for subsequent management using convolutional neural network Rapid AI development cycle for the coronavirus (COVID-19) pandemic: Initial results for automated detection & patient monitoring using deep learning CT image analysis CT image dataset about COVID-19 Coronavirus detection and analysis on chest CT with deep learning Hypergraph learning for identification of COVID-19 with CT imaging COVID-AL: The diagnosis of COVID-19 with deep active learning Artificial intelligence for the detection of COVID-19 pneumonia on chest CT using multinational datasets Handwritten digit recognition with a backpropagation network 3D convolutional neural networks for human action recognition Imagenet classification with deep convolutional neural networks CO-RADS -A categorical CT assessment scheme for patients with suspected COVID-19: Definition and evaluation Comparing and combining algorithms for computer-aided detection of pulmonary nodules in computed tomography scans: The ANODE09 study nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation A comprehensive analysis of deep regression CT images and clinical features for COVID-19 Grand challenge -COVID-19 CT classification challenge Quo vadis, action recognition? a new model and the Kinetics Dataset End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography Lung nodule detection and classification from thorax CT-scan using RetinaNet with transfer learning Multi-task contrastive learning for automatic CT and X-ray diagnosis of COVID-19 Deep residual learning for image recognition Densely connected convolutional networks Going deeper with convolutions Very deep convolutional networks for large-scale image recognition Inception-v4, inception-resnet and the impact of residual connections on learning Learning transferable architectures for scalable image recognition Efficientnet: Rethinking model scaling for convolutional neural networks Artificial intelligence augmentation of radiologist performance in distinguishing COVID-19 from pneumonia of other origin at chest CT COVID-19 detection in computed tomography images with 2D and 3D approaches FBSED based automatic diagnosis of COVID-19 using X-ray and CT images COVID-19 diagnosis on CT scan images using a generative adversarial network and concatenated feature pyramid network with an attention mechanism Transfusion: Understanding transfer learning for medical imaging Automated detection and forecasting of COVID-19 using deep learning techniques: A review Imagenet: A large-scale hierarchical image database Deep learning regression for prostate cancer detection and grading in bi-parametric MRI Relational modeling for robust and efficient pulmonary lobe segmentation in CT scans Training deep nets with sublinear memory cost Bootstrap estimation of diagnostic accuracy with patientclustered data Advantages and examples of resampling for CAD evaluation Coen de Vente received the B.S. and M.S. degrees in biomedical engineering from the Eindhoven University of Technology, the Netherlands, in 2017 and 2019, respectively.There, he followed a track on medical imaging, a collaborative effort with the University Medical Center Utrecht, the Netherlands. Since 2019, he has been a Ph.D. candidate with the Diagnostic Image Analysis Group, Department of Medical Imaging, Radboudumc, Nijmegen, the Netherlands. His current research focuses on deep learning techniques for medical image analysis and screening of eye diseases from retinal imaging.Luuk H. Boulogne received the B.S. and M.S. degrees in artificial intelligence from the University of Groningen, Groningen, the Netherlands, in 2016 and 2018, respectively.Since 2019, he has been a Ph.D. candidate with the Diagnostic Image Analysis Group, Department of Medical Imaging, Radboudumc, Nijmegen, the Netherlands. His research focuses on predicting the effects of lung volume reduction surgery on a patient's lung function.Kiran Vaidhya Venkadesh received the B.Tech. and M.Tech. degrees from the Engineering Design Department, Indian Institute of Technology, Madras, India, in 2016, with a specialization in biomedical design.After his graduation, he worked with Predible Health and developed deep learning solutions for medical image analysis for the Indian healthcare system. Since 2019, he has been a Ph.D. candidate with the Diagnostic Image Analysis Group, Department of Medical Imaging, Radboudumc, Nijmegen, the Netherlands. There, he is working on early lung cancer detection with an emphasis on temporal analysis on chest CT scans using deep learning.Cheryl Sital received the B.S. and M.S. degrees in biomedical engineering from the Eindhoven University of Technology, Eindhoven, the Netherlands, in 2017 and 2019, respectively.There, she followed a track focused on medical imaging, a collaborative effort with the University Medical Center Utrecht, the Netherlands. In 2020, she joined the Diagnostic Image Analysis Group, Department of Medical Imaging, Radboudumc, Nijmegen, the Netherlands as a Ph.D. candidate. She works on deep learning techniques for improved assessment of oncological CT scans. Since 2019, he is a tenure-track Researcher with the Diagnostic Image Analysis Group, Department of Medical Imaging, Radboudumc, Nijmegen, the Netherlands, where he is leading the research group on musculoskeletal image analysis with machine learning and artificial intelligence.