key: cord-0606979-t22il8to authors: Goodwin, Brian D; Jaskolski, Corey; Zhong, Can; Asmani, Herick title: Intra-model Variability in COVID-19 Classification Using Chest X-ray Images date: 2020-04-30 journal: nan DOI: nan sha: ac28c41a38cdbae61be7145096661508a213ecb8 doc_id: 606979 cord_uid: t22il8to X-ray and computed tomography (CT) scanning technologies for COVID-19 screening have gained significant traction in AI research since the start of the coronavirus pandemic. Despite these continuous advancements for COVID-19 screening, many concerns remain about model reliability when used in a clinical setting. Much has been published, but with limited transparency in expected model performance. We set out to address this limitation through a set of experiments to quantify baseline performance metrics and variability for COVID-19 detection in chest x-ray for 12 common deep learning architectures. Specifically, we adopted an experimental paradigm controlling for train-validation-test split and model architecture where the source of prediction variability originates from model weight initialization, random data augmentation transformations, and batch shuffling. Each model architecture was trained 5 separate times on identical train-validation-test splits of a publicly available x-ray image dataset provided by Cohen et al. (2020). Results indicate that even within model architectures, model behavior varies in a meaningful way between trained models. Best performing models achieve a false negative rate of 3 out of 20 for detecting COVID-19 in a hold-out set. While these results show promise in using AI for COVID-19 screening, they further support the urgent need for diverse medical imaging datasets for model training in a way that yields consistent prediction outcomes. It is our hope that these modeling results accelerate work in building a more robust dataset and a viable screening tool for COVID-19. The spread of the novel coronavirus, which causes COVID-19, has caught most of the world off-guard resulting in severely limited testing capabilities. For example, as of April 15, 2020 almost 3 months since the first case in the US, only about 3.3 million tests have been administered [1] , which equates to approximately 1% of the US population. Reverse transcription-polymerase chain reaction (RT-PCR) is an assay commonly used to test for COVID-19, but is available in extremely limited capacity [2] , [3] . In an effort to offer a minimally invasive, low-cost COVID-19 screen via x-ray imaging, AI engineers and data scientists have begun to collect datasets [4] and utilize computer vision and deep learning algorithms [5] . All these efforts seek to leverage an available medical imaging modality for both diagnosis and, in the future, predicting case outcome. Clinical observations have largely propelled AI research in computer vision for screening COVID-19, and these reports cite differentiable lung abnormalities of COVID-19 patients from chest CT [6] , x-ray [7] , [8] , and even ultrasound [9] . Current research also shows that COVID-19 is correlated with specific biomarkers in x-ray [10] . Though these recent efforts are valuable in that they will lay the foundation for future work in this area, there are significant flaws in the methodology as well as in the behavior of the resultant models. Much of the initial work on COVID-19 prediction from chest x-ray used a training set that included a little over 100 images with 10 test images (that were, in fact, identical to the validation set). Though such small test data sets do not allow for declaring sweeping diagnostic value statements, unfortunately the popular media articles effectively hype the value of these models with hopeful titles like "Coronavirus Neural Network can help spot COVID-19 in Chest X-rays" [11] , "How AI Is Helping in the Fight Against COVID-19" [12] , and "A.I. could help spot telltale signs of coronavirus in lung X-rays" [13] . Network weights from these publications are not publicly available for these published models. We have responded to this shortcoming by providing pre-trained weights for many of the most common deep learning architectures for computer vision, and we have made the code for pre-training freely available. To our knowledge, this repository of pretrained model weights is the first of its kind in response to the current crisis and the first to report prediction results across multiple architectures on a test set that is held out from the validation and training sets. Our goal is to facilitate advancement of screening technology for COVID-19 and highlight the need for larger, more diverse datasets. The urgency for a clinical methodology to provide COVID-19 screens cannot be understated [14] . Our hope is twofold: 1) that the community advances computer vision for COVID-19 detection via x-ray before recommending use in a clinical setting and 2) that pre-trained model weights will help accelerate ongoing development in AI to augment the decision-making process for clinicians during a time where healthcare workers are under a severe amount of stress. We carried out a series of experiments to quantify baseline machine-learning performance in detecting COVID-19 from chest x-ray images using a series of common, openly available neural network architectures. Computational benchmarking was outside of our experimental approach since it has been extensively studied [15] . In this study, we focused on quantifying the expected variability in prediction outcomes and sought to quantify the reliability of predictions with respect to the chest x-ray data that is currently available to the public and contains COVID-19 positive scans. All computational experiments were executed on Lambda Blade (lambdalabs.com) hardware (8x RTX 8000 + NVLink GPUs with 48GB GDDR6 RAM, 2x Intel Xeon Gold 5218 processor with 512 GB RAM) using the Pytorch framework [16] . An identical train-validation-test split (TVTS) was employed for all experiments, and each model architecture was trained 5 separate times (creating 5 separate models, each). We controlled for TVTS and model architecture while allowing randomness in weight initialization and data augmentation during batch training for each experiment. We designed our approach with the aim to elucidate expected variability in model behavior for COVID-19 detection in chest x-ray. Since COVID-19 datasets are becoming more abundant, existing sets are constantly growing and evolving, and it has become important to cite the data source and the day it was acquired. Specifically, this study uses data from the Cohen et al. [4] chest x-ray dataset that was acquired on 2020-04-17. This dataset contains three classes: 1) healthy, 2) community acquired pneumonia (CAP), and 3) COVID-19 (examples in Figure 1 ). We employed a modified 80-10-10 TVTS (Table I) training paradigm. It was modified to maximize the number of COVID-19 training samples and double the size of the test hold out set used in previous work [5] . The dataset version at the time of this study was large enough to accommodate the addition of a COVID-19 validation set and test hold out set 2x larger than previous work (from 10 hold out images to 20) [5] . We elected to generate baseline results for the following commonly used architectures: Resnet-18, -50, -101, -152 [17] , WideResnet-50, -101 [18] , ResNeXt-50, -101 [19] , MobileNet-v1 [20] , Densenet-121, -169, -201 [21] . ADAM optimization was used during training on only the last fully connected layer in each network using a batch size of 128 resulting in a mean compute time of 156.7+/-50.7 sec/epoch across all architectures. Since all models have been pre-trained on ImageNet, we elected to freeze the convolutional layers to retain the higher-level learned features [22] ; i.e., all weights were frozen but for the final layer in each network. Models were trained on chest x-ray images of size 3 x 512 x 512 px. All images were assumed to be RGB channels despite their inherent grayscale property. Therefore, no manipulations were made to the networks to uniquely accommodate single channel x-ray imaging data. All models were trained for 100 epochs with stopping criteria, and weights from lowest validation loss were saved out; COVID-19 recall was not considered during training. All experiments were carried out using weighted cross entropy loss (wCEL) where the contribution to the loss from a given class is weighted based on its representative proportion in the total dataset. Our decision to use wCEL (as opposed to CEL) is based on the objective to achieve a high recall in the underrepresented class (the COVID-19 class) since the AI task is straightforward: detect COVID-19. Stopping criteria was based on plateau of the validation wCEL. Performance metrics were then calculated using only the test holdout set. Our first iteration of testing (not reported in this paper) used CEL, and performance metrics were found to have improved dramatically when models were trained using wCEL (as reported in this paper). Validation losses followed a common trend across models during training with a distribution illustrated in Figure 2 ). Each training experiment was carried out on an identical TVTS of the dataset. For each architecture, 5 training runs were carried out to gather a small distribution of models (and prediction outcomes). The class prediction (and therefore COVID-19 detection) was based solely on the maximum value output from the softmax layer (3 total classes). Sources of variability across all experiments included image augmentation transformations that are based on random draws from a binomial probability distribution, shuffling for batch allocation (i.e., each batch ID does not contain identical images across all experiments), and random weight initialization (last layer only). Datasets were prepared in a manner consistent with Wang et al. [5] (see github.com/lindawangg/COVID-Net), and a data augmentation protocol was implemented. Given the consistent format of a chest x-ray, only modest translational and rotational augmentations were applied with brightness jitter and a possibility of horizontal flip (Table II) . Multiple measures of accuracy and uncertainty were calculated to quantify baseline performance expectations for each network. We report statistical tests, model performance characteristics, and common accuracy metrics with the aim to quantify the expected performance and the variability of model performance given the size of current COVID-19 chest x-ray datasets. Specifically, we report Type I (false pos- [24] . This statistical test was carried out using the method described in Dietterich [23] for a binary classification task (COVID-19 v. non-COVID-19). All statistical analyses were carried out using R [25] , and figures were built using ggplot2 [26] . Among all tested models, Densenet-169 was found to have the highest false negative rate (FNR), Densenet-121 had the highest FNR variance, and Resnet-18 had the highest mean false positive rate (FPR) (Figure 3 ). All plot labels for model architectures are organized by number of tunable parameters increasing from left to right (order reversed in Figure 5B ). The lowest FNR (0.15) was achieved by Densenet-201 and ResneXt-101. Overall, Type I and II error rates varied across models and varied modestly within model architecture ( Figure 3 ). Networks had consistently lower softmax output probabilities for COVID-19 in the event of Type I error (FP) while TP probability distributions consistently extended well into those from FP outputs (Figure 4 ). No significant difference across architectures was found between softmax probabilities in the event of a more serious Type II error (FN; failing to correctly detect COVID-19). For the binary COVID-19 detection task, mean softmax outputs were 0.601+/-0.161 and 0.213+/-0.135 for FP and FN, respectively. Interestingly, 16 .5% of all TP predictions had softmax probability outputs below 0.667. Prediction behavior was often inconsistent between models and those with identical architectures to a significant degree based on McNemars test ( Figure 5 ). P-values of the McNemars chi-squared test statistic were used to estimate the prediction behavior consistency (in the test set only) between all models for the binary classification task of detecting COVID-19 versus non-COVID-19. With the null hypothesis, a low p-value suggests that the two models in question have inconsistent prediction behaviors. A large portion of model comparisons show low p-values and therefore high prediction inconsistency ( Figure 5B ). Similarly, low p-values were common when comparing models having identical architectures ( Figure 5C ). If prediction behavior were largely consistent between models, distributions shown in Figures 5B and C would have large spikes near 1.0, instead the distribution of p-values is more uniform than expected. Given that our experimental design controlled for the TVTS and model architecture across all experiments, it is expected that a high quality dataset would produce a distribution of p-values indicating similar behavior between models with identical architectures (Figure 5C ). McNamers test suggests that several architectures have reliably inconsistent behavior. The most apparent differences are between Mobilenet-v2 v. Wideresnet-101, Resnet-152 v. Wideresnet-101, and Resnet-152 v. Wideresent-50 ( Figure 5A ). Mobilenet-v2 was found to have the least inter-and intra-model differences, which could be explained by its relatively small number of tunable parameters making a more "general" model fit. Predictions from Densenet-121 models had the most consistency, on average, with all other trained models including those with identical architecture. Models sharing the Wideresnet-101 architecture had the most intra-model differences followed by ResneXt-101. Comparisons between Mobilenet-v2, densenet-121, and Resnet-18 indicate that these architectures had the most similar prediction behavior ( Figure 5B ), and their false negative rate (FNR) values fall in the middle of the pack. Conversely, models with deeper architectures were responsible for the highest FNR values (Table III) . Using the ensemble of models trained during this study, the FPR and FNR become 0.009 and 0.20, respectively. The ensemble predictions were carried out by simply summing the output from the last layer (softmax) of each network for a given image in the test hold-out set. The model ensemble offers no improvement over best performing models in FNR or FPR but improves the F1 score and multiclass accuracy to 0.640 (from 0.625) and 0.894 (from 0.884), respectively. The AI community is responding to the COVID-19 crisis and releasing publications at a rapid pace. Studies show promise in using AI in a clinical setting as a screening tool for diseases like COVID-19. While these results suggest that machine-learning techniques should only be used for COVID-19 screening (not diagnostic purposes), our intent with this study was to accelerate existing work for clinical augmentation purposes and provide insight into model selection for COVID-19 use cases. The use of AI in diagnostic procedures should be limited for use as a decision support tool during screening, and diagnoses should not rely on AI results alone. Given the length of the incubation period and the variability in the symptom onset latency from infection, it is difficult to control for the time at which the image was acquired relative to the time of infection. However, efforts have been made to include an offset in COVID-19 imaging datasets by accounting for the number of days since the start of symptoms or hospitalization [4] . Those who curate these datasets also must deal with inherent ambiguity in medical records such as image acquisition "after a few days" of symptoms (for example, Cohen et al. [4] assume 5 days). With the currently available datasets, AI engineers rely completely on clinical diagnoses and therefore assume that no false positive images exist; i.e., it is assumed that patients that tested positive for COVID-19 are indeed COVID-19 positive. Given an estimated false negative rate for COVID-19 tests are high ( 10%) [27] , perhaps those testing positive can be safely assumed that they are indeed infected with the virus (currently, false positive RT-qPCR tests are not reported). Deep learning architectures like those in this study have been translated for use in upper body CT images [28] , and current studies demonstrate more reliable predictions than those from chest x-rays [29] , [30] . Reasons for this performance relate to the image detail that can be obtained from a chest CT as well as the size of the dataset used by Li et al. [29] , which contained images from 1296 COVID-19 patients allowing for a larger hold-out test size than what is currently available in x-ray (127 images at the time of this study) [29] . While CT scans provide enhanced screens, they are less accessible, more expensive, and less efficient than the chest x-ray due to the preparation time and scan time required. Consequently, CT falls short as a COVID-19 screening tool due to limitations including the complex mechanics and calibrations required for 3D geometrical renderings. Furthermore, only 25 CT scanners exist in the United States per million population (approx.) [31] , and COVID-19 screens place an increased demand on top of existing demand for CT scans. Not only that, but an increased infection risk is an unfortunate corollary for non-COVID-19 patients requiring CT scans [3] , especially due to the amount of CT essential equipment that interfaces with the patient compared to that for chest x-ray: not to mention the increased exposure for clinicians and technicians. If COVID-19 screens are needed en masse, the chest x-ray is a promising, low-cost service that requires no moving parts and could be modified to meet a spike in demand. While this study suffers from several limitations, accuracy metrics on a few tested architectures achieved results either consistent with or better than the current state-ofthe-art in both x-ray and CT studies [5] , [29] , [32] - [35] . The results of this study, though limited to specific neural network architectures, suggest that transfer learning provides an efficient means to achieve high accuracy in detecting COVID-19. However, models remain highly dependent upon model architecture and vary depending on initial conditions and data augmentation steps. To better quantify the reliability of AI predictions in this context, our next goal is to implement segmentation techniques [36] and carry out a cross-validation protocol for each architecture and use a richer dataset. Model accuracy metrics indicate that more advancements are necessary before using AI for COVID-19 screening via x-ray. In our opinion, clinicians should not rely on solutions derived from architectures that have high (and high variability) FNR. Furthermore, inconsistent behavior in predictions between models with identical architectures injects doubt into its efficacy, which is unlikely to resolve until dataset limitations are worked out. Without a more abundant dataset, we do not expect deep learning approaches for COVID-19 screening to gain the reliability needed for clinical implementation. Finally, our aim is to encourage a focus on advancing the quality of x-ray screens for COVID-19 due to its efficiency over other means and to accelerate workflows that seek to leverage AI. We have made pre-trained model weights and the code for training freely available to the community through our github repository under a Creative Commons Attribution-NonCommercial 4.0 License (covidresearch.ai/datasets; github.com/synthetaic). We expect these model weights to provide significant improvements in model training efficiency as these public datasets continue to grow and evolve. We believe that AI results have the potential to achieve a degree of reliability that alleviates skepticism within the medical community regarding the use of chest x-ray and computer vision to screen COVID-19. Models trained for this study had the task of detecting COVID-19 versus community acquired pneumonia and non-COVID-19 (healthy cases), which limits feature space to which the models are exposed. Future work should include many more image classifications to enable the network to learn features specific to a given pathology, which could provide the means to elucidate differentiable features of COVID-19 in chest x-ray. We aim to solve the data limitation problem in future work through numerical methods and data collection. The covid tracking project (covidtracking.com) Correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: a report of 1014 cases Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19 COVID-19 Image Data Collection COVID-net: A tailored deep convolutional neural network design for detection of COVID-19 cases from chest radiography images Covid-19 chest ct image segmentation -a deep convolutional neural network solution Clinical features of patients infected with 2019 novel coronavirus in wuhan, china Imaging profile of the covid-19 infection: radiologic findings and literature review Pocovid-net: Automatic detection of covid-19 from a new lung ultrasound imaging dataset (pocus) Extracting possibly representative covid-19 biomarkers from x-ray images with deep learning approach and image data related to pulmonary diseases A neural network can help spot covid-19 in chest x-rays How ai is helping in the fight against covid-19 A.i. could help spot telltale signs of coronavirus in lung x-rays Essentials for radiologists on covid-19: an updateradiology scientific expert panel Benchmark analysis of representative deep neural network architectures Pytorch: An imperative style, high-performance deep learning library Deep residual learning for image recognition Wide residual networks Aggregated residual transformations for deep neural networks Mobilenets: Efficient convolutional neural networks for mobile vision applications Densely connected convolutional networks Imagenet: A large-scale hierarchical image database Approximate statistical tests for comparing supervised classification learning algorithms R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing Elegant Graphics for Data Analysis Evaluation of COVID-19 RT-qPCR test in multi-sample pools Coronavirus detection and analysis on chest ct with deep learning Artificial intelligence distinguishes COVID-19 from community acquired pneumonia on chest CT Automatic detection of coronavirus disease (covid-19) in x-ray and ct images: A machine learning-based approach The industry of ct scanning Deep learning system to screen coronavirus disease 2019 pneumonia Large-scale screening of covid-19 from community acquired pneumonia using infection size-aware classification Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography: a prospective study Development and evaluation of an ai system for covid-19 diagnosis Miniseg: An extremely minimum network for efficient covid-19 segmentation