key: cord-192409-vhd7gjmf authors: Goldstein, Elisha; Keidar, Daphna; Yaron, Daniel; Shachar, Yair; Blass, Ayelet; Charbinsky, Leonid; Aharony, Israel; Lifshitz, Liza; Lumelsky, Dimitri; Neeman, Ziv; Mizrachi, Matti; Hajouj, Majd; Eizenbach, Nethanel; Sela, Eyal; Weiss, Chedva S; Levin, Philip; Benjaminov, Ofer; Bachar, Gil N; Tamir, Shlomit; Rapson, Yael; Suhami, Dror; Dror, Amiel A; Bogot, Naama R; Grubstein, Ahuva; Shabshin, Nogah; Elyada, Yishai M; Eldar, Yonina C title: COVID-19 Classification of X-ray Images Using Deep Neural Networks date: 2020-10-03 journal: nan DOI: nan sha: doc_id: 192409 cord_uid: vhd7gjmf In the midst of the coronavirus disease 2019 (COVID-19) outbreak, chest X-ray (CXR) imaging is playing an important role in the diagnosis and monitoring of patients with COVID-19. Machine learning solutions have been shown to be useful for X-ray analysis and classification in a range of medical contexts. The purpose of this study is to create and evaluate a machine learning model for diagnosis of COVID-19, and to provide a tool for searching for similar patients according to their X-ray scans. In this retrospective study, a classifier was built using a pre-trained deep learning model (ReNet50) and enhanced by data augmentation and lung segmentation to detect COVID-19 in frontal CXR images collected between January 2018 and July 2020 in four hospitals in Israel. A nearest-neighbors algorithm was implemented based on the network results that identifies the images most similar to a given image. The model was evaluated using accuracy, sensitivity, area under the curve (AUC) of receiver operating characteristic (ROC) curve and of the precision-recall (P-R) curve. The dataset sourced for this study includes 2362 CXRs, balanced for positive and negative COVID-19, from 1384 patients (63 +/- 18 years, 552 men). Our model achieved 89.7% (314/350) accuracy and 87.1% (156/179) sensitivity in classification of COVID-19 on a test dataset comprising 15% (350 of 2326) of the original data, with AUC of ROC 0.95 and AUC of the P-R curve 0.94. For each image we retrieve images with the most similar DNN-based image embeddings; these can be used to compare with previous cases. The Coronavirus Disease 2019 (COVID-19) pandemic, caused by the SARS-CoV-2 virus, poses tremendous challenges to healthcare systems around the world, and requires physicians to make clinical decisions with limited prior knowledge. Medical decisions are based also on imaging, and can be supported by a method for automatically retrieving prior patients that had similar imaging findings. Moreover, an ongoing concern is to rapidly identify and isolate SARS-CoV-2 carriers in order to contain the disease. Reaction (RT-PCR) (1, 2) . However, a recent study suggests that RT-PCR tests result in up to 30% false negatives, depending on the respiratory specimens (3), possibly from non-specific amplification and sample contamination. Taken together, the prominent undetected fraction of active patients inevitably leads to uncontrolled viral dissemination, masking hidden essential epidemiological data (4) (5) (6) . Additionally, RT-PCR testing kits are expensive and processing them requires dedicated personnel and can take days. Characteristics of COVID-19 such as consolidations and ground-glass opacities can be identified in both CXRs and CT scans (5, 7, 8) . Both are often used to support RT-PCR diagnosis, and are strong candidates for alternative means of COVID-19 testing. Portable X-ray machines play a central role in COVID-19 handling (9) , and most available CXRs of patients with COVID-19 in Israel come from portable X-rays. While COVID-19 is easier to detect in CT (10) , CT is more expensive, exposes the patient to higher radiation, and its decontamination process is lengthy and causes severe delays between patients. Deep learning models have shown impressive abilities in image related tasks, including in many radiological contexts (11, 12) . They have great potential in assisting COVID-19 management efforts, but require large amounts of training data. When training neural networks for image classification, images from different classes should only differ in the task specific characteristics; it is important, therefore, that all images are taken from the same machines. Otherwise, the network could learn the differences, e.g., between machines associated with different classes rather than identifying physiological and anatomical COVID-19 characteristics. This study aims to provide machine learning tools for COVID-19 identification and management. A large dataset of images from portable X-rays was sourced and used to train a network that can detect COVID-19 in the images with high reliability and to develop a tool for retrieving CXR images that are similar to each other. The network affords a detection accuracy of 89.7% and sensitivity of 87.1%. This retrospective study was approved by the Institutional Review Board (IRB) and the Helsinki committee of the participating medical centers in compliance with the public health regulations and provisions of the current harmonized international guidelines for good clinical practice (ICH-GCP) and in accordance with Helsinki principles. Informed consent was waived by the IRB for the purpose of this study. The code development and analysis was performed by six of the authors who are not radiologists This study includes CXR images from 1384 patients, 360 with a positive COVID-19 diagnosis and 1024 negative, totaling 2427 CXRs. Patients' COVID-19 labels were determined by a combination of RT-PCR testing and clinical assessment by the physicians. The COVID-19 positive images include all CXRs performed with portable X-ray machines on patients admitted to four hospitals in Israel during the pandemic's first wave (December 2019 through April 2020). For the control dataset we obtained CXRs taken by the same X-ray machines prior to December 2019. These are patients without COVID-19, typically with another respiratory disease. The test set was taken from the full CXR dataset and contains 350 CXR (15%) of which 179 (51%) are positive for COVID-19 and 171 (49%) are negative. To prevent the model from identifying patient-specific image features (e.g., medical implants) and associating them with the label, each patient was either used for the training or the test set. All images were used in the highest available resolution without lossy compression (e.g. jpeg); 4% (101/2426) of the images were excluded due to lateral positioning, or due to rectangular artifacts in the image, of these 98 were COVID-19 positive. No additional selection criteria were used to exclude CXR images based on clinical radiological findings. The model pipeline (Figure 1 ), begins with a series of preprocessing steps, including augmentation, normalization, and segmentation of the images. Augmentations are transformations that change features such as image orientation and brightness. These properties are irrelevant for correct classification, but may vary during image acquisition, and can affect the training performance of the network because of its rigid registration with respect to orientation and pixel values. They serve to enlarge the dataset by creating a diverse set of images, increasing model robustness and generalizability (13, 14) . Importantly, augmentations should correspond to normal variation in CXR acquisition; to ensure this we consulted with radiologists when defining the augmentation parameters (see Appendix). The normalization process aims to standardize image properties and scale. It consists of cropping black edges, standardizing the brightness and scaling the size of each image to 1024X1024 pixels using bilinear interpolation. To enhance performance we created an additional image channel using lung segmentation via a U-net (15) pre-trained on a different dataset. This network produces a pixel-mask of the CXR indicating the probability that each pixel belongs in the lungs, allowing the network to access this information while training. Input images contain 3 channels: the original CXR, the segmentation map, and one filled with zeroes. This is done to accommodate the pre-trained models we used that use 3-channel RGB images. We compared five network models: ResNet34, ResNet50, ResNet152 (16), VGG16 (17) and Chexpert (11) . We additionally classify the images by aggregating the results of these networks using a majority vote. The general approach of these architectures is to reduce images from a high-dimensional to a low-dimensional space such that a simple boundary can be used separate image classes. The models were trained using transfer learning, i.e. using pre-trained weights and subsequently retraining them on our data. Training was performed with the Adam optimizer with an initial learning rate of 1e-6 which was exponentially decreased as epochs progressed. We used cross-entropy as a loss function with an L2 regulariser with regularization coefficient 1e-2. The best test accuracy scores were achieved after 32 epochs. The models were built and trained using Pytorch 1.6; All code will be made available upon publication. In addition to classification, we propose a method for retrieving a number of CXR images that are the most similar to a given image. The activation of layers of the neural network serve as embeddings of the images into a vector space, and should capture information about clinical indications observed in the images. We use these embeddings to search for similarity between the resulting vectors, and retrieve the nearest neighbors of each image. For model evaluation we used accuracy, precision, and area under the curve (AUC) for receiver operating characteristic (ROC) and precision recall (P-R) curves. The patient data included in this study are shown in Table 1 Table 1 . The performance of the network was tested upon 15% (350 of 2426) of the images that were taken from the total dataset and was set aside before training. The metrics we used are accuracy, namely the proportion of successful classifications overall, sensitivity (alsorecall), which is the proportion of positive images that the network classified correctly and specificity, the proportion of correctly classified negative images. We trained five deep network models whose accuracy and sensitivity rates can be seen in Table 2 . We selected ResNet50 for the rest of the analysis, as it achieved the best performance in our task with accuracy 89.7% (314/350), sensitivity 87.1% We then train ResNet50 on the dataset with and without all the preprocessing stages. As seen in (c) Table 2 , preprocessing incurs an improvement of 4% in accuracy and 5% in sensitivity. In addition to the binary decision of whether a patient has COVID-19, we provide a score between 0 and 1, corresponding to the probability the network assigns to the positive label. It is given by the activation of the network's last layer, before it is passed through an activation function that produces the binary output. Whenever this score is above the threshold of 0.5, an image is classified as positive for COVID-19. We generate a histogram of these scores, as can be seen in Figure 3 , and observe that the majority of the correctly classified points are accumulated at the edges, while the wrongly classified images are more spread out along the x-axis. We additionally visualize the distinction made by the model using t-distributed Stochastic Neighbor Embedding (t-SNE) (18) . t-SNE uses a nonlinear method to reduce high dimensional vectors into two dimensions, making it possible to visualize the data points and reveal similarities and dissimilarities between them. We used one of the last layers of the networks, which essentially provides an embedding of the images into a vector space. These vector embeddings of the images are given as input to the t-SNE. In Figure 4 it can be seen that the arrangement of the dots, representing the features of the images, colored by their GT labels. The figure depicts two distinct clusters, revealing a similarity between most of the images with the same GT. In order to test the model on a more difficult task, we were supplied with 22 CXRs, 9 positive for COVID-19 and 13 as control, classified by radiologists as difficult to diagnose and used as a test on our model. The accuracy on the test was 77% and sensitivity of 77%. In Figure 5 , three correctly classified images from this test are shown with the network's classification score and the GT. such that all images with confidence score above the threshold will be labeled as positive for COVID-19 and images below as negative. The ground truth (GT) of each image is also shown. Finally, we applied K-Nearest Neighbors (KNN) on the image embeddings in order to retrieve images similar to each other as shown in Figure 6 . For each image we retrieve 4 images with the closest image embeddings; averaging over these images' predictions achieves 87% accuracy (305/350) and 83.2% sensitivity (149/179), meaning that the nearest images typically have the same labels. with 473 COVID-19 X-ray images. The performance reported in these research papers is generally high with accuracy rate ranging from 89%-99% and specificity ranging from 80% to 100% (19, 21) . However, these results were obtained via testing solely on subsets of the data available to the research. These have a number of drawbacks. They include a limited number of positive COVID-19 CXR images, which can cause the model to overfit, as it is exposed to a relatively small number of characteristics from the data which can impair the ability to generalize to external datasets. These models' reliability still need to be verified on external data. As machine learning models tend to improve and generalize better when the amount of data increases (22) , a dataset with more positive COVID-19 images as the one used in this study, with 1191 positive CXR, tends to be more stable. In addition, these datasets were compiled from various sources, often using one source only for COVID-19 images and another only for COVID-19 negative images. Positive and negative images in these datasets may therefore be produced by different X-ray machines, in particular portable and fixed machines, which give rise to images with different expressions of optical features. This can allow the network's predictions to rely on features related to the source more than on the relevant medical information (23) . In this research we used CXR from the same machines both for patients with both positive and negative COVID-19 outcomes. As future work, we intend to deploy our model for testing in a clinical setting. We will further investigate the scoring process for the image similarities we provide. We would ideally like to compare the disease progression for patients that were found by our tool to have similar lung findings. Additionally, we will examine how CXR are influenced by progression of the disease. Lung damage may remain after the virus leaves the body, leading to false positives in classification in later stages of the disease. Lastly, our classifier is tailored towards portable Xrays within the four Israeli hospitals that provided the data. It may need further fine tuning to be used in other hospitals or diagnostic settings. In summary, we showed a deep neural network which is able to reliably detect patients with coronavirus disease 2019. Even though medical imaging has not yet been approved as a standalone diagnosis tool (9), we believe it can be used as an aid to medical judgement with the advantage of immediate outcome. We also created a tool for X-ray image retrieval based on lung similarities. This tool can help physicians draw connections between patients with similar disease manifestations, by referring them to images with similar lung characteristics. These images can be linked internally to the corresponding patients, and the treatment and outcome of these patients can then inform their decision upon treatment for the current patient. A novel aspect of our model architecture is adding an additional input channel to each image in the form of a probability vector, which indicates for each pixel the probability it belongs to the lung. These probabilities are obtained by applying a pre-trained U-net to segment the lung area from the image. Adding this mask as an additional channel to the X-ray image helps the network focus on the lung area while training. An example of segmentation can be seen in Figure 2 . Based on that, our network architecture consists of two main parts -feature extraction, and decision head. The feature extractor is a neural network based on a Resnet50 architecture that gets an image as input (in our case -2D image), performs mathematical operations on it and outputs a feature map, namely a matrix of numbers which describe the image. This matrix of features is converted to a vector (with the same values) and then goes into the decision head which is a simple neural network. In our case it consists of 3 fully connected layers. The output of the decision head is two numbers which describe the confidence of the algorithm about the classification results: COVID-positive or COVID-negative. In addition, the last layer (a vector) in the decision head is referred to as the "embedding" and is used as an input to the t-SNE and KNN algorithms described in the text. Analytical sensitivity and efficiency comparisons of SARS-COV-2 qRT-PCR assays. medRxiv Diagnosing COVID-19: The Disease and Tools for Detection Evaluating the accuracy of different respiratory specimens in the laboratory diagnosis and monitoring the viral shedding of 2019-nCoV infections. medRxiv Modes of contact and risk of transmission in COVID-19 among close contacts. medRxiv FALSE-NEGATIVE RESULTS OF INITIAL RT-PCR ASSAYS FOR COVID-19: A SYSTEMATIC REVIEW. medRxiv Sensitivity of chest CT for COVID-19: Comparison to RT-PCR Chest Imaging Appearance of COVID-19 Infection. Radiol Cardiothorac Imaging Initial CT findings and temporal changes in patients with the novel coronavirus pneumonia (2019-nCoV): a study of 63 patients in Wuhan ACR Recommendations for the use of Chest Radiography and Computed Tomography (CT) for Suspected COVID-19 Infection | American College of Radiology Statements/Recommendations-for-Chest-Radiography-and-CT-for-Suspected-COVID19-Infection. Accessed Imaging Profile of the COVID-19 Infection: Radiologic Findings and Literature Review. Radiol Cardiothorac Imaging CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison Deep Learning in Ultrasound Imaging Data augmentation in training deep learning models for medical image analysis. Intell Syst Ref Libr The Effectiveness of Data Augmentation in Image Classification using Deep Learning U-net: Convolutional networks for biomedical image segmentation Deep residual learning for image recognition Very Deep Convolutional Networks for Large-Scale Image Recognition Visualizing Data using t-SNE COVID-19 Image Data Collection: Prospective Predictions Are the Future COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest X-Ray Images The unreasonable effectiveness of data A Critic Evaluation of Methods for COVID-19 Automatic Detection from X-Ray Images We decided to apply left to right flips, as COVID-19 is known to affect the lungs symmetrically Thus, flipping will not change the characteristic manifestation of the disease. Moreover, some Xray images may be taken from the back, and we do not always have clear labels as to the direction in which the X-ray was taken In order to increase the number of images which can improve training performance, several different transformations are performed with a certain probability We would like to acknowledge Avithal Elias and Nadav Nehmadi for their helpful comments and contribution in the initial stages of the project. In this appendix, we elaborate further on the data processing and the neural network design. Before training, each image goes through a preprocessing pipeline. We start by cropping out areas that contain only text around the images themselves. We then unify the image sizes, preserving the original aspect ratios via padding, and apply a CLAHE (filter that was seen to enhance images and improve deep learning performance 10 ). On the training data, we also apply a series of augmentations. Augmentations are transformations performed on the data that serve a dual purpose. First, applying the augmentations creates additional diverse set of images from the existing ones and enables one to artificially increase a dataset to improve performance 11 . Augmentations are therefore very commonly used on medical images, where datasets tend to be relatively small 12 .Second, these transformations can help the network generalize better 13 , as they alter features that are unimportant to the identification of COVID-19 in the lungs. This way the network can learn the important features and ignore the irrelevant ones. Crucially, the transformations must preserve the image labels -a coronavirus patient must still be identifiable as one. To ensure this, we consulted with radiologists when defining the transformations and their parameter ranges. The augmentations are performed randomly, with parameters chosen uniformly within the defined range as seen in Figure 1 . Not all augmentations are applied each time, but rather each augmentation has a certain probability of being applied, represented by p below: