key: cord-0815252-tp6xkivc authors: Danilov, Viacheslav V.; Proutski, Alex; Karpovsky, Alex; Kirpich, Alexander; Litmanovich, Diana; Nefaridze, Dato; Talalov, Oleg; Semyonov, Semyon; Koniukhovskii, Vladimir; Shvartc, Vladimir; Gankin, Yuriy title: Indirect supervision applied to COVID-19 and pneumonia classification date: 2021-12-28 journal: Inform Med Unlocked DOI: 10.1016/j.imu.2021.100835 sha: 4e43e42355d48808d3ba650eae49749e8e601b86 doc_id: 815252 cord_uid: tp6xkivc The novel coronavirus 19 (COVID-19) continues to have a devastating effect around the globe, leading many scientists and clinicians to actively seek to develop new techniques to assist with the tackling of this disease. Modern machine learning methods have shown promise in their adoption to assist the healthcare industry through their data and analytics-driven decision making, inspiring researchers to develop new angles to fight the virus. In this paper, we aim to develop a CNN-based method for the detection of COVID-19 by utilizing patients' chest X-ray images. Developing upon the inclusion of convolutional units, the proposed method makes use of indirect supervision based on Grad-CAM. This technique is used in the training process where Grad-CAM's attention heatmaps support the network's predictions. Despite recent progress, scarcity of data has thus far limited the development of a robust solution. We extend upon existing work by combining publicly available data across 5 different sources and carefully annotate the comprising images across three categories: normal, pneumonia, and COVID-19. To achieve a high classification accuracy, we propose a training pipeline based on indirect supervision of traditional classification networks, where the guidance is directed by an external algorithm. With this method, we observed that the widely used, standard networks can achieve an accuracy comparable to tailor-made models, specifically for COVID-19, with one network in particular, VGG-16, outperforming the best of the tailor-made models. Since its introduction into the human population in late 2019, COVID-19 continues to have a devastating effect on the global populace with the number of infected individuals steadily rising [1] . With widely available treatments still outstanding and the continued strain placed on many healthcare systems across the world, efficient screening of suspected COVID-19 patients and their subsequent isolation is of paramount importance to mitigate further spread of the virus. Presently, the accepted gold standard for patient screening is reverse transcriptase-polymerase chain reaction (RT-PCR) where the presence of COVID-19 is inferred from the analysis of respiratory samples [2] . Despite its success, RT-PCR is a highly involved manual process with slow turnaround times, with results becoming available up to several days after the test is performed. Furthermore, its variable sensitivity, lack of standardized reporting, and a widely ranging total positive rate [3] [4] [5] calls for alternative screening methods. Chest radiography imaging (such as X-ray or computed tomography (CT) imaging) has gained traction as a powerful alternative, where the diagnosis is administered by expert radiologists who analyze the resulting images and infer the presence of COVID-19 through subtle visual cues [6] [7] [8] [9] [10] . Of the two imaging methods studied, X-ray imaging has distinct advantages with regards to accessibility, availability, and rate of testing [11] . Furthermore, the existence of portable X-ray imaging systems does not require patient transportation or physical contact between healthcare professionals and suspected infected individuals, thus allowing for efficient virus isolation and a safer testing methodology. Despite its obvious promise, the main challenge facing radiography examination is the scarcity of trained experts that could conduct the analysis at a time when the number of possible patients continues to rise. As such, a computer system that could accurately analyze and interpret chest X-ray images could significantly alleviate the burden placed on expert radiologists and further streamline patient care. Image identification techniques are readily adopted in Artificial Intelligence (AI) and could prove to be a powerful solution to the problem at hand. Deep learning models, such as convolutional neural networks (CNNs), have gained traction in the field of medical imaging [12, 13] and here we train 10 promising CNNs for the purpose of COVID-19 classification in chest X-ray images. To assist the models, we utilize a purpose-built extraction of a soft mask as part of a three-stage procedure. To better quantify the performance of our proposed framework we benchmark our results against recently developed COVID-Net models [14] . To ensure consistency, we utilize our dataset to output predictions across an array of different COVID-Net models. The structure of the rest of this paper is as follows: section "Related Work" briefly discusses some of the existing work used to diagnose COVID-19 in radiographic imaging; section "Data" summarizes the data collected from the 5 most studied datasets; section "Methods" J o u r n a l P r e -p r o o f describes the proposed three-stage workflow using an indirect attention mechanism; section "Results" displays the results obtained during all 3 stages, outlines further improvements of the proposed workflow, its advantages over other models and showcases possible implementations; section "Conclusion" represents a synthesis of key points of the developed model based on the indirect attention mechanism. The necessity for faster turnaround times to interpret radiographic images has led to a substantial effort to adopt CNN-based techniques, with a concentrated effort on distinguishing COVID-19 infected patients with the aid of both CT [15] [16] [17] [18] [19] [20] [21] and X-ray [14, 22, [31] [32] [33] [34] [23] [24] [25] [26] [27] [28] [29] [30] imaging. Several overviews into the application of CNN techniques to aid in COVID-19 diagnosis have been conducted and we refer the reader to [35] [36] [37] for more details. The authors in [34] propose DeepCOVID-XR, an ensemble of CNNs, to detect the presence of COVID-19 on frontal chest radiographs with an accuracy of 82% reported on a test set of 300 images (194 of which were from COVID-19 infected patients). Studying 5,090 images (1,979 of which were COVID-19 positive), the authors in [33] were able to achieve a binary classification accuracy of 99.5% by making use of HOG + CNN architecture for feature extraction and VGG for classification. In [32] , pre-trained CNN models VGG-16, VGG-19, MobileNet, and Inception ResNet V2, are used to achieve a classification accuracy of at least 90.8% across 545 images (181 of which are COVID-19 positive). Patients diagnosed with COVID-19 present symptoms consistent with pneumonia in their X-ray images, necessitating the ability to distinguish between COVID-19 and non-COVID-19 based pneumonia findings. Mahmud et al. [29] for binary and multi-class classification, respectively, when making use of 2,331 images (231 of which were COVID-19 positive). Mansour et al. [39] introduced an unsupervised deep-learningbased variational autoencoder model for COVID-19 prediction, with resultant accuracies of 98.7% and 99.2% for binary and multi-class classification respectively. The authors tested their model J o u r n a l P r e -p r o o f against the X-ray dataset found in [40] , split across normal, COVID-19, SARS, and ARDS classes. Khan et al. [41] developed CoroNet, a CNN model based on the Xception architecture. When tasked with classifying X-ray images as either normal, COVID-19, bacterial pneumonia, or viral pneumonia, the model achieved an accuracy of 89.6%, based on a dataset consisting of 1,251 images (284 of which belonged to COVID-19 positive cases). Chandra et al. [42] introduced an automatic COVID-19 screening system that uses a two-phase classification approach (normal vs abnormal and then COVID-19 vs pneumonia). The implemented classifier ensemble makes use of majority voting across five benchmark classification algorithms. By making use of 2,346 X-ray images (782 were COVID-19 positive), the authors report accuracies of 98.1% and 91.3% for each phase respectively. Ozturk et al. [30] developed DarkCovidNet, a model that obtained an accuracy of 87.0% when distinguishing between COVID-19, normal, and pneumonia in 1,127 images (127 of which are from COVID-19 positive patients). Wang et al. [14] , developed a state-of-the-art model, called COVID-Net, that attains an accuracy of 93.3% when classifying a patient's image across three categories: normal, pneumonia, and COVID-19. Despite recent progress in the development of CNN-based algorithms, several fundamental challenges remain: the scarcity of publicly available data, overfitting of models, and model sizes that make their adoption within a healthcare setting cumbersome. We extend upon existing works by combining various publicly available data sources and carefully annotate the images across three classes: normal, pneumonia, and COVID-19. The data is then divided into training, validation, and testing subsets with an 8:1:1 split respectively, with a strict class balance maintained across all sets. Furthermore, we make use of widely adopted CNNs whose size is a fraction of some purpose-built models. We collected data from different publicly available sources to train a high-precision classifier and to estimate its generalization properties. At the time of publication, we identified the following five datasets; COVID Chest X-Ray Dataset (CCXD) [40, 43] , Actualmed COVID-19 Chest X-Ray Dataset (ACCD) [44] , Figure 1 COVID-19 Chest X-Ray Dataset (FCCD) [45] , COVID-19 Radiography Database (CRD) [46, 47] , and RSNA Pneumonia Detection Dataset (RSNA) [48] . Since the datasets include different labels for their findings, we reassigned the labels to maintain consistency across the global dataset. We assigned viral and bacterial cases of pneumonia to the "Pneumonia" label; SARS, MERS-CoV, COVID-19, and COVID-19 (ARDS) to the "COVID-19" label; "no findings" and "normal" diagnosis to the "Normal" label. It should be noted that the RSNA dataset includes only normal and pneumonia cases. Originally, this dataset consisted of 20,672 normal cases and 9,555 cases of pneumonia. In order to keep class balance in our dataset, we incorporated a total of 800 normal and 700 pneumonia cases. It is worth noting that normal and pneumonia cases from the CRD dataset were excluded because they duplicated images from the CCXD dataset. The final dataset includes images acquired from the anterior-posterior (AP) and posteroanterior (PA) directions only. Lateral CXR has no clinical applicability to distinguish COVID-19 patients [49] . During network training, validation, and testing, the dataset was split in an 8:1:1 ratio i.e. the training subset includes 2,122 images (80%), the validation subset -242 images (10%), and the testing subset -267 images (10%). The split of data within training, validation, and testing phases was performed according to the distribution shown in Table 2Error ! Reference source not found.. The proposed workflow in this study is divided into three stages. noted that different COVID-Net models [14] are considered in this study. To date, COVID-Net models are state-of-the-art models used for distinguishing COVID-19 and pneumonia cases. All COVID-Net models are abbreviated to CXR in the remainder of the paper. As mentioned previously, 10 deep learning networks were selected to determine which network architectures are most effective in recognizing COVID-19 and pneumonia. All networks vary in the number of weights, architecture topology, data processing, etc. Additionally, CXR models are used for comparison purposes. In order to compare the investigated networks, we provide an overview of the networks used during the first stage in Table 3 . To train the aforementioned networks, we used bodies of these networks with frozen ImageNet weights. The most optimal version of each model was obtained through a series of training jobs performed on the collected dataset through the utilization of Amazon SageMaker. Having performed hyperparameter tuning based on a Bayesian optimization strategy, a set of hyperparameter values for the best performing model, given by the validation accuracy, was found. We chose the following pool of hyperparameters for the investigation:  The number of blocks, where each block is constructed of densely connected, activation, and dropout layers, was chosen to vary from 1 to 5.  Activation functions were chosen from a set of ReLU, ELU, Leaky ReLU, and SELU.  The dropout rate was varied from 0.00 to 0.50 with a step of 0.05. It is worth noticing that the architecture including 3 densely connected and 2 dropout layers was an optimal solution for all networks. However, the number of neurons varied slightly from network to network. The optimal number of neurons for the first and second densely-connected layers varied from 112 to 136 and from 56 to 72 respectively. A similar situation was observed for the dropout rate which varied from 0.05 to 0.15 for the first dropout layer, and from 0.05 to 0.10 for the second dropout layer. In this regard, we chose the optimal architecture of all network classifiers consisting of the following layers:  Densely-connected layer with 128 neurons and ELU activation;  Dropout layer with dropout rate equal to 0.10;  Densely-connected layer with 64 neurons and ELU activation;  Dropout layer with dropout rate equal to 0.05;  Densely-connected layer with 3 neurons;  Softmax activation layer. It is important to note that for the first stage, only the classification heads were trained with the body weights frozen. According to the results of the hyperparameter tuning procedure, the gradient descent optimizer SGD with a learning rate equal to 10 -4 proved to be optimal. Having trained several state-of-the-art networks, we found that most of them diverged. As a result, L2regularization with λ of 0.001 was applied to all training networks. All networks were trained with a batch size equal to 32. To avoid overfitting during network training, we applied Early Stopping regularization, monitoring validation loss with a patience equal to 10 epochs. For training networks in both first and second stages, we used cross-entropy, calculated as follows: where is the number of classes (3 in our study), is the ground-truth label (ternary indicator), is the softmax probability for the -th class, is a small positive constant used for avoiding an undefined case of (0). During the second stage, we took the four best performing networks with their trained heads from the first stage, namely MobileNet V2, EfficientNet B1, EfficientNet B3, VGG16, unfroze their body weights (weights of feature extractors) and retrained them using the SGD optimizer whose learning rate was 10 -5 . As seen, we decreased the learning rate by a factor of 10 compared to that used in the first stage. It is important to lower the learning rate at this stage since J o u r n a l P r e -p r o o f a larger model with more unfrozen weights is trained, and this requires the readaptation of the pretrained weights. Otherwise, unfreezing all weights without changes in the training policy may lead to quick model overfitting. Once the performance and accuracy metrics of all networks were estimated, four networks that showed the best results during the first stage were chosen for fine-tuning. Besides training both bodies and heads of the networks, we introduced an indirect supervision mechanism for the considered networks. We were inspired by [59] , where the authors proposed a framework that where is the representation from the last convolutional layer whose features have the best compromise between high-level semantics and detailed spatial information. The attention map has the same size as the convolutional feature maps (see the column with the size of the output feature matrix in Table 3 ). Using the trainable attention map we generate a soft mask that is applied to an input image. This procedure allows us to obtain regions * which are beyond the network's current attention for class and are calculated as follows: where is an input image, ( ) is a masking function that is based on the thresholding operation, and ⊙ denotes element-wise multiplication. Since standard thresholding is not derivable, ( ) is approximated using a sigmoid function, where is the thresholding matrix filled with values, is a scale parameter, ensuring T( ) , is equal to 1, when , is larger than or equal to 0 otherwise. Having obtained a soft mask * , the attention block of the pipeline uses it to compute the prediction scores for all classes. Since the indirect supervision mechanism is used to guide the network to focus its attention on all parts of a given class, * has to contain as little features belonging to the target class as possible because regions beyond the high-responding area on the attention map area should not include single-pixel areas that can trigger the network to recognize the object of class . From the perspective of the attention loss function, it is designed to minimize the prediction score of * and is calculated as follows: where is the number of ground-truth class labels for an input image . While modern neural networks enable superior performance, their lack of decomposability into intuitive and understandable components makes them hard to interpret. In this regard, an achievement of the model transparency is useful to explain their predictions. Class Activation Map (CAM) is a modern-day technique used for model interpretation [60] . Though CAM is a good technique to demystify the working of CNNs, it suffers from several drawbacks. For example, CAM requires feature maps to directly precede the softmax layers, so it applies to a particular kind of network architecture that performs global average pooling over convolutional maps J o u r n a l P r e -p r o o f immediately before prediction. Such architectures may achieve inferior accuracies compared to general networks on some tasks or simply be inapplicable to new tasks. De facto deeper representations of a CNN capture the best high-level features. Furthermore, CNNs naturally retrain spatial information which is lost in fully connected layers, so we expect the last convolutional layer to have the best tradeoff between high-level semantics and detailed spatial information. In this regard, a popular technique, known as Grad-CAM and published in [50] , aims to improve the shortcomings of CAM and claims to be compatible with any kind of architecture. The technique does not require any modifications to the existing model architecture, and this allows its application to any CNN-based architecture. Unlike CAM, Grad-CAM uses the gradient information flowing into the last convolutional layer of a CNN to understand each neuron for a decision of interest. Grad-CAM improves on its predecessor, provides better localization and clear class discriminative saliency maps. As such, we created heatmap images using the following equations: where the algorithm takes gradient of the output with respect to a feature map , then it averages the result to get a weight of each feature map . Finally, Grad-CAM takes a linear combination of weights and feature maps , which gives us heatmaps. Having trained 10 neural networks, we found that two networks tend to overfit more than others. This is likely connected with their normalization layers. Networks such as MobileNet V2 and VGG-16 do not have Batch/Instance/Layer/Group Normalization layers in their architecture. In this regard, these networks start overfitting (MobileNet V2) or hit a validation loss/accuracy plateau (VGG-16) after approximately 100 epochs, while the training accuracy keeps increasing. Regression (L2 regularization), ElasticNet (L1-L2 regularization), Dropout, and Early Stopping may help to avoid this problem. In this regard, we applied Ridge Regression, Dropout layers, and Early Stopping in our training pipeline. As for the remaining networks, they did not suffer from overfitting; however, they could not reach better validation loss/accuracy values. When a given model reached its best validation loss, we saved the associated model weights using a saving Since loss is poorly interpreted, we compared commonly used network metrics such as accuracy and F1-score. Table 4 and Table 5 summarize these metrics estimated during the first stage. As seen, MobileNet V2, EfficientNet B1, EfficientNet B3, and VGG-16 achieved better results than other networks. Additionally, we provide all obtained metrics (Accuracy, F1-score, Precision, and Recall), computed over different subsets, classes, and stages in Appendix A. (Table 4 and Table 5 ) and second stages ( Table 6 and Table 7 ), we can state that MobileNet V2 and VGG-16 have a larger boost in accuracy over EfficientNet models. Once full training was performed, MobileNet V2 and VGG-16 got a +6% and +9% accuracy change on the validation subset and a +1% and +4% accuracy change on the testing subset. On the other hand, EfficientNet B1 and EfficientNet B3 displayed a +2% and +3% accuracy change on the validation subset and a -1% and +6% accuracy change on the testing subset. It should also be noted, that the largest boost in the classification of COVID-19 was achieved by VGG-16. This network had an +11% boost, while MobileNet V2, EfficientNet B1, and EfficientNet B3 could reach the level of +2%, 0%, and +6%, respectively. integrate multiple kernels of different sizes (1×1, 3×3, and 5×5) which should assist in detecting area-specific features. However, 3×3 convolutional kernels, integrated to VGG-16 and MobileNet V2, turned out to provide a better solution, allowing for the network's better generalization ability and its ability to distinguish healthy patients from those diagnosed with COVID-19 or pneumonia. Appendix D. Due to the nature of the task at hand, we utilize Grad-CAM for training and visualization purposes only. As we do not segment the COVID-19 affected regions, we have insufficient image information to compute associated metrics such as the Dice coefficient or the Jaccard distance. However, based on the obtained results, we may state that the training of the J o u r n a l P r e -p r o o f models using soft masks obtained by the indirect supervision mechanism (Stage III) has a positive effect on the search for correct patterns by the models. Networks such as MobileNet V2 (Fig 5c and Fig 6c) and VGG-16 (Fig 5f and Fig 6f) identify affected areas correctly, despite the inaccuracies in the location of the heatmaps. On the other hand, interpretation of the EfficientNet networks showed that they are not activating around the proper patterns of the image. This allows us to assume that EfficientNet B1 and EfficientNet B3 have not properly learned the underlying patterns in our dataset and/or we may need to collect additional data for more complex training. J o u r n a l P r e -p r o o f COVID-19 Virus Pandemic -Worldometer n Detection of SARS-CoV-2 in Different Types of Clinical Specimens Estimating false-negative detection rate of SARS-CoV-2 by RT-PCR Evaluating the accuracy of different respiratory specimens in the laboratory diagnosis and monitoring the viral shedding of 2019-nCoV infections Sensitivity of Chest CT for COVID-19: Comparison to RT-PCR Clinical Characteristics of Coronavirus Disease 2019 in China Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China Imaging Profile of the COVID-19 Infection: Radiologic Findings and Literature Review Essentials for radiologists on COVID-19: An update-radiology scientific expert panel Correlation of Chest CT and RT-PCR Testing for Coronavirus Disease 2019 (COVID-19) in China: A Report of 1014 Cases The Role of Chest Imaging in Patient Management During the COVID-19 Pandemic: A Multinational Consensus Statement From the Fleischner Society Comparison of Deep Learning Approaches for Multi-Label Chest X-Ray Classification Identifying pneumonia in chest X-rays: A deep learning approach COVID-Net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images Classification of the COVID-19 infected patients using DenseNet201 based deep transfer learning Using Artificial Intelligence to Detect COVID-19 and Community-acquired Pneumonia Based on Pulmonary CT: Evaluation of the Diagnostic Accuracy COVID-Net CT-2: Enhanced Deep Neural Networks for Detection of COVID-19 from Chest CT Images Through Bigger, More Diverse Learning An Open-Source Deep Learning Approach to Identify Covid-19 Using CT Image COVIDNet-CT: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases From Chest CT Images Classification of COVID-19 patients from chest CT images using multi-objective differential evolution-based convolutional neural networks Densely connected convolutional networks-based COVID-19 screening model Metaheuristic-based Deep COVID-19 Screening Model from Chest X-Ray Images PDCOVIDNet: a parallel-dilated convolutional neural network architecture for detecting COVID-19 from chest X-ray images Deep Convolutional Neural Networks to Diagnose COVID-19 and other Pneumonia Diseases from Posteroanterior Chest X-Rays Detecting Coronavirus from Chest X-rays Using Transfer Learning A Fine-tuned deep convolutional neural network for chest radiography image classification on COVID-19 cases Deep-COVID: Predicting COVID-19 from chest X-ray images using deep transfer learning COVID-ResNet: A Deep Learning Framework for Screening of COVID19 from Radiographs CovXNet: A multi-dilation convolutional neural network for automatic COVID-19 and other pneumonia detection from chest X-ray images with transferable multi-receptive feature optimization Automated detection of COVID-19 cases using deep neural networks with X-ray images Classification of COVID-19 in chest X-ray images using DeTraC deep convolutional neural network Transfer Learning-Based Automatic Detection of Coronavirus Disease 2019 (COVID-19) from Chest X-ray Images COVID-19 Detection from Chest X-ray Images Using Feature Fusion and Deep Learning DeepCOVID-XR: An Artificial Intelligence Algorithm to Detect COVID-19 on Chest Radiographs Trained and Tested on a Large U.S. Clinical Data Set Medical Imaging with Deep Learning for COVID-19 Diagnosis: A Comprehensive Review An Overview of Deep Learning Techniques on Chest X-Ray and CT Scan Identification of COVID-19 Overview of current state of research on the application of artificial intelligence techniques for COVID-19 Convolutional capsnet: A novel artificial neural network approach to detect COVID-19 disease from X-ray images using capsule networks Unsupervised Deep Learning based Variational Autoencoder Model for COVID-19 Diagnosis and Classification COVID-19 Image Data Collection: Prospective Predictions Are the Future CoroNet: A deep neural network for detection and diagnosis of COVID-19 from chest x-ray images Coronavirus disease (COVID-19) detection in Chest X-Ray images using majority voting based classifier ensemble COVID-19 Image Data Collection Chest X-ray Dataset Initiative 2020 Figure 1 COVID-19 Chest X-ray Dataset Initiative COVID-19 Radiography Database | Kaggle n.d Can AI Help in Screening Viral and COVID-19 Pneumonia? Review of Chest Radiograph Findings of COVID-19 Pneumonia and Suggested Reporting Language Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization Inverted Residuals and Linear Bottlenecks Densely connected convolutional networks Rethinking Model Scaling for Convolutional Neural Networks. 36th Int Conf Mach Learn ICML Very deep convolutional neural network based image classification using small training sample size Identity Mappings in Deep Residual Networks Rethinking the Inception Architecture for Computer Vision Inception-v4, inception-ResNet and the impact of residual connections on learning COVID-Net: COVID-Net Open Source Tell Me Where to Look: Guided Attention Inference Network Learning Deep Features for Discriminative Localization ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:J o u r n a l P r e -p r o o f