key: cord-0558509-1ynlnism authors: Park, Sangjoon; Kim, Gwanghyun; Oh, Yujin; Seo, Joon Beom; Lee, Sang Min; Kim, Jin Hwan; Moon, Sungjun; Lim, Jae-Kwang; Ye, Jong Chul title: Vision Transformer for COVID-19 CXR Diagnosis using Chest X-ray Feature Corpus date: 2021-03-12 journal: nan DOI: nan sha: b843b652ef4f848ee23e415c598c28565da0c3ae doc_id: 558509 cord_uid: 1ynlnism Under the global COVID-19 crisis, developing robust diagnosis algorithm for COVID-19 using CXR is hampered by the lack of the well-curated COVID-19 data set, although CXR data with other disease are abundant. This situation is suitable for vision transformer architecture that can exploit the abundant unlabeled data using pre-training. However, the direct use of existing vision transformer that uses the corpus generated by the ResNet is not optimal for correct feature embedding. To mitigate this problem, we propose a novel vision Transformer by using the low-level CXR feature corpus that are obtained to extract the abnormal CXR features. Specifically, the backbone network is trained using large public datasets to obtain the abnormal features in routine diagnosis such as consolidation, glass-grass opacity (GGO), etc. Then, the embedded features from the backbone network are used as corpus for vision transformer training. We examine our model on various external test datasets acquired from totally different institutions to assess the generalization ability. Our experiments demonstrate that our method achieved the state-of-art performance and has better generalization capability, which are crucial for a widespread deployment. The novel coronavirus disease 2019 , caused by severe acute respiratory syndrome coronavirus-2, is an ongoing pandemic resulting in 113,695,296 people infected with 2,526,007 death worldwide as of 1 March 2021. In the face of the unprecedent pandemic by COVID-19, public health care systems have confronted challenges in many aspects including critical shortage of medical resources, while many health care provider have themselves been infected [15] . Because of its highly transmissible and pathologic natures of COVID-19, the early screening of COVID-19 is becoming increasingly important to prevent further spread of disease and lessen the burden of health care systems. Currently, real-time polymerase chain reaction (RT-PCR) is the gold standard in COVID-19 confirmation due to high sensitivity and specificity [20] , but it takes several hours to get the result. As many patients with confirmed COVID-19 present radiological findings of pneumonia, radiologic examinations may be useful for fast diagnosis [18] . Though chest computed tomography (CT) has superior sensitivity and specificity for diagnosis of COVID-19 [2] , the routine use of CT places a huge burden on health care system due to its high cost and relatively longer scan time than chest radiograph (CXR). Therefore, there exist practical advantages to use CXR as primary screening tool under global pandemic. The common CXR findings of COVID-19 include bilateral involvement, peripheral and lower zone dominance of ground glass opacities and patchy consolidations [7] . Even though it has been reported that the sensitivity and specificity of COVID-19 diagnosis with CXR alone is lower than with CT or RT-PCR [25] , CXR still has its potential for the fast screening of COVID-19 during the patient triage, determining the priority of patient's care to help saturated health care system in pandemic situation. Accordingly, many approaches have been proposed using deep learning to diagnose COVID-19 with CXR [11, 14, 16, 22] , but they suffered from common problems of limited number of labelled COVID-19 data, resulting poor generalization ability [12, 27] . The reliable generalization performance on unseen, totally different data set is crucial for real world adoption of the system. In general, the most common approach to solve this problem is to build adversarially robust model with millions of training data [4] . However, constructing well-curated dataset containing large number of labelled COVID-19 cases is difficult due to saturation of health care system in many countries. Though the previous studies have tried to mitigate the problem either by using transfer learning from other large dataset like ImageNet [1] , or by utilizing weakly-supervised learning method [24, 29] and anomaly detection [28] , their performances are often suboptimal and do not guarantee the ability to generalize. In addition, as COVID-19 usually involves both lung fields with lower zone dominance, the model should extract features based on the global manifestation of the diseases. Transformer, which was first introduced in the field of natural language processing (NLP), is a deep neural network based on self-attention mechanism that results in significantly large receptive fields [21] . After achieving astounding results in NLP, it has inspired the vision community to study its applications in computer vision since it enables modeling long-range dependency within images. Vision Transformer (ViT) has first showed how Transformer can totally replace standard convolution operations in deep neural network achieving stateof-the-art (SOTA) performance [10] . However, training Vision Transformer from scratch requires huge amount of data, so that the authors also suggested hybrid model by conjugating conventional convolutional neural network (e.g. ResNet) backbone that produces initial feature embedding. As such, Transformer, being trained using the feature corpus generated by the ResNet backbone, can mainly focus on learning the global attention. Empirical results shows that the hybrid model present better performance in small-sized data set. Although this preliminary results is promising, there are still remaining concerns that the corpus generated by the ResNet may not be an optimal input feature embedding for diagnosis with CXRs. Fortunately, there are several publicly available large-scale datasets for CXR classification which was built before the COVID-19 outbreak. Among them CheXpert [13] dataset consist of labeled abnormal observations including low-level CXR features (e.g. opacity, consolidation, edema, etc.) useful for diagnosis of infectious diseases. Furthermore, there exists many advanced CNN architectures including the model proposed by [26] , which utilizes probabilistic class activation map (PCAM) pooling to explicitly leverage the benefit of class activation map to improve both classification and localization ability for these low-level features. Therefore, we propose a novel vision Transformer which utilizes this existing network as a backbone for CXR low-level feature embedding, upon which vision Transformer is trained using the generated corpus from the backbone network. It is remarkable that our network fundamentally resembles the text classification task using Transformer, which classifies the entire sentence by aggregating the meaning, location and relationship of words in a sentence, using the lowlevel word embedding. Furthermore, our network emulates the clinical experts who make a CXR image-level diagnosis (e.g. normal, tuberculosis, COVID-19 pneumonia, etc.) by collating the information of low-level features in terms of their pattern, multiplicity, distribution and locations. Contribution. In this paper, we proposed a Vision Transformer model tailored for COVID-19 CXR diagnosis, leveraging the low-level CXR feature corpus attained from prebuilt large-scale public dataset for CXR features such as opacity, consolidation, edema, etc. We show that our method is superiority to both baseline Vision Transformer and SOTA models especially in terms of the generalization performance on unseen datasets. We also provided clinically interpretable visualization results of model, which is of great help for COVID-19 diagnosis and localization. The merit of our model is that Transformer can exploit the low-level CXR feature corpus, which was obtained from the backbone network trained to extract abnormal CXR features from publicly available large and well-curated CXR dataset. The overall framework of our method is illustrated in Fig. 1 . Pre-training Backbone Network for Low-level Feature Corpus. As a backbone network to extract low-level CXR feature corpus from an image, we adopted a modified version of model proposed by [26] , which utilizes probabilistic class activation map (PCAM) pooling to explicitly leverage the benefit of class activation map to improve both classification and localization ability (see Fig. 1 (A)). The backbone network was trained beforehand with prebuilt public CXR dataset to classify 10 labelled observations including no finding, cardiomegaly, lung opacity, edema, consolidation, pneumonia, atelectasis, pneumothorax, pleural effusion and support devices. As shown in Fig. 1 (A) , there are several layers one could extract the feature embedding, and we found that the intermediate level embedding before the PCAM operation contains the most useful information. However, care should be taken since the PCAM unit trained with specific low-level CXR features (e.g.cardiomegaly, lung opacity, edema, consolidation) was essential to improve the accuracy of the intermediate level feature embedding by guiding the feature aligned to provide the optimal PCAM maps. Specifically, with the pre-trained backbone network G, an input image x ∈ R H×W ×C is encoded into intermediate feature map f ∈ R H ×W ×C . We used the C dimension feature vectors f of each H × W pixels as encoded representations for low-level features at each pixel locations, and constructed the low-level CXR feature corpus. Vision Transformer. Similar to BERT [9] , Vision Transformer model is encoderonly architecture (see Fig. 1 (B) ). As the Transformer encoder uses constant latent vector of dimension D, we first projected encoded features f of dimension C into f p of dimension D using 1 × 1 convolution kernel. Similar to [class] token of BERT, we prepended additional learnable embedding vector f cls to projected features f p , to make the last L layer output of this [class] token z 0 L represent the diagnosis of a whole CXR image (= y) by attaching the classification heads to z 0 L . In addition, we added a positional embedding E pos to encode a notion of the sequential order to the projected features f p . The Transformer encoder layers used in our model are the same as standard Transformer encoder consisting of alternating layers of multihead self-attention (MSA), multilayer perceptron (MLP), layer normalization (LN), and residual connections in each block, which is described as follows: Model Interpretability. For model interpretability, we adopted a saliency map visualization method tailored for Vision Transformer. Different from other methods relying on attention maps or heuristic propagation of attention, the method proposed by [3] assigns local relevance with deep Taylor decomposition, and propagate the local relevance throughout the layers. By the relevance propagation, this method overcomes the challenges of attentions layers and the skip connections. Datasets and Partitioning. For the pre-training of the feature encoder network, we used CheXpert [13] Total dataset was curated into three classes; normal, other infections (which includes bacterial pneumonia and tuberculosis), and COVID-19. The numbers of images for each class are summarized in Table 1 . To evaluate the generalization ability of models in various institutional settings, we set aside 3 institutional datasets (CNUH, KNUH, YNU) collected from totally different institutions with different devices and settings as external test datasets as they contain the cases of all three label classes. Our pre-processing method includes histogram equalization, Gaussian blurring with 3 × 3 kernel, normalization and resizing to 512 × 512. As the backbone network, we took up on the network architecture that scored the first place in CheXpert challenge 2019 [26] , which consists of DenseNet-121 backbone followed by PCAM operations for each class. In particular, we used intermediate feature map of size 16 × 16 × 1024 before PCAM operation as input for Transformer, since this feature map contains a common representation of all the abnormal findings. As this feature map already encodes the representations for important CXR findings, we adopted relatively simple transformer architecture with four encoder layers with eight attention heads. For training of backbone network, we used Adam optimizer with learning rate of 0.0001. The backbone network was trained for 160,000 optimization steps with step decay scheduler. The batch size was set to 8. For training of model for COVID-19 diagnosis, We used SGD optimizer (momentum = 0.9) with max grad norm set to 1 with learning rate of 0.001. The model was trained for 10,000 optimization steps with cosine warm-up scheduler (warm-up steps = 500), with batch size set to 8. These optimal hyperparameters are determined experimentally. We adopted area under the ROC curve (AUC) as our evaluation metric, but also calculated sensitivity, specificity and accuracy after adjusting the threshold to meet the sensitivity ≥ 80% if possible. Pre-processing, development and evaluation of the algorithm was performed with Python version 3.7 and PyTorch 1.6 on Nvidia Tesla V100. results demonstrate that our model retains stable performance (AUC ≥ 0.900) irrespective of external settings, which confirms fair excellent generalization ability of our model. Comparison with baseline and SOTA models. To compare the diagnostic performance of our model with the baseline and SOTA models, ResNet-50 was used as baseline and ViT (ViT-B/16), hybrid ViT model (R50-ViT-B/16) were used as SOTA models. All models except ours were trained using the pre-trained ImageNet weights since it significantly improves their performance, and were subjected to same training setting with our model for fair comparison. As shown in Table 2 , our model outperformed the SOTA models as well as the baseline in all of the external test datasets, demonstrating the superior performance and stability in real-world application. Model Interpretability Results. Fig. 2 illustrates the examples of visualization of saliency map for each disease classes. As shown in Fig. 2 (A) , the our model well-localized a focal cavitary lesion caused by bacterial infection, while it was also able to delineate the multi-focal areas of involvement by virus which is common for COVID-19 infection as in Fig. 2 (B) . Self-supervised Contrastive Pre-training. Since it has been suggested that the Transformer-based model may benefit from pre-training to learn the sequence structure by self-supervised manner before the fine-tuning for downstream tasks [5, 9, 17] , we evaluated the benefit of self-supervised pre-training in Transformer-based SOTA (hybrid ViT) model and our model. We used Sim-CLR [6] , which is a self-supervised contrastive learning method with data augmentation framework, as our self-supervised pre-training method. As provided in Table 3 , the experimental results reveal that the self-supervised pre-trainig was superfluous or even detrimental to our model since it already equipped with well-trained backbone network, though it slightly improved the performance of Transformer-based SOTA model. However, our model still outperformed SOTA model, which alludes that the corpus generated by the ResNet may not be a optimal input feature embedding for CXR classification. Ablation. We conducted an ablation study to determine whether to have backbone network fixed or trainable after training with CheXpert dataset. The results of Table 3 suggests that having weights of trainable is better than fixing them for in all of the 3 external test datasets, which is thought to be mitigated though improved capacity by trainable parameter of backbone network, dispelling the concern about overfitting. In this work, we proposed a novel Vision Transformer model for COVID-19 CXR diagnosis, using the low-level CXR feature corpus. The novelty of this study lies in leveraging a backbone network trained to find low-level abnormal CXR findings in prebuilt large-scale dataset to embed feature corpus suitable for high-level disease classification with Transformer model. The experimental results on various external test datasets confirm that our model not only achieves SOTA performance in the diagnosis of COVID-19 and other infectious disease but also retains stable performance irrespecitve of the external settings, which is a sine-qua-non for widespread application of system. In addition, we provided interpretable results with improved visualization method beyond attention, which is expected to be of great help to clinicians. Covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks Chest ct findings in coronavirus disease-19 (covid-19): relationship to duration of infection Transformer interpretability beyond attention visualization More data can expand the generalization gap between adversarially robust and standard models Generative pretraining from pixels A simple framework for contrastive learning of visual representations Chest x-ray in new coronavirus disease 2019 (covid-19) infection: findings and correlation with clinical outcome Bimcv covid-19+: a large annotated dataset of rx and ct images from covid-19 patients Bert: Pre-training of deep bidirectional transformers for language understanding An image is worth 16x16 words: Transformers for image recognition at scale Covidx-net: A framework of deep learning classifiers to diagnose covid-19 in x-ray images The challenges of deploying artificial intelligence models in a rapidly evolving pandemic Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison Automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks Covid-19 and the risk to health care workers: a case report Deep learning covid-19 features on cxr using limited training data sets Improving language understanding by generative pre-training Radiological findings from 81 patients with covid-19 pneumonia in wuhan, china: a descriptive study End-to-end learning for semiquantitative rating of covid-19 severity on chest x-rays Real-time rt-pcr in covid-19 detection: issues affecting the results Attention is all you need Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases A weakly-supervised framework for covid-19 classification and lesion localization from chest ct Frequency and distribution of chest radiographic findings in patients positive for covid-19 Weakly supervised lesion localization with probabilistic-cam pooling Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study Covid-19 screening on chest x-ray images using deep learning based anomaly detection Deep learning-based detection for covid-19 from chest ct using weak label