key: cord-0493376-e0ejyn4b authors: Park, Sangjoon; Kim, Gwanghyun; Oh, Yujin; Seo, Joon Beom; Lee, Sang Min; Kim, Jin Hwan; Moon, Sungjun; Lim, Jae-Kwang; Ye, Jong Chul title: Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification date: 2021-04-15 journal: nan DOI: nan sha: fefd5ad3ee8b3b7a780143741e095e5a9d655d73 doc_id: 493376 cord_uid: e0ejyn4b Developing a robust algorithm to diagnose and quantify the severity of COVID-19 using Chest X-ray (CXR) requires a large number of well-curated COVID-19 datasets, which is difficult to collect under the global COVID-19 pandemic. On the other hand, CXR data with other findings are abundant. This situation is ideally suited for the Vision Transformer (ViT) architecture, where a lot of unlabeled data can be used through structural modeling by the self-attention mechanism. However, the use of existing ViT is not optimal, since feature embedding through direct patch flattening or ResNet backbone in the standard ViT is not intended for CXR. To address this problem, here we propose a novel Vision Transformer that utilizes low-level CXR feature corpus obtained from a backbone network that extracts common CXR findings. Specifically, the backbone network is first trained with large public datasets to detect common abnormal findings such as consolidation, opacity, edema, etc. Then, the embedded features from the backbone network are used as corpora for a Transformer model for the diagnosis and the severity quantification of COVID-19. We evaluate our model on various external test datasets from totally different institutions to evaluate the generalization capability. The experimental results confirm that our model can achieve the state-of-the-art performance in both diagnosis and severity quantification tasks with superior generalization capability, which are sine qua non of widespread deployment. The novel coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has emerged as one of the deadliest virus of the century, resulting in about 137 million people infected with over 2.9 million death worldwide as of April 2021. In the light of the unprecedented pandemic of COVID-19, public health systems have faced many challenges, including scarce medical resources, which are pushing healthcare providers to face the threat of infection (Ng et al., 2020) . Considering its ominously contagious nature, the early screening of COVID-19 infection the pandemic situation. In addition, CXR is useful for followup, which should be inexpensive and low in radiation exposure, to assess response to treatment. Consequently, many studies have reported early application of CXR deep learning for diagnosis Hemdan et al., 2020; Narin et al., 2020; Oh et al., 2020) or severity quantification of COVID-19 (Cohen et al., 2020a; Signoroni et al., 2020a; Zhu et al., 2020a; Wong et al., 2020) , but they suffered from ineradicable drawbacks of poor generalization capability stemming from the scanty labelled COVID-19 data (Hu et al., 2020; Zech et al., 2018; Roberts et al., 2021) . The stable generalization performance on unseen data is indispensable for widespread adoption of the system (Roberts et al., 2021) . One of the most commonly used measures to solve this problem is to build a robust model with innumerable training data (Chen et al., 2020a) , but it is difficult to construct largescale dataset with labeled COVID-19 cases under the current pandemic situation. As a result, several methods have been proposed to mitigate the problem by transfer learning (Apostolopoulos and Mpesiana, 2020), weakly supervised learning (Zheng et al., 2020a; Wang et al., 2020b) , and anomaly detection , but their performances are still suboptimal. The previous studies mostly utilize convolutional neural network (CNN) models, which was not specially designed for manifestations of COVID-19 which can be characterized by bilateral involvement, peripheral and lower zone dominance of ground glass opacities and patchy consolidations (Cozzi et al., 2020) . Although CNN architecture has shown to be superb in many vision tasks, it may not be optimal for problems requiring high-level CXR disease classification, where the global characteristics like multiplicity, distribution, and patterns have to be considered. This is due to the the intrinsic locality of pixel dependencies in the convolution operation. To overcome the similar limitation of CNN in computer vision problems that requires the integration of global relationshop between pixels, Vision Transformer (ViT) equipped with the Transformer architecture (Vaswani et al., 2017) was proposed to model long-range dependency among pixels through the self-attention mechanism, showing the state-of-theart (SOTA) performance in image classification task (Dosovitskiy et al., 2020) . Since the Transformer was originally invented for natural language processing (NLP) in order to attend different positions of the input sequence within a corpus and compute a representation of that sequence, the choice of an appropriate corpus is the prerequisite for the Transformer design. In the original paper (Dosovitskiy et al., 2020) , two ViT models were suggested utilizing either direct pixel-patch embedding or feature embedding by ResNet backbone as corpora for Transformer. A problem occurs here, however, that neither the direct pixel-patch embedding nor feature embedding from ResNet may not be the optimal input embedding for CXR diagnosis of COVID-19. Fortunately, several large-scale CXR data sets are constructed before COVID-19 pandemic and publicly available. For example, CheXpert (Irvin et al., 2019) , a large dataset that contain over 220,000 CXR images, provides labeled common low-level CXR findings (e.g. consolidation, opacity, edema, etc.), which is also useful for diagnosis of infectious disease. Moreover, an advanced CNN architecture has been suggested using the same dataset , which uses probabilistic class activation map (PCAM) pooling to leverage the class activation map to enhance the localization ability as well as classification performance. To take the maximum advantage of both the dataset and the network architecture for COVID-19, here we propose a novel ViT architecture which utilizes this advanced CNN architecture as a feature extractor for low-level CXR feature corpus, upon which Transformer is trained for downstream tasks of diagnosis by utilizing the self-attention mechanism in Transformer. It is worth to mention that our network basically identical to the text classification task with Transformer architecture, where the Transformer not only add up the meaning but also consider the location and relationship of words to make classification in sentence-level. Moreover, our method emulates the clinical experts who determine the final diagnosis of CXR (e.g. normal, bacterial pneumonia, COVID-19 infection, etc.) by comprehensively considering the low-level features with their pattern, multiplicity, location and distribution (e.g. Multiple opacities and patch consolidations exist with lower lung zone dominance: high probability for COVID-19) as illustrated in Fig. 1 . Another important contribution of this paper is to show that our ViT framework can be also used for COVID-19 severity quantification and localization, enabling the serial follow-up of severity and thereby assisting the treatment decision of clinicians (Cohen et al., 2020b) . The severity of COVID-19 can be determined by quantifying the extent of COVID-19 involvement. Recently, array-based simple severity annotations where 1 or 0 is assigned to each 6 subdivisions of lungs are proposed by Toussie et al. (2020) , and we are interested in utilizing this weak labeling approach for severity quantification. As the Transformer output already incorporates the long-range relationship between regions through self-attention, we use this Transformer output to design a light-weighted network that can accurately quantify and localize the COVID-19 extents from weak labels. Specifically, we adopt the region of interest (ROI) max-pooling of the output Transformer feature to bridge the severity map and simple array. Consequently, in addition to the global severity score from 0 to 6, our model can create an intuitive severity level map where each pixel value explicitly means the likelihood of the presence of a COVID-19 lesion using the weak array-based labels. In summary, our main contributions are as follows. • A novel ViT model for COVID-19 is proposed by leveraging the low-level CXR feature corpus that contain the representations for common CXR findings with pre-built large-scale dataset. • We have not limited our model to diagnosis, but expanded our model to quantify severity to provide clinicians with clinical guidelines for making treatment decisions. • We experimentally demonstrated that our method outperforms other Transformer-based models as well as CNNbased models especially in terms of the generalization on unseen data. The remainder of this paper is organized as follow. Section 2 summarizes the related works. Section 3 and Section 4 describes the proposed framework and datasets, respectively. Experimental results are presented in Section 5. Finally, we conclude this work in Section 6. Transformer (Vaswani et al., 2017) , which was originally invented for NLP, is a deep neural network based on self-attention mechanism that facilitates appreciably large receptive fields. After demonstrating its astounding performance, not only has Transformer become a de facto standard practice in NLP, but it has also motivated the computer vision community to explore its applications in computer vision by taking advantage of the long-range dependency between pixels (Khan et al., 2021) . The ViT was the first major attempt to apply a pure Transformer directly to image, suggesting that it can completely replace the standard convolution operations by attaining SOTA performance. However, the experimental results showed that training vanilla ViT model requires a huge computational cost. Therefore, the authors also suggested hybrid architecture by conjugating CNN backbone (e.g. ResNet) to Transformer. With the feature extracted by ResNet, the Transformer can mainly focus on modeling the global attention. The experimental results suggest that it was able to achieve higher performance with the hybrid approach with relatively small amount of computations. After the introduction of ViT, the application of Transformer in computer vision has become an active area of investigation, resulting in many variant models of ViT showing SOTA performance in a variety of vision tasks including object detection , classification (Dosovitskiy et al., 2020; Chen et al., 2020b) , segmentation , and so on. Class activation map (CAM) is a sort of class-specific saliency map obtained by quantifying the contribution of particular area of an image to the prediction of network. The most useful aspect of CAM is that it enables the localization of the important area only with weak labels, namely image-level supervision. Despite its excellent localization ability, most of previous works utilized CAM to generate heatmaps for lesion localization and visualization during inference. To leverage the localization ability of CAM to enhance the performance of network itself, one recent study utilized the CAM during training in CXR classification and localization tasks . They devised a novel global pooling operation that explicitly leverages the CAM in a probabilistic manner and is known as Probabilistic-CAM (PCAM) pooling. Different from standard approaches that use CAM for direct localization, they bound it with additional fully-connected layer and sigmoid function to get probabilities for each CXR findings. Then, the normalized attention weights were obtained from these output probability to make weighted feature maps containing more useful representation for each class. They showed that PCAM pooling operation can enhance both localization and diagnostic performance of the model and achieved first place in the 2019 CheXpert Challenge. To build an automated algorithm for severity quantification, pixel-level annotation such as lesion segmentation label can offer a plentiful information. However, this type of labelling methods are labor-intensive and collecting large data with this pixel-level annotated label is not feasible under the global pandemic of COVID-19. To alleviate the problem, simplified severity annotation methods, such as score-based and arraybased methods, have been proposed. For example, Cohen et al. (2020a) suggested a geographic extent score and a lung opacity score based on a rating system of lung oedema proposed by Warren et al. (2018) . A geographic extent score assigns scores that range from 0 to 4, while lung opacity score assigns values of 0 to 3 based on the severity of involvement in each lung area. Borghesi and Maroldi (2020) designed Brixia score, another array-type severity labeling method, dividing lung with anatomic landmarks and assign score of 0-3 to each subdivision. Similarly, Toussie et al. (2020) suggested an array-based severity score for COVID-19. After dividing both lungs into six divisions, each area is assigned a value of 0 or 1, depending on the presence of COVID-19 involvement, which adds up to an overall severity of 0 to 6. We adopted the array-based annotation method suggested by Toussie et al. (2020) for severity quantification of COVID-19. One of the novel contributions of our approach is to show that we can maximize the performance of the Transformer model by using the low-level CXR corpus that comes from the backbone network trained with a large well-curated public record to produce common CXR findings. As the backbone network is trained with large number of data, the network is less prone to overfitting and thereby the generalization capability can be improved. As a backbone network to extract low-level feature, we used the modified version of network proposed by Ye et al. (2020) . Firstly, the backbone network was pre-trained to classify 10 common low-level findings with a large public dataset. As depicted in Fig. 2 , feature maps in each layer can be the candidates for utilizable feature embedding for the subsequent Transformer, and we experimentally found that the common embedding before the PCAM operation comprises of most useful information. Nevertheless, care should be exercised since the PCAM operation for specific low-level CXR findings (e.g. lung opacity, consolidation, etc.) turns out to be crucial to achieve the optimal embedding at intermediate level, as PCAM aligns these features to obtain the better classification results. More detailed experimental results about the level of feature map will be provided using ablation studies in Section 5.6.2. The overall framework and the architecture of our ViT model is provided in Fig. 3 . Specifically, for a given H × W size input image x ∈ R H×W , the backbone network G generates H × W size feature maps F: Here, the feature tensor F ∈ R H ×W ×C is defined as where f n ∈ R C denotes a C-dimensional embedded representation of low-level features at the n-th encoded block. These feature vectors are used to construct the low-level CXR feature corpora for Transformer. Then, similar to Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) , our Vision Transformer uses Transformer encoder layers to the input embedding. Specifically, since the Transformer encoder utilizes constant latent vector of dimension D, the extracted C dimension feature f n ∈ R C is first projected to a D dimension featurẽ f n ∈ R D using 1 × 1 convolution kernel. We then prepended learnable [class] token embedding vector f cls ∈ R D to projected feature tensor. This leads to the following composite projected feature tensor: A positional embedding E pos that has the same shape to the projected feature tensorF is then added to encode a notion of the sequential order: This is then used as an input to a Transformer composed of L successive encoder layers: where H ×W and T (l) denotes the l-th encoder layer. The encoder layers used in our model are the same as standard Transformer which consists of repeated layers of multihead self-attention (MSA), multilayer perceptron (MLP), layer normalization (LN), and residual connections in each block, as shown in Fig. 3 . Then, the first column z (L) 0 of Z (L) represents the Transformer attended feature vector with respect to the [class] token. Therefore, by simply adding a linear classifier as the classification head, we can obtain the diagnosis result of the input CXR image x (see Fig. 3(A) ). For the interpretability of the classification model, we adopted a visualization method of saliency map tailored for ViT suggested by (Chefer et al., 2020) , which computes relevancy for Transformer network. Specifically, unlike the traditional approaches of gradient propagation methods (Selvaraju et al., 2017; Smilkov et al., 2017; Srinivas and Fleuret, 2019) or attribution propagation methods (Bach et al., 2015; Gu et al., 2018) , which rely on the heuristic propagation along attention graph or the obtained attention maps, the method in Chefer et al. (2020) calculate the local relevance with deep Taylor decomposition, which is then propagated throughout the layers. This relevance propagation method is especially useful for models based on Transformer architecture, as it overcomes the problem of selfattention operations and skip connections. In the classification task, only one Transformer output at the [class] token position is used. However, the rest of the Transformer output also produces feature embedding at each block position by taking into account of long-range relations between the blocks. Therefore, we conjecture that this information is useful for the severity quantification, as severity is determined by both local and global manifestation of the disease. Accordingly, as shown in Fig. 3 , these outputs are combined by additional lightweighted network to produce the COVID-19 severity map. Specifically, as shown in Figs. 3(b) and 4, we first extract the Transformer output Z (L) except the [class] token position: which is used as an input to the map head network N S = N(Z res ) Then, the network output is multiplied pixel-wise with the segmentation mask M, after which ROI max-pooling (RMP) is applied to generate the severity mask Y sev ∈ R 3×2 : where ⊗ denotes the Hadamard product. In detail, the lung was divided into a total of six subdivisions, by dividing the right and left lungs into three subdivision (upper, middle, lower zone) with 5/12 and 2/3 line. Next, the largest values within each six subdivision were assigned as predicted values of severity array. Then, the map head network is trained by minimizing error of the estimated severity array with respect to the weakly annotated severity label as in Fig. 4 . To generate the lung segmentation mask, we used method introduced by Oh and Ye (2021) . In contrast to the existing approaches that are prone to under-segmentation for the severely infected lung with large consolidations, this novel approach enables the accurate segmentation of abnormal lung as well as normal lung area by learning common features using a single generator with AdaIN layers. Since a single generator is used for all these tasks by simply changing the AdaIN codes, the generator can synergistically learn the common features to improve segmentation performance for abnormal CXR data. Datasets used for this study can be divided into three: dataset for pre-training backbone, dataset for classification, dataset for severity quantification. For the pre-training of the backbone netowrk to extract the low-level CXR features, we used CheXpert dataset containing 10 labeled CXR findings: no finding, cardiomegaly, opacity, edema, consolidation, pneumonia, atelectasis, pneumothorax, pleural effusion, and support device. With a total of 224,316 CXR images from 65,240 subjects, the 32,387 lateral view images were excluded, leaving 29,420 PA and 161,427 AP view data available. With this large number of CXRs, it was able to train the backbone network robust to the variation in subjects, which is one of the key strengths of our model. , Daegu, Korea) labeled by board-certified radiologists for this study. Finally, the integrated dataset was divided into three label classes including normal, other infections (e.g. bacterial infection, tuberculosis) and COVID-19 infection, considering the application in real clinical setting. For PA view images, we put 3 institutional data (CNUH, YNU, KNUH) aside as external test datasets to evaluate the generalization capability by using data collected from independent hospitals with different devices and settings. On the other hand, CNUH data was solely used as the external test dataset for AP view images, since it was the only dataset containing all three label classes. Table 3 summarizes dataset resources and global severity levels. Different from diagnosis, the PA and AP view data were integrated and utilized without division for severity quantification task, since there is possibility that follow-up images may be obtained with both PA and AP view even in a single patient. Two board certified radiologists labeled the severity for three institutional datasets (CNUH, YNU, KNUH) using the arraybased severity labeling method of Toussie et al. (2020) as in 6 . We also utilized publicly available data, Brixia dataset, after translating its severity score the same as that of the institutional datasets. We alternately used one institutional dataset as an external testset and trained the models with two remaining datasets together with the Brixia dataset to evaluate the generalization capability in various external settings. Besides, 12 COVID-19 cases from BIMCV dataset were used to compare the severity map generated by our model to those annotated by clinical experts. The CXR images were preprocessed via histogram equalization, Gaussian blurring with 3 × 3 kernel, normalization, and finally resized to 512 × 512. As our backbone network, the modified version of the network proposed by Ye et al. (2020) , which comprises DenseNet-121 baseline followed by PCAM operations. Among several layers of intermediate feature maps, we used the feature map of size 16 × 16 × 1024 just before the PCAM operation. For subsequent Transformer architecture, we used standard Transformer model with 12 layers and 12 heads per each layer, while comparison with Transformer architecture with different network size is also provided in Section 5.6.1. For pre-training of the backbone network, Adam optimizer with learning rate of 0.0001 was used. We trained the backbone network for 160,000 optimization steps with step decay scheduler with batch size of 8. For training of classification model, SGD optimizer with momentum 0.9 was used with learning rate of 0.001. A max gradient norm of 1 was applied to stabilize training. We trained the model for 10,000 optimization steps with cosine warm-up scheduler (warm-up steps = 500) with batch size of 16. We trained two individual classification models for PA and AP view images, respectively. For severity quantification, a map head with four upsizing convolutional blocks is used, with last block followed by sigmoid non-linearity which squashes output into [0-1] range. Training of severity quantification model was done with SGD optimizer with learning rate of 0.003 for 12,000 optimization steps with constant learning rate, and batch size of 4 was used. These optimal hyperparameters were determined experimentally. We used the area under receiver operating characteristic curve (AUC) as the evaluation metrics for diagnostic performance of the classification model, but also calculated sensitivity, specificity, and accuracy after adjusting the thresholds to meet the sensitivity value of ≥ 80%, if possible. As evaluation metrics for severity quantification, we used the Mean Squared Error (MSE) as the main metric, but the Mean Absolute Error (MAE), Correlation Coefficient (CC), and R 2 score were also measured and compared. For classification model, the comparison between models and ablation studies were performed with PA view data, since it usually offers better diagnostic quality and is standard position for CXR diagnosis of lung disease, while both PA and AP view CXRs were used in severity quantification model. All experiments including preprocessing, development and evaluation of the model, was performed using Python version 3.7 and Pytorch library version 1.7 on Nvidia Tesla V100 and NVidia RTX 3090. The diagnostic performances of the proposed model for PA view images are provided in Table 4 (Table 5) , which was slightly decreased compared to those of PA view images but still showed fair performance (AUC ≥ 0.800) in the external test dataset, considering the fact that the diagnosis of infectious disease using only AP view image is not standard and usually deteriorates the diagnostic performance. Fig. 5 exemplifies the visualization of saliency maps for each disease classes in the external test datasets. As shown in the examples, our model well-localized a focal infected area either by bacterial infection (Fig. 5 (a) ) or tuberculosis (Fig. 5 (b) ), while it was also able to delineate the multi-focal lesions in periphery of both lower lungs in Fig. 4 (c) , which is typical findings for COVID-19 pneumonia. The results of severity quantification of our model are shown in Table 6 . Our model showed the MSE of 1.682, 1.677, 1.607, the MAE of 1.028, 1.102, 0.930, correlation coefficient of 0.781, 0.777, 0.682, and R 2 score of 0.572, 0.572, 0.432 in three external institutions. Brixia dataset contains a consensus subset of 150 CXR images labeled by five independent radiologists. Within this subset, the average MSE between the consensus severity score calculated from majority voting and each radiologist's rating is 1.683. As a result, the MSEs of 1.657, 1.696, and 1.676 in three external institutions show our model's performance comparable to those of experienced radiologists and generalization capability in the clinical environment. Fig 6 illustrates the examples of severity quantification, including the predicted scores, arrays, maps, and lesion contours in one of the external test datasets, which confirms that not only can our model correctly predict global severity, but it also generates an intuitive severity map that highlights the affected area, which can also be used to contour lesions. Finally, Fig 7 exemplifies the comparison between ground truth segmenteation label of involved area and model's prediction of involvement in BIMCV dataset. As shown in the figure, the model generally well-localized the areas of involvement. To compare the performance with other CNN-based and Transformer-based SOTA models, ResNet-50, ResNet-152, DenseNet-121, the standard ViT and hybrid ViT models were tested with the same external test datasets. All models except ours were initialized with ImageNet pre-traiend weights as it significantly improved the performance. Other training and evaluation settings were kept the same for fair comparison. As suggested in Table 7 and Table 8 , our model outperformed not only the CNN-base models but also the Transformer-based models in all of the external test datasets for both classification and severity quantification tasks, demonstrating superb and stable performance in real-world application. The performance improvement was not just the result of increased complexity of the model, considering that our model also considerably outperformed the model that has increased complexity (ResNet-152). To get better understanding of our models, we conducted a series of ablation studies, as provided in Table 9 and Table 10 . More details are as follows. Since the network size can be various according to the number of Transformer encoder layers and self-attention heads, we evaluated the effect of network size by constructing models with different size consisting of 2 layers and 4 heads per layer, 4 layers and 8 heads per layer, 12 layers and 12 heads per layer. The experimental results shows that the standard Transformer architecture with 12 layers and 12 heads per layer works the best. In our model, the intermediate feature map of backbone network was used as input feature corpus for Transformer architecture. Here we tested the various options for intermediate feature maps. To determine whether the features before or after PCAM operations are better, the common feature map before PCAM operation and weighted feature maps after PCAM operation were compared. In addition, in case of weighted feature maps after PCAM operation, there might be options whether to use all 10 weighted feature maps for each CXR findings to utilize all relevant features, or to use weighted feature maps of CXR findings related to infectious disease (opacity, consolidation, and pneumonia), or just to use only the most related one feature (pneumonia) to reduce redundancy. As shown in Table 9 , we can see that the common feature before PCAM operation achieved best performance. Since our model utilizes the pre-trained backbone network with large-scaled dataset, we evaluated whether freezing or leaving the backbone weights trainable yield better result. The experimental result shows that it is better to train weights than to freeze them in all three external test data sets. This could be due to an increased capacity through a trainable backbone network, which dispels the fear of overfitting. Regarding the input image size, we compared the input CXR of size 1024 × 1024 and 512 × 512 with the same model and experimental settings. The size of intermeidate feature map (16 × 16 × 768) remained the same by increasing the stride of patch embedding layer by two fold. As suggested in Table 9 , increasing the resolution of input CXR does rather deteriorate than improve the performance, leading to severe overfitting and poor generalization performance. Since previous works of Transformer-based models have suggested the possibilities that self-supervised pre-training to model structural sequence may be beneficial (Devlin et al., 2018; Radford et al., 2018; Chen et al., 2020b) , we evaluated whether Transformer-based models (standard ViT, hybrid ViT and our model) benefit from self-supervsied pre-training. We adopted SimCLR (Chen et al., 2020c) , a contrastive learning technique for visual representations with data augmentation framework, as our self-supervised pre-training method. As shown in Table 9 , the benefit of self-supervised pre-training was not prominent for both hybrid ViT and our models, though it slightly improved the performance of standard ViT model. Nevertheless, our model still outperformed the other Transformerbased models pre-trained with self-supervised learning, which alludes that the low-level CXR feature corpus generated by the proposed method may be more suitable input feature embedding than direct pixel-patch embedding (standard ViT) or feature embedding by ResNet backbone (hybrid ViT) for CXR classification task. ROI max-pooling method is rooted in the annotation rule where 1 or 0 is assigned to each subdivision according to the presence or absence of the abnormality, not averaging each subdivision's opacity degree. Therefore, if there are pixels with a high probability of abnormality in a small region, although most of the pixels have a low probability, the value of 1 should be assigned into the subdivision. So, the max-pooling can be more appropriate for bridging the severity map and the array than average pooling. To demonstrate the hypothesis, we compared the performance of severity quantification between the model using average-pooling and max-pooling for converting the severity map into the array. As in Table 10 , the ROI maxpooling method outperforms the ROI average-pooling. As Brixia dataset constitute the majority of data for severity quantification, we evaluated whether the model can show the stable performance only with the Brixia dataset. However, as in Table 10 , the performance was detrimental when the institutional datasets are all excluded from training data, addressing the necessity for including at least one institutional dataset to obtain stable performance of model. This degradation in performance may be result from a variety of reasons, including the Mean squared error (MSE) of the severity score (0-6) is chosen as the metric for comparison. discrepancy between details in labeling method of the Brixia dataset and the institutional datasets (e.g. anatomic landmarks determining the horizontal lines), the anatomical difference be-tween races (e.g. Caucasian vs. Asian) and so on. In this study, we developed a novel ViT model that leverages low-level CXR feature corpus for diagnosis and severity quantification of COVID-19. The novelty of this work is to decouple the overall framework into two steps, the first is to pre-train the backbone network to classify low-level CXR findings with prebuilt large-scale dataset to embed optimal feature corpus, which was then leveraged in the second step by Transformer for highlevel diagnosis of disease including COVID-19. By maximally utilizing the benefit of large-scale dataset containing more than 220,000 CXR images, overfitting problem of neural network with limited numbers of COVID-19 cases can be substantially alleviated. The extensive experimental results on various external institutions have demonstrated that our model not only outperforms other CNN and Transformer-based SOTA models but also retains outstanding generalization performance irrespective of the external settings, which is sine qua non of extensive adoption of system. In addition, we also adapted the proposed method to severity quantification problem, demonstrating a performance similar to that of clinical experts, thereby expanding its application in the clinical setting. In the current pandemic situation, our method holds great promise in a variety of clinical scenarios (see Fig. 8 ). Primarily, it can be used in patient triage along with the RT-PCR to isolate the suspected subjects waiting for RT-PCR results, as it was reported that positive radiological findings precede positive RT-PCR results in substantial portion (308 out of 1,014) of patients (Ai et al., 2020) . In addition, it is possible to give guidance in treatment decision or to evaluate the response by applying our severity prediction algorithm to consecutive CXRs. Given the lack of experienced radiologists and unavailability of examination with higher sensitivity, the applications of our model could be of great value in underprivileged countries. Finally, the concept of making higher-level diagnosis by aggregating low-level feature corpus, which is readily available with pre-built datasets, can be applied to quickly develop a robust algorithm against newly emerging pathogen, since it is expected to share the common low-level CXR features with existing diseases. Correlation of chest CT and RT-PCR testing for coronavirus disease 2019 (COVID-19) in china: a report of 1014 cases COVID-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks On pixel-wise explanations for non-linear classifier decisions by layerwise relevance propagation Chest CT findings in coronavirus disease-19 (COVID-19): relationship to duration of infection COVID-19 outbreak in italy: experimental chest x-ray scoring system for quantifying and monitoring disease progression Transformer interpretability beyond attention visualization More data can expand the generalization gap between adversarially robust and standard models Generative pretraining from pixels A simple framework for contrastive learning of visual representations, in: International conference on machine learning Predicting COVID-19 pneumonia severity on chest x-ray with deep learning COVID-19 image data collection: Prospective predictions are the future Chest x-ray in new coronavirus disease 2019 (COVID-19) infection: findings and correlation with clinical outcome Bimcv COVID-19+: a large annotated dataset of rx and ct images from COVID-19 patients Bert: Pre-training of deep bidirectional transformers for language understanding An image is worth 16x16 words: Transformers for image recognition at scale Understanding individual decisions of cnns via contrastive backpropagation Covidx-net: A framework of deep learning classifiers to diagnose COVID-19 in x-ray images The challenges of deploying artificial intelligence models in a rapidly evolving pandemic Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison Transformers in vision: A survey Automatic detection of coronavirus disease (COVID-19) using x-ray images and deep convolutional neural networks COVID-19 and the risk to health care workers: a case report Deep learning COVID-19 features on cxr using limited training data sets Unifying domain adaptation and self-supervised learning for cxr segmentation via adain-based knowledge distillation Improving language understanding by generative pre-training Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans Grad-CAM: Visual explanations from deep networks via gradientbased localization Radiological findings from 81 patients with COVID-19 pneumonia in wuhan, china: a descriptive study End-to-end learning for semiquantitative rating of COVID-19 severity on chest X-rays Endto-end learning for semiquantitative rating of COVID-19 severity on chest x-rays Smooth-Grad: removing noise by adding noise Full-gradient representation for neural network visualization Real-time rt-pcr in COVID-19 detection: issues affecting the results Clinical and chest radiography features determine patient outcomes in young and middle-aged adults with COVID-19 Attention is all you need Covid-net: A tailored deep convolutional neural network design for detection of COVID-19 cases from chest x-ray images A weakly-supervised framework for COVID-19 classification and lesion localization from chest CT Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weaklysupervised classification and localization of common thorax diseases Severity scoring of lung oedema on the chest radiograph is associated with clinical outcomes in ARDS COVIDNet-S: Towards computer-aided severity assessment via training and validation of deep neural networks for geographic extent and opacity extent scoring of chest X-rays for SARS-CoV-2 lung disease severity Weakly supervised lesion localization with probabilistic-cam pooling Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study COVID-19 screening on chest x-ray images using deep learning based anomaly detection Deep learning-based detection for COVID-19 from chest CT using weak label Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers Deep transfer learning artificial intelligence accurately stages COVID-19 lung disease severity on portable chest radiographs Deformable DETR: Deformable transformers for end-to-end object detection