key: cord-0998045-kod4qdm9
authors: Park, Sangjoon; Kim, Gwanghyun; Oh, Yujin; Seo, Joon Beom; Lee, Sang Min; Kim, Jin Hwan; Moon, Sungjun; Lim, Jae-Kwang; Ye, Jong Chul
title: Multi-task Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification
date: 2021-11-04
journal: Med Image Anal
DOI: 10.1016/j.media.2021.102299
sha: 8fd324c62ef5aade241f477cad2a2eb9132caefc
doc_id: 998045
cord_uid: kod4qdm9

Developing a robust algorithm to diagnose and quantify the severity of the novel coronavirus disease 2019 (COVID-19) using Chest X-ray (CXR) requires a large number of well-curated COVID-19 datasets, which is difficult to collect under the global COVID-19 pandemic. On the other hand, CXR data with other findings are abundant. This situation is ideally suited for the Vision Transformer (ViT) architecture, where a lot of unlabeled data can be used through structural modeling by the self-attention mechanism. However, the use of existing ViT may not be optimal, as the feature embedding by direct patch flattening or ResNet backbone in the standard ViT is not intended for CXR. To address this problem, here we propose a novel Multi-task ViT that leverages low-level CXR feature corpus obtained from a backbone network that extracts common CXR findings. Specifically, the backbone network is first trained with large public datasets to detect common abnormal findings such as consolidation, opacity, edema, etc. Then, the embedded features from the backbone network are used as corpora for a versatile Transformer model for both the diagnosis and the severity quantification of COVID-19. We evaluate our model on various external test datasets from totally different institutions to evaluate the generalization capability. The experimental results confirm that our model can achieve state-of-the-art performance in both diagnosis and severity quantification tasks with outstanding generalization capability, which are sine qua non of widespread deployment.

The novel coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has emerged as one of the deadliest viruses of the century, resulting in about 137 million people infected with over 2.9 million death worldwide as of April 2021. In the light of the unprecedented pandemic of COVID-19, public health systems have faced many challenges, including scarce medical resources, which are pushing healthcare providers to face the threat of infection (Ng et al., 2020) . Considering its ominously contagious nature, the early screening of COVID-19 infection becoming increasingly important to avert the further spread of disease and thereby reduce the burden on the saturated health care system.

Currently, the real-time polymerase chain reaction (RT-PCR) is considered as the gold standard in the diagnosis of COVID-19 for its high sensitivity and specificity (Tahamtan and Ardebili, 2020) , but it takes several hours and even days depending on regions to get the exam results due to overstressed laboratories. Since the majority of patients with confirmed COVID-19 present positive radiological findings, the radiologic examinations can be useful for rapid screening of disease (Shi et al., 2020) . Although computed tomography (CT) scan has excellent sensitivity and specificity for COVID-19 diagnosis , the use of CT is a major burden because of its high cost and potential for cross-contamination in the radiology suite. Therefore, Chest X-ray (CXR) holds many practical advantages as a primary screening tool in the pandemic situation. In addition, CXR is useful for follow-up, which should be inexpensive and low in radiation exposure, to assess response to treatment.

Consequently, many studies have reported early application of CXR deep learning for diagnosis Hemdan et al., 2020; Narin et al., 2020; Oh et al., 2020) or severity quantification of COVID-19 (Cohen et al., 2020a; Signoroni et al., 2020a; Zhu et al., 2020a; Wong et al., 2020) , but they suffered from ineradicable drawbacks of poor generalization capability stemming from the scanty labelled COVID-19 data (Hu et al., 2020; Zech et al., 2018; Roberts et al., 2021) . The stable generalization performance on unseen data is indispensable for widespread adoption of the system (Roberts et al., 2021) .

One of the most commonly used measures to solve this problem is to build a robust model with innumerable training data (Chen et al., 2020a) . However, although plenty of CXRs of COVID-19 is taken all around the world every day, available datasets are still limited due to lack of the expert labels and the difficulties in sharing patient data outside the hospital for privacy issues. The situation becomes even worse in the current pandemic situation, hindering the collaboration between different hospitals in different countries. As a result, several methods have been proposed to mitigate the problem by transfer learning (Apostolopoulos and Mpesiana, 2020) , weakly supervised learning (Zheng et al., 2020a; Wang et al., 2020b) , and anomaly detection , but their performances are still suboptimal.

The previous studies mostly utilize convolutional neural network (CNN) models, which were not specially designed for manifestations of COVID-19 which can be characterized by bilateral involvement, peripheral and lower zone dominance of ground-glass opacities, and patchy consolidations (Cozzi et al., 2020) . Although CNN architecture has shown to be superb in many vision tasks, it may not be optimal for problems requiring high-level CXR disease classification, where global characteristics like multiplicity, distribution, and patterns have to be considered. This is due to the the intrinsic locality of pixel dependencies in the convolution operation.

To overcome the similar limitation of CNN in computer vision problems that require the integration of global relationship between pixels, Vision Transformer (ViT) equipped with the Transformer architecture (Vaswani et al., 2017 ) was proposed to model long-range dependency among pixels through the selfattention mechanism, showing the state-of-the-art (SOTA) performance in the image classification task (Dosovitskiy et al., 2020) . Since the Transformer was originally invented for natural language processing (NLP) in order to attend different positions of the input sequence within a corpus and compute a representation of that sequence, the choice of an appropriate corpus is the prerequisite for the Transformer design.

In the original paper (Dosovitskiy et al., 2020) , two ViT models were suggested utilizing either direct pixel-patch embedding or feature embedding by ResNet backbone as corpora for Transformer. A problem occurs here, however, that neither the direct pixel-patch embedding nor feature embedding from ResNet may not be the optimal input embedding for the CXR diagnosis of COVID-19. Fortunately, several large-scale CXR data sets are constructed before the COVID-19 pandemic and are publicly available. For example, CheXpert (Irvin et al., 2019) , a large dataset that contains over 220,000 CXR images, provides labeled common low-level CXR findings (e.g. consolidation, opacity, edema, etc.), which is also useful for the diagnosis of infectious disease. Moreover, an advanced CNN architecture has been suggested using the same dataset , which uses probabilistic class activation map (PCAM) pooling to leverage the class activation map to enhance the localization ability as well as classification performance. To take the maximum advantage of both the dataset and the network architecture for COVID-19, here we propose a novel ViT architecture which utilizes this advanced CNN architecture as a feature extractor for low-level CXR feature corpus, upon which Transformer is trained for downstream tasks of diagnosis by utilizing the selfattention mechanism in Transformer.

It is worth mentioning that our network is basically identical to the text classification task with Transformer architecture, where the Transformer not only adds meaning but also takes into account the location and relationship of words to classify at the sentence-level. Moreover, our method emulates the clinical experts who determine the final diagnosis of CXR (e.g. normal, bacterial pneumonia, COVID-19 infection, etc.) by comprehensively considering the low-level features with their pattern, multiplicity, location, and distribution (e.g. Multiple opacities and patch consolidations exist with lower lung zone dominance: high probability for COVID-19) as illustrated in Fig. 1 .

Another important contribution of this paper is to show that our ViT framework can be also used for COVID-19 severity quantification and localization, enabling the serial follow-up of severity and thereby assisting the treatment decision of clinicians (Cohen et al., 2020b) . The severity of COVID-19 can be determined by quantifying the extent of COVID-19 involvement. Recently, array-based simple severity annotations where 1 or 0 is assigned to every 6 subdivisions of lungs are proposed by Toussie et al. (2020) , and we are interested in utilizing this weak labeling approach for severity quantification. As the Transformer output already incorporates the long-range relationship between regions through self-attention, we use this Transformer output to design a light-weighted network that can accurately quantify and localize the COVID-19 extents from weak labels. Specifically, we adopt the region of interest (ROI) max-pooling of the output Transformer feature to bridge the severity map and simple array. Consequently, in addition to the global severity score from 0 to 6, our model can create an intuitive severity level map where each pixel value explicitly means the likelihood of the presence of a COVID-19 lesion using the weak array-based labels.

Finally, we have integrated the developed classification and severity quantification models into multi-task learning (MTL) framework to enable a single versatile model to perform the classification and severity quantification simultaneously, to better offer a more straightforward application of the developed system as well as improving the performances of individual tasks by sharing robust representation between related tasks.

In summary, our main contributions are as follows.

• A novel ViT model for COVID-19 is proposed by leveraging the low-level CXR feature corpus that contains the representations for common CXR findings with the prebuilt large-scale dataset.

• We have not limited our model to classification but expanded our model to quantify severity to provide clinicians with clinical guidelines for making treatment decisions.

• The classification and severity quantification models were integrated into a single multi-task model for straightforward applicability, which also improved the performances of both tasks.

• We experimentally demonstrated that our method outperforms the previous models for COVID-19 as well as other CNN and Transformer-based architectures especially in terms of the generalization on unseen data.

The remainder of this paper is organized as follows. Section 2 summarizes the related works. Section 3 and Section 4 describes the proposed framework and datasets, respectively. Experimental results are presented in Section 5. Finally, we conclude this work in Section 6.

Transformer (Vaswani et al., 2017) , which was originally invented for NLP, is a deep neural network based on a selfattention mechanism that facilitates appreciably large receptive fields. After demonstrating its astounding performance, not only has Transformer become a de facto standard practice in NLP, but it has also motivated the computer vision community to explore its applications in computer vision by taking advantage of the long-range dependency between pixels (Khan et al., 2021) .

The ViT was the first major attempt to apply a pure Transformer directly to an image, suggesting that it can completely replace the standard convolution operations by achieving SOTA performance. However, the experimental results showed that training the vanilla ViT model requires a huge computational cost. Therefore, the authors also suggested hybrid architecture by conjugating CNN backbone (e.g. ResNet) to Transformer. With the feature extracted by ResNet, the Transformer can mainly focus on modeling the global attention. The experimental results suggest that it was able to achieve higher performance with the hybrid approach with a relatively small amount of computations.

After the introduction of ViT, the application of Transformer in computer vision has become an active area of investigation, resulting in many variant models of ViT showing SOTA performance in a variety of vision tasks including object detection , classification (Dosovitskiy et al., 2020; Chen et al., 2020b) , segmentation , and so on.

Class activation map (CAM) is a sort of class-specific saliency map obtained by quantifying the contribution of a particular area of an image to the prediction of the network. The most useful aspect of CAM is that it enables the localization of the important area only with weak labels, namely imagelevel supervision. Despite its excellent localization ability, most previous works utilized CAM to generate heatmaps for lesion localization and visualization during inference. To leverage the localization ability of CAM to enhance the performance of the network itself, one recent study utilized the CAM during training in CXR classification and localization tasks . They devised a novel global pooling operation that explicitly leverages the CAM in a probabilistic manner and is known as PCAM pooling. Different from standard approaches that use CAM for direct localization, they bound it with an additional fully connected layer and sigmoid function to get probabilities for each CXR findings. Then, the normalized attention weights were obtained from these output probabilities to make weighted feature maps containing more useful representations for each class. They showed that PCAM pooling operation can enhance both localization and diagnostic performance of the model and achieved first place in the 2019 CheXpert Challenge.

For a detailed process of the PCAM operation, please refer to Appendix A.

To build an automated algorithm for severity quantification, pixel-level annotation such as lesion segmentation labels can offer plentiful information. However, this type of labeling methods are labor-intensive and collecting large data with this pixellevel annotated label is not feasible under the global pandemic of COVID-19. To alleviate the problem, simplified severity annotation methods, such as score-based and array-based methods, have been proposed. For example, Cohen et al. (2020a) suggested a geographic extent score and a lung opacity score based on a rating system of lung edema proposed by Warren et al. (2018) . A geographic extent score assigns scores that range from 0 to 4, while a lung opacity score assigns values of 0 to 3 based on the severity of involvement in each lung area. Borghesi and Maroldi (2020) designed Brixia score, another array-type severity labeling method, dividing lung with anatomic landmarks and assign a score of 0-3 to each subdivision. Similarly, Toussie et al. (2020) suggested an array-based severity score for COVID-19. After dividing both lungs into six divisions, each area is assigned a value of 0 or 1, depending on the presence of COVID-19 involvement, which adds up to overall severity of 0 to 6. We adopted the array-based annotation method suggested by Toussie et al. (2020) for severity quantification of COVID-19.

Upon rapid spread of COVID-19, there have been numerous approaches to enable automated diagnosis and severity prediction of COVID-19. For diagnosis, Wang et al. (2020a) proposed COVIDNet that adopted a lightweight projection-expansionprojection-extension design and long-range connectivity to improve representational capacity and showed good performance compared with standard CNN models. Khan et al. (2020) proposed CoroNet which was based on Xception (Chollet, 2017) network pre-trained on ImageNet and subsequently fine-tuned with COVID-19 data. Similarly, Minaee et al. (2020) proposed Deep-COVID in which various data augmentations were used and the last layer of standard CNNs were fine-tuned for COVID-19 data. Using DarkNet-19 (Redmon and Farhadi, 2017) used for object detection framework, Ozturk et al. (2020) proposed DarkCOVIDNet.

To quantify the severity of COVID-19 infection on CXR, Cohen et al. (2020a) devised a network pre-trained to classify 7 pathologies and trained to perform linear regression to predict the severity scores. Kwon et al. (2020) proposed CheXNet that pre-trained on ImageNet and subsequently trained to predict COVID-19 severity with their custom dataset. Finally, Li et al. (2020) introduced PXS-score based on a convolutional Siamese network pre-trained on CheXpert dataset, where two separate images are taken as inputs and passed through twinned CNN and Euclidean distance between two outputs are used for calculating the severity scores.

As described above, however, the previous approaches are mainly based on the standard CNN model pre-training and transfer learning from the irrelevant dataset (e.g. ImageNet), and therefore do not guarantee an optimal generalization performance for COVID-19.

One of the novel contributions of our approach is to show that we can maximize the performance of the Transformer model by using the low-level CXR corpus that comes from the backbone network trained with a large well-curated public record to produce common CXR findings. As the backbone network is trained with a large number of data, the subsequent models using this backbone for classification and severity quantification tasks are less prone to overfitting, even with a smaller number of labeled cases. This is shown to improve the generalization capability of the network. After devising the model for classification and severity quantification of COVID-19, we further integrated these two models into a single multi-task model that can do two tasks simultaneously to offer better applicability as well as to improve the performances of individual tasks.

As a backbone network to extract low-level features, we used the modified version of the network proposed by Ye et al. (2020) . Firstly, the backbone network was pre-trained to classify 10 common low-level findings with a large public dataset. As depicted in Fig. 2 , feature maps in each layer can be the candidates for utilizable feature embedding for the subsequent Transformer, and we experimentally found that the common embedding before the PCAM operation comprises of most useful information. Nevertheless, care should be exercised since the PCAM operation for specific low-level CXR findings (e.g. lung opacity, consolidation, etc.) turns out to be crucial to achieving the optimal embedding at the intermediate level, as PCAM aligns these features to obtain better performances. Through this operation, more prominent feature representations are embedded for each low-level entity, and combining these low-level feature representations to yield high-level results of classification and severity quantification with the Transformer is one of the key ideas of our method. More detailed experimental results about the role of PCAM operation will be provided within ablation studies of Section Section 5.6.2.

The overall framework and the architecture of our ViT model is provided in Fig. 3 . Since our model use the same pre-trained backbone and Transformer architecture for two tasks, shared backbone layer can be defined as in Fig. 3 (A) . Specifically, for a given H × W size input image x ∈ R H×W , the backbone network G generates H × W size feature maps F:

Here, the feature tensor F ∈ R H ×W ×C is defined as

where f n ∈ R C denotes a C-dimensional embedded representation of low-level features at the n-th encoded block. These feature vectors are used to construct the low-level CXR feature corpora for Transformer. Then, similar to Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) , our ViT uses Transformer encoder layers to the input embedding. Specifically, since the Transformer encoder utilizes constant latent vector of dimension D, the extracted C dimension feature f n ∈ R C is first projected to a D dimension featuref n ∈ R D using 1 × 1 convolution kernel. We then prepended learnable [class] token embedding vector f cls ∈ R D to projected feature tensor. This leads to the following composite projected feature tensor:

A positional embedding E pos that has the same shape to the projected feature tensorF is then added to encode a notion of the sequential order:

This is then used as an input to a Transformer composed of L successive encoder layers:

where

H ×W and T (l) denotes the l-th encoder layer. The encoder layers used in our model are the same as standard Transformer which consists of repeated layers of multi-head self-attention (MSA), multi-layer perceptron (MLP), layer normalization (LN), and residual connections in each block, as shown in Fig. 3 (A) .

Then, the first column z (L) 0 of Z (L) represents the Transformer attended feature vector with respect to the [class] token, which is used for the classification task. The rest of the Transformer output also produces feature embedding at each block

position by taking into account long-range relations between the blocks. Therefore, we conjecture that this information is useful for the severity quantification, as severity is determined by both local and global manifestations of the disease.

Simply adding linear classifiers to [class] token as the classification head, we can obtain the diagnosis result y of the input CXR image x (see Fig. 3 

For the interpretability of the classification model, we adopted a visualization method of saliency map tailored for ViT suggested by (Chefer et al., 2020) , which computes relevancy for the Transformer network. Specifically, unlike the traditional approaches of gradient propagation methods (Selvaraju et al., 2017; Smilkov et al., 2017; Srinivas and Fleuret, 2019) or attribution propagation methods (Bach et al., 2015; Gu et al., 2018) , which rely on the heuristic propagation along with attention graph or the obtained attention maps, the method in Chefer et al. (2020) calculate the local relevance with deep Taylor decomposition, which is then propagated throughout the layers. This relevance propagation method is especially useful for models based on Transformer architecture, as it overcomes the problem of self-attention operations and skips connections.

As shown in Fig. 3 (B), reshaped output features except for [class] token are combined by an additional lightweight net-work to produce the COVID-19 severity map.

Specifically, as shown in Figs. 3(b) and 4, we first extract the Transformer output Z (L) except the [class] token position:

which is used as an input to the map head network N

Then, the network output S ∈ R 512×512 is multiplied pixelwise with the segmentation mask M ∈ R 512×512 , generating the severity map S⊗ M. Finally, ROI max-pooling (RMP) is applied to provide the severity mask Y sev ∈ R 3×2 :

where ⊗ denotes the Hadamard product. In detail, the lung was divided into a total of six subdivisions, by dividing the right and left lungs into three subdivisions (upper, middle, lower zone) with 5/12 and 2/3 lines. Next, the largest values within each six subdivision were assigned as predicted values of the severity array. Then, the map head network is trained by minimizing the error of the estimated severity array with respect to the weakly annotated severity label as in Fig. 4 . For details of the model R2C2 output and post-processing for the severity array, refer to Appendix B. To generate the lung segmentation mask, we used the method introduced by Oh and Ye (2021) . In contrast to the existing approaches that are prone to under-segmentation for the severely infected lung with large consolidations, this novel approach enables the accurate segmentation of abnormal lung as well as normal lung area by learning common features using a single generator with AdaIN layers. Since a single generator is used for all these tasks by simply changing the AdaIN codes, the generator can synergistically learn the common features to improve segmentation performance for abnormal CXR data.

Since the classification and severity quantification model shares the same layers other than task-specific heads, we trained and evaluated the model with MTL as well as single-task learning (STL) for both tasks. By the MTL framework, we aimed not only to offer a simpler configuration for better applicability but also to improve the performances of two relevant tasks, COVID-19 classification and severity quantification, by learning more robust feature representation shared between the two related tasks as suggested in the previous studies (Zhang and Yang, 2017) .

Datasets used for this study can be divided into three: dataset for pre-training backbone, the datasets for classification, datasets for severity quantification.

For the pre-training of the backbone network to extract the low-level CXR features, we used CheXpert dataset containing 10 labeled CXR findings: no finding, cardiomegaly, opacity, edema, consolidation, pneumonia, atelectasis, pneumothorax, pleural effusion, and support device. With a total of 224,316 CXR images from 65,240 subjects, the 32,387 lateral view images were excluded, leaving 29,420 posterior-anterior (PA) and 161,427 anterior-posterior (AP) view data available. With this large number of CXRs, it was able to train the backbone network robust to the variation in subjects, which is one of the key strengths of our model. , Daegu, Korea) labeled by board-certified radiologists for this study. Finally, the integrated dataset was divided into three label classes including normal, other infections (e.g. bacterial infection, tuberculosis), and COVID-19 infection, considering the application in the real clinical setting. Both PA and AP view CXRs were utilized to build and evaluate our model in a view-agnostic setting. We used three institutional data (CNUH, YNU, KNUH) as external test datasets to evaluate the generalization capability by using data collected from independent hospitals with different devices and settings, and other data for training and internal validation of the models. Table 2 summarizes dataset resources and global severity levels. Similar to diagnosis, the PA and AP view data were integrated and utilized without division for severity quantification task since there is the possibility that follow-up images may be obtained with both PA and AP view even in a single patient. Two board-certified radiologists labeled the severity for three institutional datasets (CNUH, YNU, KNUH) using the arraybased severity labeling method of Toussie et al. (2020) as in Fig. 6 . We also utilized publicly available data, Brixia dataset, after translating its severity score the same as that of the institutional datasets. We alternately used one institutional dataset as an external testset and trained the models with two remaining datasets together with Brixia dataset to evaluate the generalization capability in various external settings. Besides, 12 COVID-19 cases from BIMCV dataset were used to compare the severity map generated by our model to those annotated by clinical experts. Details of the patient and CXR image characteristics of four hospitals (CNUH, YNU, KNUH, AMC) datasets are provided in Appendix C.

The CXR images were preprocessed via histogram equalization, Gaussian blurring with 3 × 3 kernel, normalization, and finally resized to 512 × 512. As our backbone network, the modified version of the network proposed by Ye et al. (2020) , comprises the DenseNet-121 baseline followed by PCAM operations. Among several layers of intermediate feature maps, we used the feature map of size 16 × 16 × 1024 just before the PCAM operation. For subsequent Transformer architecture, we used a standard Transformer model with 12 layers and 12 heads per layer.

For pre-training of the backbone network, Adam optimizer with a learning rate of 0.0001 was used. We trained the backbone network for 160,000 optimization steps with a step decay scheduler with a batch size of 8. Data augmentations including random flipping, rotation, translation were performed to increase the variability of training data during pre-training. For the classification task, stochastic gradient descent (SGD) optimizer with momentum 0.9 was used with a learning rate of 0.001. A max gradient norm of 1 was applied to stabilize training. We trained the model for 10,000 optimization steps with a cosine warm-up scheduler (warm-up steps = 500) with a batch size of 16. For the severity quantification task, a map head with five upsizing convolution layers is used, with the last block followed by sigmoid non-linearity which squashes output into [0-1] range. Training of severity quantification model was done with SGD optimizer with a learning rate of 0.003 for 12,000 optimization steps with constant learning rate, and batch size of 4 was used. These optimal hyperparameters were determined experimentally. Similar to pre-training, various data augmentation (horizontal flipping, rotation, translation, and scaling) was performed to increase the training data for both tasks. As the loss functions, binary cross-entropy (BCE) losses were used for each class label for pre-training and classification task, while BCE losses for each location array within a CXR were used for severity quantification task.

In the MTL setting, the shared layers were trained with the optimizer, scheduler, and hyperparameter to those of the classification task. Considering the scales of loss from each task, the losses from task-specific heads were scaled to 1:5 for classification and severity quantification to balance their influence to the shared network layers.

Since our model was trained using both PA and AP CXRs, the classification, and severity quantification performances were evaluated in a view-agnostic manner with both PA and AP images. However, we also evaluated and provided the model performances for PA and AP images separately for the classification task, in which the diagnostic performance could differ significantly according to CXR views. We used the area under the receiver operating characteristic curve (AUC) as the evaluation metrics for diagnostic performance of the classification model, but also calculated sensitivity, specificity, and accuracy after adjusting the thresholds to meet the sensitivity value of ≥ 80%, if possible. As evaluation metrics for severity quantification, we used the Mean Squared Error (MSE) as the main metric, but the Mean Absolute Error (MAE), Correlation Coefficient (CC), and R 2 score were also measured and compared. The performance metrics were reported with estimated 95% confidence intervals (CIs). Model performances were compared statistically using AUC with DeLong test (De-Long et al., 1988) for classification task and using MSE with paired t-test for severity prediction task, respectively. Statistically significant differences were defined as p < 0.05.

All experiments including preprocessing, development, and evaluation of the model, were performed using Python version 3.7 and PyTorch library version 1.7 on NVIDIA Tesla V100, Quadro RTX 6000, RTX 3090, and RTX 2080 Ti.

We first evaluated whether the model trained with the MTL approach provides better performance than two task-specific models trained with the standard STL approach. As shown in Table 3 and 4, the multi-task model for two tasks outperformed the expert model trained exclusively for each task with statistical significance, for both classifications and severity prediction tasks. Hence, the following experiments were mainly conducted under the MTL setting, and other models used for comparison were also implemented with the MTL approach for a fair comparison.

The detailed diagnostic performances of the proposed model are provided in Table 5 . On average of 3 label classes (normal, other infection, COVID-19), our model showed stable performances regardless of external data with the mean AUCs of 0.949, 0.931, 0.907, sensitivities of 90.2%, 87.0%, 85.1%, specificities of 84.9%. 86.2%, 83.7%, and accuracy of 86.8%, 86.5%, 84.1% for three labels in three external institutions, which confirmed the stability in performance even with a viewagnostic setting and outstanding generalization capability in clinical situations with different devices and settings. The diagnostic performances our model evaluated only on PA and AP view images are also provided in Appendix D.

Fig. 5 exemplifies the visualization of saliency maps for each disease class in the external test datasets. As shown in the examples, our model well-localized a focal infected area either by a bacterial infection (Fig. 5 (a) ) or tuberculosis (Fig. 5 (b) ), while it was also able to delineate the multi-focal lesions in the periphery of both lower lungs in Fig. 4 (c) , which is typical findings for COVID-19 pneumonia.

The results of severity quantification of our model are shown in Table 6 . Our model showed the MSE of 1.441, 1.435, 1.458, the MAE of 0.843, 0.943, 0.890, correlation coefficient of 0.800, 0.830, 0.731, and R 2 score of 0.634, 0.633, 0.485 in three external institutions. Brixia dataset contains a consensus subset of 150 CXR images labeled by five independent radiologists. Within this subset, the average MSE between the con-sensus severity score calculated from majority voting and each radiologist's rating is 1.683. As a result, the MSEs of 1.441, 1.435, and 1.458 in three external institutions show our model's performance comparable to or better than those of experienced radiologists and generalization capability in the clinical environment. Fig. 6 illustrates the examples of severity quantification, including the predicted scores, arrays, maps, and lesion contours in one of the external test datasets, which confirms that not only can our model correctly predict global severity, but it also generates an intuitive severity map that highlights the affected area, which can also be used to contour lesions.

Finally, Fig. 7 exemplifies the comparison between the ground truth segmentation label of the involved area and the model's prediction of involvement in BIMCV dataset. As shown in the figure, the model generally well-localized the areas of involvement.

To compare the performance with the other baseline and SOTA CNN-based models, we adopted the following models: ResNet-50, ResNet-512, DenseNet-121 as the baseline CNNbased models, and EfficientNet-B7, NASNet-A-Large, SE-Net-154 as the SOTA CNN-based models. For comparison with Table 7 . Comparison of the classification performance with various baseline and SOTA CNN-based models, and Transformer-based models.

Others other Transformer-based models, we used ViT (ViT-B-16) and hybrid ViT (R50-ViT-B-16) models. All models underwent the same pre-training process on CheXpert dataset and were subsequently trained, evaluated with datasets and settings the same as the proposed model for a fair comparison. As suggested in Table 7 and 8, our model outperformed or at least comparable to both the SOTA CNN-based models as well as the baseline CNN-based models with statistical significance. When compared to Transformer-based models, our model showed statistically better performance than other Transformer-based models. Note that our model showed superior performance not only to the models with less complexity (e.g. ResNet-50, DenseNet-121) but also to those with more complex architectures (e.g. NASNet-A-Large, SE-Net-154, ViT models). Table 9 . Comparison of the classification performance of the proposed model with COVID-19 classification models in related works.

Others 4.965 * * (4.558-5.372) Note: * , * * denote the better performance of ours with statistical significance (p <0.05, p < 0.001). CI: confidence interval.

These results suggest that our model offers better generalization performances in both classification and severity quantification tasks compared with the existing model architectures, which did not result from increased complexity.

We also compared our model with the tailored models in the related works of Section 2.4. The tailored models for comparison were implemented and trained using the settings proposed in the original papers (e.g. pre-training, hyperparameters, etc.) on our dataset the same as the proposed model for a fair comparison. As shown in Table 9 and 10, our model considerably outperformed previous models proposed in the related works for both COVID-19 classification and severity quantification. Although a few models showed reasonable performances in some test datasets (e.g. DarkCOVIDNet in YNU dataset and CheXNet in CNUH dataset), they failed to show stable performances over various external test datasets. The unstable performances of previous models for COVID-19 on various external test setting account for why the deep learning models readily developed for automated diagnosis and severity prediction of COVID-19 not lead to the widespread application.

In experiments of the diagnostic model, the results should be interpreted with caution, since the actual prevalence of the disease is not the same as in the experimental dataset collected for the study. That is to say, in our case, the prevalence of 26.9% for COVID-19 in the external test set is quite higher than the real-world prevalence of COVID-19 in any country. Therefore, we evaluated the performance metrics under a range of disease prevalences of COVID-19 in the external test datasets using bootstrapping with replacement. As shown in Figure  8 , the proportion of predicted negative and negative predicted value (NPV) for COVID-19 drastically increase with decreasing COVID-19 prevalence to real-world reported ranges (Yiannoutsos et al., 2021) , from NPV of 93.7% to 99.1% and negatively predicted proportion of 63.9% to 78.6%. Thus, this simulation suggests that under the real-world prevalence of COVID-19, about 80% of the RT-PCR test can be spared with the application of the proposed model as a screening tool with an NPV over 99%.

To have a better understanding of the model's misprediction, we exemplified the failure cases by the proposed model for both classification and severity quantification tasks. As shown in Fig. 9 , though our model failed to offer the correct predictions for the failure cases, its confusion could be explained with cogent interpretations, and it attends on the lesion of interest in Fig. 10 . Example of the failure case of the proposed model for severity quantification task. The model confused a faint opacity in the right middle lobe as COVID-19 involvement, yielding an overall score higher than the label. Nevertheless, its prediction came close to the label annotation. many cases. Similarly, for severity quantification, it provided the severity array come close to the label annotation, even in case of the wrong prediction as in Fig. 10 .

In addition, we further exemplified the cases in which the previous classification models failed while the proposed model offered the correct predictions for comparison in Appendix E.

To get better understanding about the contribution of individual components within our model, we conducted a series of ablation studies as provided in Table 11 and 12. More details are as follows.

Pre-training the backbone on a pre-built large CXR dataset (CheXpert dataset) to extract low-level features is one of the key ideas of our method. Therefore, we conducted experiments to compare the performances of the proposed model with and without CheXpert pre-trained weights both in the internal validation dataset and the external test datasets. As shown in Table  13 and 14, the experimental results suggest the performance increases with pre-training were prominent in the external test datasets, while the improvement was not prominent, and even better performance without pre-trained backbone was observed in the internal validation dataset. Combined together, these results demonstrate that the model without CheXpert pre-trained weights is more prone to overfitting, supporting our arguments that pre-training the backbone on large-scale CXR is a crucial Note: * , * * denote the better performance of ours with statistical significance (p <0.05, p < 0.001). CI: confidence interval. component of the model in terms of better generalization capability.

To support our claim that PCAM operation enables the backbone network to embed better feature representations for subsequent tasks, we conducted an ablation study with and without PCAM operation. The experimental results in Table 11 and 12 show that the model with PCAM operation shows better performances both for classification and severity prediction tasks, but the benefit was more prominent for classification task. These findings are consistent with the intuition that PCAM operation would be more useful in classification tasks where the robust representations for various low-level features can be more directly related to the final diagnosis of a given CXR.

Another key idea of our approach is that the Transformer is capable of properly combining the extracted low-level features to yield high-level outputs. To validate this argument, the ablation study without the Transformer was conducted, R2C4 which is identical to train and evaluate the performance of the CNN backbone (DenseNet-121 equipped with PCAM) without a Transformer body, which was trained in a multi-task manner for the classification and severity quantification tasks. As provided in Table 11 and 12, the performances were significantly deteriorated without the Transformer architecture, both for the classification and the severity prediction tasks, proving that the Transformer architecture plays a key role within our method.

Since a recent study has suggested that ViT model works decently without the positional embeddings (Chen et al., 2021) , we performed an ablation study with and without the positional embeddings. As shown in Table 11 and 12, the model without the positional embedding showed no statistical difference in severity prediction task, but provided slightly lower performances for classification task in some datasets. This is consistent with the intuition that the positional information has meaning for diagnosis of disease (e.g. tuberculosis often involves the apex of lungs, but COVID-19 more often presents in the lower periphery), but may not be important to yield a summed severity score overall lung areas which can be considered to be permutation-invariant.

Increasing concerns on the overestimation of the deep learning model for COVID-19 now bring the real-world applicability of the models into question. As pointed out in recent literature (Wynants et al., 2020) , although hundreds of deep learning models for automated diagnosis of COVID-19 have been suggested so far, most of them did not work well in a realworld application. Most of them were sensitive to specific settings of image acquisition, overfit to unimportant findings of image (Roberts et al., 2021 ) and therefore showed unpardonable performance deterioration in a different setting. Similarly, in this study, we have observed that previously suggested models for both COVID-19 classification and severity quantification showed unsatisfactory generalization performances in various external data. Our model, on the other hand, showed stable performances in various external test datasets with different settings and even regardless of PA and AP view (see Appendix D and Appendix F). This finding is important since it will broaden the actual applicability of the developed model in the clinical setting.

In the current pandemic situation, our method holds great promise as a screening tool. As shown in the simulation of realworld COVID-19 prevalence (see Fig. 8 ), it could reliably deprioritize the population with a low risk of infection using readily obtainable CXRs. With NPV over 99%, the model could spare up to 80% of the tested population from the molecular test, thereby prioritize the limited medical resources to subjects more likely to have COVID-19. In this respect, the ap- plication of our model would be of great value in the resourceconstrained area. Supposing it is used along with the molecular test, it could be utilized to isolate the suspected subjects waiting for RT-PCR results, as it was reported that positive radiological findings precede positive RT-PCR results in a substantial portion (308 out of 1,014) of patients . In addition, since our model also provides the estimated severity of COVID-19 infection, it is possible to give guidance in treatment decisions or to evaluate the response using our model for severity prediction of consecutive CXRs.

In summary, we developed a novel ViT model that leverages low-level CXR feature corpus for diagnosis and severity quantification of COVID-19. The novelty of this work is to decouple the overall framework into two steps, the first is to pre-train the backbone network to classify low-level CXR findings with the prebuilt large-scale dataset to embed optimal feature corpus, which was then leveraged in the second step by Transformer for high-level diagnosis of disease including COVID-19. By maximally utilizing the benefit of the large-scale dataset containing more than 220,000 CXR images, the overfitting problem of neural networks with limited numbers of COVID-19 cases can be substantially alleviated. In addition, we also adapted the proposed method to severity quantification problem, demonstrating a performance similar to that of clinical experts, thereby expanding its application in the clinical setting. Not confined to devising the model for each task, we enabled a novel ViT model to be a multi-task model that can be used for both classification and severity prediction, offering a simpler configuration and better performances for individual tasks. We performed extensive experiments on various external institutions to demonstrate the superior generalization performance of the proposed model over the existing models for COVID-19 as well as other CNN and Transformer-based architectures, which is the sine qua non of widespread adoption of the system.

Finally, we believe that the novel concept of making higherlevel diagnoses by aggregating low-level feature corpus, which is readily available with pre-built datasets, can be applied to quickly develop a robust algorithm against the newly emerging pathogen, since it is expected to share the common low-level CXR features with existing diseases.

This work was supported in part by the National Research Foundation ( First, the feature map from backbone network is transformed to a probability map using 1 × 1 convolution and sigmoid layer. This probability map is then normalized and pixel-wise multiplied with feature map to generate weighted feature map. Finally, the weighted feature map is reduced with global average pooling and passed to final classifier to provide prediction probability.

Using the feature map from the Transformer, a map head leveraging the five upsizing convolution layers followed by a sigmoid layer generates an output with a range of [0-1]. This is subsequently multiplied by lung mask to provide severity map suitable for the shape of lung as shown in Fig. B1 

The details of patient and CXR characteristics of four hospital data deliberately collected for this study are provided in Table C1 .

Table D.2 and D.3 shows the classification results evaluated exclusively on PA and AP view CXRs, respectively. For both PA and AP view images, our model provided stable performances, although the diagnostic performance with AP view images was slightly lower than PA view images. Nonetheless, it still showed good performance (AUC ≥ 0.800) in the external test dataset, considering the fact that the diagnosis of infectious disease using only AP view image is not standard and usually deteriorates the diagnostic performance. 

For analysis of the failure cases, we additionally analyzed the failure cases of the previous models using our model for comparison. The previous models were visualized with the methods proposed in the original papers. Note that COVIDNet and CoroNet could not be implemented since they did not provide the details of the model visualizations. As shown in Fig. E1 , our model successfully predicted the correct label and localized the lesion in the failure cases of the previous models for both and other infections. Similarly, for severity quantification, our model more correctly predicted ground truth severity annotation than the previous models as in Fig. E2 . In addition, the severity map generated by our model predicted the locations of the COVID-19 involvement with the high agreement. 

We have further evaluated the generalization performance of our model in other publicly available datasets. For classification, we used Actualmed COVID-19 CXR Dataset (DarwinAI et al.) containing 155 PA and 30 AP CXRs. This dataset contains 58 COVID-19 cases and 127 non-COVID-19 cases. Note that classification metrics could only be calculated in COVID-19, since the the dataset contain only COVID-19 and non- COVID-19 labels. For severity quantification, COVID-19 Image Data Collection (Cohen et al., 2020c) was used which contains 163 images annotated with Brixia severity score. These scores are converted in accordance with our severity scoring method. After conversion, it contains 14 (8.6%), 16 (9.8%), 16 (9.8%), 19 (11.7%), 19 (11.7%), 79 (48.5%) cases of severity score 1, 2, 3, 4, 5, 6, respectively.

As shown in Table F1 and Table F2 , our model provided good performances in both COVID-19 classification and severity quantification tasks in these datasets from other sources.

Correlation of chest CT and RT-PCR testing for coronavirus disease 2019 (COVID-19) in china: a report of 1014 cases

COVID-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks

On pixel-wise explanations for non-linear classifier decisions by layerwise relevance propagation

Chest CT findings in coronavirus disease-19 (COVID-19): relationship to duration of infection

COVID-19 outbreak in italy: experimental chest x-ray scoring system for quantifying and monitoring disease progression

Transformer interpretability beyond attention visualization

More data can expand the generalization gap between adversarially robust and standard models

Generative pretraining from pixels

An empirical study of training self-supervised vision transformers

Xception: Deep learning with depthwise separable convolutions

Predicting COVID-19 pneumonia severity on chest x-ray with deep learning

COVID-19 image data collection: Prospective predictions are the future

Covid-19 image data collection: Prospective predictions are the future

Chest x-ray in new coronavirus disease 2019 (COVID-19) infection: findings and correlation with clinical outcome

Bimcv COVID-19+: a large annotated dataset of rx and ct images from COVID-19 patients

Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach

Bert: Pre-training of deep bidirectional transformers for language understanding

An image is worth 16x16 words: Transformers for image recognition at scale

Understanding individual decisions of cnns via contrastive backpropagation

Covidx-net: A framework of deep learning classifiers to diagnose COVID-19 in x-ray images

The challenges of deploying artificial intelligence models in a rapidly evolving pandemic

Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison

Coronet: A deep neural network for detection and diagnosis of covid-19 from chest x-ray images

Transformers in vision: A survey

Combining initial radiographs and clinical variables improves deep learning prognostication in patients with covid-19 from the emergency department

Automated assessment and tracking of covid-19 pulmonary disease severity on chest radiographs using convolutional siamese neural networks

Deep-covid: Predicting covid-19 from chest x-ray images using deep transfer learning

Automatic detection of coronavirus disease (COVID-19) using x-ray images and deep convolutional neural networks

COVID-19 and the risk to health care workers: a case report

Deep learning COVID-19 features on cxr using limited training data sets

Unifying domain adaptation and self-supervised learning for cxr segmentation via adain-based knowledge distillation

Automated detection of covid-19 cases using deep neural networks with x-ray images

Yolo9000: better, faster, stronger

Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans

Grad-CAM: Visual explanations from deep networks via gradientbased localization

Radiological findings from 81 patients with COVID-19 pneumonia in wuhan, china: a descriptive study

End-to-end learning for semiquantitative rating of COVID-19 severity on chest X-rays

Endto-end learning for semiquantitative rating of COVID-19 severity on chest x-rays

Smooth-Grad: removing noise by adding noise

Full-gradient representation for neural network visualization

Real-time rt-pcr in COVID-19 detection: issues affecting the results

Clinical and chest radiography features determine patient outcomes in young and middle-aged adults with COVID-19

Attention is all you need

Covid-net: A tailored deep convolutional neural network design for detection of COVID-19 cases from chest x-ray images

A weakly-supervised framework for COVID-19 classification and lesion localization from chest CT

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weaklysupervised classification and localization of common thorax diseases

Severity scoring of lung oedema on the chest radiograph is associated with clinical outcomes in ARDS

COVIDNet-S: Towards computer-aided severity assessment via training and validation of deep neural networks for geographic extent and opacity extent scoring of chest X-rays for SARS-CoV-2 lung disease severity

Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal

Weakly supervised lesion localization with probabilistic-cam pooling

Bayesian estimation of sars-cov-2 prevalence in indiana by random testing

Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study

COVID-19 screening on chest x-ray images using deep learning based anomaly detection

A survey on multi-task learning

Deep learning-based detection for COVID-19 from chest CT using weak label

Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers

Deep transfer learning artificial intelligence accurately stages COVID-19 lung disease severity on portable chest radiographs

Deformable DETR: Deformable transformers for end-to-end object detection

Male ( Note: Values are presented as mean ± standard deviation or median (range). 

☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: