key: cord-0538475-91lu3b9u authors: Kim, Gwanghyun; Park, Sangjoon; Oh, Yujin; Seo, Joon Beom; Lee, Sang Min; Kim, Jin Hwan; Moon, Sungjun; Lim, Jae-Kwang; Ye, Jong Chul title: Severity Quantification and Lesion Localization of COVID-19 on CXR using Vision Transformer date: 2021-03-12 journal: nan DOI: nan sha: ecda3471f8bb82d73c80660b635308c395d82e5f doc_id: 538475 cord_uid: 91lu3b9u Under the global pandemic of COVID-19, building an automated framework that quantifies the severity of COVID-19 and localizes the relevant lesion on chest X-ray images has become increasingly important. Although pixel-level lesion severity labels, e.g. lesion segmentation, can be the most excellent target to build a robust model, collecting enough data with such labels is difficult due to time and labor-intensive annotation tasks. Instead, array-based severity labeling that assigns integer scores on six subdivisions of lungs can be an alternative choice enabling the quick labeling. Several groups proposed deep learning algorithms that quantify the severity of COVID-19 using the array-based COVID-19 labels and localize the lesions with explainability maps. To further improve the accuracy and interpretability, here we propose a novel Vision Transformer tailored for both quantification of the severity and clinically applicable localization of the COVID-19 related lesions. Our model is trained in a weakly-supervised manner to generate the full probability maps from weak array-based labels. Furthermore, a novel progressive self-training method enables us to build a model with a small labeled dataset. The quantitative and qualitative analysis on the external testset demonstrates that our method shows comparable performance with radiologists for both tasks with stability in a real-world application. The ongoing coronavirus disease 2019 (COVID- 19) pandemic has resulted in 115,028,175 confirmed cases with 2,551,329 death cases worldwide as of March 2, 2021 [13] . As pneumonia is commonly present in COVID-19 patients, radiological examinations are often used for the diagnosis of COIVD-19 [17] . In particular, chest X-ray (CXR) has comparative advantages in terms of the short scan time, low cost, and a low dose of radiation over chest computed tomography (CT) [11] . Therefore, there is a great potential to use CXR to analyze the patient's condition, such as severity quantification or lesion localization. In particular, a deep learning-based algorithm that quantifies the severity and localizes COVID-19 lesions on CXR image may help radiologists under global pandemic. Although pixel-level segmentation labels have the most abundant information toward this goal, it is hard to collect a large dataset due to its time-consuming annotation. To mitigate this issue, simple array-based severity labeling methods are introduced, where integer-valued severity scores are assigned on the six or eight subdivisions of CXR images [1, 20] . With the array labels, several algorithms [2, 19] quantify the severity of COVID-19 and generate explainability maps using convolutional neural networks and visualization methods such as GradCAM [16] , LIME [15] , etc. However, the probability values on the explainability maps, usually based on the normalized activation [16] , are not directly related to the real probability value of the lesion existence. Accordingly, comparisons of the saliency maps with the true lesion annotation from the radiologists are rarely made. To provide clinically meaningful quantification of severity and localization of COVID-19 lesion, here we propose a novel Vision Transformer (ViT) trained in a weakly-supervised manner using severity array labels. Recently, Vision Transformer (ViT) [7] was shown to attain state-of-the-art (SOTA) performance on the image classification tasks by learning long-range dependency among pixels using a self-attention mechanism [22] . Training a vanilla ViT requires a vast dataset to learn inductive biases, so that the authors of [7] suggest using a hybrid ViT that uses a convolutional neural network (CNN) as a feature embedding network on the small-sized dataset. By extending the idea, our vision Transformer is trained using the low-level CXR feature corpus that are generated using a feature extraction network pretrained on a large CXR dataset. Additionally, we use ROI max-pooling layer that can bridge between pixel-level supervision and the severity array label in a weakly-supervised manner [14] . One of the important advantages of our novel ViT scheme for severity quantification and lesion localization is that the global attention maps from Transformer can lead to full lesion maps where each pixel value directly means the probability of the abnormality of COVID-19. Moreover, our novel progressive self-training, which was inspired by [23] , enables to utilize the large unlabeled dataset in addition to the small severity-labeled dataset. By performing both quantitative and qualitative evaluation using the external test data set, we validate the model's performance and its generalization capability for different institute data set. Our model is trained with the frontal CXR images with the severity score arrays annotated following the method in [20] . Specifically, each lung is first subdivided into three areas in the vertical direction. The lower area extends from the intercostal groove to the lower hilar mark. The middle area extends from the lower hilar mark to the upper hilar mark, and the upper area extends from the upper hilar mark to the tip. Then, each area is divided into two regions along the horizontal direction across spines. Binary score 0/1 is assigned to each region according to the absence/presence of the opacity [20] . Accordingly, the completed label has a form of 3×2 array and a global severity score that is the sum of all elements ranges 0-6. The proposed model's overall architecture is illustrated in Fig. 1 (a) . Firstly, an input CXR image is preprocessed and given to the lung segmentation network. The segmented lung image is fed into the feature embedding network, followed by a Vision Transformer. The generated final features from Vision Transformer are provided into the map head that generates the full COVID-19 probability map. By ROI max pooling, the 3×2 COVID-19 severity arrays are estimated as the final output of our model. Hybrid Vision Transformer Backbone. For the feature embedding network to extract a low-level CXR feature corpus, the network of [24] that won first place in CheXpert [10] challenge is employed, which applies probabilistic class activation map (PCAM) pooling to the output from the DenseNet-121 based feature extractor to enhance localization performance as well as classification performance. The feature extractor is pretrained on the vast public CXR image dataset [10] classifying 10 radiological findings: pneumonia, consolidation, lung opacity, pleural effusion, cardiomegaly, edema, atelectasis, pneumothorax, support devices, and no finding. Specifically, we use 16×16×1024 features before transition layer 3 of DenseNet-121. Specifically, the segmented lung x ∈ R H×W ×C is projected into the feature c ∈ R H ×W ×C through the feature embedding network F . The feature vector c ∈ R C is the projected vector at each pixel location. For Vision Transformer, we adopt the ViT-B/16 architecture of [7] , which input is 1024×16×16 patches. We first embed projected features c ∈ R C into c p ∈ R D using 1 × 1 convolution kernel. Learnable vector c cls that is the role of [class] token of BERT [6] is included for the training of our ViT. However, we utilize the final layer outputs of the ViT-B/16 except for the output at the token position. Also, positional embedding E pos is added not to lose the positional information of the feature map. Multihead self-attention (MSA), multi-layer perceptron (MLP), layer normalization (LN), and residual connections in each block which are essential parts of the standard transformer [6] are used equally in our model. Formally, this procedure can be written by where L denotes the number of layers in ViT (i.e. L = 12 for the case of ViT-B/16). Probability Map Generation and ROI Max Pooling. A map head using the output of ViT is composed of 4 upsizing convolutional blocks and generates a map which size is the same as the input size. The detailed architectures of the map head are illustrated in Fig. 1 (b) . Multiplying the output of the map head with the lung mask m ∈ R H×W , the COVID-19 lesion probability map y ∈ R H×W is generated. ROI max-pooling (RMP) is used for converting the COVID-19 lesion map into the severity array a ∈ R 3×2 as depicted in Fig. 1(a) . Specifically, the lungs are separated into the right and left lung by computing the lung mask's connected components. Next, the lines that split each lung into three zones are estimated at 5/12 and 2/3 locations of the line between the highest and lowest lung mask location. Then, the max value of each subdivision is assigned to each corresponding element of the 3×2 array. For optimizing the model, a binary cross-entropy loss is calculated between the predicted severity array and the label severity array. These line estimation and max-pooling processes are the keys to a weakly supervised learning scheme. We evaluate the weakly-supervised learning scheme's validity by performing both quantitative and qualitative external tests in Section 3. Under the pandemic situation, it is often difficult to collect enough severity labels even if the labeling method is quite simple. Motivated by [23] , we employ the progressive self-training that utilizes the larger severity-unlabeled dataset, as well as the small severity-labeled dataset, so that it can improve the performance of the model. The detail procedure of the self-training method is in Fig. 2 . First, a teacher network is trained with the labeled dataset. In the second step, a new student network, a copy of the teacher, is trained on the previous dataset combined with new data in a subset of the unlabeled dataset. The model is optimized with the labeled input's true label and with the pseudo labeled generated from the teacher network for the unlabeled input. Next, the student becomes a new teacher, and the process is iterated, going back to the second step. Dataset. For the segmentation network training, a total of 247 normal CXR images from the publicly available JSRT dataset [18] with its segmentation label from the publicly available SCR dataset [21] are used. To provide reliable segmentation masks even for abnormal CXR, the segmentation network is then fine-tuned in a semi-supervised manner using a total of 680 CXR images of pneumonia, tuberculosis was used from public archives [3] . For the pretraining of the feature embedding network to generate low-level CXR feature corpus, we used 190,847 frontal CXR images from the publicly available CheXpert dataset [10] where the presence of 14 types of radiological findings are annotated (no finding, enlarged cardiom., cardiomegaly, lung lesion, lung opacity, edema, consolidation, pneumonia, atelectasis, pneumothorax, pleural effusion, pleural other, fracture, supported devices). For training of ViT and the map head, the domestic dataset from 2 independent institutions (Yeungnam University Hospital [YNU], Daegu, Korea; Kyungpook National University Hospital [KNUH], Daegu, Korea) is combined with the publicly available Brixia dataset [19] . Two board certificated radiologists labeled severity with consensus following the labeling method in [20] for the domestic dataset. Brixia labeling [1, 19] is different from the method of [20] in terms of severity score scale (0-3) for each subdivision and the anatomic landmarks determining the split line. However, the authors of [19] mentioned that this severity label could be extended to the labeling [20] that we use by mapping the score in {1,2,3} to 1 for each zone with slight differences, which procedure we followed. For external testing of the quantification and the localization performance of the model, CXR images from another independent domestic institution, Chungnam National Univerity Hospital [CNUH] , are used for the quantitative evaluation. The severity labels are annotated from the same two radiologists who labeled CXR images from the other domestic training dataset. In the publicly available BIMCV dataset [4] , the COVID-19 lesion segmentation label is annotated on 12 frontal images. We use these images for the qualitative analysis of our model. The numbers of images for quantifying severity and the localization of COVID-19 lesion are summarized in Table 1 . Implementation details. Our preprocessing method contains resizing to 256× 256, Gaussian blurring with a 3× 3 kernel, histogram equalization, and normalization. For our segmentation model, we use Adam optimizer [12] with a learning rate of 0.0001, batch size of 1 for the training of 20,000 steps. For the feature extraction network, Adam optimizer [12] with a learning rate of 0.0001, batch size of 8, and binary cross-entropy loss are used for binary classification of 10 pathologies for 160,000 steps of training. For ViT and the map head training, the weights of the segmentation and feature extraction network are frozen, and SGD optimizer with momentum of 0.9 and learning rate of 0.004, batch size of 8 is used for 12,000 steps. The models at 12,000, 11,500 and 11,000 steps are ensembled for the test. We adopt Mean Squared Error (MSE) as a main metric for the regression of the global severity score ranging 0-6, but also calculated Mean Absolute Error (MAE), Correlation Coefficient (CC), R 2 score for the global score regression, and mean of area under the ROC curve (AUC) for the binary classification of each region in severity array. The proposed networks are implemented with PyTorch framework and trained using an NVIDIA V100. Comparison of our ViT based model with CNN-based models. To validate the superiority of ViT based architecture of our model (D121+ViT- , NasNet-Large [25] . All models except ours were trained based on ImageNet [5] pretrained weights, and the same training setting from our proposed model was applied. The quantitative comparison of severity quantification performance on CNUH external testset is represented in Table 2 . Our model outperformed the CNNbased models for most of the metrics, demonstrating superior performance and generalizability. We also perform the qualitative comparison of the localization performance on BIMCV [4] external testset between our ViT based model and DenseNet-121 [9] based model. The result is represented in Fig. 3 . Our proposed model's prediction of the abnormal region on the CXR image shows more accurate localization than DenseNet-121 [9] based model. Performance of progressive self-training. To evaluate the usefulness of the progressive self-training methods, we compare the models trained with four different training data sets depending on the utilization of Brixia dataset, as described in Table 3 . In default (not included), the models are trained only on the domestic training dataset. For the third model proposed, the inclusion of Brixia data without labels increased progressively after every 2,000 of the total 12,000 steps as described in Fig. 2 . For the last model, after 2,000 steps of training only on the domestic dataset and the whole Brixia data [19] without labels are included all at one, and another 10,000 steps of training is followed. External test results on CNUH testset are represented in Table 3 . The model trained on the progressive self-training outperforms all methods, proving the effectiveness of the method and implying the labeling method discrepancy between the domestic dataset and Brixia dataset [19] . Brixia dataset [19] provides a consensus subset where five different radiologists annotate 150 CXR images, and the MSE between the gold standard from the majority voting and each radiologist's global score is calculated to 1.683. Therefore, the MSE value of 1.296 in our model can be accepted as reasonable considering the MSE of the radiologists. In this study, we presented a Vision Transformer tailored to quantify the severity and the localization of COVID-19 related lesions on CXR. Our novel Vision Transformer using a low-level CXR feature corpus enabled us to encode longrange dependency between pixels, which is crucial for the localization of the lesion. Furthermore, our method enabled the generation of the full COVID-19 lesion map, in which pixel values directly represent the probability of abnormality. In addition, progressive self-training methods allowed the use of the small severity annotated dataset and the large unlabeled dataset. Our model provided the results that are the comparable performance to the expert radiologists on the external testset, validating its generalizability. Covid-19 outbreak in italy: experimental chest x-ray scoring system for quantifying and monitoring disease progression Predicting covid-19 pneumonia severity on chest x-ray with deep learning Covid-19 image data collection: Prospective predictions are the future Bimcv covid-19+: a large annotated dataset of rx and ct images from covid-19 patients ImageNet: A largescale hierarchical image database Bert: Pre-training of deep bidirectional transformers for language understanding An image is worth 16x16 words: Transformers for image recognition at scale Deep residual learning for image recognition Densely connected convolutional networks Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison Portable chest x-ray in coronavirus disease-19 (covid-19): A pictorial review Adam: A method for stochastic optimization Covid-19 and the risk to health care workers: a case report Is object localization for free?-weaklysupervised learning with convolutional neural networks why should i trust you?" explaining the predictions of any classifier Gradcam: Visual explanations from deep networks via gradient-based localization Radiological findings from 81 patients with covid-19 pneumonia in wuhan, china: a descriptive study Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists' detection of pulmonary nodules End-to-end learning for semiquantitative rating of covid-19 severity on chest x-rays Clinical and chest radiography features determine patient outcomes in young and middle-aged adults with covid-19 Segmentation of anatomical structures in chest radiographs using supervised methods: a comparative study on a public database Attention is all you need Self-training with noisy student improves imagenet classification Weakly supervised lesion localization with probabilistic-cam pooling Learning transferable architectures for scalable image recognition