key: cord-0176830-8nclims3 authors: Park, Sangjoon; Kim, Gwanghyun; Oh, Yujin; Seo, Joon Beom; Lee, Sang Min; Kim, Jin Hwan; Moon, Sungjun; Lim, Jae-Kwang; Park, Chang Min; Ye, Jong Chul title: AI can evolve without labels: self-evolving vision transformer for chest X-ray diagnosis through knowledge distillation date: 2022-02-13 journal: nan DOI: nan sha: 620245bc6f9e3027b334609ef3d632555906dd10 doc_id: 176830 cord_uid: 8nclims3 Although deep learning-based computer-aided diagnosis systems have recently achieved expert-level performance, developing a robust deep learning model requires large, high-quality data with manual annotation, which is expensive to obtain. This situation poses the problem that the chest x-rays collected annually in hospitals cannot be used due to the lack of manual labeling by experts, especially in deprived areas. To address this, here we present a novel deep learning framework that uses knowledge distillation through self-supervised learning and self-training, which shows that the performance of the original model trained with a small number of labels can be gradually improved with more unlabeled data. Experimental results show that the proposed framework maintains impressive robustness against a real-world environment and has general applicability to several diagnostic tasks such as tuberculosis, pneumothorax, and COVID-19. Notably, we demonstrated that our model performs even better than those trained with the same amount of labeled data. The proposed framework has a great potential for medical imaging, where plenty of data is accumulated every year, but ground truth annotations are expensive to obtain. With the early success of deep learning for medical imaging [1] [2] [3] , the application of artificial intelligence (AI) for the medical image has rapidly accelerated in recent years [4] [5] [6] . In particular, many deep learning based computer-aided diagnosis (CAD) softwares have been introduced into routine practice 7-10 for various imaging modalities such as chest X-ray (CXR). These deep learning-based AI models have demonstrated the potential to dramatically reduce the workload of clinicians in a variety of contexts if used as an assistant, leveraging their power to handle a large corpus of data in parallel. The advantage can be maximized in resource-limited settings such as underdeveloped countries where various diseases such as tuberculosis prevail while the experts to provide the accurate diagnosis are scanty. Most of the existing AI tools are based on the convolutional neural network (CNN) models built with supervised learning, but collecting large and well-curated data with the ground truth annotation is rather difficult in the underprivileged areas where the amount of available data itself is abundant. In particular, although the size of data increases in number every year in these areas, the lack of ground truth annotation hinders the use of increasing number of data to improve the performance of AI models. Given the limitation in label availability, an important line of machine learning research is self-supervised and semi-supervised learning, which relies less on the corpus of labeled data. In general, the orthodoxy was that a model trained with a supervised learning approach is the upper bound of the performance. However, it was recently shown that the self-training with knowledge distillation between the teacher and noisy student, a type of semi-supervised learning approach, can substantially improve the robustness of the model to adversarial perturbations. The key idea of this method is to train two separated student and teacher models so that the student is trained with images with various forms of noise to meet the teacher's prediction with the same but clean image. Experimental results suggest that the knowledge distillation with enough noise can do better in various external validation settings than the traditional supervised model. In addition, a recently developed Vision transformer (ViT) 11 was successfully utilized in a method called the distillation with no label (DINO) 12 by exploiting the knowledge distillation between student and teacher via the local to global view correspondence for self-supervised learning. Besides achieving a new SOTA performance among self-supervised learning approaches, the powerful self-attention mechanism in ViT can segment objects without supervision, showing that the model is capable of a higher-level image understanding. Inspired by that both methods are based on the knowledge distillation between teacher and student, here we suggest a ViT-based self-evolving framework for CXR diagnosis that can gradually improve the performance simply with an increasing amount of unlabeled data, amalgamating the distinct strengths of self-supervised learning and self-training through knowledge distillation. Our method, dubbed distillation for self-supervised and self-train learning (DISTL), can gradually improve the performance of the AI model in various external validation settings by maximally utilizing the common ground of knowledge distillation from self-supervised and self-training with the increasing amount of unlabeled data (Fig. 1a ). Of note, it even outperforms the supervised model trained with the same amount of labeled data in the external validation. Furthermore, the proposed self-evolving method has substantial robustness to the real-world data corruptions, and our model offers a more straightforward visualization of the model's attention to locating the lesion. We argue that the distillation of knowledge through self-training and self-supervised learning, even without knowledge of the lesion, results in a high correlation of attention with the lesion, which may be the reason for the superior performance in diagnosis. Overview of the proposed framework As shown in Fig. 1a , to stably evolve our model performance leveraging unlabeled cases, the two identical models, teacher, and student are utilized for distillation, encouraging the student model to match its noised prediction obtained from a given CXR to the clean prediction of teacher model obtained from the same CXR. However, unlike the previous noisy self-training approach, both self-supervision and self-training methods were leveraged in our method. Specifically, self-supervision plays a key role to incentivize the model to learn the task-agnostic semantic features of the CXR by having more shape bias to the image content ( Supplementary Fig. 1 ), while the self-training enables the model to directly learn the task-specific information, for example, the diagnosis of tuberculosis. To verify this hypothesis, we conducted ablation studies by removing each component, demonstrating these two components are imperative to attain the optimal performance ( Supplementary Fig. 2 ). The details of our algorithm and ablation studies can be found in Supplementary Material. In our method, to gradually evolve the model performances given the increasing unlabeled data accumulated over time (Fig. 1b) , the initial model was first built using supervised learning with the small labeled data. Then, we used this initial model as the teacher to train the student in large unlabeled data. In this process, the teacher is slowly co-distilled from the updated student. In addition, to prevent the student from being deteriorated by the imperfect estimation of the teacher, the correction with the initial small labeled data was done per the predefined steps. The updated model is then used as the starting point of the next-generation model, similar to the previous selftraining approach with increasing time T 13 . We evaluated the proposed framework in three CXR tasks including the diagnosis of tuberculosis by using only a small corpus of labeled data for supervision and gradually increasing the amount Figure 1 : Overview of the proposed framework for the self-evolving AI model. (a) The distillation for selfsupervised and self-train learning (DISTL) method is composed of the two components, for self-supervision and self-training. (b) The initial model is trained with small labeled data. Then, using this model as the teacher, the student is trained with the DISTL method under an increasing amount of data over time T . of unlabeled data simulating the real-world data accumulation over time. In particular, to confirm whether our AI model can gradually self-evolve in the data-abundant but the label-insufficient situation, we set our main task as the diagnosis of tuberculosis (TB), as it is highly demanded in clinics after World Health Organization has identified the use of AI-based CAD for CXR screening of tuberculosis as a potential solution in resource-limited settings 14 . We Fig. 3 and "Details of datasets for diagnosis" section). After collection, a total of 35,985 CXRs were further divided into 3,598 labeled (10% of total data) and 32,387 unlabeled subsets (90% of total data). Next, assuming the situation in the clinic that the number of unlabeled cases increases as time goes, the unlabeled subset was further divided into three. Then, using these three folds, we increased the total amount of available unlabeled data to be 30%, 60%, 90% of total data, supposing the time T = 1, 2, 3 goes as shown Fig. 2a . During this process, the subset of labeled data remains fixed to the initial 3,598 CXRs (10% of total data) (Fig. 2b) For pneumothorax diagnosis, we used the SIIM-ACR pneumothorax data 15 for the model development and internal validation. As it contains the CXRs and the segmentation mask for either pneumothorax and normal cases, we adopted it to be the pneumothorax diagnosis task, as the binary classification problem. Similar to the tuberculosis diagnosis task, we partitioned this data into a labeled and unlabeled subset, and the unlabeled subset was further divided into three to simulate the gradually accumulating data with time ( Fig. 3a and "Details of datasets for diagnosis" section). Our TB diagnosis model can self-evolve with increasing unlabeled data. We first evaluated whether the performance of TB diagnosis can gradually be improved with the proposed framework given the increasing numbers of unlabeled data. As shown in Fig. 4a Not confined to the metric itself, we also observed an interesting finding that the model attention of the ViT model gets refined with increasing time T (Fig. 4d) . As the AI model evolves with increasing time T , the self-attention of AI gets refined to better localize the target lesion as well as semantic structures within the given CXR image. Notably, the gradual improvement of performance was prominent for the ViT model equipped with self-attention than the CNN-based models ( Fig. 5a and b ). The ViT model showed a linear increase as well as the best performance among the models, although other CNN-based models also showed performance improvement with the proposed framework under increasing unlabeled data. In addition, the ViT model showed no sign of overfitting which was observed in some CNN-based models at later T . We further evaluate whether the existing self-supervised and semi-supervised learning meth- ods, which can also be utilized for the plenty of unlabeled data with increasing T , can improve the performance of the AI model gradually similar to the proposed framework ( Fig. 5c and d) . With the same experimental settings, the existing methods presented the significant degradation of performance at T = 1 where the number of unlabeled data is relatively small, while the performances slightly improve with more data with increasing T . Even with this increase in performance, none of the existing self-supervised and semi-supervised methods showed prominent performance im- creasing the number of unlabeled data over time (Fig. 6a) . Notably, the performance was stably improved as the same in the experiments without adding these other classes data (Fig. 6b and c) , suggesting the robustness of the proposed framework assuring that the AI model is not confused by these unfamiliar classes to the initial model trained only with normal and tuberculosis data. Secondly, we randomly make the label wrong with a probability of 5% for the supervised learning and evaluated whether the performance decreased (Fig. 6d) . The model trained with supervised learning using the corrupted label showed significant deterioration in performance, while that with the proposed framework was not altered as it does not depend on the label for increased data (Fig. 6e and f). Taken Verifying applicability of the proposed framework in other tasks. We further analyze whether the gradual performance improvement with the proposed framework can also be observed in the CXR tasks other than tuberculosis diagnosis. First, for pneumothorax diagnosis, similar to the observation in the tuberculosis diagnosis task, the model trained with the proposed framework improved gradually over increasing times T ( Fig. 8a and b) . Notably, the performance of the model with the proposed framework was lower than the supervised one when available unlabeled data are relatively small (T = 1) but it ultimately outperformed the supervised model with the increased numbers of unlabeled data (T = 3). Similarly, for COVID-19 diagnosis, the proposed framework provided the stable performance improvement over time, whereas the model trained with the same amount of labeled data showed a substantial performance drop at later T in the external validation ( Fig. 8c and d) , suggesting that overfitting to training data degraded the generalization performance of the supervised model. Given the striking results of early studies that AI can keep up or even surpass the performance of the experienced clinician in various applications in medical imaging 1, 2 , we have confronted the era of flooding AI models for medical imaging. However, these models share a common drawback that they highly depend on the quantity and quality of labels as well as the data. If the labeled corpus does not contain sufficient data points to represent the entire distribution, the resulting model can be biased and the generalization performance can unpardonably deteriorate. In the field of medical imaging, a large number of raw data is being accumulated each year without label annotation. With the supervised learning approaches, it is difficult to utilize this large corpus of unlabeled data. Therefore, several methods based on unsupervised learning 19, 20 , self-supervised learning 21 and semi-supervised learning 22 have been proposed to cope with this problem, but their performances are still sub-optimal. To cope with this problem, the proposed framework stands based on two key components: self-supervised learning and noisy self-training with knowledge distillation, which offers stably evolving performance simply with an increasing amount of unlabeled data. The first component, in our method, is similar to that proposed in a previous work 12 , which encourages the model to learn the task-agnostic semantic information of the image by the local-global correspondence. In our preliminary experiment, the model built only with this self-supervision attends noticeably well on the image layout, and particularly, object boundaries as shown in Supplementary Fig. 1 . Secondly, the semi-supervised component enables the model to directly learn the task-specific features, the diagnosis of tuberculosis, similar to the noisy student self-training 13 . Under the continuity and clustering assumption, 23 , learning with a soft-pseudo label along with student-side noise increases not only the performance but also the robustness on adversarial samples. Interestingly, we have found an analogy between the proposed framework and the training process of radiologists during their junior years. When a junior radiologist learns to read CXR, a common practice is to first read "CXR" and affirm it with "computed tomography" image of the same patient, which usually offers a more accurate diagnosis. This procedure is analogous to the learning process of the student in our framework in which the model learns to match the prediction of the "noisy" augmented image to that of the "clean" original image by the teacher which offers a more accurate diagnosis. In addition, it is also a common practice that the "junior" radiologist learns referring to the "senior" radiologist's reading, which is similar to the "teacher-student" distillation used in our framework. Finally, during the learning process, the junior radiologists occasionally refer to the "text book" containing small but typical cases, which prevents from being biased from recently seen atypical cases. In our framework, the "correction step" with the small number of initial labeled data plays a similar role. As a result, the proposed framework, unlike the existing self-supervised and semi-supervised learning approaches, offered gradually evolving performance simply by increasing the amount of unlabeled data, with the substantial robustness to the data corruption from data of different classes or label corruption. In addition, we found in the experiments extending the application of our framework to the pneumothorax and COVID-19 diagnosis that it provides the benefit generally applicable to a variety of tasks. Practically, our method holds great potential for the screening of diseases like tuberculosis, especially when applied in underprivileged areas. In the simulation of application of the model under real-world prevalence, it shows a negatively predicted portion of 72.5% and a negative predictive value (NPV) of 0.977. That is to say, it can rule out the 72.5% of the screened population from further evaluation by a clinician with the probability of 97.7%, resulting in a substantial decrease of workload in resource-limited settings. In addition, the AI model can improve the performance by itself, using the proposed DISTL method and the iterative self-evolving framework without any further supervision by human experts. This is another important merit to be used in underprivileged areas, where plenty of data are available due to the high prevalence of diseases but the number of experts is scanty. This study has several limitations. First, the details concerning the patient demographics and CXR characteristics were not available in some open-source data used for the training and internal validation. Second, although we simulated the robustness to unseen class data assuming real-world data collection in the experiment, it was not possible to consider all the other minor classes that can be considered in real-world data accumulation. Third, we utilized a total of 35,985 CXRs to demonstrate the benefit of the proposed framework by using them after dividing them into the small labeled and large unlabeled subsets, but the number may be insufficient to draw a firm conclusion. Further studies are warranted to verify the proposed framework in a data corpus large enough to represent the general distribution. Nevertheless, with the data-abundant but label-insufficient condition being common for medical imaging, we believe that it may offer great applicability to a broad field of medical imaging. Details of datasets for pre-training. To pretrain the model to learn task-relevant CXR feature in a large corpus of CXRs, we used the CheXpert dataset 24 containing 10 common CXR classes: no finding, cardiomegaly, lung opacity, consolidation, edema, pneumonia, atelectasis, pneumothorax, pleural effusion, and support device. Among the 10 classes, five classes including lung opacity, consolidation, edema, pneumonia, and pleural effusion, considered to be related to the manifestation of infectious disease, were selected as task-relevant CXR features. Consequently, the model was first trained to classify these five classes with the CheXpert data. With a total of 224,316 CXRs from 65,240 subjects, 29,420 posterior-anterior (PA) and 161,427 anterior-posterior (AP) view CXRs were used after excluding the 32,387 lateral view CXRs. Thanks to this huge number of cases, the model was able to be a robust extractor for the task-relevant CXR features, without depending upon the variation in patients and the setting for image acquisition. As suggested in the ablation study of pre-training (see Supplementary Fig. 2) , this pre-training step has brought us a substantial increase in performance and is one of the key components of our model. Code Availability The code is available at the following github repository. https://github.com/depecher/ Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs Clinically applicable deep learning for diagnosis and referral in retinal disease Radiologist-level pneumonia detection on chest x-rays with deep learning Ai for medical imaging goes deep Machine learning in medical imaging Artificial intelligence in medical imaging: threat or opportunity? radiologists again at the forefront of innovation in medicine Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks Efficient deep network architectures for fast chest x-ray tuberculosis screening and visualization A systematic review of the diagnostic accuracy of artificial intelligencebased computer programs to analyze chest x-rays for pulmonary tuberculosis Tuberculosis detection from chest x-rays for triaging in a high tuberculosisburden setting: an evaluation of five artificial intelligence algorithms An image is worth 16x16 words: Transformers for image recognition at scale Emerging properties in self-supervised vision transformers Self-training with noisy student improves imagenet classification Siim-acr pneumothorax segmentation -kaggle Bimcv covid-19+: a large annotated dataset of rx and ct images from covid-19 patients End-to-end learning for semiquantitative rating of COVID-19 severity on chest x-rays Grad-CAM: Visual explanations from deep networks via gradient-based localization A tour of unsupervised deep learning for medical image analysis Unsupervised deep transfer feature learning for medical image classification Self-supervised learning for medical image analysis using image context restoration Semi-supervised medical image classification with relation-driven self-ensembling model Recent deep semi-supervised learning approaches and related works Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison Tbxpredict -browse /data at sourceforge Two public chest x-ray datasets for computer-aided screening of pulmonary diseases. Quantitative imaging in medicine and surgery A large chest x-ray image dataset with multi-label annotated reports Rethinking computer-aided tuberculosis diagnosis Weakly supervised lesion localization with probabilistic-cam pooling