key: cord-0616326-zdv1gcyn authors: Sirshar, Mehreen; Hassan, Taimur; Akram, Muhammad Usman; Khan, Shoab Ahmed title: An Incremental Learning Approach to Automatically Recognize Pulmonary Diseases from the Multi-vendor Chest Radiographs date: 2022-01-07 journal: nan DOI: nan sha: f0c77e92680960df95ead1510352ad33b2981eb4 doc_id: 616326 cord_uid: zdv1gcyn Pulmonary diseases can cause severe respiratory problems, leading to sudden death if not treated timely. Many researchers have utilized deep learning systems to diagnose pulmonary disorders using chest X-rays (CXRs). However, such systems require exhaustive training efforts on large-scale data to effectively diagnose chest abnormalities. Furthermore, procuring such large-scale data is often infeasible and impractical, especially for rare diseases. With the recent advances in incremental learning, researchers have periodically tuned deep neural networks to learn different classification tasks with few training examples. Although, such systems can resist catastrophic forgetting, they treat the knowledge representations independently of each other, and this limits their classification performance. Also, to the best of our knowledge, there is no incremental learning-driven image diagnostic framework that is specifically designed to screen pulmonary disorders from the CXRs. To address this, we present a novel framework that can learn to screen different chest abnormalities incrementally. In addition to this, the proposed framework is penalized through an incremental learning loss function that infers Bayesian theory to recognize structural and semantic inter-dependencies between incrementally learned knowledge representations to diagnose the pulmonary diseases effectively, regardless of the scanner specifications. We tested the proposed framework on five public CXR datasets containing different chest abnormalities, where it outperformed various state-of-the-art system through various metrics. Lungs are the fundamental organs within the human respiratory system which provide respiration. Lungs are enclosed within the human chest, and their pathology is majorly observed through chest radiographs, also known as chest X-rays (CXRs) [1] . Apart from this, many pulmonary disorders are also diagnosed through CXRs by observing the abnormal pathological patterns within the poster-anterior, anteroposterior, and lateral projections of the thoracic cavity [2] . Moreover, CXR imagery is also an effective and low-cost modality for detecting edema, tuberculosis, single or multiple nodules, and pneumonia [3] (as shown in Figure 1 ). Among these pathologies, the fatal one is pneumonia (especially COVID-19 pneumonia [4] ), which is clinically identified by observing airspace opacities, lobar consolidation, and interstitial opacities [4] . On the other hand, edema is identified through cephalization of the pulmonary vessels, septal lines, patchy shadowing (with air bronchograms), and increased cardiac size [5] . Tuberculosis is identified (from CXRs) by observing the consolidations and cavities that are often seen in the upper lung zones (with or without mediastinal or hilar lymphadenopathy) [6] . In contrast, the nodules appear as a spot within the lung zones as observed in the CXRs [7] . Many researchers have proposed autonomous frameworks to screen pulmonary diseases using CXRs [8] . The initial methods employed machine learning to recognize different chest abnormalities at the inference stage [9] . However, these methods were confined to limited datasets and experimental settings due to the subjectiveness in their handcrafted features [10] . With the advances in deep learning, the recent wave of diagnostic frameworks utilizes convolutional neural networks to screen and grade different chest abnormalities like pneumonia [11] . Deep learning, although, increased the diagnostic performance of such frameworks by many folds. Still, they require exhaustive training efforts involving high computational power and large-scale (and well-annotated) data, and this limits their applicability towards screening new types of pathologies in a clinical setting. The incremental learning paradigm addresses this inherent limitation of deep neural networks. But incremental learning systems are vulnerable to catastrophic forgetting, which is defined as the inability of the classification model to forget its prior knowledge upon learning new tasks [12, 13] . To overcome catastrophic forgetting within an incremental learning framework, many researchers have proposed knowledge distillation [14] and contrastive learning [15] based strategies. But these schemes ignore the structural similarities and interdependencies between different knowledge representations, which can significantly boost the classification performance of the incremental learning systems while showcasing high resistance to catastrophic forgetting. Deep learning has increased the performance of the medical image diagnostic frameworks [16, 17] by many folds, especially to predict abnormal lung pathologies from the CXRs [11] . The majority of these systems are based on transfer learning or fine-tuning approaches, which utilize pre-trained models such as VGG-19 [18] , MobileNet [19] to classify abnormal chest pathologies [20, 21, 22] using CXRs. Although transfer learning systems can fulfill the large-scale data requirement (to some extent) for recognizing a limited set of chest pathologies [8, 1, 4] . However, these systems tend to forget their source domain knowledge (while they are tuned for the target domain tasks) [23] . Due to this, they cannot be deployed in hospitals for screening purposes. In clinical practice, the classification model is expected to recognize new types of pathologies (while retaining its previously learned knowledge) using very few training examples [24, 25] . Incremental learning enables the deep learning models to meet this requirement. More specifically, the concept of incremental learning originates to periodically learn scarce yet inter-related tasks effectively without re-training the model from scratch [26] . Furthermore, by showing high resistance to catastrophic forgetting, the deep incremental learning models can eliminate the need for re-training on the large datasets to learn the limited number of classification categories [27, 28] . The initial strategies to address catastrophic forgetting phenomena were based on distillation [29] , in which the knowledge of the previously trained instance of the model is compressed and transferred to the new instance. The new instance then retains these representations by minimizing the distillation loss function explicitly (through the training examples which correspond to the previously added classes) [29] . Apart from this, the new model instance is also penalized (during training) to learn the new classes (from the provided small-scale set of training examples) [30] . Here, the work of Li et al. [12] is notable where they proposed the learning without forgetting (LwF) scheme. LwF [12] optimizes the distillation and cross-entropy loss functions in each training increment to enable the model to retain its prior knowledge while learning new representation simultaneously. Moreover, Rebuffi et al. [14] improved the LwF [12] for the class-incremental learning tasks by indefinitely learning the feature representations related to the distilled and newly added categories via joint distillation and classification loss function optimization. Aljundi et al. [31] introduced gating auto-encoders that can incrementally learn the feature representations for the task at hand, and based upon the nature of the test sample, the processing request is automatically forwarded to the relevant gate to perform the appropriate classification task. Castro et al. [32] presented an end-to-end incremental learning scheme in which they proposed a cross-distilled loss function to retain prior learned knowledge while learning new classification tasks by penalizing the candidate network via cross-entropy loss function. Similarly, Roy et al. [33] presented hierarchically-fashioned CNN architecture that is incrementally trained to perform various classification tasks with minimal training efforts. Tian et al. [15] presented a contrastive learning strategy that outperformed knowledge distillation and other cutting-edge distillers for various knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer. They proposed an objective function that minimizes the Kullback-Leibler (KL) divergence between a teacher and student network's outputs. Mirzadeh et al. [34] addressed the gap in teacher-student learning network through a multi-step knowledge distillation process. Lee et al. [35] introduced a global distillation strategy to reduce catastrophic forgetting on the massive unlabeled data. Lopez-Paz et al. [36] proposed utilizing episodic memories in their framework, dubbed Gradient Episodic Memory (GEM), to resist catastrophic forgetting during incremental training. Chaudhry et al. [37] proposed averaged GEM (A-GEM) that replaces the quadratic programming (in GEM [36] ) with dot products through which the gradient calculation is performed only once. This modification significantly reduces the computational resources which were required in the original GEM [36] framework. Apart from this, Hegde et al. [38] presented a compact and sparse student network that replicates the compressed representation of the teacher model to learn various classification via knowledge distillation incrementally. Although, the state-of-the-art incremental learning strategies give decent classification performance while overcoming catastrophic forgetting phenomena. Still, these schemes treat the previously learned representations and newly stacked classes independent of each other, and this cap the performance of the deep classification networks towards performing these tasks simultaneously, especially when they are highly correlated and inter-related with each other [15] . To address these limitations, we present a novel incremental learning loss function (L IL ) that not only penalizes the classification network to learn newly added class representations while distilling the previously learned knowledge. But it also ensures that the network understands their complex relationships and structural dependencies to recognize them effectively at the inference stage. To summarize, the main contributions of the paper are: • This paper presents a first attempt towards utilizing incremental learning to periodically screen different pulmonary disorders from the CXRs irrespective of their scanner specifications. • Unlike state-of-the-art incremental learning approaches [14, 29] , the proposed framework analyzes the structural and semantic relationships between periodically learned chest abnormalities (via proposed L IL loss function), which enables it to recognize them effectively at the inference stage. • The proposed framework is rigorously evaluated on five public datasets, where it achieved the accuracy and F1 score of up to 0.8405 and 0.8303, respectively. Furthermore, it outperforms the state-of-the-art schemes on all five datasets, as evident from the Section 5. The rest of the sections are organized as follows: Section 3 presents the proposed framework in detail. Section 4 enlists experimental protocols. Section 5 presents the results of the proposed framework and its comparison with the state-of-the-art schemes. Section 6 concludes the paper and envisage future directions. The block diagram of the proposed framework is shown in Figure 2 . Here, we can see that the proposed framework is trained in two phases. In the first phase, the classification model (within the proposed framework) is penalized via L IL to recognize various chest abnormalities periodically. We dubbed this phase as disease incremental learning. Moreover, in the second phase, the proposed framework performs dataset incremental learning to train the candidate network (via L IL ) for recognizing chest diseases from different datasets. At the inference stage, the proposed framework can identify different pulmonary disorders via CXR imagery regardless of their scanner specifications. More details about training and testing phases are presented in the subsequent sections. The incremental training of the proposed framework is done in two phases. The first phase is related to the disease incremental learning, and the second phase is related to the dataset incremental learning. In the first training phase (related to the disease incremental learning), we train the candidate classification model (incrementally) to recognize different chest abnormalities such as effusion, pulmonary edema, pleural thickening, etc., from the first dataset. Here, in each training increment, the L IL loss function ensures that the network minimizes the classification loss while learning newly added disease categories. Furthermore, L IL also penalizes the classification network to distill its prior learned knowledge representations through a subset of training examples that were used in the previous iteration. Moreover, unlike other incremental learning schemes [15, 14, 12, 29] , the proposed L IL also minimizes the mutual distillation objective function L M D . L M D ensures that the classification model resolves the complex inter-dependencies between incrementally learned knowledge representations via Bayesian inference. More details about the loss functions are presented in Section 3.3. In the second training phase, the proposed framework is trained to recognize pulmonary disorders across different datasets. Although the training strategy remains the same, i.e., the classification model within the proposed framework is penalized in each training increment (via L IL ) to recognize different disease categories from the CXR scans. However, these CXR scans are acquired with different scanners. Also, these CXR scans belong to different datasets, showcasing different types of abnormal disease patterns. The proposed framework learns such kind of diversified classification tasks (incrementally) without catastrophically forgetting its prior knowledge by minimizing the L M D objective function within L IL . Due to L M D , the L IL loss function optimizes the classification model, which yields better performance than state-of-the-art incremental learning schemes. The detailed comparison of the proposed framework with existing approaches is presented in Section 5. At the testing (inference) stage, the proposed framework possesses the capacity to recognize different pulmonary diseases from the CXR scans regardless of their scanner specifications. Unlike the conventional transfer learning or fine-tuning approaches, the proposed framework can be further modified to recognize more chest pathologies (via few-shot training) using the proposed L IL loss function. This makes the proposed framework scalable and an ideal choice for clinical screening, as it can be easily adapted to recognize emerging diseases using very few annotated training examples. We propose a novel incremental learning loss function, dubbed L IL , which penalizes the candidate classification model to learn new disease classification tasks while simultaneously retaining its prior learned knowledge. Furthermore, the proposed L IL loss function enables the classification model to showcase high resistance to catastrophic forgetting phenomena compared to the recently introduced incremental learning approaches. This is because L IL considers the knowledge representations which the classification models learn to be non-mutually exclusive. Through the proposed L M D objective function, the L IL penalizes the candidate network to exploit the structural and semantic similarities between different incrementally learned knowledge representations, which result in the superior classification performance as compared to the state-of-the-art approaches. Mathematically, the L IL loss function is expressed below: and where b s denotes the batch size, n o and n n denotes, respectively, the number of training examples associated with previously learned (o) and newly added (n) disease categories (in the current training iteration), and τ is the temperature constant that scales the ground truth labels and the output logits. Apart from this, t τ o in Eq. 2 represents the scaled ground truth labels for the training examples (x) belonging to the previously learned categories o, and p(l τ o (x k,m )) is the predicted softmax probability obtained from the scaled logits l o of the training examples (x) belonging to the previously learned categories. In Eq. 3, q(t τ n (y k,m )) denotes the true softmax distribution of the scaled ground truth labels (t τ n ) associated with the training examples (y) which enables the classification network to learn newly added classes n, and p(l τ n (y k,m )) represents the predicted softmax probability of the scaled output logits l n generated from training examples y associated with the newly added disease categories (in current training iteration). From Eq. 2 and 3, we can see that L D is a distillation loss function which ensures that the network does not forget previously learned classes in the current training increment. Moreover, L C is a classification loss function that penalizes the candidate model to learn newly added disease categories through their respective training examples. Moreover, the parameters α, β, and γ represent the loss weights which are empirically determined to be 0.5, 0.25, and 0.25, respectively, across all the datasets. Contrary to the existing knowledge distillation approaches, we introduced a novel mutual distillation objective function (L M D ) in the proposed L IL function. L M D bridges the gap between newly added and previously learned knowledge representations by exploiting their complex structural and semantic dependencies. Incrementally learned knowledge representations are generally non-mutually exclusive in nature [39] . Therefore, penalizing the classification networks to recognize their complex relationships and dependencies is crucial for achieving robust performance and high resistance to catastrophic forgetting during incremental training. To handle this, we introduce a novel L M D function that utilizes Bayesian inference to analyze the extent of similarities between incrementally learned knowledge representations and minimize them to recognize interrelated disease categories effectively at the inference stage. For any classification network having an input z during incremental training, such that z = {x, y|x, y ∈ R 2 }, where x and y denotes the training examples of the previously learned and newly added disease categories, respectively. To analyze the extent of similarities, we compute the joint probability distribution between the scaled output logits l τ o (x) and l τ n (y), such that p(l τ o (x), l τ n (y)) = p(l τ o (x)|l τ n (y)) × p(l τ n (y)). Similarly, we have p(l τ n (x), l τ o (y)) = p(l τ n (x)|l τ o (y)) × p(l τ o (y)). Adding the notion of disease categories (d) to these definitions yield: Afterwards, the posterior for each class d i ∈ d is computed through Bayes rule: and n di represents the total number of disease d i , and n d denotes the total number of disease categories d. Moreover, in Eq. 6 and 7, p(d = d i |l τ o (x), l τ n (y)) and p(d = d i |l τ n (y), l τ o (x)) denotes the posterior probabilities, p(l τ o (x), l τ n (y)|d = d i ) and p(l τ n (y), l τ o (x)|d = d i ) denotes the likelihood, respectively. Since these likelihoods derive from the numerical representations, therefore, in the proposed framework, we model them through the multivariate Gaussian distribution, as expressed below: whereẑ i ∈ z, D represents the multivariate dimension, µ and Σ denotes the average and covariance of the output logits, respectively. Afterward, L M D is computed as: where t τ o (x) and t τ n (y) denotes the scaled ground truth labels corresponding to the training samples x and y of previously learned and newly added classes, respectively. This section presents a detailed discussion on the experimental protocols (including dataset description, training strategies, and evaluation metrics), which we followed for conducting the proposed study. The proposed scheme is evaluated on five publicly available datasets containing high-resolution chest radiographs. The detailed description of each dataset is presented below: The first dataset which we used for the evaluation of the proposed framework is the Indiana dataset [40] . Indiana dataset was collected from various hospitals affiliated with the Indiana University School of Medicine [40] . The complete dataset contains 7,470 frontal and lateral CXRs depicting normal and abnormal pathologies such as cardiac hypertrophy, pulmonary edema, opacity, or pleural effusion. The second dataset on which we tested the proposed system is the Montgomery County (MC) dataset [41] . This dataset was collected from the Department of Health and Human Services in partnership with Montgomery County, Maryland in the United States. The group consisted of 138 frontal chest radiographs from the Montgomery County Tuberculosis Screening Program, of which 80 were normal, and 58 had tuberculosis symptoms. Moreover, the scans within the MC dataset have a resolution of 4020 × 4892 and 4892 × 4020 pixels. The third dataset on which we evaluated the proposed framework is the Shenzhen dataset [42] . The dataset was collected in collaboration with Shenzhen People's Hospital, Guangdong Medical College, Shenzhen, China. It contains 662 CXR scans, from which 326 represent normal pathologies and 336 are tuberculosis affected. The fourth dataset on which we evaluated the proposed framework is the Japanese Society of Radiological Technology (JSRT) dataset [43] that contains normal scans as well as scans that are affected with the pulmonary nodule. JSRT [43] contains 247 CXRs, from which 154 contains pulmonary nodules (100 malignant and 54 benign), and 93 scans have no nodules. All CXR scans within JSRT dataset [43] have a resolution of 2048 × 2048 pixels, while their color depth is 12 bits. The last dataset on which the proposed framework is evaluated is the Zhang CXR dataset [44] . Zhang dataset is originally designed to classify different retinal diseases via optical coherence tomography (OCT) imagery [44] . However, it also contains CXRs depicting healthy and pneumonic pathologies. In the Zhang CXR dataset [44] , 3,883 training scans depict pneumonic pathologies while 1,349 scans are from healthy subjects. Similarly, the testing set contains 390 pneumonic, and 234 healthy scans [44] . Moreover, all of these scans are arranged within the dataset as per their depicted pathologies. The proposed framework is implemented using TensorFlow 1.14 and Keras 2.0.0 with Python 3.7.4 on the Anaconda platform. Some of the utility functions are also implemented using MATLAB R2020a. The proposed framework's training was conducted in two phases where the candidate classification model minimized the L IL loss function in each iteration. The number of epochs (in each training increment) was 20 (and the number of cycles in each epoch varies as per each dataset). Also, during each training increment, we fed the candidate network with around 20% of the original training data (where 10% were used for the distillation process and the remaining 10% were used to learn the newly added classes). Apart from this, we used ADADELTA [45] as an optimizer, and the training was conducted on the machine with a Core i7-9750H@2.6 GHz processor, 32GB DDR4 RAM, and NVIDIA RTX 2080 Max-Q GPU with cuDNN v7.5 and a CUDA Toolkit 10.1.243. To evaluate the proposed framework and to compare it with the state-of-the-art schemes, we used the standard classification metrics such as accuracy, true positive rate (T P R), positive predicted value (P P V ), and F 1 scores, as expressed below: where T P , T N , F P , and F N denotes the true positives, true negatives, false positives, and false negatives, respectively. The proposed framework has been thoroughly evaluated on five public CXR datasets with four different classification networks, i.e., VGG-16 [18] , ResNet-50 [46] , ResNet-101 [46] and MobileNet [19] . Furthermore, these networks have been trained incrementally where, in each iteration, they minimized the proposed L IL loss function to learn different pulmonary disease classification tasks. To present the evaluation results of the proposed framework in the best manner, we organized this section as follows: At first, we present detailed ablation studies to determine the hyperparameters of the proposed framework on all five datasets. Afterward, we present a detailed evaluation of the proposed on each dataset. The ablation study for the proposed framework include 1) the determination of optimal temperature constant τ , and 2) to evaluate the capacity of L IL in resisting the catastrophic forgetting with and without the inclusion of L M D objective function across each dataset. [40] L The temperature constant τ generates soft-target probabilities for each class, enabling deep neural networks to accurately learn the distinct feature representations during the knowledge distillation process. It should be noted that τ is a dataset-dependent parameter, and its optimal value varies across different datasets. Table 1 reports the effect of varying τ in terms of classification error across each dataset. Here, we can see that although the optimal value of τ varies across each dataset, it typically ranges between 2 ≤ τ ≤ 2.5 for each classification network. The second ablation study analyzes the effect of L M D within the proposed L IL loss function. From Table 2 , we can observe that for all datasets, including L M D objective function within L IL significantly improved the classification performance of the proposed framework in terms of accuracy. This is because of the fact L M D enables L IL to analyze the complex inter-dependencies between different incrementally learned knowledge representations, which makes the candidate classification model to effectively screen different inter-related disease categories while showcasing high resistance to catastrophic forgetting. The first dataset for which we evaluated the proposed loss function is the Indiana dataset [40] that contains CXR scans depicting cardiac hypertrophy, pulmonary edema, opacity, and pleural effusion. The classification performance of the L IL driven classification models is reported in Table 3 where we can see that the best performance is achieved for ResNet 101 [46] . Also, it should be noted that the performance of the incremental ResNet 101 model is competitive with its fine-tuning variant, i.e., the performance of incremental ResNet 101 model only lags by 10.60% in terms of accuracy with the fine-tuning approach. After evaluating the proposed framework on Indiana dataset [40] , we trained it incrementally on the MC dataset [41] for screening Tuberculosis subjects. The performance of the proposed framework after adapting to the MC dataset is shown in Table 3 . Here, we can observe that the classification performance for all the backbone models is similar, except for the ResNet 101 model, which is lagging from its fine-tuning variant by 13.93%. The third dataset on which we evaluated the proposed framework is the Shenzhen dataset [42] . It can be observed from Table 3 that on Shenzhen dataset [42] , all the pre-trained models have similar classification performance (except for the MobileNet [19] ) in terms of accuracy. Also, the best performing incremental ResNet 101 [46] model only lags from its fine-tuned variants by 5.40% which is appreciable. The next dataset on which we evaluated the proposed framework is the JSRT [43] . In Table 3 , we report the classification performance of the pre-trained models towards recognizing a diverse range of pulmonary disorders. We can observe that the incrementally trained ResNet 101 [46] model was able to cope up with the conventional fine-tuning approach by 88.20% (i.e., it is only lagging from fine-tuned baseline by 11.79%). Also, it should be noted that using the proposed L IL loss function, we incrementally trained our model with very few training examples (as described in Section 4.2). Following the same training dataset quota for fine-tuning approach would result in a drastic decrease in performance (for all classification networks) due to over-fitting. Therefore, considering the fact that we achieved good generalization over a diverse range of chest pathologies and scanner specifications with few-shot training, we believe that the performance of L IL driven incremental classification models is significant. The last dataset on which we evaluated the L IL driven classification models is the Zhang CXR dataset [44] . Unlike other datasets, the Zhang CXR dataset [44] can only be utilized for binary classification tasks, i.e., to classify healthy and pneumonic pathologies. Therefore, to evaluate the capacity of the proposed loss function on the Zhang CXR dataset [44] , we first trained all the classification models for recognizing only Table 3 . Here, we can see that the L IL driven ResNet-101 [46] only lags from its fine-tuned variant by 6.50% in terms of accuracy and 5.45% in terms of F1 score. It should also be noted here while the incremental ResNet-101 [46] uses a significantly lower amount of training samples compared to its fine-tuning variant, i.e., it just uses 522 (134 healthy and 388 pneumonia affected) scans out of 5,232 to identify these pathologies. Apart from evaluating the applicability of the proposed L IL loss function with different classification networks, we also compared it with the state-of-the-art incremental learning schemes (based on knowledge distillation [34, 14] and contrastive learning [15] ). The comparison is reported in Table 4 . We can observe here that the incremental ResNet 101 model trained using the proposed L IL loss function outperformed the state-of-the-art schemes for each dataset. Furthermore, we can also appreciate the significance of L M D objective function within L IL through Table 4 , where L M D allowed the candidate classification model to understand the mutual information between incrementally learned disease representations that improved its performance by many folds on each dataset (especially on Shenzhen [42] , JSRT [43] , and Zhang CXR [44] ). The qualitative evaluations of the proposed framework is performed by observing the attention maps (obtained by the best performing incremental ResNet 101 [46] model) as shown in Figure 3 . These attention maps are generated from the latent vectors (feature maps within the deeper layers of the network), which are then resized to the network input size. From Figure 3 , we can observe that the ResNet 101 [46] (trained incremental using the proposed L IL loss function) focuses on recognizing the chest abnormalities while predicting the disease categories. For example, see the attention map in Figure 3 (R), where the network picked the consolidations compared to other scan regions. Similarly, see how the network paid attention to the opacities in Figure 3 (X). However, not all the focused areas (within attention maps) are clinically relevant. For example, see the focused areas in Figure 3 (L, T, and V). The incrementally trained network pays attention to these irrelevant areas because it could not differentiate between the lesions and the background regions due to their high spatial similarities within the candidate scan. However, it should also be noted that although these focused features are not clinically relevant, they do enable the classification model to predict the disease category (within each scan) accurately. This paper presents a novel incremental learning scheme that can screen various pulmonary diseases from the CXR scans irrespective of their scanner specifications. Furthermore, unlike its competitors based on conventional transfer learning and fine-tuning approaches, the proposed framework can effectively recognize new types of chest pathologies with few-shot training without catastrophically forgetting its previously acquired knowledge. The classification network within the proposed scheme is trained via a novel L IL loss function that not only penalizes the network to learn new class representation while distilling its previous knowledge. Rather, it also ensures that the network understands the complex relationships and inter-dependencies between different knowledge transfer tasks to differentiate them effectively at the inference stage. Compared to the state-of-the-art incremental learning approaches based on knowledge distillation [34, 14] , and contrastive learning [15] , the proposed framework show high resistance to catastrophic forgetting phenomena. This also results in better classification performance, as evident from Table 4 . Furthermore, the proposed framework has the potential to be deployed in a clinical setting. The computer-aided screening systems (in clinical practice) are expected to recognize new types of pathologies (especially the rarely seen ones) with few training examples. The proposed framework is an ideal choice for such situations, where it can be easily modified to incrementally learn different disease patterns within the CXR scans, unlike the conventional transfer learning approaches. Although, after incrementally learning a diversified range of chest pathologies from five public datasets, the proposed framework showcased some of the clinically irrelevant features for diagnosing the disease categories. For example: see the pairs (K, L), (S, T), (U, V) in Figure 3 . Nevertheless, the proposed framework does manage to identify these diseases at the inference stage correctly. In the future, we envisage investigating the proposed framework for screening COVID-19 and grading its severity as per the clinical standards. Fully Convolutional Neural Network for Lungs Segmentation from Chest X-Rays Pneumonia Can Be Prevented-Vaccines Can Help Community Acquired Bacterial Pneumonia: Aetiology, Laboratory Detection and Antibiotic susceptibility Pattern COVID-19 detection using deep learning models to exploit Social Mimic Optimization and structured chest X-ray images using fuzzy color and stacking approaches Acute pulmonary edema due to occult air embolism detected on an automated anesthesia record: illustrative case Robbins Basic Pathology Benign Lung Tumors and Nodules Automatic Screening For Tuberculosis In Chest Radiographs: A Survey The Coding Of Roentgen Images For Computer Analysis As Applied To Lung Cancer Advanced Approaches To Computer-Aided Detection Of Thoracic Diseases On Chest X-rays CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning Learning without Forgetting Meta-Transfer Learning Driven Tensor-Shot Detector for the Autonomous Localization and Recognition of Concealed Baggage Threats iCaRL: Incremental Classifier and Representation Learning Contrastive Representation Distillation Computer-aided detection in chest radiography based on artificial intelligence: a survey Attention to Lesion: Lesion-Aware Convolutional Neural Network for Retinal Optical Coherence Tomography Image Classification Very Deep Convolutional Networks for Large-Scale Image Recognition MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications COVID-19: Automatic detection from X-ray images utilizing transfer learning with convolutional neural networks Deep Ensemble Learning Based Objective Grading of Macular Edema by Extracting Clinically Significant Findings from Fused Retinal Imaging Modalities Deep Learning for Classification and Localization of COVID-19 Markers in Point-of-Care Lung Ultrasound Incremental Learning Through Deep Adaptation RAG-FW: A hybrid convolutional framework for the automated extraction of retinal lesions and lesion-influenced grading of human retinal pathology Deep structure tensor graph search framework for automated extraction and characterization of retinal layers and fluid pathology in retinal SD-OCT scans Representation Learning: A Review and New Perspectives Less-Forgetting Learning In Deep Neural Networks An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks Distilling the Knowledge in a Neural Network Class-incremental Learning via Deep Model Consolidation Expert gate: Lifelong learning with a network of experts End-to-End Incremental Learning Tree-CNN: A Hierarchical Deep Convolutional Neural Network for Incremental Learning Improved Knowledge Distillation via Teacher Assistant Overcoming Catastrophic Forgetting with Unlabeled Data in the Wild Gradient Episodic Memory for Continual Learning Efficient Lifelong Learning with A-GEM Variational student: Learning compact and sparser networks in knowledge distillation framework Continual Learning and Catastrophic Forgetting Indiana University -Chest X-Rays (PNG Images) Two Public Chest X-ray Datasets For Computer-Aided Screening Of Pulmonary Diseases Activities Of The Korean Institute Of Tuberculosis Development Of A Digital Image Database For Chest Radiographs With And Without A Lung Nodule: Receiver Operating Characteristic Analysis Of Radiologists' Detection Of Pulmonary Nodules Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning ADADELTA: An Adaptive Learning Rate Method Deep Residual Learning for Image Recognition We would like to acknowledge the National University of Sciences and Technology, Pakistan, and Khalifa University, UAE, for providing us with the resources in order to conduct this research. All the authors declare that there are no competing interests that could influence the work presented in this article.