key: cord-0481131-qzhlfh5z authors: Wang, Kai; Liu, Xialei; Herranz, Luis; Weijer, Joost van de title: HCV: Hierarchy-Consistency Verification for Incremental Implicitly-Refined Classification date: 2021-10-21 journal: nan DOI: nan sha: f800577d18e1cf9eb07e6a2da30bb11fa2fdf842 doc_id: 481131 cord_uid: qzhlfh5z Human beings learn and accumulate hierarchical knowledge over their lifetime. This knowledge is associated with previous concepts for consolidation and hierarchical construction. However, current incremental learning methods lack the ability to build a concept hierarchy by associating new concepts to old ones. A more realistic setting tackling this problem is referred to as Incremental Implicitly-Refined Classification (IIRC), which simulates the recognition process from coarse-grained categories to fine-grained categories. To overcome forgetting in this benchmark, we propose Hierarchy-Consistency Verification (HCV) as an enhancement to existing continual learning methods. Our method incrementally discovers the hierarchical relations between classes. We then show how this knowledge can be exploited during both training and inference. Experiments on three setups of varying difficulty demonstrate that our HCV module improves performance of existing continual learning methods under this IIRC setting by a large margin. Code is available in https://github.com/wangkai930418/HCV_IIRC. In the lifetime of a human being, knowledge is continuously learned and accumulated. However, deep learning models suffer from knowledge forgetting, also known as catastrophic forgetting [11, 21] , when presented with a sequence of tasks. Incremental learning [5, 19, 24] , also referred to as continual learning, has been a crucial research direction in computer vision that aims to prevent this forgetting of previous knowledge in neural networks. Another aspect of human learning is the association of new concepts to old concepts, people construct a hierarchy of knowledge to better consolidate this information. Recently, the IIRC (Incremental Implicitly-Refined Classification) setup [1] has been proposed as a novel extended benchmark to evaluate lifelong learning methods in a realistic setting where the construction of hierarchical knowledge is key. On the IIRC benchmark (see Fig. 1 ), each class has multiple granularity levels. But only one label is present at any time, which requires the model to infer whether the related labels have been observed in previous tasks. This setting is much closer to real-life learning, where a learner gradually improves its knowledge of objects (first it labels roses as a plant, later as a flower, and finally a rose). Based on this benchmark, Abdelsalam et al. [1] adapted and evaluated several state-ofthe-art incremental learning methods to address this problem, including iCaRL [27] , LU-CIR [9] , and AGEM [4] . However, their work does not propose an effective solution specifically designed for the IIRC problem. They do not aim to incrementally learn the hierarchical knowledge that is important to correctly label the data in this setting. Furthermore, there are also some other limitations in the current version of the IIRC benchmark: (i) The granularity is limited to two layers, while in reality there are often more layers involved (see WordNet [22] hierarchy of ImageNet [6] ). (ii) The first task always contains a large number of superclasses, which means that the learner encounters data from most classes already in these early stages 1 . This makes training relatively easy, and the proposed setup less applicable. To overcome catastrophic forgetting under the IIRC setup, we propose a module called Hierarchy-Consistency Verification (HCV). We aim to explicitly learn in an incremental manner the hierarchical knowledge that underlies the data. While learning new tasks with new super and subclasses, we automatically discover relations, e.g. the class 'flower' is a subclass of 'plant'. Next, we show how this knowledge can be exploited to enhance incremental learning. Principally, in the described example, we would not use images from 'flower' as negative examples for the class 'plant' (a problem from which the methods in [1] suffer). Next, we show how the hierarchical knowledge can be used at inference time to improve the predictions. Based on these observations, our main contributions are: • We propose a Hierarchy-Consistency Verification (HCV) module as a solution to the IIRC setup. It incrementally discovers the hierarchical knowledge underlying the data, and exploits this during both training and inference. • We extend the IIRC benchmark to a challenging 3-layer hierarchy on the IIRC-CIFAR dataset. In addition, we propose a much harder setup where the superclasses are distributed uniformly over incremental tasks to test the robustness of different methods. • Experiments show that we successfully acquire hierarchical knowledge, and that exploiting this knowledge leads to significantly improvements of existing incremental learning methods under the IIRC setup (with absolute accuracy gains of 3-20%). 2 Related work Incremental learning methods can be categorized into three types [5, 19] as follows. Regularization-based methods. The first group of techniques add a regularization term to the loss function which impedes changes to the parameters deemed relevant to previous tasks. The difference depends on how to compute the estimation. These methods can be further divided into data-focused [10, 14, 26, 34] and prior-focused [2, 3, 11, 13, 15, 33] . Data-focused methods use knowledge distillation from previously learned models. Priorfocused methods estimate the importance of model parameters as a prior for the new model. Parameter isolation methods. This family focuses on allocating different model parameters to each task. These models begin with a simplified architecture and updated incrementally with new neurons or network layers in order to allocate additional capacity for new tasks. In Piggyback/PackNet [17, 18] , the model learns a separate mask on the weights for each task, whereas in HAT [28] masks are applied to the activations. This method is further developed to the case where no forgetting is allowed in [20] . In general, this branch is restricted to the task-aware (task incremental) setting. Thus, they are more suitable for learning a long sequence of tasks when a task oracle is present. Replay methods. This type of methods prevent forgetting by including data from previous tasks, stored either in an episodic memory or via a generative model. There are two main strategies: exemplar rehearsal [4, 9, 16, 27, 32] and pseudo-rehearsal [29, 31] . The former stores a small amount of training samples (also called exemplars) from previous tasks. The latter use generative models learned from previous data distributions to synthesize data. Classification problem is normally considered that the categories are not overlapped with each other. However, the concepts in real life are connected to each other with hierarchical information. For example, in ImageNet [6] , the categories are hierarchized by WordNet [22] knowledge. For hierarchical classification [30] , the system groups things according to an explicit hierarchy, which is important to some applications, such as bioinformatics [7] and COVID-19 identification [25] . Another related area is multi-label classification [35] , where each image is related to multiple labels. Multi-label classification is a generalization of the single-label categorizing problem. In the multi-label problem there is no constraint on how many of the classes the instance can be assigned to. While under this setup, there is no hierarchical constraints among categories. By comparison, on the IIRC setup [1] , the hierarchical information is implicitly defined. The developed model for this problem should be able to learn this hierarchy by itself and predict the multiple labels for each instance. The original work that presented the IIRC setup [1] ignores the hierarchical nature of the classes during incremental learning. Consequently, some samples are incorrectly used as negative samples for their superclass labels, potentially resulting in a drop of performance. Here we propose our method to incrementally learn the hierarchy and directly exploit this information to remove said interference. Moreover, we also show how the estimated hierarchy can be exploited at inference time. Our method is general and can be applied to existing methods for incremental learning that can be trained with a binary cross-entropy loss (in experiments we will show results for iCaRL [27] , and LUCIR [9] ). Given a series of tasks, each task t ∈ [1, T ] is composed of data D t from the current class set C t which can contain both super-and subclasses. During training of task t the model will only one of which is present in C t . In the proposed setup of [1] , always first the superclass is learned and later the subclass (like in Fig. 1 ). We will use lowercase y for a one-hot vector, and capital Y to identify a binary vector possibly with multiple non-zero elements. It is important to note that even if during training only a single label y i t is provided, during testing after task t we consider test data ( , i.e., at test time we are expected to predict all non-zero elements in Y i t . To make the common recognition model applied in this multi-class case, in [1] they propose to replace the conventional cross-entropy loss by a binary cross-entropy loss: hierarchical knowledge, we verify the hierarchy consistency both during training and inference time to boost the performance of the continual learning models. Our algorithm, called Hierarchy-Consistency Verification (HCV), contains two phases which we describe in the following (see also Fig. 2) . Moreover, the learned hierarchy is also exploited at inference. The mission at this stage is to estimate the existing hierarchical relationship between subclasses u i t and superclasses v i t . This stage occurs before the training of the current task. Supposing we have learned the classifier F t−1 for all previous classes. We could use F t−1 to classify all accessible data D train t for class y c and produce a prediction vector p y c . Then, with the new class label vectorȲ i t , the binary cross-entropy loss is rewritten as: Inference with HCV (Infer-HCV). At inference time, if a multi-class prediction vector is not consistent with our estimated hierarchical knowledge H, we mark it as a wrong prediction (e.g. it estimates a sub and superclass combination that is not in accordance to our hierarchical knowledge captured by H). Based on this assumption, we process each predic-tionŶ i t with H t . If the prediction is in accordance with H t it remains unchanged. If we need to add labels toŶ i t to make it be in accordance to H t we do so (add subclass or superclass label). If we need to remove labels fromŶ i t to reach accordance with H t , we randomly select one of the possible solutions containing the least number of removed labels. See the supplementary material for a visual explanation of Infer-HCV. Datasets. We use the same two datasets as in IIRC [1] : CIFAR100 [12] and ImageNet [6] . For CIFAR100, we take the two-level hierarchy split IIRC-CIFAR from IIRC [1] , we denote this as IIRC-2-CIFAR. It is composed of 15 superclasses and 100 subclasses. To further explore the performance of incremental learning methods over multi-level hierarchy, we further extend the IIRC-2-CIFAR into a three-level hierarchy dataset IIRC-3-CIFAR with two highest superclasses (we name them as "root"): "animals" and "plants". That accounts 2 rootclasses, 15 superclasses and 100 subclasses. For ImageNet, due to its huge amount of data, we collect 100 subclasses according to the hierarchy proposed in IIRC [1] . In total there are 10 superclasses and 100 subclasses (including those have no superclass labels). We denote this dataset as IIRC-ImageNet-Subset as a simplified version of the original one. The detailed hierarchies and task information are referred to the supplementary material. Incremental task configurations. For IIRC-2-CIFAR, we adopt the training sequence from IIRC [1] , where the first task is with 10 superclasses, in the sequential tasks each with 5 classes. And for IIRC-3-CIFAR, we uniformly distribute the rootclasses and superclasses to form 23 tasks in total, the first task is 7 classes and then the coming tasks are 5 for each. For IIRC-ImageNet-Subset, we have 11 tasks each with 10 classes. Here the superclasses are also uniformly distributed. We want to stress that the uniform distribution of superclasses (and rootclasses) leads to a more challenging setting than proposed in the original IIRC. Baselines and Compared methods. We compare the performance of the following variants: (1) Incremental Joint learns the model across tasks and the model has access to all the data from previous tasks with complete information (having access to all the label annotations Y t ). It serves as the upper bound for comparison. (2) ER-infinite is similar to Incremental Joint but with incomplete information (only access to the current label annotations y t ). (3) iCaRL-CNN is the original version of incremental learning method iCaRL [27] . (4) iCaRL-norm is the adapted version of iCaRL [27] with replacement of the distance metric from L2-distance to Cosine similarity. (5) LUCIR is the incremental learning method LU-CIR [9] . (6) ER is the finetuning baseline with 20 image exemplars per class as experience replay. (7) FT is the finetuning baseline without image replay. Implementation details. For most implementation details, we follow the IIRC configurations [1] . For these three setups, we use the ResNet-32 [8] as the classification backbone. For model training, we use SGD (momentum=0.9) as optimizer, which is commonly used in continual learning [23] . For the IIRC-2-CIFAR and IIRC-3-CIFAR setting, the learning rates begin with 1.0 then decay by 0.1 on the plateau of the validation performance. For IIRC-ImageNet-Subset, the learning rate starts with 0.5 and decay by 0.1 on the plateau. The number of training epochs is 140, 140 and 100 for IIRC-2-CIFAR, IIRC-3-CIFAR and IIRC-ImageNet-Subset, respectively. For all these three setups, the batch size is 128 and weight decay is 1e-5. During training, we apply random resized cropping (of size 32 × 32) to both CIFAR100 and ImageNet images. Then a random horizontal flip is applied and followed by a normalization. And for images replay, we keep a fixed number of 20 saved exemplars per class by default. For evaluation, we adopt the precision-weighted Jaccard similarity (pw-JS) proposed in IIRC [1] , which integratedly considers both precision and recall indexes. And the threshold τ is set to 0.6 in all experiments (except in ablation study over it). HCV applied to existing methods. To verify the performance of our proposed HCV, we apply it to iCaRL-CNN, iCaRL-norm and LUCIR. The average pw-JS value is provided in Table 1 . We conduct experiments using three different settings, that is IIRC-2-CIFAR, IIRC-3-CIFAR and IIRC-ImageNet-Subset. On IIRC-2-CIFAR setting, with the help of our HCV module during the training stage, the average numbers are increased by nearly 4.3% for all three different continual learning methods. When we apply HCV also at inference time, it further improves the consistency of final predictions achieving the average number by 3.2%, 2.8%, 1.7% for these three methods respectively. On the IIRC-3-CIFAR setting, since it is a much harder setup for incremental learning, all these variants suffer a significant drop of performance. LUCIR is much better compared to iCaRL-CNN and iCaRL-norm. Applying HCV in both training and inference stages helps to boost performance around 6.5% for two iCaRL variants and 21.1% for LUCIR. IIRC-ImageNet-Subset setting has much higher image diversity, thus it also imposes difficulties for these incremental methods. Under this setting, LUCIR performs worse than iCaRL-CNN and iCaRL-norm even with the improvement from HCV. And iCaRL-CNN works similar to iCaRL-norm but with marginally better performance. Overall, using our proposed HCV during training and inference improves performance of existing methods consistently for different settings. Final estimated hierarchy graph and visual examples. After learning the last task under IIRC-2-CIFAR setup when applying our SPL module to iCaRL-CNN, we estimate the full hierarchy and draw a subgraph with 3 superclasses in Fig. 4 (right) . We can observe that most subclasses are correctly annotated with its superclasses. However table is not correctly annotated because its confidence (58%) does not reach the threshold. Interestingly, television is wrongly classified as a subclass of furniture. In real life, we could also regard it as a member of furniture and this was learned because televisions occur often in furniture scenes. This kind of information can help human operators in annotating and verifying the dataset hierarchy. Further, we see that house, bridge, castle are false positives, and are classified as subclasses of vehicles. This could be because vehicles images co-occur with the house, bridge, castle classes as their background. Finally, we also show some visual examples from IIRC-2-CIFAR setup and in-the-wild images in Fig. 4(left) . Comparison with SOTA methods. In Fig. 3 we plot the dynamic performance changes of different methods. The general trend on different settings are similar. Incremental Joint always achieves the best results as an upper bound, benefiting from access to all data and labels, while ER-infinite lacks the knowledge of full labels resulting in a worse performance. Our proposed HCV improves existing methods consistently, but the gap between our best and the two upper bounds (ER-infinite and Incremental Joint) is still large, which shows that IIRC setting is a very challenging setting requiring more research. Confusion matrices. Fig. 5 shows the confusion matrices after learning task 11 under IIRC-2-CIFAR setup. They are from the ground truth, original continual learning methods, and HCV applied to both training and inference time. It can be observed that after using HCV, the redundant predictions are cleaned with our learned prior knowledge about the classes Figure 4 : Visual examples of our model applied to IIRC-2-CIFAR setup (annotated with superclasses and subclasses) and in-the-wild images (annotated with class names). We plot the top-5 (ranked by % percentage) predicted superclasses for each query image. We take the default threshold τ = 0.6 to distinguish the success and failure cases. A subgraph of the final predicted graph under IIRC-2-CIFAR setup with iCaRL method is shown on the right. Here the top-1 predicted superclasses with percentages are listed. hierarchy, therefore HCV plays a role of a de-noising procedure for confusion matrices. Ablation study over threshold τ. We conduct an ablation study on the threshold τ under IIRC-2-CIFAR setup. In Fig. 6a , we compare the values of τ {0.4, 0.5, 0.6, 0.7} when applying HCV on both training and inference stages. We can observe that with different hyper-parameters, it improves over iCaRL-CNN consistently. In Fig. 6b , we show how the hierarchy correctness score (HCS) changes with the threshold from 0.1 to 0.8, and is around 75% to 80% when τ is in the range [0.3, 0.7]. In our experiments, we set τ = 0.6 by default. Ablation study over hierarchy correctness score (HCS). We also conduct an ablation study over the HCS on LUCIR and ER methods as shown in Fig. 6d and Fig. 6e . The hierarchy correctness scores for iCaRL, LUCIR, ER are 76.2%, 56.0%, 34.3%, respectively (the HCS curves by training sessions are shown in Fig. 6c ). The higher hierarchy correctness score for iCaRL-CNN helps it achieve state-of-the-art performance on IIRC-2-CIFAR and IIRC-ImageNet-Subset (Table 1 and Fig. 3) . While LUCIR achieves a much lower score though it is regarded as one of the best methods in continual learning [19] . We also show the performance of the LUCIR and ER methods with the ground-truth hierarchy, which means it has a HCS of 100% (see Fig. 6d and Fig. 6e ). In this case 3.0% and 15.0% improvements are observed for LUCIR and ER respectively. That implies that our HCV module can benefit from a preciser hierarchy estimation to reduce the gap to ERinfinite. To test how a completely wrong class hierarchy influences our model, we randomly generate a hierarchy for IIRC-2-CIFAR and apply it to ER (Fig. 6e) , we can observe a drop of HCS from 34.3% to 0.0%, and the overall performance drops for ER to nearly 7.0%. HCV (on LUCIR) performance with 10 orders. In Fig. 6f the experiments are conducted with all 10 task-orderings proposed in IIRC [1] . We plot the average performance. Here we apply our SPL and Infer-HCV to the LUCIR model. We observe a significant and consistent improvement compared to the ER baseline (≈10.0%) and the basic LUCIR method (≈8.0%). In conclusion, our method improves the performance under various orders and settings. In this paper, we proposed a Hierarchy-Consistency Verification module for Incremental Implicit-Refined Classification (IIRC) problem. With this module, we can boost the existing incremental learning methods by a large margin. From our experiments on three different setups, we evaluate and prove the effectiveness of our proposed module during both training and inference. And from the visualization of confusion matrices, we can also find that our HCV module works as a denoising method to the confusion matrices. For future work, we are interested in associating hierarchical classification, multi-label classification with IIRC problem, thus to have a more robust model to overcome forgetting in more realistic setups. Here we show the groundtruth hierarchies of IIRC-2-CIFAR/IIRC-3-CIFAR in Table 1 and IIRC-ImageNet-Subset in Table 2 . For the task splits used in our experiments, we select one from the IIRC paper for IIRC-2-CIFAR (Table 3 ) and propose our splits for IIRC-3-CIFAR (Tab. 4)/IIRC-ImageNet-Subset (Tab. 5) to test the models in more complex hierarchies. In Fig. 10 Iirc: Incremental implicitly-refined classification Memory aware synapses: Learning what (not) to forget Riemannian walk for incremental learning: Understanding forgetting and intransigence Efficient lifelong learning with a-gem Ales Leonardis, Greg Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks ImageNet: A Large-Scale Hierarchical Image Database A tutorial on hierarchical classification with applications in bioinformatics. Research and trends in data mining technologies and applications Deep residual learning for image recognition Learning a unified classifier incrementally via rebalancing Less-forgetting learning in deep neural networks Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences Learning multiple layers of features from tiny images Overcoming catastrophic forgetting by incremental moment matching Learning without forgetting Rotate your networks: Better weight consolidation and less catastrophic forgetting Generative feature replay for classincremental learning Packnet: Adding multiple tasks to a single network by iterative pruning Piggyback: Adapting a single network to multiple tasks by learning to mask weights Class-incremental learning: survey and performance evaluation Ternary feature masks: continual learning without any forgetting Catastrophic interference in connectionist networks: The sequential learning problem Wordnet: a lexical database for english Understanding the role of training regimes in continual learning Continual lifelong learning with neural networks: A review Covid-19 identification in chest x-ray images on flat and hierarchical classification scenarios Encoder based lifelong learning Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning Overcoming catastrophic forgetting with hard attention to the task Continual learning with deep generative replay A survey of hierarchical classification across different application domains Memory replay gans: Learning to generate new categories without forgetting Large scale incremental learning Continual learning through synaptic intelligence Class-incremental learning via deep model consolidation A review on multi-label learning algorithms We acknowledge the support from Huawei Kirin Solution, the Spanish Government funding for projects PID2019-104174GB-I00 and RTI2018-102285-A-I00, and Kai Wang acknowledges the Chinese Scholarship Council (CSC) No.201706170035. Herranz acknowledges the Ramón y Cajal fellowship RYC2019-027020-I. Fig. 8 show the confusion matrices after learning task 11 and the last task under IIRC-2-CIFAR setup. We can observe similar trends as in Fig. 5 in the main paper when applying the HCV modules to training and inference stages of iCaRL-CNN. We verify the hierarchy consistency also at inference time (see Section 3.2). Here we provide some examples for better understanding of Infer-HCV. In Fig. 9 there are four examples of how the Infer-HCV module works. We address the examples one column at a time:I. This example is correctly matched by the first row of H t , so it remains unchanged.II. This example does not match any row in H t . We match it with the first row by removing the second label.III. This example also does not match. We match it by adding the first class label to make it in accordance with the last row of H t .IV. This example also does not match. It can be modified by removing the 4th or 5th labels, so we randomly choose one from them to make it compatible with H t . wolf spider, marmoset, squirrel, monkey, guenon, orangutan, macaque, baboon, Madagascar cat, capuchin, soccer ball 10 howler monkey, siamang, gibbon, gorilla, spider web, red wine, crate, colobus, tennis ball, barn spider 11 croquet ball, indri, chimpanzee, titi, spider monkey, langur, ping-pong ball, computer keyboard, patas, proboscis monkey Table 5 : IIRC-ImageNet-Subset task split configuration. (S) denotes the super classes.