key: cord-0632586-x3vxdshc authors: Peng, Kuan-Chuan title: Iterative Self Knowledge Distillation -- From Pothole Classification to Fine-Grained and COVID Recognition date: 2022-02-04 journal: nan DOI: nan sha: 5b1bec4170e8739d2f39bb98c3beee68c0782f82 doc_id: 632586 cord_uid: x3vxdshc Pothole classification has become an important task for road inspection vehicles to save drivers from potential car accidents and repair bills. Given the limited computational power and fixed number of training epochs, we propose iterative self knowledge distillation (ISKD) to train lightweight pothole classifiers. Designed to improve both the teacher and student models over time in knowledge distillation, ISKD outperforms the state-of-the-art self knowledge distillation method on three pothole classification datasets across four lightweight network architectures, which supports that self knowledge distillation should be done iteratively instead of just once. The accuracy relation between the teacher and student models shows that the student model can still benefit from a moderately trained teacher model. Implying that better teacher models generally produce better student models, our results justify the design of ISKD. In addition to pothole classification, we also demonstrate the efficacy of ISKD on six additional datasets associated with generic classification, fine-grained classification, and medical imaging application, which supports that ISKD can serve as a general-purpose performance booster without the need of a given teacher model and extra trainable parameters. Detecting potholes is essential for the municipalities and road authorities to repair defective roads. The vehicle repair bills related to pothole damage have cost U.S. drivers $3 billion annually on average [1] . Due to the cost constraint of the edge devices which the pothole classifiers run on, the edge devices installed on the road inspection vehicles may only have limited computational power (e.g., no GPU). In such scenarios, lightweight models are needed if real-time inference speed is required. Motivated by this application and deployment time constraint of pothole classifiers, we focus on the following problem: Given a fixed number of training epochs and a lightweight model to be trained, what can practitioners do to improve the pothole classification accuracy? Given the problem, we explore self knowledge distillation (KD) [2] to tackle pothole classification. By self KD, we refer to the KD methods which need no teacher model in advance and introduce no extra trainable parameters. Showing that KD is actualy learned label smoothing regularization, Yuan et al. [2] propose Tf-KD self , the teacher-free KD by using the pre-trained student model itself as the teacher model. Inspired by [2] and the assumption that better teacher models result in better student models, we propose iterative self knowledge distillation (ISKD), which iteratively performs self KD by using the pre-trained student model as the teacher model. Most KD methods [3, 4, 5, 6, 7] typically require that the teacher model is available in advance, which is not always true. Even if there are KD methods which need no teacher model [2, 8, 9] , these methods typically do not experiment on lightweight models or perform KD multiple times. Utilizing KD iteratively and requiring no teacher model, our proposed ISKD shows its efficacy on lightweight models for pothole classification, generic classification, fine-grained classification, and medical imaging application. Although there exist methods utilizing iterative KD [10, 11, 12] , they typically require additional constraints and are validated on only few datasets. For example, Koutini's method [10] requires training multiple models and selecting the best trained model for each class to predict pseudo labels for sound event detection. In contrast, our proposed ISKD only needs to train one model at once with no need to select models and predict pseudo labels. In [11, 12] , their teacher model is trained until convergence for each KD iteration, and their methods are validated on only few datasets. In addition, the accuracy gain of [11] comes from the ensemble of all the students in the history, which needs more deployment space at testing time. In contrast, we show ISKD's efficacy on a wide variety of datasets even when the teacher model is only moderately trained, and ISKD does not rely on the ensemble, which is more practical for embedded devices. To the best of our knowledge, we are the first in pothole classification to use iterative self knowledge distillation when training lightweight neural networks under limited training epochs. We make the following contributions: (1) We propose iterative self knowledge distillation (ISKD), which outperforms the state-of-the-art self KD method Tf- KD self [2] on the road damage dataset [13] , the Nienaber potholes simplex [14] and complex [15] datasets, CIFAR-10 [16] , CIFAR-100 [16] , Oxford 102 Flower [17] , Oxford-IIIT Pet Dataset [18] , Caltech-UCSD Birds 200 [19] , and COVID-19 Radiography [20] datasets, which supports the wide applicability of ISKD from pothole classification to generic, fine-grained, and medical imaging classification. (2) We provide more evidence showing that even using a teacher model with accuracy lower than the baseline accuracy from a classifier trained with a larger number of epochs, the student model can still possibly outperform the baseline. (3) ISKD can outperform the baseline under a wide range of weight balancing the objectives of ISKD, which supports that ISKD is flexible with respect to parameter selection. Inspired by Yuan et al. [2] , we propose iterative self knowledge distillation (ISKD) such that both teacher and student models can improve over time. We illustrate ISKD in Fig. 1 , where we denote the teacher/student model in the k-th KD iteration i k as T k /S k . During i 1 , since T 1 is not given in advance, we train S 1 using the softmax cross-entropy loss L c as the classification loss. During i k (k > 1, k ∈ N), we use the trained student model in i k−1 (i.e., S k−1 ) as T k , and train S k with both L c and the Kullback-Leibler (KL) divergence loss. Specifically, the total loss function to train S k during i k can be written as where KLD is the KL divergence, z t /z is the output probability distribution of T k /S k , and α is the weight of KLD. During i k , we freeze the parameters of T k and only train S k . We pre-train S k from ImageNet [21] , not from S k−1 because we hope to decrease the chance that S k is trapped from the possibly local optimum associated with S k−1 . ISKD stops at i k if S k shows no obvious accuracy gain over S k−1 . Since Yuan et al. [2] show that to benefit the student model, the teacher model need not outperform the student model, we directly use the previously trained student model as the current teacher model, waiving the typical KD requirement that the teacher model is needed in advance. We expect that using ISKD improves both T k and S k when k increases under the assumption that using better teacher models in KD generally results in better student models. We experiment on road damage dataset (termed as RDD) [13] , Nienaber potholes simplex (termed as simplex) [14] and complex (termed as complex) [15] datasets, CIFAR-10 [16] , CIFAR-100 [16] , Oxford 102 Flower dataset (termed as Oxford-102) [17] , Oxford-IIIT Pet dataset (termed as Oxford-37) [18] , Caltech-UCSD Birds 200 dataset (termed as CUB-200) [19] , and COVID-19 Radiography dataset (termed as COVID) [20] . We choose these datasets to cover a diverse range of task domains from pothole, generic, fine-grained to medical imaging classification. The RDD, simplex, and complex datasets provide annotations of whether each image contains any pothole or not. The Oxford-102, Oxford-37, and CUB-200 datasets provide the images and labels of 102, 37, and 200 different species of flowers, cats and dogs, and birds, respectively. The COVID dataset includes 4 different types of chest x-rays: normal, COVID, lung opacity, and viral pneumonia. For all the datasets except COVID, we use the official training/testing split of each dataset. Since the COVID dataset does not provide the official training/testing split, we randomly generate the split using the ratio of 7:3. We use the official PyTorch [22] implementation of ResNet-18 [23] , SqueezeNet v1.1 [24] , and ShuffleNet v2 x0.5 & x1.0 [25] , modify their last layers such that the number of output nodes of the last layer equals the number of classes, and pre-train them from the ImageNet [21] . The four network architectures are selected based on the following criteria: (1) For the ease of reproducibility, they are officially supported by PyTorch [22] , which provides their weights pre-trained from the ImageNet [21] . (2) Considering typically limited computational power on edge devices, we limit the number of network parameters to be fewer than 12M. We first experiment on the four network architectures mentioned previously using the RDD [13] , simplex [14] , and complex [15] datasets. For each KD iteration, we use the same network architecture for both teacher and student models. We compare ISKD with the following two baselines with the same network architecture, total number of training epochs, and learning schedule: (1) training a classifier using L c without KD (termed as the large-epoch baseline), and (2) Tf-KD self [2] , which only performs KD once without multiple KD iterations. For the extended study involving the other six datasets irrelevant to potholes, we use the ResNet-18 [23] and ShuffleNet v2 x1.0 [25] as the network architectures of ISKD. We conduct the extended study in the same way as the pothole classification task mentioned previously using the same two baselines. For all the experiments, the training images are resized to dataset experiment ID RDD [13] ResNet-18 [ 224×224, and the model is pre-trained from ImageNet [21] and fine-tuned with the training data of each dataset. We use the momentum 0.9, weight decay 5e-4, batch size 128, and the SGD optimizer to train the student model for 50 epochs for each KD iteration, and the learning rate is fixed within each KD iteration but model-specific during training. For ResNet-18 [23] and SqueezeNet v1.1 [24] , we use the initial learning rate 0.001, but for ShuffleNet v2 x0.5 & x1.0 [25] , we use the initial learning rate 0.1. Following Tf-KD self [2] , we obtain the α values by grid search on the validation data sampled from the training set when experimenting on the RDD dataset [13] . Once we determine the α values from the RDD dataset, we fix the α values and use the same set of α values when experimenting on other datasets (i.e., the α values are not tuned for most of the datasets except the RDD dataset). We purposely do so to test whether the α values searched from one dataset can be transferable and directly applied to other datasets. For all the other parameters, we use the default PyTorch [22] setting unless otherwise specified. The experimental results are summarized in Table 1 [23] ) and the prior works using backbones with more parameters. trained for. E 1 ∼ E 6 list the performance of S 1 ∼ S 6 , and E 7 summarizes the last performance after i 1 ∼ i 6 . There is no teacher model for E 1 and the large-epoch baseline (E 8 ), but for the baseline using Tf-KD self [2] (E 9 ), the teacher model is the trained S 1 . All the student models are pre-trained from ImageNet [21] for ISKD (E 7 ) and the two baselines (E 8 , E 9 ). In Table 1 , the accuracy of E 2 is higher than that of E 1 , which shows that self KD can improve the student model's accuracy. The accuracy of E p is higher than that of E q for most cases when 2 ≤ q < p ≤ 6, which validates the assumption that in self KD, better teacher models result in better student models and that self KD can be done iteratively instead of just once. Comparing E 7 with E 8 and E 9 , we show that given a fixed number of training epochs, ISKD outperforms the large-epoch baseline and the state-of-the-art self KD method Tf-KD self [2] in most cases. The fact that E 7 outperforms E 9 is also an ablation study supporting that Fig. 2: The teacher-student accuracy relation on the simplex [14] and complex [15] datasets using ResNet-18 [23] . The gray lines are obtained from linear regression, and the line equation and Pearson's correlation coefficient R are marked in the legend. Given the baseline performance in E1 of Table 1 , the shaded blue areas are the areas where the teacher model performs worse than the baseline but the student model outperforms the baseline. self KD is better done iteratively than just once. Since Table 1 covers diverse task domains, including pothole, generic, fine-grained, and medical imaging classification, our results support that ISKD can serve as a generalpurpose performance booster. In addition, we obtain the results in Table 1 by directly using the α values chosen for pothole classification without tuning the α values for each dataset, which supports that the α values we use are transferable across different datasets. We also compare the accuracy of ISKD (backbone: ResNet-18 [23] ) with the prior methods which use the backbones with more parameters in Table 2 , where ISKD performs on par or even outperforms the listed methods which use more parameters. This finding supports that ISKD is more parameter efficient than the listed methods. Furthermore, we analyze what is the worst performing teacher model in self KD which can still make the student model outperform the baseline with no KD (E 1 in Table 1 ). To gain more insight, we design the following experiment to find the accuracy relation between the teacher and student models. Given that E 1 in Table 1 is trained for 50 epochs, we use the 50 models saved after each epoch is completed as the teacher models. We repeat E 2 in Table 1 for 50 times (each time with one of the 50 teacher models produced during E 1 ), and record the teacher-student accuracy relation. We perform this experiment on the simplex [14] and complex [15] datasets using ResNet-18 [23] as the network architectures. We show the result of teacher-student accuracy relation in Fig. 2 . Most of the data points are above the red line (x = y), which supports that performing self KD can make the student model outperform the teacher model. Fig. 2 suggests that the accuracy of the teacher and student models has strong positive correlation (the slopes of the gray lines are positive and the R 2 ≥ 0.5), which again supports the assumption that using better teacher models in KD generally results in better student models. We find that the number of data points falling into the shaded blue areas is not negligible, which serves as statistically more significant evidence than [2] supporting that even Table 1 by using the α values chosen for pothole classification. Shown as the red lines, the baseline for Ei is the accuracy of the student model in Ei−1 reported in Table 1. if the teacher model is worse than the baseline, it is still likely that the student model can outperform the baseline after KD. Another experiment which is also not presented in the paper of Yuan et al. [2] is the impact of the α values on the accuracy. To study this, we use the ResNet-18 [23] as the network architecture of ISKD and the same random seed 1, repeat E 2 , E 3 , and E 4 corresponding to the CUB-200 [19] dataset with different α values, and report the student's accuracy in Fig. 3 , where the red lines mark the baseline accuracy (i.e., the student's accuracy in the previous KD iteration reported in Table 1 ). For each sub-figure of Fig. 3 , the teacher model is fixed as the corresponding one used in Table 1 . The triangular points in Fig. 3 are the accuracy reported in Table 1 using the α values chosen for pothole classification, so their corresponding accuracy is not necessarily the best. Fig. 3 shows that the student model outperforms the baseline in each KD iteration of ISKD under a wide range of α values, which supports that ISKD is flexible in terms of parameter selection. We propose iterative self knowledge distillation (ISKD) in self KD to improve pothole classification accuracy when training lightweight models given a fixed amount of training epochs. Experimenting on three pothole classification datasets and six other datasets associated with generic classification, fine-grained classification, and medical imaging application, we show that ISKD outperforms the state-of-the-art self KD method Tf-KD self , for most cases, given the same number of training epochs and that ISKD has wide applicability to various tasks without the need of a given teacher model and extra trainable parameters. In addition, we show more evidence supporting that the performance of the student model can benefit from self KD even when the pre-trained student model (which serves as the teacher model) is only moderately trained. Our study on the impact of the weight balancing the objectives of ISKD shows that even if we choose different weights deviating from the weights we initially use within a reasonable range, the student model can still improve over KD iterations, which supports that ISKD is flexible in terms of parameter selection. American Automobile Association (AAA) pothole fact sheet Revisiting knowledge distillation via label smoothing regularization Heterogeneous knowledge distillation using information flow modeling Knowledge as priors: Cross-modal knowledge generalization for datasets without superior knowledge Knowledge distillation meets self-supervision Correlation congruence for knowledge distillation Knowledge distillation via route constrained optimization Online knowledge distillation via collaborative learning Regularizing class-wise predictions via self-knowledge distillation Iterative knowledge distillation in R-CNNs for weakly-labeled semi-supervised sound event detection Born again neural networks," in ICML Iterative knowledge distillation for automatic check-out Road damage detection and classification using deep neural networks with smartphone images Kaggle dataset: Nienaber potholes 1 simplex Kaggle dataset: Nienaber potholes 2 complex Learning multiple layers of features from tiny images Automated flower classification over a large number of classes Cats and dogs Caltech-UCSD Birds 200 COVID-19 radiography database ImageNet large scale visual recognition challenge PyTorch Deep residual learning for image recognition SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <0.5mb model size ShuffleNet V2: Practical guidelines for efficient CNN architecture design Learning multiple layers of features from tiny images Attention augmented convolutional networks When vision transformers outperform ResNets without pretraining or strong data augmentations