key: cord-0194930-2axad2to authors: Wang, Haoliang; Zhao, Chen; Zhao, Xujiang; Chen, Feng title: Layer Adaptive Deep Neural Networks for Out-of-distribution Detection date: 2022-03-01 journal: nan DOI: nan sha: ce21215364e3a5685f0287b69fd7ea07319c3db6 doc_id: 194930 cord_uid: 2axad2to During the forward pass of Deep Neural Networks (DNNs), inputs gradually transformed from low-level features to high-level conceptual labels. While features at different layers could summarize the important factors of the inputs at varying levels, modern out-of-distribution (OOD) detection methods mostly focus on utilizing their ending layer features. In this paper, we proposed a novel layer-adaptive OOD detection framework (LA-OOD) for DNNs that can fully utilize the intermediate layers' outputs. Specifically, instead of training a unified OOD detector at a fixed ending layer, we train multiple One-Class SVM OOD detectors simultaneously at the intermediate layers to exploit the full spectrum characteristics encoded at varying depths of DNNs. We develop a simple yet effective layer-adaptive policy to identify the best layer for detecting each potential OOD example. LA-OOD can be applied to any existing DNNs and does not require access to OOD samples during the training. Using three DNNs of varying depth and architectures, our experiments demonstrate that LA-OOD is robust against OODs of varying complexity and can outperform state-of-the-art competitors by a large margin on some real-world datasets. Recently, deep neural networks (DNNs) have demonstrated remarkable performance in classification problems. However, DNNs are often designed for a static and closed world, assuming the same data distribution during training and test times. In an open-world environment, it is important to detect examples from novel class distributions in safety-critical applications (e.g. detecting new categories of objects during autonomous driving and diagnoses of unknown diseases, such as COVID- 19) . It is hence necessary to develop DNNs that can identify OOD examples while at the same time classifying samples from known class distributions with high accuracy. A number of recent methods have been proposed to detect OOD examples based on DNNs. The majority of these methods detect OOD examples based on predictive uncertainty measures of a softmax classifier, such as entropy [15] , epistemic uncertainty [12] , and others [4, 11, 18, 19] . A more recent work presents the Deep-MCDD [7] , that estimates a spherical decision boundary for each class based on support vector data description (SVDD), such boundaries will enclose the in-distribution (InD) samples and distinguish OODs based on their closest class-conditional distribution. Instead of using the last layer outputs, [1] proposed to find the best intermediate layer based on a holdout validation OOD dataset. However, all of the above methods detect the OOD examples at the same level of representation (i.e. outputs at one single layer) and they hence fail to account for the different representation complexities of OOD examples. Particularly, our empirical study indicates that different OODs may be better detected at their appropriate levels of representations (see Section 4.2) . This observation motivates us to propose a novel framework, namely Layer-Adaptive OOD detection (LA-OOD), a generic modification to off-the-shelf DNNs that introduces OOD detectors to intermediate layers. Specifically, we train separate One-Class SVM (OCSVM) OOD detectors using different layers' outputs and employ a simple yet effective layer-adaptive policy function to identify the best layer for detecting each potential OOD sample (see Figure 1 ). We tune the OOD detectors through self-adaptive data shifting [16] to improve its accuracy and robustness against unseen OODs, and fine tune the framework using alternating optimization, in which the DNN classification error and the OOD detectors' training errors are minimized jointly. The main contributions are stated as follows: -We propose a novel layer-adaptive OOD detection framework (LA-OOD) that is practical for any off-the-shelf DNNs. Multiple OOD detectors are attached to the intermediate layers of a DNN, through a simple yet effective layer-adaptive policy, our proposed framework is able to fully utilize the intrinsic characteristics of inputs encoded in the intermediate latent space, hence, detect OODs with varying complexity. -We propose a joint objective that fine-tune the OOD detectors while maintaining DNN's classification accuracy. We also designed an OOD confusion metric and a Grad-CAM visualization tool to facilitate decision making and improve the model interpretability. -Extensive experiments have been conducted to demonstrate the effectiveness of our proposed framework. On three DNNs with varying depth and architectures, using two InD datasets and five OOD datasets, LA-OOD outperform state-of-the-art baseline methods in most settings without any OOD training or validation samples, being a practical yet effective OOD detection framework for OODs of different complexity. Dynamic Neural Networks with Early-Exit. Adaptive early-exist is a rising research topic in deep learning. By attaching early exits to a DNN, such methods allow "simple" samples to be output at early layers without "overthinking" [5, 6] . For a given input, an early-exit could be determined by either a confidence metric [9] or a learned decision function [2] . However, these methods aim to improve DNN performance by focusing on InD sample evaluation without giving enough attention to OODs. In this paper, we adopt the idea of early exits for the outof-distribution detection problem and propose a novel framework in which each OOD sample is detected at its best layer. OOD Detection for Deep Neural Networks. In recent years, researchers have developed a number of OOD detection methods, where the majority of such techniques use the final outputs of a DNN to separate the OODs from the InD samples [15] . [4] proposes a baseline method that detects OODs based on the maximum softmax probabilities of a DNN's final outputs. ODIN [11] incorporates the temperature scaling and input perturbation into the maximum softmax probabilities to enhance the margin between InD and OOD samples. More recently, [7] extends Deep-SVDD to a multi-class setting and proposes the Deep-MCDD, It integrates multiple SVDDs into a single deep model where each SVDD is trained to surround one InD class sample. However, these works mainly focus on the high-level conceptual features outputted by the ending layers of DNNs while ignoring the low-level representations at the intermediate layers, hence, may "overthink" the problem and fail on OODs of relatively low complexity. In contrast, LA-OOD not only considers the ending layers' outputs but also takes the intermediate layers into consideration to generate more accurate OOD predictions. Two existing methods [1, 8] utilize intermediate outputs of a DNN for OOD detection. [8] defines the confidence score of input as a weighted average of the Mahalanobis distance to the closest class-conditional distribution at each layer, such weighting function is trained using an additional validation set. [1] proposes the OODL which decides an optimal discernment layer based on a holdout OOD dataset. Both methods require the OOD samples during the training, such OOD samples not only are hard to obtain in real-world applications, but also make the trained models susceptible to unseen OODs. In this work, we tune the OOD detectors using pseudo OODs generated through self-adaptive data shifting [16] of the InD training samples, hence, does not require any OOD samples during the training. Since OOD samples are rarely available during the training, here we formulate the OOD detection as a one-class classification problem, in which OOD detectors only target to determine whether an input is in-distribution or not. Let x ∈ X be an input, y ∈ Y = {1, · · · , K} being its label, given a deep neural network M with L layers, it tries to classify each input to K classes: y = M(x) ∈ Y. With the intermediate outputs x ( ) at layer ∈ {1, · · · , L}, its OOD score s ( ) = C (x ( ) ) is computed by a layer-specific OOD detector C . Separate OOD detectors could be attached to different layers of M, the final OOD score of x could be obtained by taking the maximum OOD scores outputted by all the OOD detectors: s final = max {C (x ( ) )} L =1 . Such OOD score then can be used to determine whether x is in-distribution or not based on a predefined threshold δ. In the context of one-class classification, there are many possible selections for the OOD detector (KDE, GMM, k-NN, etc.) In this paper, we use the One-Class Support Vector Machine (OCSVM) [13] which is one of the most commonly used one-class classifier in the literature. Note that, we could replace OCSVM with any other one-class classifiers as our framework design does not depend on a specific choice of one-class classifiers. For the OCSVM, a feature mapping Φ : ∈ R d into a high dimensional feature space F. An OCSVM will try to find the best separating hyperplane that separates all the input samples from the origin such that the distance to the origin is maximized. Normally, the calculation of the feature mapping Φ is avoided by using the kernel trick k(x i , x j ) = (Φ(x i ) · Φ(x j )). In this paper, we select the commonly used Gaussian Radial Base Function (RBF) kernel: where γ is the kernel width. Using Lagrange multipliers, optimizing the OCSVM C at layer is equivalent to solving the following dual Quadratic Programming (QP) problem: are the Lagrange multipliers, and ν ∈ (0, 1] is the upper bound of the training error. Given an input sample x and its layer outputs x ( ) , its OOD score at layer is calculated using the decision function: where the offsets ρ ( ) can be recovered by . Positive scores represent OODs, and negative scores represent InDs (assuming the default zero threshold is used, i.e., δ = 0). Given a pre-trained DNN model M θ parameterized by θ, using the OCSVMs as OOD detectors, we propose a joint objective for training both the backbone model and the OOD detectors: Here the first term L(θ) denotes the loss function of the backbone network, and the second term is the summation of losses for all the OOD detectors multiplied by a regularization parameter λ > 0. We aim to fine-tune the layer-dependent feature representations and the parameters of layer-dependent OCSVM jointly so that the training errors of the OOD detectors are minimized while maintaining DNN's classification accuracy. To solve Eq. (3), an alternating optimization technique is applied in which the θ and {α ( ) } L =1 will be updated alternatively: In step I, we fix the estimated dual coefficients {α ( ) } L =1 for all OCSVMs, then re-estimate the backbone model parameter θ: In step II, we fix the backbone model to update the intermediate outputs for the training samples, then based on the newly generated outputs, we re-train all the OOD detectors using Eq. 1. Two important hyper-parameters for OCSVM training are the Gaussian kernel width γ and the training error upper bound ν. γ controls the smoothness of the decision boundary. The smaller the γ, the smoother the decision boundary will be. ν controls the error ratio, which is often tuned to reject the noisy samples in the training set and it also determines a lower bound on the fraction of support vectors. These two hyper-parameters are critical for OCSVM to achieve good performance. In general, these hyper-parameters are tuned using a held-out validation set that includes both InD and OOD samples. In this work, we adopt the self-adaptive data shifting [16] to generate pseudo-OODs for hyper-parameter tuning. Such pseudo-OODs are created purely using InD samples through edge pattern detection [10] . We summarized our LA-OOD training procedure in Algorithm 1. for input x i , we either need to define a threshold for each of these OOD detectors or design a decision policy that consolidates all the OOD scores into a final prediction. Empirically, we found that a layer-adaptive policy performs better than some fixed thresholds as it is very common that the predictions of OOD detectors diverge from each other (see Section 4.3). Here we choose a simple yet effective layer-adaptive policy that propagates the most confident opinion among all OOD detectors as the final prediction, specifically, the policy is design as One challenge to such policy design is that OCSVMs trained on different features generally will have a different scale of scores, this effect could be alleviated by normalizing the training features for each OCSVM, here we simply use the standardization: x = (x −x)/σ, withx being the sample mean and σ being its standard deviation. Empirical Settings 1 . (1) Datasets. Two InD datasets (CIFAR10 and CI-FAR100) and five OOD datasets (LSUN, Tiny ImageNet, SVHN, DTD [3] , and Pure Color) are considered in the experiments. The "Pure Color" dataset is a synthetic dataset that contains 10,000 randomly generated pure-color images. 1 The source code and datasets are available at: https://github.com/haoliangwang86/LA-OOD For each InD-OOD combination, we construct a training set using all the training images in the InD dataset and form a balanced test set using all the test images in both InD and OOD datasets, when the sizes of their test set mismatch, we randomly selected the same number of images from the larger dataset to match the test sample size of the smaller one. All images are down-sampled to 32 × 32 resolution using Lanczos interpolation. (2) Backbone Models. We evaluate our method using three popular CNNs in computer vision and machine learning studies. Particularly, we select the VGG-16, ResNet-34, and DenseNet-100 to demonstrate the effectiveness of our framework for DNN models of varying depth and architectures. (3) Feature Reduction. A feature reduction operation is applied to the intermediate outputs to maintain the scalability [1] . Among the pooling methods we have tested: max/average pooling with various sizes, global max/average pooling, the global average pooling performs the best. The pooled features are then standardized using the training set mean and deviation. (4) Hyper-parameters Tuning. We fix ν to be 0.001 so that only a small number of InD samples will be considered as noise, the γ is tuned using pseudo-OODs generated by self-adaptive data shifting [16] of only the InD training samples. We search γ in [0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0], for different InD-backbone pairs, we will shrink the value range to accommodate the differences in feature complexity and to reduce training time. (4) Baseline Methods and Evaluation Metrics. We compare our method with four state-of-the-art OOD detection baselines: MSP [4] , ODIN [11] (both temperature scaling and input preprocessing are used to achieve optimum performance), OODL [1] (we use the iSUN [17] as an additional OOD dataset to find its optimal discernment layer), and Deep-MCDD [7] . Three commonly adopted OOD detection metrics are used: AUROC, AUPR, and FPR at 95% TPR. The experimental results are reported in Table 1 , the mean values of the each evaluation metric are also reported to demonstrate the overall performance on OOD datasets with varying complexities. It is worth noting that previous works often choose to use linear interpolation for the down-sampling operation [1, 7, 8, 11] , however, we found that using linear interpolation will create severe aliasing artifacts which make such OOD samples easily detectable, therefore, to generate more genuine OOD samples, we down-sampled the OOD images using the Lanczos interpolation which is much more sophisticated than the linear interpolation. From Table 1 , it could be seen that OODs that of higher complexity will be harder to detect, such as the LSUN and Tiny ImageNet images that could contain complex backgrounds or multiple objects in a single image. OODs of lower complexity are easier to detect, such as the SVHN that contains cropped street view house numbers or DTD that contains images of different textures. The synthetic Pure Color dataset is of the lowest complexity as it contains limited information. Such dataset complexity could be easily verified using entropy or energy metrics. The OOD detection methods that utilize the ending layers' features (MSP, ODIN, and Deep-MCDD) generally perform well on detecting OODs with higher complexity, such as the LSUN and the Tiny ImageNet datasets, however, they tend to give poor decisions for OODs of lower complexity such as the SVHN, DTD, and the Pure Color datasets. The OODL baseline method could utilize the intermediate features, from the performance evaluation, we could see that OODL exhibit the same performance pattern as MSP, ODIN, and MCDD, however, it is due to that LSUN and Tiny ImageNet have similar complexity as the iSUN dataset, which is used to determine the optimal discernment layers for OODL, when the test OODs are of different complexity compare to iSUN, its performance could degrade significantly. Through multiple intermediate OOD detectors and the layer-adaptive policy, LA-OOD can exploit the full-spectrum characteristics encoded in different intermediate layers. Specifically, by taking the early layers' outputs into consideration, LA-OOD outperforms the other four baseline methods by a large margin on OOD datasets of lower complexity (SVHN, DTD, and Pure Color). More importantly, LA-OOD achieves the best average AUROC/AUPR/FPR at 95% TPR for all InD-Backbone settings, which indicates our proposed method is robust against OODs of different complexity. Overall, LA-OOD achieves an 8.21% improvement margin on AUROC, 7.8% improvement margin on AUPR, and 29.98% improvement margin on FPR at 95% TPR compare to the secondbest baseline method. As the layer of a DNN goes deeper, more complex features could be learned [20] , by attaching OOD detectors to the intermediate layers, we could detect OODs based on features of different complexities. Figure 2 shows the number of OODs identified by different OOD detectors. For the LSUN and Tiny ImageNet OOD datasets which are of higher complexity, most of them are identified by the last two OOD detectors, while for the other three OOD datasets that have relatively lower complexity, they are mainly detected by the first seven detectors. In Figure 3 we show the correctly identified Tiny ImageNet samples by different layer's OOD detectors using the VGG backbone and CIFAR10 as InD dataset. It could be seen that the OOD detectors at the initial layers are more sensitive to the image colors and textures which relate to the fine-scale details of the input images, while the OOD detectors at the ending layers tend to detect OODs based on objects or scenes. As the layer goes deeper, more and more complex OODs can be detected. Similar pattern could also be found on the DTD dataset, as shown in Figure 4 . The disagreement between the OOD detectors indicates that their predictions are inconsistent and confused. Here we define a confusion score D(x) = L 1 C (x ( ) ) to measure the prediction divergence between the OOD detectors. For a good OOD detector, this confusion score should be negative for most of the InD test samples and positive for predicted OODs, the confusion occurs when the confusion score is close to 0. We expect this confusion metric to be a reliable indicator in cases where the framework is unable to make a confident prediction and may have misclassified a test sample. Such an indicator has significant importance in handling errors due to the possible severe impact of false positives in real-world applications. We performed a confusion analysis on VGG backbone, using CIFAR10 as InD and SVHN as OOD, the confusion scores are shown in Figure 5 . While the InD samples tend to have small negative values (with an average of -0.16), the OOD samples are more concentrated on the positive side (with an average of 0.02). More importantly, the majority of the InD samples (99.78%) have negative confusion scores and this makes the confusion analysis highly reliable and less prone to false positives. The confusion happens when the confusion score is close to zero, according to applications, a threshold could be determined based on the tolerance for misclassification. Towards this error mitigation problem, we carry on the confusion analysis by designing a visualization tool for image OOD detection. Specifically, we adopt the Grad-CAM [14] to show the root causes of the OOD predictions in the input space. The analysis is continued on the VGG backbone and CIFAR10 InD setting. As for the OOD dataset, we use the Tiny ImageNet since it has the most related class definition as CIFAR10. Some examples are shown in Figure 6 to illustrate the disagreement between two OOD detectors: C4 and C9, the numbers below the heatmaps are their corresponding OOD scores, with red color representing an OOD prediction and green color representing an InD prediction. We could see that OOD detectors at the early layers are more sensitive to textures and colors, while OOD detectors at the ending layers are more focused on objects and scenes. An optimal discernment layer [1] (or best layer) could be found for a particular OOD dataset, but it may not be the optimal choice for OOD datasets of different complexity. In Figure 7 we show the AUROC of SVHN and LSUN at each layer of VGG-16 (using CIFAR10 as InD). The best layer for SVHN is layer 5, while the best layer for LSUN is the last layer. Such best layer could be estimated using a separate OOD dataset, however, as we could see from Table 1 , OODL that estimates the best layer using the iSUN dataset could have its performance degrade significantly when OODs of different complexity are encountered. Therefore, instead of choosing the best layers for different OODs, LA-OOD propagates the most confident OOD prediction across all layers, and could effectively construct a good OOD confidence measurement for unseen OODs. For all five OOD datasets considered in this paper, LA-OOD can achieve competitive or even better accuracy compare to their corresponding best layers. Here we evaluate the effectiveness of the "early exits". We compare the results of the proposed LA-OOD with the average performance using each OOD detector solely on five OOD datasets mixed (LUSN + Tiny ImageNet + SVHN + DTD + Pure Color). Results are shown in Table 2 . Using VGG-16 as an example, for both CIFAR10 and CIFAR100 InD settings, LA-OOD can achieve consistently better performance than any single OOD detector. We proposed the LA-OOD, a layer-adaptive OOD detection framework for deep neural networks. By attaching multiple intermediate OOD detectors to the DNNs, LA-OOD can fully exploit the intrinsic characteristics of the intermediate latent space and reveal OODs with increasing complexity at deeper layers. Extensive experiments have been conducted to verify the effectiveness and interpretability of LA-OOD. On three DNNs with varying depth and architectures, our framework outperforms the state-of-the-art baselines without using any OOD training/validation data, being a reliable method for detecting unseen OODs. Detecting out-of-distribution inputs in deep neural networks using an early-layer output Adaptive neural networks for efficient inference Describing textures in the wild A baseline for detecting misclassified and out-ofdistribution examples in neural networks Multi-scale dense convolutional networks for efficient prediction Shallow-deep networks: Understanding and mitigating network overthinking Multi-class data description for out-of-distribution detection A simple unified framework for detecting outof-distribution samples and adversarial attacks The cascading neural network: building the internet of smart things Selecting critical patterns based on local geometrical and statistical information Enhancing the reliability of out-of-distribution image detection in neural networks Predictive uncertainty estimation via prior networks Support vector method for novelty detection Gradcam: Visual explanations from deep networks via gradient-based localization Out-ofdistribution detection using an ensemble of self supervised leave-out classifiers Hyperparameter selection of oneclass support vector machine by self-adaptive data shifting Turkergaze: Crowdsourcing saliency with webcam based eye tracking Rank-based multi-task learning for fair regression Fairness-aware online meta-learning Interpreting deep visual representations via network dissection