key: cord-0624297-qvytu1an
authors: Abrahao, Bruno; Wang, Zheng; Ahmed, Haider; Zhu, Yuchen
title: Model Rectification via Unknown Unknowns Extraction from Deployment Samples
date: 2021-02-08
journal: nan
DOI: nan
sha: 66549cd933c4e3a87e737bcb673404f026f3455b
doc_id: 624297
cord_uid: qvytu1an

Model deficiency that results from incomplete training data is a form of structural blindness that leads to costly errors, oftentimes with high confidence. During the training of classification tasks, underrepresented class-conditional distributions that a given hypothesis space can recognize results in a mismatch between the model and the target space. To mitigate the consequences of this discrepancy, we propose Random Test Sampling and Cross-Validation (RTSCV) as a general algorithmic framework that aims to perform a post-training model rectification at deployment time in a supervised way. RTSCV extracts unknown unknowns (u.u.s), i.e., examples from the class-conditional distributions that a classifier is oblivious to, and works in combination with a diverse family of modern prediction models. RTSCV augments the training set with a sample of the test set (or deployment data) and uses this redefined class layout to discover u.u.s via cross-validation, without relying on active learning or budgeted queries to an oracle. We contribute a theoretical analysis that establishes performance guarantees based on the design bases of modern classifiers. Our experimental evaluation demonstrates RTSCV's effectiveness, using 7 benchmark tabular and computer vision datasets, by reducing a performance gap as large as 41% from the respective pre-rectification models. Last we show that RTSCV consistently outperforms state-of-the-art approaches.

Data quality constitutes a critical factor affecting the performance of prediction models. In particular, incomplete training data frequently results in structural mismatches between data-driven trained models and the respective target space in which they are supposed to be deployed, which makes most classifiers susceptible to systematic errors due to their limited ability to rectify a model post-training.

In scenarios of increasing dependence on algorithmic decisions in high stake situations, deficient models result in costly (sometimes fatal) errors, unfairness, and other problems. For example, an autopilot system may fail to recognize peculiar traffic signs it has never encountered during training, leading to accidents. In the case of automated recruiting, data from industries dominated by a given gender may result in biased classifiers, likely to reject examples of the opposite gender due to the lack of enough successful observations that belong to that class. In addition, the unseen joint distribution of features and "data-drift" may contribute to high confidence errors. For instance, when training a classifier to distinguish between white dogs and black cats, when presented with a black dog at deployment time, the model may predict "cat" with high confidence [Lakkaraju et al., 2017] . For "data-drift," the structure of the target space may change over time and deviate from the trained model. Take, for example, the anecdotal account in the beginning of the COVID-19 pandemic, where physicians attempted to identify what type of "bacteria" had been causing an unusual high number of "pneumonia" cases, overlooking the fact that there was a new type of agent, i.e., a novel virus affecting the respiratory system.

We focus on classification tasks where a trained model may be oblivious to some of the domain-specific class-conditional distributions that a set of hypotheses can recognize. Data examples from these "invisible" joint-distributions of features that a classifier is oblivious to, i.e., "hidden classes", form the unknown unknowns (u.u.s), which cause a prediction model to make errors with high confidence. This definition encompasses other terms researchers use in different contexts. For instance, in the literature that addresses over-confident softmax predictions of neural networks, especially in computer vision, the term out-of-distribution (OOD) samples refer to the same concept [Hendrycks and Gimpel, 2017 , Liang et al., 2018 , Lee et al., 2018 , Liu et al., 2020 . In addition, researchers have name the problem of classification with u.u.s the Open Set Recognition (OSR) problem [Scheirer et al., 2013 , due to the contrast between the "open" nature of discovering u.u.s and the traditional closed set scenario, where the training and test classes match.

We contribute to the mitigation of the u.u.s problem by proposing Random Test Sampling and Cross-Validation (RTSCV) 2 , a general algorithmic framework that aims to perform a post-training model rectification of a base classifier at deployment time in a supervised way. RTSCV aims to reduce the structural mismatch between a trained model and the target space by extracting u.u.s from samples of the target space. RTSCV augments the training set with a sample of the test set (or deployment data) and uses this redefined class layout to discover unknown unknowns via cross-validation. Our key insight is that by augmenting a training set that possesses m classes with a dummy class, labeled m + 1, whose examples come from a test set sample, cross-validation is likely to decouple examples that belong to known classes from m + 1, due to the high variance and broad boundary of this dummy class. Conversely, u.u.s. coming from separable classes may share greater affinity to the dummy class, as they are expected to being poor fits to the known classes and because the decision boundary around the dummy class may have been established with the contribution of examples from the u.u.s in the test sample.

RTSCV bears two advantages compared to previous methods. First, RTSCV can work in combination with a diverse family of modern classifiers. The bulk of existing methods on identifying u.u.s focus on modifying specific classification methods, such as SVM, KNN, and DNNs, in such a way as to include a free parameter that can be learned at the deployment phase. This allows for the method to predict u.u.s as possible outputs [Scheirer et al., 2013 , Júnior et al., 2016 . However, unlike RTSCV, these methods do not generalize, as they are classifier-specific. We note that RTSCV can be easily paired with any trained classifier. Second, RTSCV relies solely on the use of a sample of the test data, which removes assumptions made by approaches like active learning, which are often challenging to operationalize in practice, such as the existence of budgeted queries to an oracle [Vandenhof and Law, 2019 , Lakkaraju et al., 2017 , Simard et al., 2017 .

We contribute a theoretical analysis with performance guarantees based on the design bases of modern classifiers, including Maximum Likelihood Estimation, Bayes classifier, and Minimum Mahalanobis distance. Through an extensive experimental evaluation, we use 7 benchmark tabular and more challenging computer vision datasets like CIFAR-10, CIFAR-100, and SVHN with ResNet and DenseNet as base models. Our results suggest that RTSCV is a promising direction for post-training rectification of a base classifier by reducing a performance gap (Accuracy, Fmeasure, AUROC) as large as 41%. Moreover, our results indicate that RTSCV consistently outperforms state-of-the-art approaches and baselines by a significant margin.

The conceptual idea of a "class" is often subjective, and different hypothesis spaces will separate the feature space into different class-conditional distributions. Here we employ a working definition of u.u.s classes via a geometric argument. That is, given a fixed hypothesis space, the u.u.s form separable clusters in the feature space that are distinguishable from the known structures in the target space. This definition is without loss of generality, as it allows for any abstraction of conceptual blindness to examples. We note that u.u.s are not the only sources of prediction errors. To delineate the aims of RTSCV, here we discuss different types of errors and the scope in which RTSCV operates. Let f be a classifier and consider the hypothesis space produced by this model, i.e., the set of all functions that can be returned by it. We assume that the model is consistent, that is, if there is a function in the hypothesis space, the machine is going to produce that function from training. Further, let E be the Bayes Error, or the irreducible error. If E(H) is the lowest error we could produce with hypothesis space H, and E(H, D) is the minimum error we produce with H and available training data D, then E(H, D) − E represents the overall generalization error given H and D, which can be decomposed as the sum of E(H, D) − E(H) and E(H) − E. We call the first difference estimation error, and the second difference approximation error. That is, the model may produce errors due to either deficiencies of the model (approximation error) or to the training data (estimation error), such as u.u.s. In this paper, we focus on the latter, i.e., reducing the estimation error under the assumption of a fixed hypothesis space. We also assume that the data are free of mislabeling errors.

We emphasize the distinction between u.u.s detection and outlier detection. Outliers are rare extreme values, produced by the realization of (possibly known) class-conditional distributions. As outliers tend to be isolated from any cluster in the feature space, we make a distinction with u.u.s, which are exemplary of clusters generated by some joint-distribution of their features, but whose structure is invisible to the trained model. As such, detecting outliers is beyond the scope of our work. In addition, related to our approach is zero-shot learning, where we assume that the test set also includes unseen data classes during training. It aims to discover u.u.s, by creating unseen joint-distributions of features that come from prescribed combinations of attributes among known classes. This method is not comparable to our approach, as we do not make use of such side information.

Early work focused on extending classical machine learning algorithms to enable u.u.s prediction. Prominent examples are SVM-based methods, such as [Scheirer et al., 2013] , which proposed the 1-vs-Set machine that separates the feature space with an additional hyperplane parallel to the hyperplane obtained from the SVM. It then optimizes the open space risk for this linear kernel slab model. To further reduce the open set risk, proposed the W-SVM to incorporate non-linear kernels under a compact abating probability (CAP) model. Another similar approach is the P I -SVM by [Jain et al.] . Besides modifications to SVM, [Júnior et al., 2016] introduced the OSR version of the Nearest Neighbor classifier (OSNN) based on a threshold method that relies on measurements of the distance of an u.u.s sample from the known space.

To address large and high-dimensional datasets, recent approaches proposed to modify Deep Neural Networks (DNNs). A baseline was proposed by [Hendrycks and Gimpel, 2017] , formalizing the observation that the softmax predictions may assign high confidence to erroneously classified out-of-distribution samples (u.u.s). [Liang et al., 2018] designed the ODIN detector that could better differentiate the confidence scores between in-distribution and out-of-distribution samples in the target space, by combining temperature scaling and input perturbation. Using probabilistic modeling, specifically, Gaussian discriminant analysis (GDA), [Lee et al., 2018] modeled the softmax outputs of known classes as class-conditional Gaussian distributions. It then used the closest Mahalanobis distance (MD) to these Gaussian distributions of each test sample as the confidence score. As a modification to this approach, [Lee et al., 2020] replaced the MD confidence score with the class-conditional log-likelihood. For DNNs, researchers focused on changing the network architecture to adapt to u.u.s detection, such as OpenMax [Bendale and Boult] , CROSR [Yoshihashi et al.] , and C2AE [Oza and Patel] . The surveys by [Geng et al., 2020] and [Boult et al., 2019] provide a comprehensive review of these methods. For all of the preceding approaches, we argue that the modification of existing classifiers is model-specific. In contrast, RTSCV is a general algorithmic framework capable of working with any classifier.

A different line of research addressed the u.u.s in an incremental or active learning manner. [Rudd et al., 2018] formulated a theoretically sound classifier, the Extreme Value Machine (EVM), grounded on the Extreme Value Theory, which is able to perform nonlinear kernel-free variable bandwidth incremental learning. [Vandenhof and Law, 2019] and [Lakkaraju et al., 2017] both proposed a hybrid framework combining human crowdsourcing and algorithmic methods, in which some priors of the u.u.s are extracted by experts whose feedbacks guide adjustments of the trained model. By adopting an active learning environment, these approaches can potentially cope with a dynamic feature space. Nevertheless, requiring the presence of an oracle is oftentimes unrealistic, as it may be time and human-labor expensive, and therefore non-scalable for many real-world applications. Our proposal offsets these shortcomings by relying only on the analysis of a data sample at deployment time.

The idea of using cross-validation to rectify incorrect data was previously experimented to identify mislabeled training data [Brodley and Friedl, 1999] . In their noise-reduction approach, cross-validation is performed over the training set and mislabeled data are those given different "pseudo-labels" from their original labels after the cross-validating phase. While their work focuses on identifying mislabeled data, RTSCV aims to augment a model with new labels to the data examples that class-conditional distributions that were not contemplated at training time generates, which allows for the correct classification of these examples.

We set our scope on multi-class classification tasks. Let f be the input base classifier, which we will treat as a black-box. Let X = {X 1 , X 2 , · · · , X m } be the training set with labels {1, 2, . . . , m}. Let Y be the test set (as a representative of the target space). To detect the u.u.s and rectify a trained model, we present RTSCV in Algorithm 1. In summary, we first randomly sample the test set Y of cardinality |c · Y |, for a given fraction c, to obtain a sample X s , from which we create a new dummy class and assign the label m + 1. Note that X s may contain examples from both the known and (potentially multiple) u.u.s classes. We then augment the original training set with examples from X s , resulting in a new intermediate training set X, to which we apply cross-validation in combination with a base classifier f .

Input: Training set X with labels {1, 2, . . . , m}, test set Y , sample rate c, base classifier f , number of crossvalidation folds k 1. Randomly sample test set Y to obtain a subset X s such that |X s | = c · |Y | 2. Assign label m + 1 to X s 3. Obtain an augmented training set X ← X ∪ X s with labels {1, 2, . . . , m + 1} 4. Run a k-fold cross-validation on X 5. Let X u be the set of samples with predicted label m + 1 during cross-validation 6. Label samples in X u with label m + 1 7. Obtain the rectified training set X ← X ∪ X u with labels {1, 2, . . . , m + 1} 8. Train f on X 9. return f Multiple results were recorded to calculate the 95% confidence interval. Plot (a) corresponds to openness 9.3%. For (b), the accuracy on the y-axis stands for the combined, overall classification accuracy.

Intuitively, RTSCV relies on the correct re-classification of samples in X s during cross-validation, regarding whether they belong to a known or u.u.s class. Samples from the test set make up a high variance class X s , whose boundary encompasses all other classes (i.e., it may contain examples from any class). In light of this, examples that belong to known classes are likely placed in the correct classes due to the conciseness and specificity of the representation. On the contrary, u.u.s are classified as members of m + 1 due to the dissimilarity with all other classes and to affinity with some examples that contributed to the position of the decision boundary around class m + 1. The examples classified during cross-validation with label m + 1 make up a new set X u , (the u.u.s class), which we adjoin to the original training set X to form the rectified model X.

The sample rate c is a critical hyperparameter. For c, a very small sample rate may not contain enough representatives of u.u.s due to a small test set sample, whereas a large c may lead to a sample class X s that over-represents the structure of the known classes, thereby causing the cross-validation to assign examples of known classes to X u . In Figure 1 we illustrate the classifier's performance versus the sample-training ratio, i.e., c as a function of the training set size, for two of the benchmark datasets we used in our experimental evaluation. We discuss the search for the optimal c in Section 4.

The number of cross-validation folds k determines the running time of RTSCV, whose time complexity is roughly

where T f is the running time of the input base classifier f without RTSCV. Figure 2 (a) displays the model performance against k for several datasets we used in the evaluation of RTSCV. Note that RTSCV effectively rectifies a trained model even if we use the more computationally economical "holdout validation" approach.

We contribute a theoretical analysis that justifies the correctness of RTSCV and establishes performance guarantees as a function of class separability and the test sample size. As RTSCV is a general framework that may be combined with any base classifier, each employing disparate approaches to establish decision boundaries, there are major challenges in establishing a concise set of mathematical tools that would cover the basis of each specific approach. In light of this, we analyze its behavior through the lens of objectives that modern classifiers aim to optimize to find the sufficient conditions for the correct relabeling of the test set sample X s , namely Maximum Likelihood Estimation (MLE), Bayes classifier (BC), and Minimum Mahalanobis Distance (MMD). These objectives are shared by many classification models, such as the rule-based, margin-based, etc. The influence of known class separability on RTSCV framework. We vary the between-class distances and the covariances of the known classes to generate different J1 scores. (c) The influence of u.u.s class separability on RTSCV framework. We vary the covariance and the distance of the u.u.s class to the known classes.

We structure the following theorems based on two classification cases under RTSCV of a data point x ∈ X s : (1) the true label of x is one of the known classes, where the correct decision is to assign the true known label during cross-validation, and (2) x belongs to the u.u.s class and the process should keep it in X s . We aim to establish the correctness of RTSCV by showing that this correct decision is the one that optimizes the MLE, the BC, and the MMD.

We model all the known and u.u.s classes as non-identical multivariate Gaussian distributions. Specifically, let X 1 , X 2 , . . . , X m be the known classes with distinct mean µ i ∈ R d and diagonal covariance matrix Σ i ∈ R d×d , and X u is the u.u.s class with mean µ u ∈ R d and diagonal covariance matrix Σ u ∈ R d×d . Furthermore, we assume all the distributions are homoscedastic, so Σ i = σ 2 i I for i ∈ {1, 2, . . . , m} and Σ u = σ 2 u I, for some σ i ∈ R + , σ u ∈ R + . In this way, since the sample class X s is obtained by randomly sampling the test set by RTSCV, we can model X s as a Gaussian mixture of X 1 , X 2 , . . . , X m , X u weighted by their respective percentage in the test set, denoted P (X i ), i ∈ {1, 2, . . . , m, u}.

Under the preceding assumptions, via Gaussian discriminant analysis, we first focus on MLE to explore how the total likelihood of the dataset changes under different labeling scheme of the sample class X s . For a test set sample x ∈ X s , its likelihood of being in one of the X i (i ∈ {1, 2, . . . , m, u}) is:

where |Σ i | denotes the determinant of Σ i . Its likelihood of being in the sample class X s is

as X s is a Gaussian mixture. We can now find sufficient conditions under which the correct classification of x can increase the total likelihood. Theorem 3.1 (Maximum Likelihood Estimation). For x ∈ X s , if x is a sample of some known class, i.e., x ∼ X k for some k ∈ {1, 2, . . . , m}, for x to be correctly classified as belonging to X k based on MLE, we require it to have a higher class-conditional likelihood for class X k than that of X s . And we have

) given that, for all i ∈ {1, · · · , m, u}:

Proof. All proofs are in the supplementary materials.

Theorem 3.1 characterizes sufficient conditions for the re-classification of X s to maximize total likelihood. In summary, it says that X s will be correctly classified based on MLE once the class separability is above a threshold characterized by the squared difference of the means and the squared within-class variances.

We now turn our attention to the Bayes classifier, which determines the membership of x ∈ X s by considering its posterior probability [Murty and Devi, 2011] . By Bayes' theorem, the posterior for x to be in class X i is given by p(X i |x) = L i (x)P (X i )/p(x). In scenario we consider, P (X i ) stands for the prior of X i in the training set:

. . , m, s}. In this way, one can calculate the Bayesian decision boundary of x between X s and X k for some k ∈ {1, 2, . . . , m}.

Theorem 3.2 (Bayesian Classification). For x ∈ X s , if x ∼ X k , we have:

given that, for all i ∈ {1, · · · , m, u}:

given that

Theorem 3.2 (similarly to Theorem 3.1) establishes a performance guarantee on the correct behavior of the framework. That is, when the class separability between the u.u.s class and the known classes is greater than a given threshold, controlled by the sample size, the priors P (X s ), and the dimensionality of the data, the re-classification of X s will be as we expect with high probability.

Last, the Minimum Mahalanobis Distance (MMD) mimics the goal shared by classifiers to place a data point in the class whose joint-probability distribution of features is the closest to that of the data point. We show that the correct decision by RTSCV is the one that minimizes the MD.

For a test set sample x ∈ X s , its squared MD to some class X i is given by

The analogous behavior of a classifier that employs MMD as a metric would assign x to the class that has the closest MD to x 3 .

Assume Σ s = σ 2 s I. For x ∈ X s , if x ∼ X k for some k ∈ {1, 2, . . . , m}, we have:

In alignment with Theorem 3.1 and Theorem 3.2, here we establish a requirement of class separability as the sufficient condition for the expected behavior of RTSCV.

To illustrate the preceding theoretical analysis, here we empirically study the effect of class separability using a synthetic dataset that consists of 10 distinct known classes and one u.u.s class, all sampled from pre-fixed 2-dimensional Gaussian distributions. To measure class separability, we adapted the notion of scatter matrices [Theodoridis and Koutroumbas, 2008] to address the presence of the u.u.s class . Specifically, assume {X 1 , X 2 , . . . , X m } are the m known classes with µ i being the mean of X i and Σ i being the covariance matrix of X i . For a u.u.s class X u , its mean and covariance matrix are µ u and Σ u . The between-class scatter matrix S b is defined to measure the separability between different classes:

is the percentage of examples in X i , compared to the total number of examples in the dataset. When measuring the distance between the m known classes and the u.u.s class, we set µ 0 = µ u . Otherwise, if we want to measure the between-class distance within the known classes, we set µ 0 = 1 m m i=1 µ i . For the within-class scatter matrix S w , when measuring the u.u.s class, we simply define S w = Σ u , the covariance matrix of X u . For the known classes, S w is the weighted sum of all the covariance matrices of the known classes: S w = m i=1 P (X i )Σ i . To combine both scatter matrices, we use the J1 criterion [Theodoridis and Koutroumbas, 2008] :

which increases by making the means of different classes spread out and intra-class variances small. We evaluate RTSCV on different dataset configurations of varying class means and covariances, each corresponding to a unique J1 score. The result is plotted in Figure 2 

We evaluate RTSCV on both tabular datasets, where we pair RTSCV with classical classifiers like SVM and simple fully connected neural networks, and computer vision datasets, where we use deep convolutional neural networks, such as ResNet [He et al., 2016] and DenseNet [Huang et al., 2017] .

As established by Theorem 3.2, there is a sweet spot for the sample rate that best balances the classification of x ∈ X s by governing the prior P (X s ). Let sample-training ratio be the ratio of the size of the test set sample and the size of the training set. Figure 1 plots RTSCV's performance on COIL-20 and CIFAR10-ImageNet used in Section 4.1 and 4.2, respectively, under varying sample-training ratio. The model performance climbs steeply with the sample size before the sample starts to over-represent the known data, and then decreases slowly. We search for an optimal c using by assessing the misclassification of known data. The optimal sample rate is dataset-dependent and varies from 0.06 to 0.1. Note that the small cardinalities make RTSCV practical for efficiently sampling during model deployment. We also searched for the optimal number k, from 2 to 6, of cross-validation for the Letters, Pendigits, COIL-20 and MNIST datasets, and selected k = 3.

We selected the Letter Recognition dataset and Pendigits dataset which contain the hand-writings of 26 English letters and 10 digits, respectively. 4 Further, we also down-sample the Columbia University Image Library (COIL-20), which contains grey-scale images of 20 objects [Nene et al., 1996] , following the PCA-based technique by [Geng and Chen, 2020] . We also select the MNIST, which contains 10 digit classes of dimension 28 × 28 [LeCun et al., 2010] .

On the tabular datasets except for MNIST, we use a standard SVM implementation as the base model. We first report the results of the pre-rectified model, i.e., the performance of the base classifier under the presence of u.u.s without applying RTSCV. To compare with RTSCV, we select three previously proposed methods in the literature of u.u.s discovery, which are (1) EVM [Rudd et al., 2018] , (2) 1-vs-Set [Scheirer et al., 2013] and (3) WSVM . In particular, 1-vs-Set and WSVM are both SVM-based algorithms. For MNIST, we use a simple MLP, a four-layer fully connected network, as the base model. Finally, to create u.u.s in the test set of the chosen datasets, we remove certain classes of data from the training set while keeping the test set unchanged.

To be consistent with previous methods, we evaluate the F-measure of our method and other baselines against the openness [Scheirer et al., 2013] . Openness. The u.u.s may be divided into different classes that span different geometric regions of the feature space. The openness metric proposed by [Scheirer et al., 2013] increases with the number of u.u.s classes. A larger openness indicates a larger number of u.u.s classes relative to that of known classes in the test data:

F-measure is the harmonic mean of the precision and recall. In our multi-class scenario, it is obtained by averaging the classwise F-measures, combining classification accuracy of both the known and u.u.s classes.

We present the experimental results on all four datasets in Table 1 . The F-measure of the classifier is plotted as a function of openness, which we vary by controlling the number of known classes removed from the training set. RTSCV achieves a consistent performance improvement over the pre-rectified model, closing a performance gap as large as 41% in some cases. Coupled with the SVM, RTSCV beats previous u.u.s detection methods on 5 out 6 settings while maintaining a small disadvantage to the EVM on COIL-20 with openness 9.3%. For RTSCV with MLP on MNIST, it performs the best with the largest openess, i.e., 42.3% where 9 out of 10 classes are selected as u.u.s during test. This further assures the robustness of RTSVC under a disproportionate amount of u.u.s in the test data.

We also evaluate RTSCV on several more challenging pattern recognition tasks, such as computer vision datasets. Specifically, we select the CIFAR-10 and CIFAR-100, both containing colored object images [Krizhevsky, 2009] , as well as the Street View House Numbers (SVHN) from Google Street View project [Netzer et al., 2011] . All image dimensions of these datasets are 32 × 32.

For each of the computer vision datasets, we separately test our RTSCV framework with ResNet [He et al., 2016] and DenseNet [Huang et al., 2017] , two network architectures that have achieved good performance on large benchmark datasets. (See the supplementary material for details on the training configuration). For comparison, we include the results of the baseline method by [Hendrycks and Gimpel, 2017] , the ODIN [Liang et al., 2018] , and the Mahalanobis method (MD) [Lee et al., 2018] , three previous approaches on detecting OOD samples (u.u.s) for neural networks. Different from Section 4.1, here we create u.u.s in the test set by introducing additional computer vision datasets to the target space. Specifically, we use the resized version of Tiny-ImageNet [Deng et al., 2009] and LSUN [Yu et al., 2015] following the techniques in [Liang et al., 2018 , Lee et al., 2018 . We present a summary of the known datasets and u.u.s datasets we used in Table 2 .

We adopt the following metrics to evaluate performance on both known classes classification and u.u.s detection.

Classification accuracy is the accuracy of the known class classification, i.e., the total number of correct predictions on the known class labels divided by the total number of known class test samples. Detection accuracy is the number of u.u.s in the test data that are correctly detected by the model divided by the total number of the u.u.s.

AUROC depicts the relationship between true positive rate (TPR) and false positive rate (FPR) [Davis and Goadrich, 2006] . A higher AUROC indicates a higher probability for a positive instance to rank higher than a negative one.

We display the experimental results in Table 2 . RTSCV coupled with ResNet or DenseNet achieves the overall best performance with respect to all of the evaluation metrics. In particular, it has a significant improvement on the u.u.s detection accuracy while maintaining a high classification accuracy for the known classes. This suggests that RTSCV is the most effective in identifying u.u.s, without degrading the original model, even on more complex datasets with more complex classification models.

With the goal of reducing deployment errors and model bias due to deficient training data, RTSCV is a proposal of an algorithmic framework that adds the flexibility for a base classification models to rectify a trained model at deployment. We provide a rigorous theoretical analysis of correctness and performance guarantees of the process of minimizing its structural mismatch with a target space, based on objectives that most modern classifers aim to optimize. RTSCV exhibits consistent performance improvements over both the pre-rectified model and previously proposed approaches that share the same goals on 7 benchmark datasets. Moreover, it does not assume the presence of an oracle, as in the case of active learning.

Our ongoing work focus on improvements of the RTSCV, especially due to concerns with the computational cost of cross-validation. We are developing alternatives to cross-validation that could work equally well, while reducing the computational cost dramatically. For instance, we are in the process of investigating semi-supervised clustering [Zhu, 2008 , Basu et al., 2002 . In the supplementary materials, we discuss our initial efforts in this direction. Our preliminary results suggest that such an approach works equally well as RTSCV when the u.u.s consist of only one cluster and have a relatively small covariance. Nevertheless, when the u.u.s form multiple clusters or the clusters have high variance, the performance of this alternative approach drops significantly, compared to that of RTSCV. Proof. Let's first prove the case of x ∼ X k . We know that

Expanding the two terms, we have

If for all other classes other than class k, the above conditions hold, then

Hence, to obtain a sufficient condition of E x∼Xu (L s (x)) > E x∼Xu (L k (x)), we simply need to require P (X u )N (µ u , µ u , 2Σ u ) > (1 − P (X k ))N (µ k , µ u , Σ u + Σ k ), which yields

Proof. For x ∼ X k , to show that E x∼X k (p(X k |x)) > E x∼X k (p(X s |x)), by Bayes' theorem, we just need to show:

E x∼X k (L k (x)P (X k )) > E x∼X k (L s (x)P (X s )) From the proof of Theorem 3.1, we have:

For x ∼ X u , We just need to show that E x∼Xu (L s (x)P (X s )) > E x∼Xu (L k (x)P (X k )). From the proof of Theorem 3.1, we know that:

Letting P (X s )P (X u )N (µ u , µ u , 2Σ u ) > [P (X k ) − P (X s )P (X k )]N (µ k , µ u , Σ u + Σ k ) gives the desired sufficient condition: Figure 3 : Comparison between Clustering with Side Information (CSI) and our RTSCV methods under different synthetic dataset settings. Top: There is one u.u. cluster for the left plot and two u.u. clusters for the right plot. For both plots the u.u. class is located far away from the 10 known classes and only the covariance of the u.u. class is altered across different trials.

We also believe that it is worth discussing the possibility of resorting to semi-supervised clustering as an alternative to cross-validation during the process of re-classifying sample class X s , given its increasing popularity and the great potential of being more computationally economical. Given a small amount of labeled data, semi-supervised clustering performs ordinary clustering tasks under the constraints of must-links (two points must be in the same cluster) and cannot-links (two points cannot be in the same cluster), provided by the labeled data [Zhu, 2008] . In our scenario, the objective of re-classifying X s can be viewed equivalent to dividing X s into several clusters, one of which corresponds to either a known or u.u. class, with the assistance of the labeled data from the entire training set. This is also called clustering with side information (CSI) in the literature [Zhu, 2008] .

To test this alternative, we adopt a novel but simple method called Seeded-KMeans [Basu et al., 2002] . Specifically, given M known classes X 1 , X 2 , . . . , X M in the training set, we run an (M + 1)-Means clustering algorithm on sample set X s , with the initial centers of each cluster set to the mean feature vectors of X 1 , X 2 , . . . , X M and X s , respectively. After the clustering converges, we assign the label of each cluster of X s according to the class membership of the initial seeding of the corresponding center. In other words, a cluster initially seeded by the mean of some known class X k will be labeled as X k , and a cluster initially seeded by the mean of X s will be labeled as the u.u. class.

Our primary experiment suggests that such an approach works equally well as the RTSCV method when the u.u.s consist of only one cluster (sub-class) and are far away from the known base classes. As illustrated in the top-left plot of Figure 3 , in such a setting CSI has a very similar OSR performance as our RTSCV, under different covariance levels of the u.u. class. Nevertheless, when the u.u.s form multiple clusters or are close to the base classes, the performance of CSI plunges significantly, as illustrated in the top-right and bottom plots of Figure 3 . This is possibly because of the large inconsistency between the mean of X s as the initial seed of the u.u. class and the true u.u.s distribution. Bottom: There is one u.u. cluster, whose distance to the known base classes is altered across trials.

In response to that, one potential improvement of the CSI method might be to incorporate some priors on the distribution of the u.u. class, i.e., the number of sub-classes or the means of them, with light involvement of human experts. We believe that this is a very promising direction for future works.

Figure 4: RTSCV decision boundaries after cross-validation using SVM, fitted on the entire augmented training set consisting of 10 known classes and the sample class X s . The black region represents the dummy (u.u.s) class where we intend to trap the u.u.s. Points from the sample class are represented by triangles with pseudo-label 10 while points from one of the known classes are represented by squares with the respective class label.

Figure 5: RTSCV decision boundaries after cross-validation using KNN, fitted on the entire augmented training set consisting of 10 known classes and the sample class X s . The black region represents the dummy (u.u.s) class where we intend to trap the u.u.s. Points from the sample class are represented by triangles with pseudo-label 10 while points from one of the known classes are represented by squares with the respective class label. Figure 6 : RTSCV decision boundaries after cross-validation using Decision Tree, fitted on the entire augmented training set consisting of 10 known classes and the sample class X s . The black region represents the dummy (u.u.s) class where we intend to trap the u.u.s. Points from the sample class are represented by triangles with pseudo-label 10 while points from one of the known classes are represented by squares with the respective class label.

Identifying unknown unknowns in the open world: Representations and policies for guided exploration

A baseline for detecting misclassified and out-of-distribution examples in neural networks

Enhancing the reliability of out-of-distribution image detection in neural networks

A simple unified framework for detecting out-of-distribution samples and adversarial attacks

Energy-based out-of-distribution detection

Toward open set recognition

Probability models for open set recognition

Nearest neighbors distance ratio open-set classifier

Towards open set deep networks

Contradict the machine: A hybrid approach to identifying unknown unknowns

Machine teaching: A new paradigm for building machine learning systems

Multi-class open set recognition using probability of inclusion

Multi-class data description for out-of-distribution detection

Classification-reconstruction learning for open-set recognition

C2AE: class conditioned auto-encoder for open-set recognition

Recent advances in open set recognition: A survey

Learning and the unknown: Surveying steps toward open world recognition

The extreme value machine

Identifying mislabeled training data

Pattern recognition. An algorithmic approach

Pattern Recognition, Fourth Edition

Deep residual learning for image recognition

Densely connected convolutional networks

Chuanxing Geng and Songcan Chen. Collective decision for open set recognition

Mnist handwritten digit database

Learning multiple layers of features from tiny images

Reading digits in natural images with unsupervised feature learning

Imagenet: A large-scale hierarchical image database

Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop

The relationship between precision-recall and roc curves

Semi-supervised learning literature survey

This research was partially supported by a National Natural Science Foundation of China (NSFC) grant #61850410536. Abrahao developed part of this research while affiliated with Microsoft Research AI, Redmond.