key: cord-0568886-hgyp9w9p authors: Mallick, Ankur; Dwivedi, Chaitanya; Kailkhura, Bhavya; Joshi, Gauri; Han, T. Yong-Jin title: Probabilistic Neighbourhood Component Analysis: Sample Efficient Uncertainty Estimation in Deep Learning date: 2020-07-18 journal: nan DOI: nan sha: 3aff58b3a6db8fb50c63eb0e78efad97326d6211 doc_id: 568886 cord_uid: hgyp9w9p While Deep Neural Networks (DNNs) achieve state-of-the-art accuracy in various applications, they often fall short in accurately estimating their predictive uncertainty and, in turn, fail to recognize when these predictions may be wrong. Several uncertainty-aware models, such as Bayesian Neural Network (BNNs) and Deep Ensembles have been proposed in the literature for quantifying predictive uncertainty. However, research in this area has been largely confined to the big data regime. In this work, we show that the uncertainty estimation capability of state-of-the-art BNNs and Deep Ensemble models degrades significantly when the amount of training data is small. To address the issue of accurate uncertainty estimation in the small-data regime, we propose a probabilistic generalization of the popular sample-efficient non-parametric kNN approach. Our approach enables deep kNN classifier to accurately quantify underlying uncertainties in its prediction. We demonstrate the usefulness of the proposed approach by achieving superior uncertainty quantification as compared to state-of-the-art on a real-world application of COVID-19 diagnosis from chest X-Rays. Our code is available at https://github.com/ankurmallick/sample-efficient-uq Deep Neural Networks (DNNs) have achieved remarkable success in a wide range of applications where a large amount of labeled training data is available [10, 9] . However, in many emerging applications of machine learning such as diagnosis and treatment of novel coronavirus disease (COVID-19) [6] a large labeled training datasets may not be available. Furthermore, test data in these applications may deviate from the training data distribution, e.g., due to sample selection bias, nonstationarity, and even can be from Out-of-Distribution in some extreme cases [3] . Note that several of these applications are high-regret in nature implying that incorrect decisions or predictions have significant costs. Therefore, such applications require not only achieving high accuracy but also accurate quantification of predictive uncertainties. Accurate predictive uncertainty in these applications can help practitioners to assess the true performance and risks and to decide whether the model predictions should (or should not) be trusted. Unfortunately, DNNs often make overconfident predictions in the presence of distributional shifts and Out-of-Distribution data. As an example, Fig. 1 shows the predictions of different deep learning models trained to detect the presence of COVID-19 from chest X-ray images. All models achieve similar accuracy (∼ 80%) on in-distribution validation data. However, their quality of uncertainties is widely varied as explained next. While all models are forced to output some prediction, on every input image, we would want a model to not be very confident on input data that is very different from the data used to train it. However, we observe that state-of-the-art deep learning models make highly overconfident predictions on Out-of-Distribution data [17] . Figure 1 : The predictions of deep learning models that are trained to detect the presence of COVID-19 in chest X-Ray images from [6] . (a) All models correctly classify in-distribution data from [6] , (b) Uncertainty-aware models (BNN, and our model PNCA) perform better than DNN when test data is from a different source [4] , (c) As opposed to proposed PNCA, both BNN and DNN make overconfident misclassification on Out-of-Distribution data [17] (e.g., classifying shoulder X-Ray as COVID-19). Interestingly, we found that even popular uncertainty-aware models, (e.g., BNNs, deep ensembles) that are designed to address this precise issue, perform poorly in small data regime. This is an extremely problematic issue especially owing to the flurry of papers that have been attempting to use DNNs for detecting COVID-19 using chest X-Ray images [15, 5, 21] as real-world test data is almost always different as compared to the training data. While there have been separate efforts on improving the sample efficiency [14] and accurate uncertainty estimation [7] of deep learning, to the best of our knowledge there has not been any effort on studying these seemingly different issues in a unified manner. Therefore, this paper takes some initial steps towards (a) studying the effect of training data on the quality of uncertainty and (b) developing sample efficient uncertainty-aware predictive models. Specifically, to overcome the challenge of providing accurate uncertainties without compromising the accuracy in the small-data regime, we propose a probabilistic generalization of the popular non-parametric kNN approach (referred to as probabilistic neighborhood component analysis (PNCA)). By mapping data into distributions in a latent space before performing classification, we enable a deep kNN classifier to accurately quantify underlying uncertainties in its prediction. Following [11, 18] , for a meaningful and effective performance evaluation, we compare the quality of predictive uncertainty of different models under conditions of distributional shift and Out-of-Distribution. We empirically show that the proposed PNCA approach achieves significantly better uncertainty estimation performance as compared to state-of-the-art approaches in small data regime. In this section, we describe our model to achieve sample-efficient and uncertainty-aware classification. The details of the algorithm and proof of Proposition 1 are presented in Appendix B. Our approach is a generalization of NCA proposed in [8] wherein the authors learn a distance metric for kNN classification of points x i , . . . , x n ∈ R D with corresponding class labels y 1 , . . . , y n . A data point x is projected into a latent space Z ⊆ R d to give an embedding z = g w (x). Here g can be a linear transformation like a d × D matrix or a non-linear transformation like a neural network with a d−dimensional output, and w are the parameters of the transformation. The probability of a point x i selecting another point x j as its neighbor is given by applying a softmax activation to the distance between points in the latent space The probability of x i selecting a point in the same class as itself is given by q i = j:yj =yi q ij and the optimal model parameters are obtained by minimizing the loss which is the negative log-likelihood of the data under our the model. The authors of [8] experiment with a variety of transformations g w (.) and classification tasks and show that NCA achieves competitive accuracy. The lack of data may cause the NCA model to overfit when learning the weights by optimizing the loss in (2) . We expect that the uncertainty due to the scarcity of training data can be better captured by probability distributions in the latent space Z than by individual data samples. Therefore, we propose a probabilistic generalization of the model, PNCA, which learns a distribution over the model parameters w and, thus, deals with both the lack of training data and the task of accurate uncertainty estimation. Latent Space Mapping using Probabilistic Neural Networks. Each data point x passes through a probabilistic neural network with parameters W ∼ p(W) to give a random variable Z = g W (x) ∈ Z. Due to the stochasticity of W, each data point x i corresponds to a different distribution p(Z|x i ) in the latent space. NCA over Latent Distributions. Observe that the individual terms in the softmax activation correspond to a kernel between latent embeddings z, e.g., the squared exponential kernel k(z, z ) = exp(−||z − z || 2 ). Since in our approach the embedding corresponding to a data point x i is the probability distribution p(Z|x i ), we propose to use the following kernel between distributions where K ij corresponds to the inner product between distributions p(Z|x i ) and p(Z|x j ) in the Reproducing Kernel Hilbert Space (RKHS) H k defined by the kernel k [16] and, thus, captures similarity between distributions in the same way as k(z, z ) captures the similarity between individual embeddings in NCA. The forward pass described above, is used to compute a kernel between data points x i and x j which can then be used to compute the probability of x i selecting x j in an analogous fashion to NCA as Since the latent embedding Z for a data point x is given by Z = g W (x), W ∼ p(W), we can rewrite (3) as Thus, we can view K ij as a functional of p(W). The negative log likelihood in (2) is then also a functional of p. The optimal distribution over the model parameters can be obtained by solving The choice of P is critical to the success of this approach. Following [13] , we choose P to be P given by a kernel κ between model parameters w (note that this is different from the RKHS H k into which distributions in the latent space Z are embedded and which is given by the kernel k). This choice of P includes all smooth transformations from the initial distribution p 0 , and the optimization problem (6) now reduces to computing the optimal shift s * (w). Next, we provide an expression for the functional gradient of the negative log-likelihood under our model with respect to the shift s. whereL is given by substitutingK ij = 1 m 2 l,l k(g w l (x i ), g w l (x j )) in (2). To estimate the optimal shift, s * (or optimal distribution p * ) we draw an initial set of parameters w i ∼ p 0 (w) and iteratively apply the functional gradient descent transformation u = w − ∇ s L | s=0 as described in Algorithm 1 in Appendix B We consider two small data classification tasks: (1) handwritten digit recognition, and (2) COVID-19 detection from chest X-Ray images. For both tasks, we compare proposed PNCA to 4 baselines, a Deep Neural Network (DNN), a Bayesian Neural Network (BNN) trained using the approach of [13] , Deep Ensembles [11] , and NCA [8] . For PNCA and NCA, we use a neural network to map the data x to embeddings z. For NCA and PNCA, the predicted class label for a test point x i isŷ i = arg max c j:yj =c q ij . For the other models, the predicted class label is the one with the highest softmax probability (average softmax probability for BNNs and Ensembles). Following [18] , we use max c q(ŷ = c|x) as a measure of the confidence of the model Fig. 2c shows the performance comparison on Out-of-Distribution, i.e., not-MNIST dataset [2] that contains letters instead of handwritten digits. We see that PNCA has significantly fewer examples with high confidence as compared to rest of the approaches on the not-MNIST dataset illustrating the superior capability in quantifying uncertainty. COVID-19 detection There has been an increasing interest in using deep learning to detect COVID-19 from Chest X-Ray (CXR) images [19, 21] . Successful prediction from CXR data can effectively complement the standard RT-PCR test [22] . However, the lack of large amount of training data and distributional shift between train and test data are two major challenges in this task [15] . We consider two sources of COVID-19 data - [6] , which has been used by most existing works to train their models for COVID-19 classification and [4] , which we use as our unseen test data as it comes from a different source than the images in [6] . We follow the transfer learning approach of [15] wherein a ResNet-50 model pre-trained on Imagenet is used as a feature extractor and the last layer of the model is re-trained on [6] with aforementioned approaches -DNN, BNN, Ensemble, NCA, PNCA on the training dataset. We consider a binary classification problem, i.e., each model outputs a probability q of the presence/absence of COVID-19 in a given CXR image. We use the version of [6] available on Kaggle 1 as our training dataset, which contains 275 COVID-19 X-Ray images and 76 non-COVID X-Ray images. On the other hand, [4] is used as our test data, which contains 58 COVID-19 X-Ray images and 127 non-COVID X-Ray images. There is a distributional shift present between train and test data resulting in relatively low test accuracy for all models in Fig. 3a . We also look at the number of examples classified with a high confidence for both the test data and completely Out-of-Distribution data (shoulder and hand X-Rays from [17] ). As can be seen, on [4] , which potentially has a different distribution, BNN has slightly lower number of examples classified with high confidence than the other models. Next, in Fig. 3c , we can see that as the distribution shift increases, PNCA makes significantly fewer high confidence predictions than all other models corroborating its superior uncertainty quantification. In summary, these experiments demonstrate that PNCA achieves much better uncertainty quantification than the baselines without losing accuracy in small-data regime. This work serves as a caution to practitioners interested in applying deep learning for disease detection especially during the current pandemic since we find that the issues related to overconfident and inaccurate predictions of DNNs become even more severe in small-data regime. While our approach appears to be less susceptible to making overconfident misclassifications and have good uncertainty estimation performance, we acknowledge that there is still room for improvement especially with respect to the accuracy of the model. With this in mind, we will explore approaches to improve the generalization capability of PNCA in future work. Further, sample efficient uncertainty calibration approaches such as [24] and more reliable evaluation approaches for small data regime can be explored. All models are implemented in TensorFlow [1] on a Titan X GPU with 3072 CUDA cores. We use the Adam Optimizer with Nesterov Momentum [20] with a learning rate of 0.001 to train the models for 100 epochs. For DNN, BNN, and Ensemble we use minibatches of size 20 with 1 epoch corresponding to a pass over the entire dataset, while for NCA and DPNCA, the entire dataset is used to calculate gradients. Following [13] we use the RBF kernel as the kernel κ between model parameters w in PNCA, with bandwidth chosen according to the median heuristic described in their work since it causes j κ(w, w j ) 1 for all w, leading κ to behave like a probability distribution. We also use Orthogonal Random Features [23] to approximate the kernel between probability distributions in the latent space in (3) for faster computation. We use 10d features where d is the dimensionality of the latent space and a ReLU activation on the approximate kernel to set any spurious negative values to zero (since the original squared exponential kernel can never be negative). Input : Training points X and targets y along with a set of initial model parameters {w Output : A set of model parameters {w i } m i=1 ∼p(w) wherep(w) is obtained by performing functional gradient descent Table 1 contains the accuracy of different models across experiments (MNIST test data [12] , Rotated MNIST test data, COVID-19 validation data [6] and COVID-19 test data [4] ). For MNIST, accuracy values are averaged over 10 trials (where each trial corresponds to a different set of 100 training examples). For COVID-19 accuracy values, the training data is split into 5 equal folds and in each trial we use 4 folds to train the model and the 5th fold to calculate in-distribution (validation) accuracy. Since each training data point is a part of the validation data (5th fold) only once, therefore we do not have any standard deviation values for the validation accuracy. The accuracy on COVID-19 test data is averaged across all folds. MNIST Observe that for any smooth one-to-one transform u = T (w), w ∼ p(W), the kernel between the latent distributions p(Z|x i ), p(Z|x j ) corresponding to data points x i and x j under the transformed distribution p [T ] (u) can be written as Since the above holds for infinitesimal shifts u = w + s(w), a tractable choice of p 0 (For eg. Gaussian), enables efficient approximation of K ij by sample averages with samples u i , (9) is a functional of the transformation T (for fixed p(W)) i.e. a functional of the shift s in our case. Therefore, the problem of finding p * (W) in (6) reduces to the problem of finding the optimal shift (given p 0 (W)) i.e. Since s ∈ H κ , which is the RKHS for the kernel κ, we can solve (10) via functional gradient descent. Defining k ij (w, w ) = k(g w (x i ), g w (x j )) we have Assuming that the distributions p(W) and shifts s(W) are functions in a RKHS H given by the kernel κ (κ, k, and K are all different), we have (from the definition of functional gradient ∇ s K ij [s]), Thus we need to compute the difference K ij [s + r] − K ij [s] which, from (11) is given by We use E p to denote the expectation when w, w ∼ p(W). The above equation can be rewritten as where the last line follows from the RKHS property. Similarly, Since we transform the weights w after every iteration, therefore we only ever need to compute the gradient at s(w) = 0. Thus, finally, we have the expression samples. These results show that while the performance of other models (including BNN) worsens (lower accuracy, higher confidence on OOD data) as the number of samples decreases, that of our approach PNCA is largely unaffected thus validating its efficacy in small data settings. Tensorflow: A system for large-scale machine learning Anomalous instance detection in deep learning: A survey Actualmed covid-19 data Predicting covid-19 pneumonia severity on chest x-ray with deep learning Covid-19 image data collection Uncertainty in deep learning Neighbourhood components analysis Speech recognition with deep recurrent neural networks Imagenet classification with deep convolutional neural networks Simple and scalable predictive uncertainty estimation using deep ensembles Mnist handwritten digit database Stein variational gradient descent: A general purpose bayesian inference algorithm Deep probabilistic kernels for sample-efficient learning Deep-covid: Predicting covid-19 from chest x-ray images using deep transfer learning Kernel mean embedding of distributions: A review and beyond. Foundations and Trends R in Machine Learning Large dataset for abnormality detection in musculoskeletal radiographs Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift Unveiling covid-19 from chest x-ray with deep learning: a hurdles race with small data Incorporating nesterov momentum into adam Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest radiography images Detection of sars-cov-2 in different types of clinical specimens Orthogonal random features Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning This is because ∇ w1 k ij (u, v) = ∇ u k ij (u, v)∇ w1 u+ ∇ v k ij (u, v)∇ w1 v = ∇ w1 k ij (w 1 , w 1 ) + ∇ w1 k ij We can apply the same argument to simplify the terms in (22) that contain gradients with respect to other weights w 2 , . . . , w m in the same fashion. Therefore,