key: cord-0667874-iahktahr authors: Zhu, Shixiang; Xie, Liyan; Zhang, Minghe; Gao, Rui; Xie, Yao title: Distributionally Robust Weighted $k$-Nearest Neighbors date: 2020-06-07 journal: nan DOI: nan sha: 20dfc5da72a62112185072c4f72dcf179da66f22 doc_id: 667874 cord_uid: iahktahr Learning a robust classifier from a few samples remains a key challenge in machine learning. A major thrust of research has been focused on developing $k$-nearest neighbor ($k$-NN) based algorithms combined with metric learning that captures similarities between samples. When the samples are limited, robustness is especially crucial to ensure the generalization capability of the classifier. In this paper, we study a minimax distributionally robust formulation of weighted $k$-nearest neighbors, which aims to find the optimal weighted $k$-NN classifiers that hedge against feature uncertainties. We develop an algorithm, texttt{Dr.k-NN}, that efficiently solves this functional optimization problem and features in assigning minimax optimal weights to training samples when performing classification. These weights are class-dependent, and are determined by the similarities of sample features under the least favorable scenarios. When the size of the uncertainty set is properly tuned, the robust classifier has a smaller Lipschitz norm than the vanilla $k$-NN, and thus improves the generalization capability. We also couple our framework with neural-network-based feature embedding. We demonstrate the competitive performance of our algorithm compared to the state-of-the-art in the few-training-sample setting with various real-data experiments. Machine learning has been proven successful in data-intensive applications but is often hampered when the data set is small. For example, in breast mammography diagnosis for breast cancer screening [Aresta et al.(2019) Aresta, Araújo, Kwok, Chennamsetty, Safwan, Alex, Marami, Prastawa, Chan, Donovan, Fernandez, Zeineh, Kohl, Walz, Ludwig, Braunewell, Baust, Vu, To, Kim, Kwak, Galal, Sanchez-Freire, Brancati, Frucci, Riccio, Wang, Sun, Ma, Fang, Kone, Boulmane, Campilho, Eloy, Polónia, and Aguiar] , the diagnosis of the type of breast cancer requires specialized analysis by pathologists in a highly time-and cost-consuming task and often leads to non-consensual results. As a result, labeled data in digital pathology are generally very scarce; so do many other applications. Figure 2 : An illustrative comparison of Dr.k-NN and vanilla k-NN. Each colored dot is a training sample, where the color indicates its class-membership. The horizontal/vertical bar represents the probability mass of one training sample under the distribution P 1 , P 2 , respectively. In this paper, we aim to tackle the general multi-class classification problem when only very few training samples are available for each class [Wang et al.(2019) Wang, Yao, Kwok, and Ni] . Evidently, k-Nearest Neighbor (k-NN) algorithm [Altman(1992) ] is a natural idea to tackle this problem and shows promising empirical performances. Notable contributions, including seminal work [Goldberger et al.(2005) Goldberger, Hinton, Roweis, and Salakhutdinov] and the follow-up non-linear version [Salakhutdinov & Hinton(2007) Salakhutdinov and Hinton], go beyond the vanilla k-NN and propose neighborhood component analysis (NCA) . NCA learns a distance metric that minimizes the expected leave-one-out classification error on the training data using a stochastic neighbor selection rule. Some recent studies in few-shot learning utilize the limited training data using a similar idea, such as matching network [Vinyals et al.(2016) Vinyals, Blundell, Lillicrap, Kavukcuoglu, and Wierstra] and prototypical network [Snell et al.(2017) Snell, Swersky, and Zemel] . They are primarily based on distance-weighted k-NN, which classifies an unseen sample (aka. query) by a weighted vote of its neighbors and uses the distance between two data points in the embedding space as their weights. The classification performance of weighted k-NN critically depends on the choice of weighting scheme. The distance measuring the similarity between samples is typically chosen by metric learning, where a task-specific distance metric is automatically constructed from supervised data [Koch et al.(2015) Koch, Zemel, and Salakhutdinov, Plötz & Roth(2018) Plötz and Roth, Vinyals et al.(2016) Vinyals, Blundell, Lillicrap, Kavukcuoglu, and Wierstra] . However, it has been recognized that they may be not robust to the few-training-samples scenario, where an "outlier" may greatly deteriorate the performance. An example to illustrate this issue is shown in Figure 1 . The training set with only three labeled samples includes two categories we want to classify: mop and Komondor. As we can see, the query image is visually closer to the third sample and thus more likely to be misclassified as a Komondor. Here the third sample in the training set is an "outlier", since it is a Komondor dressing up as a mop, and it misleads the metric learning model to capture irrelevant details (e.g., the bucket and the mop handle) for the Komondor category. Such a problem can become even severe when the sample size is small. The discussion above highlights the importance of choosing a good weighting scheme in weighted k-NN. To develop algorithms that are more robust in the few-training-samples settings, we propose a new formulation of distributionally robust weighted k-nearest neighbors. More specifically, for a given set of features of training samples, we solve a Wasserstein distributionally robust optimization problem that finds the minimax optimal weight functions for the k-nearest neighbors. This infinitedimensional functional optimization over weight functions presents a unique challenge for which existing literature on distributionally robust optimization do not consider. To tackle this challenge, we first consider a relaxed problem that optimizes over all randomized classifiers, which turns out to admit a finite-dimensional convex programming reformulation in spite of being infinite-dimensional (Theorem 1). Next, we show that there is a weighted k-NN classifier achieving the same risk and shares the same least favorable distributions (LFDs) as the robust classifier (Theorem 2). Thereby we prove the optimality of such weighted k-NN classifier for the original distributionally robust weighted k-nearest neighbors problem. Furthermore, we derive the generalization bound of the robust weighted k-NN classifier by relating it to a Lipschitz regularization problem (Theorem 3), and showing that its Lipschitz norm can be smaller than the vanilla k-NN classifier and thus has a better control on the generalization gap (Corollary 1). Based on these theoretical results, we proposed a novel algorithm called Dr.k-NN. Unlike the traditional distance-weighted k-NN that uses the same weight for all label classes, our algorithm introduces a vector of weights, one for each class, for each sample in k-NN and performs a weighted majority vote. These weights are determined from the LFDs and reveal the significance of each sample in the worst case, thereby contributing effectively to final decision making. An example is illustrated in Figure 2 . Further, using differentiable optimization [Amos & Kolter(2017) Amos and Kolter, Agrawal et al.(2019)Agrawal, Amos, Barratt, Boyd, Diamond, and Kolter] , we incorporate a neural network into the minimax classifier that jointly learns the feature embedding and the minimax optimal classifier. Numerical experiments show that our algorithm can effectively improve the multi-class classification performance with few training samples on various data sets. Recently, there has been much interest in multi-class classification with few training samples, see [Wang et al.(2019)Wang, Yao, Kwok, and Ni] for a survey. The main idea of our work is related to metric learning [Plötz & Roth(2018) Plötz and Roth, Goldberger et al.(2005) Goldberger, Hinton, Roweis, and Salakhutdinov, Salakhutdinov & Hinton(2007) Salakhutdinov and Hinton], which essentially translates the hidden information carried by the limited data into a distance metric, and has been widely adopted in few-shot learning and meta learning [Finn et al.(2017 )Finn, Abbeel, and Levine, Koch et al.(2015 )Koch, Zemel, and Salakhutdinov, Snell et al.(2017 )Snell, Swersky, and Zemel, Vinyals et al.(2016 )Vinyals, Blundell, Lillicrap, Kavukcuoglu, and Wierstra, Li et al.(2006 . However, unlike few-shot and meta learning, where the goal is to acquire meta knowledge from a large number of observed classes and then predict examples from unobserved classes, we focus on attacking a specific general classification problem where the number of categories is fixed but labeled data are scarce. In this paper, we take a different probabilistic approach to exploit information from the data: we construct an uncertainty set for distributions of each class based on the Wasserstein distance. Wasserstein distributionally robust optimization [Esfahani & Kuhn(2018) Kuhn, Abadeh et al.(2015)Abadeh, Esfahani, and Kuhn, Blanchet & Murthy(2019) Blanchet and Murthy, Gao & Kleywegt(2016) Gao and Kleywegt, Sinha et al.(2017 )Sinha, Namkoong, and Duchi, Blanchet et al.(2019 )Blanchet, Kang, and Murthy, Shafieezadeh-Abadeh et al.(2019 )Shafieezadeh-Abadeh, Kuhn, and Esfahani, Gao et al.(2017 Gao, Chen, and Kleywegt] is an emerging paradigm for statistical learning; see [Kuhn et al.(2019) Kuhn, Esfahani, Nguyen, and Shafieezadeh-Abadeh] for a recent survey. Our work is mostly related to [Gao et al.(2018) Gao, Xie, Xie, and Xu], a framework for Wasserstein robust hypothesis testing, but is different in three important ways. First, we focus on multi-class classification, while [Gao et al.(2018) Gao, Xie, Xie, and Xu] only studied two hypotheses. Second, we focus on directly minimizing the mis-classification error while [Gao et al.(2018) Gao, Xie, Xie, and Xu] used a convex relaxation for the 0-1 loss. Third, we analyze the generalization bound while [Gao et al.(2018) Gao, Xie, Xie, and Xu] does not. Fourth, while [Gao et al.(2018) Gao, Xie, Xie, and Xu] requires sample features as an input, we develop a scalable algorithmic framework to simultaneously learn the optimal feature extractor parameterized by neural networks and robust classifier to achieve the best performance. A recent work [Chen & Paschalidis(2019)Chen and Paschalidis] studies distributionally robust k-NN regression. Note that regression and classification are fundamentally different as different performance metrics are used. In [Chen & Paschalidis(2019)Chen and Paschalidis] the objective is to minimize the mean square error, whereas in our work we minimize classification errors. Another well-known work on optimal weighted nearest neighbor binary classifier [Samworth(2012) ] assigns one weight to each sample; the optimal weights minimize asymptotic expansion for the excess risk (regret). In contrast, we consider minimax robust multi-class classification, each training sample is associated with different weights for different classes. In this section, we present our model. We first define weighted k-NN classifier in Section 2.1, then present the proposed framework of distributionally robust k-NN problem in Section 2.2. Let {(x 1 , y 1 ), . . . , (x n , y n )} be a set of training samples, where x i denotes the i-th data sample in the observation space X , and y i ∈ Y := {1, . . . , M } denotes the class (label) of the i-th data sample. Let φ : X → Ξ be a feature extractor that embeds samples to the feature space Ξ (in Section 4.2 we will train a neural network to learn φ). Denote the sample feature vectors and the empirical support as: . . , n, Ξ := {ξ 1 , . . . , ξ n }. S = {(ξ 1 , y 1 ), . . . , (ξ n , y n )}. Define empirical distributions: where δ denotes the Dirac point mass, | · | denotes the cardinality of a set, and I denotes the indicator function. Let π : Ξ → ∆ M be a randomized classifier that assigns class m ∈ {1, . . . , M } with probability π m (ξ) to a query feature vector ξ ∈ Ξ, where ∆ M is the probabilistic simplex ∆ M = {π ∈ R M + : M m=1 π m = 1}. It is worth mentioning that the randomized test is more general than the commonly seen deterministic test. In particular, the random classifier π reduces to the deterministic test if for any ξ, there exists a m such that π m (ξ) = 1. Suppose the features in each class m follows a distribution P m . We define the risk of a classifier π as the total error probabilities 1 Ψ(π; P 1 , . . . , Recall that the vanilla k-NN is performed as follows. Let c : Ξ × Ξ → R + be a metric on Ξ that measures distance between features. For any given query point ξ, let τ S 1 (ξ), . . . , τ S n (ξ) be a reordering of {1, . . . , n} according to their distance to ξ, i.e., c(ξ, ξ τ S i (ξ) ) ≤ c(ξ, ξ τ S i+1 (ξ) ) for all i < n, where the tie is broken arbitrarily. Here the superscript S indicates the dependence on the sample S. In vanilla k-NN, we compute the votes as The vanilla k-NN decides the class for ξ by the majority vote, i.e., accept the class arg max 1≤m≤M p m (ξ). To define a weighted k-NN, let us replace the equal weights in (2) by an arbitrary weight function w m : Ξ × Ξ → R + for each class m = 1, . . . , M : and use a shorthand notation w := (w 1 , . . . , w M ). In the sequel, we define a general tie-breaking rule as follows. For any ξ, denote M 0 (ξ) := arg max 1≤m≤M p m (ξ). When |M 0 (ξ)| > 1, there is a tie at ξ. We denote π m (ξ) as the probability of accepting class m for m ∈ M 0 (ξ) and we have A weighted k-NN classifier involves two parameters: number of nearest neighbors k and weighting scheme w. Particularly, w m (ξ, ξ i ) = I{y i = m} recovers the vanilla k-NN, and w m (ξ, ξ i ) = I{y i = m}/c(ξ, ξ i ) recovers the distance-based weighted k-NN. Note that our definition (3) allows different weighting schemes for different classes, which is more general than the standard weighted k-NN. The goal is to find the optimal weighted k-NN classifier π knn (·; k, w) such that the risk Ψ as defined in (1) is minimized. Since the underlying true distributions are unknown, the commonly used loss function is the empirical loss, i.e., substitute the empirical distributions P m into the risk function (1). This leads to the following optimization problem: min 1≤k≤n wm:Ξ×Ξ→R + , 1≤m≤M Ψ(π knn (·; k, w); P 1 , . . . , P M ). It is worth mentioning that this minimization problem is an infinite-dimensional functional optimization, since the weighting schemes w is a function on Ξ × Ξ. For few-training-sample setting, the empirical distributions might not be good estimates for the true distribution since the sample size is small. To hedge against distributional uncertainty, we propose a distributionally robust counterpart of the weighted k-NN problem defined in the previous subsection. Specifically, suppose each class m is associated with a distributional uncertainty set P m , which will be specified shortly. Given P 1 , . . . , P M , define the worst-case risk of a classifier π as the worst-case total error probabilities max Pm∈Pm,1≤m≤M Ψ(π; P 1 , . . . , P M ), Training samples Testing samples End-to-end Learning Framework * ( ) Figure 3 : An overview of the end-to-end learning framework, which consists of two cohesive components: (1) an architecture that is able to produce feature embedding ξ and least favorable distributions P * m for training set; (2) an Dr.k-NN makes decisions for any unseen sample ξ based on the estimated weight vector p m (ξ) (probability mass on least favorable distributions). where Ψ is defined in (1). We consider the following distributionally robust k-NN problem that finds the optimal weighted k-NN classifier minimizing the worst-case risk: Here the optimal solution P * 1 , . . . , P * M to the inner maximization problem is also called least favorable distributions (LFD) in statistics literature [Huber(1965) ]. We summarize the architecture of the proposed distributionally robust k-NN framework in Figure 3 , more details are provided in Section 4. Now we describe the uncertainty set P m . First, since we are going to re-weight the training samples to build the classifier, we restrict the support of every distribution in P m to Ξ, the set of empirical points. Second, the uncertainty set is data-driven, containing the empirical distribution P m and distributions surrounding its neighborhood. Third, to measure the closeness between distributions, we choose the Wasserstein metric of order 1 [Villani (2008)], defined as for any two distributions P and P on Ξ, where the minimization of γ is taken over the set of all probability distributions on Ξ × Ξ with marginals P and P . The main advantage of using Wasserstein metric is that it takes account of the geometry of the feature space by incorporating the metric c(·, ·) in its definition. Given the empirical distribution P m for m = 1, . . . M , we define where P( Ξ) denotes the set of all probability distributions on Ξ; ϑ m ≥ 0 specifies the size of the uncertainty set for the m-th class that specifies the amount of deviation we would like to control. In this section, we analyze the computational tractability and statistical properties of the proposed distributionally robust weighted k-NN classifier found in (6). All proofs are delegated to Appendix A. Observe that similar to (5), the formulation (6) is also an infinite-dimensional functional optimization. Let us first relate it to a relaxed robust classification problem, which turns out to be more tractable. Consider the following minimax robust classification problem over all randomized classifiers π (recalling ∆ M is the probability simplex in R M + ): Yet still, (8) is an infinite-dimensional functional optimization, since we are optimizing over the set of all randomized classifiers. We establish the following theorem stating a finite-dimensional convex programming reformulation for the problem (8). Theorem 1. For the uncertainty sets defined in (7), the least favorable distribution of problem (8) can be obtained by solving the following problem: The decision variable γ m ∈ R n×n + can be viewed as a joint distribution on n empirical points with marginal distributions P m and P m , represented by a vector p m ∈ R n + . The inequality constraint controls the Wasserstein distance between P m and P m . Below we give an intuitive explanation for the objective function in (9). Note that max 1≤m ≤M p i m − p i m measures the margin between the maximum likelihood of ξ i among all classes and the likelihood of the m-th class. Thus, the objective in (9) can be equivalently rewritten as minimization of total margin: When M = 2, the total margin reduces to the total variation distance. Also, let y i m ∈ {0, 1} be the class indicator variable of sample ξ i , observe that where the second term on the right side represents the cross-entropy (or negative log-likelihood). Therefore, problem (9) perturbs ( P 1 , . . . , P M ) to LFDs (P * 1 , . . . , P * M ) so as to minimize the total margin as well as an upper bound on cross-entropy of LFDs; the smaller the margin (or cross-entropy) is, the more similar between classes and thus the harder to distinguish among them. In this subsection we study the expressive power of the class of weighted k-NN classifiers The following theorem establishes the equivalence between the original problem (6) and the relaxed robust classification problem (8) studied in Section 3.1. Theorem 2. For the uncertainty set defined in (7), formulations (6) and (8) have identical optimal values. In addition, there exists optimal solutions of (6) and (8) that share common LFDs that are optimal to (9). Theorem 2 implies that the set of weighted k-NN classifiers is exhaustive, in the sense that it achieves the same optimal robust risk as optimizing over the set of all randomized classifiers. In our proof, we show that the weighted 1-NN classifier, with weights equal to the LFDs of (9), is an optimal solution to (6). Therefore, instead of solving (6) directly, by Theorem 1, we can solve the convex program (9) for the LFDs, based on which we construct a robust k-NN classifier. This justifies the Dr.k-NN algorithm to be described in Section 4. Next, we discuss the generalization bound -measured by the population risk under the true distribution -of the proposed distributional robust k-NN framework, by relating (6) and (8) Then by [Gao et al.(2017)Gao, Chen, and Kleywegt] , this problem is upper bounded by the following Lipschitz regularized classification problem where π m Lip := max ξ,ξ∈ Ξ, ξ =ξ is the Lipschitz norm of the function π m for each m = 1, . . . , M . Perhaps surprisingly, the next result shows that (10) and (11) Theorem 3. Formulations (10) and (11) are equivalent, and there exists an optimizer π * of (11) satisfying π * m Lip = λ * m , m = 1, . . . , M , where (λ * m ) m is the optimizer of (10) when π = π * . The theorem is proved by using c-transform [Villani (2008)] to show that any optimizer π * m can be modified into a λ * m -Lipschitz classifier while maintaining the optimality. An immediate consequence of Theorem 3 based on is the following generalization bound of (6). Figure 4 : An example of the weights P * 1 , P * 2 , P * 3 yielding from (9) and the corresponding results of Dr.k-NN using a small subset of MNIST (digit 4 (red), 6 (blue), 9 (green) and k = 5). Raw samples are projected on a 2D feature space (d = 2), with the color indicating their true class-membership. In (a)(b)(c), shaded areas indicate the kernel smoothing of P * 1 , P * 2 , P * 3 defined in (13). In (d), big dots represent the training points and small dots represent the query points, and their color depth suggests how likely the sample is being classified into the true category. Corollary 1. There exists an optimal robust 1-NN classifier π knn (·; 1, w * ) with generalization gap controlled by 1 min 1≤m≤M ϑm · R n (Lip(Ξ)), where R n (Lip(Ξ)) is the Rademacher complexity of the 1-Lipschitz functions on Ξ. In comparison, the margin-based generalization gap of the vanilla 1-NN classifier for binary classification is controlled by in the m-th class [von Luxburg & Bousquet(2004) von Luxburg and Bousquet] . Therefore, the generalization gap of the distributionally robust 1-NN classifier can be smaller than that of the vanilla 1-NN classifier by tuning the radii (ϑ m ) m properly. In this section, we present the Distributional robust k-Nearest Neighbor (Dr.k-NN) algorithm, which is a direct consequence of the theoretical justifications in Section 3. Based on Theorem 2, our algorithm contains two steps. Step 1. [Sample re-weighting] For each class m, re-weight n samples using a distribution P * m , where (P * 1 , . . . , P * M ) is the p-component of the minimizer of (9). Step 2. [k-NN ] Given a query point ξ, ordering the training samples according to their distance to ξ: c(ξ, ξ τ S 1 (ξ) ) ≤ c(ξ, ξ τ S 2 (ξ) ) ≤ · · · ≤ c(ξ, ξ τ S n (ξ) ). Compute the weighted k-NN votes, define Decide the class for a query feature point ξ as arg max 1≤m≤M p m (ξ), where the tie is broken according to the rule (4). Figure 4 gives an illustration showing the probabilistic weights (P * 1 , P * 2 , P * 3 ) for three classes and its corresponding decision boundary yielding from the weighted k-NN. Algorithm 1 Learning algorithm for Dr.k-NN Input: S m := {(x i , y i ) : y i = m, ∀i} ⊂ S, m = 1, . . . , M ; Output: The feature mapping φ(·; θ) and the LFD P * 1 , . . . , P * M supported on training samples; Initialization: θ 0 is randomly initialized; n < n is the size of "mini-set"; t = 0; while t < T do for number of mini-sets do Randomly generate M integers n 1 , . . . , n M such that M m=1 n m = n , n m > 0, ∀m; Initialize two ordered sets Ξ = ∅, P = ∅; for m ∈ {1, . . . , M } do X m ← Randomly sample n m points from S m ; Update the probability mass of LFDs P * 1 , . . . , P * M on Ξ by solving (9) given Ξ, P ; end θ t+1 ← θ t − α∇J(θ t ; P * 1 , . . . , P * M ), where α is the learning rate; t ← t + 1; end For the sake of completeness, we also extend our algorithm to non-few-training-sample setting, which is referred to as truncated Dr.k-NN. The key idea is to keep the training samples that are important in deciding the decision boundary based on the maximum entropy principle [Cover & Thomas(2006) Cover and Thomas] . This can be particularly useful for the general classification problem with an arbitrary size of training set. An illustration ( Figure 6 ) and more details can be found in Appendix B. In this section, we propose a framework that jointly learns the feature mapping and the robust classifier. Let the feature mapping φ(; θ) be a neural network parameterized by θ whose input is a batch of training samples (Figure 3) , and then compose it with an optimization layer that packs the convex problem (9) as an output layer that outputs the LFDs of (8). The optimization layer is adopted from differentiable optimization [Amos & Kolter(2017) To apply the mini-batch stochastic gradient descent, we need to ensure that each batch comprises of multiple "mini-sets", one for each class, containing at least one training sample from each class fed into the convex optimization layer. In light of (9), the objective of our joint learning framework is min θ J(θ; P * 1 , . . . , P * M ), where J(θ; P * 1 , . . . , P * and {P * m (φ(·; θ))} 1≤m≤M are the LFDs generated by the convex solver defined in (9) given input variables {ξ i = φ(x i ; θ)} 1≤i≤n . The algorithm is summarized in Algorithm 1. mini ImageNet CIFAR-10 Omniglot Lung Cancer COVID-19 CT M = 2 M = 5 M = 2 M = 5 M = 2 M = 5 M = 2 M = 5 M = 3 M = 2 K = 5 K = 10 K = 5 K = 10 K = 5 K = 10 K = 5 K = 10 K = 5 K = 10 K = 5 K = 10 K = 5 K = 10 K = 5 K = 10 K = 5 K = 8 K = 5 K = 10 In this section, we evaluate our method and eight alternative approaches on four commonlyused image data sets: MNIST [LeCun & Cortes(2010) Experiment set-up. We compare our method including Dr.k-NN and its truncated version with the following baselines: (1) k-NN based methods with different dimension reduction techniques, including Principal Component Analysis (PCA+k-NN), Singular Value Decomposition (SVD+k-NN), Neighbourhood Components Analysis (NCA+k-NN) [Goldberger et al.(2005)Goldberger, Hinton, Roweis, and Salakhutdinov] , and feature embeddings generated by Dr.k-NN (Feature embedding + k-NN) as a sanity check; (2) matching networks [Vinyals et al.(2016)Vinyals, Blundell, Lillicrap, Kavukcuoglu, and Wierstra]; (3) prototypical networks [Snell et al.(2017)Snell, Swersky, and Zemel] ; (4) MetaOpt-Net [Lee et al.(2019) Lee, Maji, Ravichandran, and Soatto] . To make these methods comparable, we adopt the same naive neural network with a single CNN layer on matching network, prototypical network, and our model, respectively, where the kernel size is 3, the stride is 1 and the width of the output layer is d = 400. In our experiments, we focus on an M -class K-sample (K training samples for each class) learning task. To generate the training data set, we randomly select M classes and for each class we take K random samples. So our training data set contains M K samples overall. We then aim to classify a disjoint batch of unseen samples into one of these M classes. Thus random performance on this task stands at 1/M . We test the average performance of different methods using 1, 000 unseen samples from the same M classes. To obtain reliable results, we repeat each test 10 times and calculate the average accuracy. Other experimental configurations are described as follows: The Adam optimizer [Kingma & Ba(2014) Kingma and Ba] is adopted for all experiments conducted in this paper, where learning rate is 10 −2 . The mini-batch size is 32. The hyper-parameter ϑ m is chosen by cross-validation, which varies from application to application. The differentiable convex optimization layer we adopt is from [Agrawal et al.(2019)Agrawal, Amos, Barratt, Boyd, Diamond, and Kolter] . To make all approaches comparable, we use the same network structure in matching network, prototypical network, and MetaOptNet as we described above. We use the Euclidean distance c(ξ, ξ ) = ξ − ξ 2 throughout our experiment. All experiments are performed on Google Colaboratory (Pro version) with 12GB RAM and dual-core Intel processors, which speed up to 2.3 GHz (without GPU). Results. We present the average test accuracy in Table 1 for the unseen samples with different M = 2, 5 and K = 5, 10 on small subsets of MNIST, mini ImageNet, CIFAR-10, Omniglot, and with M = 3 and K = 5, 8 on lung cancer data. Note that random performance for two-class and five-class classifications are 0.5 and 0.2, respectively. The figures in Table 1 show that Dr.k-NN (k = 5) outperforms other baselines in terms of the average test accuracy on all data sets. We note that 95% confidence interval of our method's performance on all the data sets is smaller than 0.08. The truncated Dr.k-NN also yields competitive results using only 20% training samples (τ = 0.9), compared to standard Dr.k-NN. To confirm that the proposed learning framework will affect the distribution of hidden representation of data points, we show the training and query samples in a 2D feature space and the corresponding decision boundary in Figure 5 . It turns out our framework finds a better feature representation in the 2D space with a smooth decision boundary and a reasonable decision confidence map (indicated by the color depth in Figure 5 (a) ). Comparison to kernel smoothing. We also compare with an approach using kernel-smoothing of the LFDs (in contrast to using k-NN) for performing classification. Consider a Gaussian kernel Step 2 of Dr.k-NN with the following We evaluate both methods on a subset of MNIST, which contains 1,000 testing samples (small dots) and 20 training samples (large dots) from two categories (indexed by blue and red, respectively). As shown in Figure 7 and Figure 8 (Appendix C), our experimental results have shown the importance of using k-NN in our proposed algorithm, where the Dr.kNN significantly outperforms the parallel version using kernel smoothing (even after the kernel bandwidth being optimized). We find that the performance when using kernel smoothing (13) heavily depends on selecting an appropriate kernel bandwidth h as illustrated by Figure 8 . Moreover, the best kernel bandwidth may vary from one dataset to another. Therefore, the cross-validation is required to be carried out to find the best kernel bandwidth in practice, which is quite time-consuming. In contrast, choosing the hyper-parameter k is an easy task, since we only have limited choices of k in few-training-sample scenario and the performance of Dr.k-NN is insensitive to the choices of k (see Figure 7 ). We propose a distributionally robust k-NN classifier (Dr.k-NN) for tackling the multi-class classification problem with few training samples. To make a decision, each neighboring sample is weighted according to least favorable distributions resulting from a distributionally robust problem. As shown in the theoretical results and demonstrated by experiments, our methods achieve outstanding performance in classification accuracy compared with other baselines using minimal resources. The robust classifier layer (9) serves an alternative to the usual softmax layer in a neural network for classification, and we believe it is promising for other machine learning tasks. The proof of Theorem 1 is based on the following two lemmas. Furthermore, the optimal classifier π * satisfies that for any ξ i ∈ Ξ, . This lemma gives a closed-form expression for the risk of the optimal classifier if P 1 , . . . , P M are known, and shows that the optimal decision π * accepts the class with the maximum likelihood. Moreover, when there is a tie (i.e., the set arg max 1≤m≤M P m (ξ) is not singleton), the optimal decision π * can break the tie arbitrarily. Proof of Lemma 1. We here prove a more general result for an arbitrary sample space Ξ. Note that each P m , 1 ≤ m ≤ M , is absolutely continuous with respect to P 1 + · · · + P M , hence the Radon-Nikodym derivative dPm d(P 1 +···+P M ) exists. Using the interchangeability principle [Shapiro et al.(2014)Shapiro, Dentcheva, and Ruszczyński] that enables us to exchange the minimization and integration, we have (1 − π m (ξ)) dPm d(P 1 +···+P M ) (ξ) d(P 1 + · · · + P M ) where the first equality is obtained by plugging in the definition of Ψ in (1); the second equality is due the interchangeability principle; and the last equality holds because for any ξ, the inner minimization attains its minimum at one of the vertices of ∆ M . More specifically, note that for each ξ, the objective of the inner minimization problem equals to 1 − M m=1 π m dPm d(P 1 +···+P M ) (ξ). Under the constaint that π ∈ ∆ M , i.e., M m=1 π m = 1, we have: and the equality holds when π are chosen such that . If there is a single maximum in { dPm d(P 1 +···+P M ) (ξ), 1 ≤ m ≤ M }, say at index m * , then this simply implies that the optimal π is chosen as π m * = 1 and π m = 0 for m = m * . If we substitute Ξ with the empirical support Ξ, the above formulation in Equation (14) translates into min π: Ξ→∆ M Ψ(π; P 1 , . . . , therefore the lemma is proved. Lemma 2. For the uncertainty sets defined in (7), the problem max Pm∈Pm,1≤m≤M ψ(P 1 , . . . , P M ) is equivalent to (9). Proof of Lemma 2. Recall that the Wasserstein metric of order 1 is defined as for any two distributions P and P on Ξ, where the minimization of γ is taken over the set of all probability distributions on Ξ × Ξ with marginals P and P , i.e., the set where P(Ξ × Ξ) denotes the joint probability distributions on Ξ × Ξ. Therefore, the Wasserstein metric W(P, P ) can be rewritten as By the definition of uncertainty sets in (7) which contains discrete distributions supported on Ξ, we can introduce additional variables γ m ∈ R n×n + which represents the distribution on Ξ × Ξ, with marginals P m ∈ P m and P m , for 1 ≤ m ≤ M . For any ξ i , ξ j ∈ Ξ, let γ i,j m denotes γ m (ξ i , ξ j ) for simplicity. Thus the objective function in the above reformualtion of W(P m , P m ) is Thereby the problem max Pm∈Pm,1≤m≤M ψ(P 1 , . . . , P M ) is equivalent to the convex optimization formulation in (9). Proof to Theorem 1. By Lemmas 1 and 2, we have To prove Theorem 1, it remains to verify the validity of exchanging max and min. We identify π as (π 1 , . . . , π n ), where π i ∈ R M + satisfies M m=1 π i m = 1. Similar to the proof of Lemma 2, P m , 1 ≤ m ≤ M , can also be identified as a vector in R n . Note that the objective function Ψ(π; P 1 , . . . , P M ) is linear in (π 1 , . . . , π n ) and concave in (P 1 , . . . , P M ), and the Slater condition holds. Hence applying convex programming duality we can exchange max and min and thus the result follows. It is worth mentioning that the optimal solution π * and corresponding LFDs P * 1 , . . . , P * M always exist since they are solutions to a saddle point problem. Next we verifyπ is a feasible classifier, i.e., it satisfies 0 ≤π(ξ) ≤ 1 and M m=1π m (ξ) = 1, ∀ξ. First, by definition, π * m (ξ) ≥π m (ξ), ∀ξ ∈ Ξ, thereforeπ m (ξ) ≤ 1 and M m=1π m (ξ) ≤ 1. If we are able to show that M m=1π m (ξ) ≥ 1, ∀ξ ∈ Ξ, then we can showπ is indeed a feasible classifier. To show (16), first note that if π * m (ξ) = 0, then we have by definitionπ m (ξ) = 0. Moreover, for anyξ ∈ Ξ, there is a set M 0 ⊂ {1, . . . , M } such that for all m ∈ M 0 , π * m (ξ) > 0, and the worst-case distribution P * m transports probability mass from supp P m toξ, which suggests that there exitŝ ξ m ∈ supp P m such thatξ ∈ arg max It follows from the definition of φ m that Thereby we have shown (16). The proof is completed by noting that the optimal solutionπ satisfies π m Lip ≤ λ * m and thus v dual ≥ v Lip . Combine with the previous result that v dual ≤ v Lip , we have shown v dual = v Lip and the proof is completed. For the sake of completeness, we extend our algorithm to non-few-training-sample setting. This can be particularly useful for the general classification problem with an arbitrary size of training set. In fact, k-NN methods notoriously suffer from computational inefficiency if the number of labeled samples n is large, since it has to store and search through the entire training set [Goldberger et al.(2005) Goldberger, Hinton, Roweis, and Salakhutdinov]. The main idea is to only keep the training samples that are important in deciding the decision boundary based on the maximum entropy principle [Cover & Thomas(2006) Cover and Thomas] . As a measure of importance, we choose the samples with the largest entropy across all categories, based on the intuition that the samples with higher entropy has larger uncertainty and will be more useful for classification purposes since they tend to lie on the decision boundary. The entropy of a sample is defined as follows. Consider a random variable which takes value m with probability π m , M i=1 π m = 1; then the entropy of this random variable is define as H(π 1 , . . . , π M ) = − M m=1 π m log π m . As a simple example, for Bernoulli random variable (which can represent, e.g., the outcome for flipping a coin with bias p), the entropy function is H(p) = −p log p − (1 − p) log(1 − p), and it is a concave function achieving the maximum at p * = 1/2, which means that the fair-coin has the maximum entropy; this is intuitive as indeed the outcome of a fair coin toss is the most difficult to predict. Now we use this entropy to define the "uncertainty" associated with each training points. With a little abuse of notation, define H( ξ) := H(π 1 ( ξ), . . . , π M ( ξ)). Denote the minimal and maximal entropy of all the training points as H min = min{H( ξ), ξ ∈ Ξ}, H max = max{H( ξ), ξ ∈ Ξ}. Define the τ -truncated training set as Ξ = { ξ ∈ Ξ : (H( ξ) − H min )/(H max − H min ) ≥ τ }, ∀τ ∈ [0, 1]. The truncated Dr.k-NN is obtained similarly as Step 2 of Dr.k-NN by restricting the training set Ξ only to the samples in Ξ (samples with larger entropy). Figure 6 reveals that the most informative samples usually lie in between categories. We can see that a truncated Dr.k-NN classifier with τ = 0.9 only uses 20% samples with little performance loss. More experimental details is presented in Section 5. Distributionally robust logistic regression Differentiable convex optimization layers An introduction to kernel and nearest-neighbor nonparametric regression Grand challenge on breast cancer histology images Quantifying distributional model risk via optimal transport Robust Wasserstein profile inference and applications to machine learning Selecting optimal decisions via distributionally robust nearest-neighbor regression Elements of Information Theory UCI machine learning repository: Lung cancer data set Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations Model-agnostic meta-learning for fast adaptation of deep networks Distributionally robust stochastic optimization with Wasserstein distance Robust hypothesis testing using Wasserstein uncertainty sets Neighbourhood components analysis A robust version of the probability ratio test A method for stochastic optimization Siamese neural networks for one-shot image recognition Cifar-10 (canadian institute for advanced research) Wasserstein distributionally robust optimization: Theory and applications in machine learning Human-level concept learning through probabilistic program induction MNIST handwritten digit database Meta-learning with differentiable convex optimization One-shot learning of object categories Neural nearest neighbors networks Learning a nonlinear embedding by preserving class neighbourhood structure Optimal weighted nearest neighbour classifiers Regularization via mass transportation Lectures on stochastic programming: modeling and theory. SIAM Prototypical networks for few-shot learning Optimal transport: old and new Matching networks for one shot learning Distance-based classification with lipschitz functions Generalizing from a few examples: A survey on few-shot learning A CT scan dataset about COVID-19 Proof of Theorem 2. On the one hand, since π knn (·; k, w) can be regarded as a special case of the general classifier π : Ξ → ∆ M , it holds that On the other hand, by Lemma 2, there exists an optimal solution to the minimax problem (8), denoted as (P * 1 , . . . , P * M ), and the optimal classifier π * as given in Lemma 1. Note that there exists 1 ≤ k * ≤ n and weight functions w * 1 , . . . , w * M such thatfor example, by taking k * = 1 and w * m = P * m , 1 ≤ m ≤ M . This implies thatThereby we have shown that formulations (6) and (8) have identical optimal values. Moreover, by the strong duality results in Theorem 1, we know (π * ; P * 1 , . . . , P * M ) is the saddle point for the formulation (8), and by the above arguments, we see that (π knn ; P * 1 , . . . , P * M ) leads to the same optimal value for formulation (6) as (π * ; P * 1 , . . . , P * M ) for formulation (8). Therefore, we show that π knn is indeed the optimal solution to (6). Proof of Theorem 3. We first show the equivalence between the Lipschitz regularized problem (11) and the minimax problem (8). Denote by v Lip the optimal value of (11) and v dual the optimal value of (10).Observe that if π m Lip ≤ λ m , thenIf we can show v dual ≥ v Lip , then we prove the equivalence between (11) and (10), thus the equivalence between (11) and (8). Let (π * ; λ * 1 , . . . , λ * M ) be a dual minimizer of problem (10), whose existence is ensured by [Gao & Kleywegt(2016) Gao and Kleywegt] .DefineThen it follows thatThen by definition, π m Lip ≤ λ * m . Indeed, for any ξ,ξ ∈ Ξ, there exists a ξ 0 ∈ arg maxξ ∈ Ξ {1 − λ * m c(ξ, ξ) − φ m (ξ)} such that:. Recall that (π * ; λ * 1 , . . . , λ * M ) is a dual minimizer of problem (10):Since v dual is the minimum value, this means that ifπ is a feasible solution, then it is also an optimal solution to (10). Proof. Using the proof of Theorem 3, there exists a classifierπ that satisfies π m Lip ≤ λ * m which is the optimal solution to problem (11) and (10), and thus is an optimal robust classifier to problem (8). Moreover, based on the proof of Theorem 2, we know that the set of weighted k-NN classifiers is exhaustive and there exist weight functions w * = {w * 1 , . . . , w * M } such that π knn (·; 1, w * ) is equivalent toπ. Therefore, π knn (·; 1, w * ) is an optimal solution satisfying π knn m (·; 1, w * ) Lip ≤ λ * m , m = 1, . . . , M.It was shown in [von Luxburg & Bousquet(2004) von Luxburg and Bousquet] that the generalization gap of Lipschitz classifiers is bounded by the corresponding Rademacher complexity. In our setting, a direct consequence of [von Luxburg & Bousquet(2004) von Luxburg and Bousquet] shows that the generalization gap of π knn (·; 1, w * ) is controlled by max 1≤m≤M λ * m · R n (Lip(Ξ)), where max 1≤m≤M λ * m denotes the maximum Lipschitz norm of the classifier {π knn m } m and R n (Lip(Ξ)) denotes the Rademacher complexity of the class of 1-Lipshitz functions on the sample space Ξ. Moreover, note that the optimal dual minimizer λ * m satisfies λ * m ≤ 1/ϑ m , ∀m. Indeed, if λ * m > 1/ϑ m , then the objective value in the m-th term in (10) is larger than 1, which is clearly not optimal. Thereby we have max 1≤m≤M λ * m =