key: cord-0431590-v793ur0s authors: Krishnaswamy, Anilesh K.; Li, Haoming; Rein, David; Zhang, Hanrui; Conitzer, Vincent title: Classification with Strategically Withheld Data date: 2020-12-18 journal: nan DOI: nan sha: 53fddc6d8616cb43542b33b0afdd189a24ff5ad5 doc_id: 431590 cord_uid: v793ur0s Machine learning techniques can be useful in applications such as credit approval and college admission. However, to be classified more favorably in such contexts, an agent may decide to strategically withhold some of her features, such as bad test scores. This is a missing data problem with a twist: which data is missing {em depends on the chosen classifier}, because the specific classifier is what may create the incentive to withhold certain feature values. We address the problem of training classifiers that are robust to this behavior. We design three classification methods: {sc Mincut}, {sc Hill-Climbing} ({sc HC}) and Incentive-Compatible Logistic Regression ({sc IC-LR}). We show that {sc Mincut} is optimal when the true distribution of data is fully known. However, it can produce complex decision boundaries, and hence be prone to overfitting in some cases. Based on a characterization of truthful classifiers (i.e., those that give no incentive to strategically hide features), we devise a simpler alternative called {sc HC} which consists of a hierarchical ensemble of out-of-the-box classifiers, trained using a specialized hill-climbing procedure which we show to be convergent. For several reasons, {sc Mincut} and {sc HC} are not effective in utilizing a large number of complementarily informative features. To this end, we present {sc IC-LR}, a modification of Logistic Regression that removes the incentive to strategically drop features. We also show that our algorithms perform well in experiments on real-world data sets, and present insights into their relative performance in different settings. Applicants to most colleges in the US are required to submit their scores for at least one of the SAT and the ACT. Both tests are more or less equally popular, with close to two million taking each in 2016 [Adams, 2017] . Applicants usually take one of these two tests -whichever works to their advantage. 1 However, given the growing competitiveness of college admissions, many applicants now take both tests and then strategically decide whether to drop one of the scores (if they think it will hurt their application) or report both. 2 The key issue here is that it is impossible to distinguish between an applicant who takes both tests but reports only one, and an applicant that takes only one test-for example because the applicant simply took the one required by her school, the dates for the other test did not work with her schedule, or for other reasons that are not strategic in nature. 3 Say a college wants to take a principled machine learning approach to making admission decisions based on the scores from these two tests. For simplicity, assume no other information is available. Assume that the college has enough historical examples that contain the scores of individuals (on whichever tests are taken, truthfully reported) along with the corresponding ideal (binary) admission decisions. 4 Based on this data, the college has to choose a decision function that determines which future applicants are accepted. If this function is known to the applicants, they are bound to strategize and use their knowledge of the decision function to decide the scores they report. 4 How can the classifier be trained to handle strategic reporting of scores at prediction time? To see the intricacies of this problem, let us consider a simple example. Example 1. Say the scores for each of the two tests (SAT and ACT) take one of two values: h (for high) or l (for low). Let * denote a missing value. Then there are eight possible inputs (excluding ( * , * ) since at least one score is required): (h, h), (h, l), (l, h), (l, l), (h, * ), ( * , h), (l, * ) and ( * , l). Assume the natural distribution (without any withholding) over these inputs is known, and so are the conditional probabilities of the label Y ∈ {0, 1}, as shown below: Table 1 : True distribution of inputs and targets: X (h, h) (h, l) (l, h) (l, l) (h, * ) ( * , h) (l, * ) ( * , l) P r(X) 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 P r(Y = 1 | X) 0.9 0.7 0.3 0.1 0.6 0.6 0.2 0.2 P r(Y = 0 | X) 0.1 0.3 0.7 0.9 0.4 0.4 0.8 0.8 Assume Y = 1 is the more desirable "accept" decision. Then, ideally, we would like to predict Y = 1 whenever X ∈ {(h, h), (h, l), (h, * ), ( * , h)}. However, the strategic reporting of scores at prediction time effectively means, for example, that an input ( * , h) cannot be assigned the accept decision of Y = 1 unless the same is done for (l, h) as well; otherwise, someone with (l, h) would simply not report the first test, thereby misreporting ( * , h) and being accepted. Taking this into account, the classifier with minimum error is given by Y = 1 whenever X ∈ {(h, h), (h, l), (h, * )}. There are many other settings where a similar problem arises. Many law schools now allow applicants to choose between the GRE and the traditional LSAT. 5 Recently, as a result of the COVID-19 pandemic, universities have implemented optional pass/fail policies, where students can choose to take some or all of their courses for pass/fail credit, as opposed to a standard letter grade that influences their GPA. They are often able to decide the status after already knowing their performance in the course. For credit scoring, some individuals might not report some of their information, especially if it is not mandatory by law [Florez-Lopez, 2010] . The ability of strategic agents to withhold some of their features at prediction time poses a challenge only when the data used to train the classifier has some naturally missing components to begin with. For if not, the principal -e.g., the entity deciding on admissions -can reject all agents that withhold any of their features, thereby forcing them to reveal all features. We focus on how a principal can best train classifiers that are robust even when there is strategic withholding of data by agents. Our methods produce classifiers that eliminate the incentive for agents to withhold data. Our contributions We now describe the key questions we are facing, and how we answer them. Our model is described formally in Section 2. All proofs are in the Supplement. If the true input distribution is known, can we compute the optimal classifier? (Section 3) We answer this question in the affirmative by showing that the problem of computing the optimal classifier (Theorem 1) in this setting reduces to the classical Min-cut problem [Cormen et al., 2009 ]. This analysis gives us the MINCUT classifier, which can be computed on the empirical distribution, estimated using whatever data is available. However, since it can potentially give complex decision boundaries, it might not generalize well. Are there simpler classifiers that are robust to strategic withholding of features? (Section 4) We first characterize the structure of classifiers that are "truthful", i.e., give no incentive to strategically hide features at prediction time (Theorem 2). Using this characterization, we devise a hill-climbing procedure (HC) to train a hierarchical ensemble of out-of-the-box classifiers and show that the procedure converges (Theorem 4) as long as we have black-box access to an agnostic learning oracle. We also analytically bound the generalization error of HC (Theorem 3). The ensemble of HC can be populated with any of the commonly used classifiers such as logistic regression, ANNs, etc. Another truthful classifier we present is a modification of Logistic Regression. This method, called IC-LR (Incentive Compatible Logistic Regression), works by encoding all features with positive values, and using positive regression coefficients -whereby it is in every agent's best interest to report all features truthfully. IC-LR uses Projected Gradient Descent for its training. The advantage of this method is that it can be directly to a large number of features. How do our methods perform on real data sets? (Section 6) We conduct experiments on several realworld data sets to test the performance of our methods, comparing them to each other, as well as to other methods that handle missing data but ignore the strategic aspect of the problem. We see that our methods perform well overall, and uncover some interesting insights on their relative performance: 1. When the number of features is small, HC is the most reliable across the board. 2. When the number of features is small, and many of them are discrete/categorical (or suitably discretized), MINCUT and IC-LR perform better. 3. If a large number of features must be used, IC-LR gives the best performance, although HC performs reasonably well with some simple feature selection techniques. Related work Our work falls broadly in the area of strategic machine learning, wherein a common assumption is that strategic agents can modify their features (i.e., misreport) in certain ways (normally at some cost), either to improve outcomes based on the classifier chosen by the principal [Hardt et al., 2016] or to influence which classifier is chosen in the first place [Dekel et al., 2010a] . The main challenge in strategic machine learning, as in this paper, is the potential misalignment between the interests of the agents and the principal. Existing results in this line of work [Chen et al., 2018 , Kleinberg and Raghavan, 2019 , Haghtalab et al., 2020 , often mainly theoretical, consider classifiers of a specific form, say linear, and ways of misreporting or modifying features in that context. Our results are different in that we focus on a specific type of strategic misreporting, i.e., withholding parts of the data, and devise general methods that are robust to this behavior that, in addition to having theoretical guarantees, can be tested practically. Some experimental results [Hardt et al., 2016] do exist -but our work is quite different; for instance, we do not need to invent a cost function (as in Hardt et al. [2016] ). Another major difference is that we consider generalization in the presence of strategic behavior, while most previous work does not (except for a concurrent paper [Zhang and Conitzer, 2021] ), which studies the sample complexity of PAC learning in the presence of strategic behavior). Our problem can also be viewed as an instance of automated mechanism design with partial verification [Green and Laffont, 1986 , Yu, 2011 , Kephart and Conitzer, 2015 , 2016 where it is typically assumed that the feature space (usually called type space in mechanism design) is discrete and has reasonably small cardinality, and a prior distribution is known over the feature space. In contrast, the feature spaces considered in this paper consist of all possible combinations of potentially continuous feature values. Moreover, the population distribution can only be accessed by observing examples. Thus, common methodologies in automated mechanism design do not suffice for our setting. A set of closely related (in particular, to Theorem 1) theoretical results are those of Zhang et al. [2019b Zhang et al. [ ,a, 2021b on the problem of distinguishing "good" agents from "bad" (where each produces a different distribution over a sample space, and the agent can misreport the set of n samples that she has drawn). However, our work is different in that we consider the standard classification problem, we focus more on practical aspects, and we do not rely on the full knowledge of the input distribution. Our work also finds a happy intersection between strategic machine learning and the literature on classification with missing data [Marlin, 2008] . The problem we study is also connected to adversarial classification [Dalvi et al., 2004 , Dekel et al., 2010b . We discuss these connections in more detail in the Supplement. We now describe our model and the requisite notation. Model with strategically withheld features: We have an input space X , a label space Y = {0, 1}, and a distribution D over X × Y which models the population. A classifier f : X → Y maps a combination of features to a label. Let F = [k] = {1, . . . , k} be the set of features, each of which a data point may or may not have. For x ∈ X , let x i denote the value of its i-th feature (x i = * if x does not have feature i ∈ [k]). For any S ⊆ [k], define x| S to be the projection of x onto S (i.e., retain features in S and drop those not in S): We assume that data can be strategically manipulated at prediction (test) time in the following way: an agent whose true data point is x can report any other data point x such that x| S = x for some S ⊆ [k]. We use → to denote the relation between any such pair x, x (x → x ⇐⇒ ∃S ⊆ [k] : x| S = x ). Note that → is transitive, i.e., for any x 1 , x 2 , x 3 ∈ X , x 1 → x 2 and x 2 → x 3 =⇒ x 1 → x 3 . We assume agents prefer label 1 to 0: in response to a classifier f , an agent with data point x will always withhold 6 features to receive label 1 if possible, i.e., the agent will report x ∈ argmax x :x→x f (x ). Incorporating such strategic behavior into the loss of a classifier f , we get Truthful classifiers We will also be interested in truthful classifiers, which provably eliminate incentives for such strategic manipulation. A classifier f is truthful if for any x, x ∈ X where x → x , f (x) ≥ f (x ). 6 In practice, f might not be perfectly known, and agents might not be able to best respond. This problem does not arise for our methods, since they are truthul. For other classifiers, their accuracy may go up or down if agents fail to best-respond; but the assumption that agents best-respond is common in many such contexts. In other words, not withholding any features is always an optimal way to respond to a truthful classifier. As a result, the loss of any truthful classifier f in the presence of strategically withheld features has the standard form: D (f ) = Pr (x,y)∼D [f (x) = y]. Note that the so-called Revelation Principle -which states that in the presence of strategic behavior, any classifier f is equivalent to a truthful classifier f -holds in this case because the reporting structure is transitive. 7 In other words, we are guaranteed that, for any classifier f , there exists a truthful classifier f , such that for any x ∈ X , max x :x→x f (x ) = f (x). Therefore, we focus on truthful classifiers in our model, without loss of generality. We first present a method for computing an optimal classifier when the input distribution is fully known. 8 Assuming X is finite, our goal is to characterize a classifier f * which minimizes the loss D (.), for a known input distribution D. As shorthand, define, for all x ∈ X , The basic idea here is simple: to partition X into two sides, one labeled 1 and the other 0, where the error accrued for each x ∈ X is given by D − (x) or D + (x), according as x is labeled 1 or 0. Such a partition should crucially respect the constraints imposed by the strategic behavior of agents : if x → x , then either x is labeled 1 or x is labeled 0. Definition 2. Given X and D, let G(D, X ) be a directed capacitated graph with vertices V = X ∪ {s, t}, where the edges E and edge capacities u are defined as follows: • For each x ∈ X , there are edges (s, x) and (x, t) in E, with capacities u(s, x) = D − (x) and u(x, t) = D + (x). • For all pairs x, x ∈ X such that x → x , there is an edge (x, x ) ∈ E with capacity u(x, x ) = ∞. In terms of the graph defined above, computing the optimal classifier f * we seek is equivalent to finding a minimum s-t cut on G(D, X ). The intuition is that the edges from s and to t reflect the value gained from labeling an example 0 or 1, respectively; one of the edges must be cut, reflecting the loss of not assigning it to the corresponding side. Moreover, if x → x , then the corresponding edge with infinite capacity prevents the assigning of 0 to x and 1 to x . Theorem 1. If (S,S) is a minimum s-t cut of G(D, X ) (where S is on the same side as s), then for the classifier f * (x) := 1(x ∈S), we have D (f * ) = min f D (f ). We note that, consequently, the optimal classifier can be computed in poly(|X |) time. In practice, it is natural to expect that we do not know D exactly, but have a finite number of samples from it. A more practical option is to apply Theorem 1 to the empirical distribution induced by the samples observed, and hope for the classifier computed from that to generalize to the true population distribution D. Implementing MINCUT Given a set X of m i.i.d. samples from D, let D be the corresponding empirical distribution over X , andX := X ∪ {x : x → x, ∃x ∈ X }. The MINCUT classifier is then obtained by applying Theorem 1 to G( D, X ), and extending it toX as and when required. Here, note that MINCUT runs in time poly(m) (and not poly(|X |)), since G( D, X ) has m nodes, and checking if a test point is inX takes poly(m) time. In light of traditional wisdom, the smaller m is relative to X , the larger the generalization error of MINCUT will be. We do not attempt a theoretical analysis in this regard, but note that when X is large, the generalization error can be extremely large (see Example 2 in the Supplement). The reason for this is two-fold: 1. MINCUT can give complicated decision boundaries. 2. MINCUT is indecisive on samples not inX . 9 Therefore, a suitable discretization of features is sometimes useful (see Section 6). Note that MINCUT is truthful, by virtue of the infinite capacity edges in Definition 2. It is easy to see that for any of the commonly used hypothesis spaces -say H consists of linear hypotheses -if a truthful classifier f is in H, then so are the components of the MAX Ensemble version of f as in Theorem 2. We have, however, stated Theorem 3 in slightly more general terms. The HILL-CLIMBING classifier: In what follows, we present a hill-climbing approach with provable convergence and generalization guarantees. The HILL-CLIMBING classifier, henceforth referred to as HC, is of the same form as given by the characterization of truthful classifiers in Theorem 2. 10 Intuitively, the approach works by considering a hierarchy of classifiers, organized by the features involved. For example, consider a setting with k = 3 features. We make a choice as to what classifiers we use -say f 1 for input of the form (x 1 , * , * ), f 2 for input of the form (x 1 , x 2 , * ), and f 3 for input of the form (x 1 , x 2 , x 3 ). Any agent with features 1 and 2 (but not 3), for example, should be able to report both features so as to be classified by f 2 , or feature 2 to be classified by f 1 instead. So in effect, assuming full knowledge of the classifiers, each agent can check all of the classifiers and choose the most favorable one. Without loss of generality, we assume that when a data point does not have all the features required by a classifier, it is automatically rejected. Input: data set X = {(x i , y i )} i∈ [m] , n subsets F 1 , F 2 , . . . , F n of F. Initialize: t ← 0, {f 0 1 , . . . , f 0 n }. while ∆ > 0 do for i = 1, 2, . . . , n do In short, HC (defined formally in Algorithm 1) works as follows: first choose a hypothesis space H, in order for Theorem 3 to apply. Then select n subsets of F (where n is a parameter), say F 1 , F 2 , . . . , F n . For each F j , we learn a F j -classifier, say f j , from among those in H. Start by initializing these classifiers to any suitable {f 0 1 , . . . , f 0 n }. In each iterative step, each of the subclassifiers is updated to minimize the empirical loss on the samples that are rejected by all other classifiers. We next show that such an update procedure always converges. To do so, as far as our theoretical analysis goes, we assume we have black-box access to an agnostic learning oracle (Line 6 in Algorithm 1). After convergence, the HC classifier is obtained as the MAX Ensemble of these classifiers. The generalization guarantee of Theorem 3 applies directly to the HC classifier. Theorem 4. Algorithm 1 converges. Connection with MINCUT: The HC formulation given above can be thought of as a less complicated version of MINCUT: some of the edge constraints are ignored with respect to learning the individual classifiers, and are instead factored in via the MAX function. Say F 1 ⊂ F 2 . For some x, it is possible that f 1 (x| F1 ) = 1 and f 2 (x| F1 ) = 0. In other words, the individual classifiers could violate the MINCUT constraints, in order to learn classification functions that generalize well individually, and also collectively thanks to the combined HC training procedure. Implementing HC: In practice, the classifiers {f 1 , f 2 , . . . , f n } in HC can be populated with any standard out-of-the-box methods such as logistic regression or neural networks, the choice of which can influence the performance of f . In Section 6, we test HC with a few such options. The assumption of having access to an agnostic learning oracle does not play a crucial role in practice, with standard training methods performing well enough to ensure convergence. Also, HC will converge in at most m (number of training examples) iterations, because in each iteration the number of correctly classified examples increases by at least one. (An iteration may need to train n individual classifiers.) This also means there is no difference between checking whether ∆ > 0 or ∆ ≥ 1/m. In our experiments, we run HC using ∆ ≥ 10 −4 , and convergence is achieved pretty quickly (see the Supplement for exact details). Choosing subsets: Note that we are free to choose any F 1 , F 2 , . . . , F n to define HC. Its generalization (via Theorem 3), will depend on the choice of n. As more and more subsets of features are included (and further binning them based on their values), HC starts behaving more and more like MINCUT. In addition, using a large number of subsets increases the computational complexity of HC. In practice, therefore, the number of subsets must be limited somehow -we find that some simple strategies like the following work reasonably well: (a) selecting a few valuable features and taking all subsets of those features, (b) taking all subsets of size smaller than a fixed number k, say k = 2. In many practical situations, a few features (possibly putting their values in just a few bins) are often enough to get close to optimal accuracy, also improving interpretability (e.g., see Wang and Rudin [2015] or Jung et al. [2017] ) The question of devising a more nuanced algorithm for selecting these subsets merits a separate investigation, and we leave this to future work. As we just mentioned, it is challenging to directly apply HC and MINCUT to a large number of features. As we will see, we can address this challenge in various ways to still get very strong performance with HC. Moreover, HC enjoys remarkable generality, generalization and convergence guarantees. Nevertheless, we would like to have an algorithm that tries to make use of all the available features, while still being truthful. In this section, we present such an approach, which, as we show later in Section 6, indeed performs comparably to -and in some cases better than -MINCUT and HC. Below we present a simple and truthful learning algorithm, Incentive-Compatible Logistic Regression (IC-LR), which is a truthful variant of classical gradient-based algorithms for logistic regression. Recall that in logistic regression, the goal is to learn a set of coefficients {β i }, one for each feature i ∈ F , as well as an intercept β 0 , such that for each data point (x, y), the predicted labelŷ given bŷ fits y as well as possible, where σ(t) = 1/(1 + e −t ) is the logistic function. Roughly speaking, IC-LR. (formally defined in Algorithm 2) works by restricting the coefficients {β i } in such a way that dropping a feature (i.e., setting x i to 0) can never make the predicted label larger. If, without loss of generality, all feature values x i are nonnegative 11 , then this is equivalent to: for each feature i ∈ F , the coefficient β i ≥ 0. IC-LR. enforces this nonnegativity constraint throughout the training procedure, by requiring a projection step after each gradient step, which projects the coefficients to the feasible nonnegative region by setting any negative coefficient to 0 (equivalently, an 1 projection). One potential issue with IC-LR. is the following: if a certain feature x i ≥ 0 is negatively correlated with the positive classification label, then IC-LR is forced to ignore it (since it is constrained to use positive coefficients). To make good use of this feature, we can include an inverted copy x i = λ − x i (where λ is chosen such that x i ≥ 0). We could also choose an apt discretization of such features (using cross-validation) and translate the discretized bins into separate binary variables. Such a discretization can account for more complex forms of correlation, e.g., a certain feature's being too high or too low me makes the positive label likelier. In practice, we find that the latter method does better. If such transformations are undesirable, perhaps for reasons of complexity or interpretability, HC methods are a safer bet. In this section, we show that, when strategic withholding is at play, MINCUT, HC and IC-LR perform well and provide a significant advantage over several out-of-the-box counterparts (that do not account for strategic behavior). Datasets Four credit approval datasets are obtained from the UCI repository [Dua and Graff, 2017] , one each from Australia, Germany, Poland and Taiwan. As is common for credit approval datasets, they are imbalanced to various degrees. In order to demonstrate the performance of classifiers in a standard, controlled setting, we balance them by random undersampling. There is a dedicated community [Chawla et al., 2004] that looks at the issue of imbalanced learning. We do not delve into these issues in our paper, and evaluate our methods on both balanced and imbalanced datasets (see the Supplement for the latter). In addition, to Table 2 -note that there is enough variation in terms of the types of features present. We then randomly remove a fraction = 0, 0.1, . . . , 0.5 of all feature values in each dataset to simulate data that is missing "naturally" -i.e., not due to strategic withholding. Testing We test all methods under two ways of reporting: "truthful", i.e., all features are reported as is, and "strategic", i.e., some features might be withheld if it leads to a better outcome. We measure the test accuracy of each classifier, averaged over N=100 runs, with randomness over the undersampling and the data that is randomly chosen to be missing, to simulate data missing for non-strategic reasons. Other metrics, and details about implementing and training the classifiers, are discussed in the Supplement. It is important to note that for testing any method, we have to, in effect, compute the best response of each data point toward the classifier. Since the methods we propose are truthful, this is a trivial task. But for other methods, this might not be easy, thereby limiting what baselines can be used. Classifiers We evaluate our proposed methods, MINCUT, HC with logistic regression (HC (LR)) and neural networks (HC (ANN)) as subclassifiers, and incentive-compatible logistic regression (IC-LR), against several baseline methods. First, they will be compared against three out-of-the-box baseline classifiers: logistic regression (LR), neural networks (ANN) and random forest (RF). We select LR for its popularity in credit approval applications; we select ANN for it being the best-performing individual classifier on some credit approval datasets [Lessmann et al., 2015] ; we select RF for it being the best-performing homogeneous ensemble on some credit approval datasets [Lessmann et al., 2015] , as HC can be viewed as a homogeneous ensemble method. For the sake of exposition, we present numbers just for baselines based on LR, as they perform relatively better. Second, for the purposes of comparison, we include MAJ -predict the majority label if examples with the exact same feature values appeared in the training set, and reject if not -which can be thought of as a non-strategic counterpart of MINCUT. We also include k-nearest neighbors (KNN) as a baseline, since it is closely related to MAJ. These out-of-the-box classifiers need help dealing with missing data, whether they are missing naturally at training and test time or strategically at test time, and to this end, we employ (a) IMP: mean/mode imputation [Lessmann et al., 2015] , and (b) R-F: reduced-feature modeling [Saar-Tsechansky and Provost, 2007] , for each of them. When the dataset has a large number of features, MINCUT and IC-LR can be directly applied. For HC, we assist it in two ways: (a) by selecting 4 features based on the training data, denoted by FS (feature selection), 13 and (b) by choosing a limited number of small subsets (30 with 1 feature and 30 with 2 features), denoted by APP (approximation). Note that since our proposed methods are truthful, we can assume that features are reported as is. However, for all out-of-the-box classifiers, except IMP(LR), it is infeasible to simulate strategic withholding of feature values, due to the enormous number of combinations of features. Last but not least, we test all methods with the discretization of continuous features (into categorical ones) [Fayyad and Irani, 1993] , for reasons given in earlier sections. For the sake of exposition, we report results only for = 0.2. We also limit our exposition of HC, IMP and R-F methods to those based on logistic regression, as these perform better than their ANN/RF/KNN counterparts. For a comprehensive compilation of all results, along with standard deviation numbers, please refer to the Supplement. With a small number of features (Table 3) : As expected, the out-of-the-box baselines perform well under truthful reporting, but not with strategic reporting. Our methods are robust to strategic withholding, and in line with the earlier discussion on the potential issues faced by MINCUT and IC-LR (in Sections 3 and 5), we see that (a) HC(LR) performs most consistently, and (b) in some cases, MINCUT (e.g., Poland) and IC-LR (e.g., Taiwan) do not do well. With discretization (Table 4) : As expected, discretization of numerical features into binary categories improves the performance of MINCUT and IC-LR, for reasons explained in Sections 3 and 5 respectively. We also see some benefit from discretization for HC(LR) when the features are mostly continuous (e.g., Poland), and less so when they are already discrete (e.g., Taiwan). With a large number of features (Table 5) : We see broadly similar trends here, except that in the case with discretization, IC-LR performs much better than before (e.g., Poland). The reason for this is that IC-LR is able to use all the available features once they are discretized into binary categories. However, without discretization, HC methods are more reliable (e.g., Poland and Taiwan). On the out-of-the-box baselines: • Imputation-based methods are sensitive vis-á-vis the mean/mode values used. There is incentive to drop a certain feature if the imputed value is a positive signal. If there are many such features, then these methods perform poorly, as seen in Table 5 (cf. Table 3 , Australia). If the imputed values do not give a clear signal (e.g., when the distribution of each feature value is not skewed), there is a high variance in the performance of these methods (see the Supplement). In some cases, the benchmarks perform as well as, or slightly better than, our incentive-compatible classifiers. For example, in Table 3 , for the Australia and Poland data sets, the accuracy of IMP(LR) and that of HC(LR) are within 0.001 of each other. This happens because the imputed values are, in these cases (but not in most of our other cases), negative indicators of the positive label, and therefore there is generally no incentive to strategically drop features. • Reduced-Feature modeling, despite performing well under truthful reporting, allows too many examples to be accepted under strategic reporting, which hurts its performance. This is true especially for smaller , as each subclassifier has fewer examples to train on, giving several viable options for strategic withholding. We note here that the variance (in the accuracy achieved) produced by our methods, since they are robust to strategic withholding, is much smaller than that of the baseline methods (exact numbers are deferred to the Supplement). In this paper, we studied the problem of classification when each agent at prediction time can strategically withhold some of its features to obtain a more favorable outcome. We devised classification methods (MIN-CUT, HC and IC-LR) that are robust to this behavior, and in addition, characterized the space of all possible truthful classifiers in our setting. We tested our methods on real-world data sets, showing that they outperform out-of-the-box methods that do not account for the aforementioned strategic behavior. An immediate question that follows is relaxing the assumption of having access to truthful training data -for example, one could ask what the best incentive-compatible classifier is given that the training data consists of best responses to a known classifier f ; or, one could consider an online learning model where the goal is to bound the overall loss over time. A much broader question for future work is to develop a more general theory of robustness to missing data that naturally includes the case of strategic withholding. We are thankful for support from NSF under award IIS-1814056. The methods presented in this paper are geared towards preventing the strategic withholding of data when machine learning methods are used in real-world applications. This will increase the robustness of ML techniques in these contexts: without taking this issue into account, deployment of these techniques will generally result in a rapid change in the distribution of submitted data due to the new incentives faced, causing techniques to work much more poorly than expected at training time. Thus, there is an AI safety [Amodei et al., 2016] benefit to our work. The lack of strategic withholding also enables the collection of (truthful) quality data. Of course, there can be a downside to this as well if the data is not used responsibly, which could be the case especially if the features that (without our techniques) would have been withheld are sensitive or private in nature. The other issues to consider in our context are those of transparency and fairness. We assume that the classifier is public knowledge, and therefore, agents can appropriately best-respond. In practice, this might not be the case; however, agents may learn how to best-respond over time if similar decisions are made repeatedly (e.g., in the case of college admissions or loan applications). While US college admission is often a black box, it need not be; many countries have transparent public criteria for university admissions (e.g., the Indian IIT admission system), and the same is true in many other contexts (e.g., Canadian immigration). Of course, transparency goes hand in hand with interpretability, i.e., the classifier must be easily explainable as well, and there could be a trade-off, in principle, between how easy the classifier is to interpret and the accuracy it can achieve. It is also possible that our methods hurt the chances of those with more missing data (similarly to how immigrants without credit history may not be able to get a credit card). This is to some extent inevitable, because if one can get in without any feature, everyone could get in by dropping all features. Therefore, the issue of fairness might arise in the case where some groups systematically tend to have more missing data. Our work can, in a way, be thought of as studying an adversarial classification [see, e.g., Vorobeychik and Kantarcioglu, 2018] problem -in particular, a decision-time, white-box, targeted attack on binary classifiers, assuming that the only strategy available to the attacker is to remove feature values, and the attacker's goal is to maximize the number of instances classified as positive. In this regard, what we study is similar in spirit to some of the existing literature [Globerson and Roweis, 2006 , Syed and Taskar, 2010 , Dekel et al., 2010b on adversarial classification. For example, Globerson and Roweis [2006] consider a problem where, at test time, the attacker can set up to a certain number of features (say pixels in an image) to zero for each instance individually in a way that is most harmful to the classifier chosen. To be robust to such attacks, they devise convex programming based methods that avoid depending on small sets of features to learn the class structure. Our work is different in that we take a more game-theoretic approach to designing classifiers (including ensemble-based ones) that are fully resistant to the strategic withholding of features by agents (that prefer being labeled positively). Moreover, we make no assumptions on the actual structure of the feature space. Our work is also related to the literature on classification withmissing data (Batista and Monard 2003; Marlin 2008). We devise methods that can deal with the strategic withholding by agents of some of their features, against a backdrop of missing data caused by natural reasons (e.g., nonstationary distributions of which features are present, or input sensor failures). The HC method can be viewed as an ensemble method for missing data [Conroy et al., 2016] that is strategy-proof against the aforementioned strategic behavior. We also study the performance of other standard, non-strategic classification methods for missing data in the strategic setting, including predictive value imputation and reduced-feature modeling [Saar-Tsechansky and Provost, 2007] . Proof. First observe that any classifier f can be viewed equivalently as a subset of X , given by Below, we use these interpretations, i.e., as a function or a subset, of a classifier interchangeably. The loss of a truthful classifier f can then be written as where the last line follows from Definition 1. Therefore, our goal is to solve the following optimization problem: Consider the following min-cut formulation (using Definition 2): Let G(D, X ) be a directed capacitated graph, with vertices V = X ∪ {s, t}, with edges E and edge capacities u defined as follows: • For each x ∈ X , there is an edge (s, x) ∈ E with capacity D − (x), and an edge (x, t) ∈ E with capacity D + (x). • For each pair x, x ∈ X where x → x , there is an edge (x , x) ∈ E with capacity ∞. Observe that each finite-capacity s-t cut (S,S) corresponds bijectively to a truthful classifier f := 1(x ∈ S \ {t}). Moreover, the capacity of the cut is given precisely by Therefore, any s-t min-cut corresponds to an optimal classifier f * , which can be computed "efficiently" (i.e., in time poly(|X |)) using any efficient max-flow algorithm given complete knowledge of D. Proof. Recall that the revelation principle holds in our setting (as mentioned in Section Section 2, also see Proposition 1). It therefore suffices to characterize all direct revelation classifiers. For any (not necessarily truthful) classifier f , consider its direct revelation implementation f , which maps feature values x to the most desirable label the data point can get by dropping features, i.e., f (x) = max x :x→x f (x ). We argue below that f has the desired form. Observe that depending on which features a data point x has, f can be decomposed into 2 k subclassifiers, denoted {f F } F ⊆ [k] . The label of x is then determined in the following way: let F x be the set of features possessed by x, i.e., Moreover, observe that (1) f F effectively depends only on x| F (i.e., f F (x) = f F (x| F )), since f F only acts on those data points where all features not in F are missing, and (2) without loss of generality, f F rejects any data point with a missing feature i ∈ F , since f F never acts on a data point where such a feature i ∈ F is missing. Now consider how f works on a data point x. For any F ⊆ F x , by dropping all features not in F , x can report x | F . Moreover, for any such One can therefore write f in the following way: for any x ∈ X , as desired. For any f ∈H, for any δ > 0, with Proof. Recall thatH is defined as the set of all classifiers that can be written as the MAX Ensemble of n classifiers in H. Given the classical VC inequality [e.g., Shalev-Shwartz and Ben-David, 2014, Theorem 6.11], we only need to bound the VC dimension ofH, and show that where d is the VC dimension of H. To this end, observe that each f ∈H is essentially a decision tree with n + 1 leaves, where each leaf is associated with a binary label, and each internal node corresponds to a classifier in H. To be precise, f can be computed in the following way: for any x ∈ X , if f 1 (x) = 1, then f (x) = 1; otherwise, if f 2 (x) = 1, then f (x) = 1, etc. It is known [see, e.g., Daniely et al., 2015, Section 5.2] that the class of all such decision trees with n + 1 leaves, which is a superset ofH, has VC dimension O(dn log dn). As a result, d VC (H) = O(dn log dn), and the theorem follows. Theorem 4. Algorithm 1 converges. Proof ..,f t n } , consider a single update step for, say, f t 1 . As in Algorithm 1, define: Then we perform the update as follows: ..,f t n } . Now, the loss calculated for f is The inequality in the above sequence of steps follows from the fact that f t+1 j accrues a lower loss on S 1 than f t j by definition, and that the classification outcomes for any (x, y) ∈ S −1 is the same for f and f . If we treat X (f ) as a potential function, we can see that it can only decrease with each step, and therefore, the algorithm has to converge at some point. There are many results in the literature on partial verification as to the validity of the revelation principle in various settings [Green and Laffont, 1986 , Yu, 2011 , Kephart and Conitzer, 2015 , 2016 . For our purposes, as mentioned in Section 2, when the reporting structure is given by a partial order (i.e., it is transitive, meaning for any x 1 , x 2 , x 3 , x 1 → x 2 and x 2 → x 3 =⇒ x 1 → x 3 ), the revelation principle holds. Below we give a quick proof for why this is the case in our setting. Proposition 1. For any classifier f : X → {0, 1}, there is a truthful classifier f such that after misreporting, f and f output the same label for all x ∈ X , i.e., Proof. Below we explicitly construct f . Let f be such that for x ∈ X , Clearly f and f output the same label after strategic manipulation. We only need to show f is truthful, i.e., for any Note that the above proof crucially depends on transitivity of the reporting structure. In fact, if the reporting structure → is not transitive, then the revelation principle in general does not hold. To see why this is the case, suppose → is not transitive, and let x 1 , x 2 , x 3 be such that x 1 → x 2 , x 2 → x 3 , and x 1 → x 3 . Suppose we want to assign label 0 to x 1 , and label 1 to x 2 and x 3 , then the only way to achieve that is to implement a classifier f where f (x 1 ) = f (x 2 ) = 0 and f (x 3 ) = 1. However, this classifier is not truthful, since x 2 always misreports as x 3 in order to be accepted. D Other observations D.1 Regarding MINCUT Naturally, the test error of MINCUT depends on X and m. For example, If X is discrete and small, one would expect that MINCUT is almost optimal given enough samples. However, when X is large or even infinite, the generalization gap can be extremely large. To see why this is true, consider the following example: Example 2. Say we are given a feature space with two features, each of which can take any real value between 0 and 1. Let the marginal distribution of D on X be the uniform distribution over X = {(x, y), (x, * ), ( * , y) | x, y ∈ [0, 1]}. When we see a new data point (x, y), unless we already have (x, y), (x, * ) or ( * , y) in the set of samples (which happens with probability 0), we know absolutely nothing about the label of (x, y), and therefore by no means we can expect f to predict the label of (x, y) correctly -in fact, f will always assign label 0 to such a data point. Below, we make a few remarks regarding the generalization bound (Theorem 3) for HC. • Observe that the generalization gap depends polynomially on the number of subclassifiers n. Without additional restrictions, n can be as large as 2 k leading to a gap which is exponential in k. This suggests that in practice, to achieve any meaningful generalization guarantee, one has to restrict the number of subclassifiers used. In fact, we do run our algorithm on a small set of features in Section 6. • Recall that the class of linear classifiers in the k-dimensional Euclidean space has VC dimension k +1. So, if we restrict all subclassifiers to be linear, and require that the number of subclassifiers n to be constant, then Theorem 3 implies that with high probability, the generalization gap is where k is the number of features, m is the number of samples, and O hides a logarithmic factor. Our algorithms are practicable in this kind of regime. In our implementation, we use Python's Scikit-learn (0.22.1) package [Pedregosa et al., 2011] of classifiers and other machine learning packages whenever possible. The categorical features in the datasets are onehot encoded. To help ensure the convergence of gradient-based classifiers, we then standardize features by removing the mean and scaling to unit variance. For imputation-based classifiers, we use mean/mode imputation: mean for numerical and ordinal features, and mode for categorical features. For reduced-feature-based classifiers, we default to reject if the test data point's set of available features was unseen in the training process. For classification methods involving Fayyad and Irani's MDLP discretization algorithm, we use a modified version of Discretization-MDLPC, licensed under GPL-3 14 . The performance of each classifier under each setting is evaluated with Nx2-fold cross-validation [Dietterich, 1998 ]: training on 50% of the data and testing on the remaining 50%; repeat N = 100 times. To tune the parameters for the classifiers, we perform grid search over a subset of the parameter space considered by Lessmann et al. [2015] , in a 5-fold cross-validation on every training set of the (outer) Nx2-cross-validation loop. We evaluate our methods on datasets with 1) 4 selected features, same for all runs (Table 6 to 29), 2) 4 selected features, based on the training data at each run (Table 30 to 53), and 3) all available features (Table 54 to 77). In addition, we evaluate our methods both with and without balancing the datasets through random undersampling. This is denoted by "balanced datasets" when we undersample before the experiment, and "unbalanced datasets" when we do not. We vary from 0 to .5 and report classifier accuracy and AUC (when applicable). Comparing Figure 1 and 3, it appears that there is no significant difference in relative accuracy across the various methods when applied to balanced and imbalanced datasets. However, comparing, for example, Table 8 and 20, we observe that when the dataset is highly unbalanced, classifiers based on imputation and reduced-feature modeling, when faced with strategic reporting, tend to accept everything and yield a considerably high accuracy. Many other general issues regarding the use of accuracy as a metric on unbalanced datasets are known [Japkowicz, 2000 , Chawla et al., 2003 , 2004 . In practice, thresholding methods are sometimes used to determine a proper threshold for binary prediction in such cases [Lessmann et al., 2015 , Elkan, 2001 . Therefore, in addition to accuracy, we evaluate our approach with area under the receiver operating characteristic curve (AUC). AUC becomes a useful metric when doing imbalanced classification because often a balance of false-positive and false-negative rates is desired, and it is invariant to the threshold used in binary classification. 15 For MINCUT, its receiver operating characteristic curve is undefined because it does not output probabilistic predictions; for HILL-CLIMBING, we take the maximum of the probabilistic predictions across all applicable classifiers to be the HILL-CLIMBING classifier's probabilistic prediction for a data point, and obtain AUC from that. From Table 24 to 29, we observe that our three proposed methods generally yield a AUC as good as, if not higher than, imputation-and reduced-feature-based classifiers on imbalanced datasets (also Figure 4) . The same holds for balanced datasets too ( Figure 2 ). For completeness, we also include the performance of classifiers based on imputation and reducedfeature modeling, with discretization. As expected, common classifiers are generally less prone to overfitting than MINCUT, and discretizing the feature space only limits their performance. The number of iteration the training of HILL-CLIMBING classifiers takes to converge varies by dataset, but usually in no more than than 5 passes through all the subclassifiers. For the balanced Australia dataset, for example, HC(LR) takes an average of 4.28 iterations to converge (median 3); HC(LR) w/ disc. takes an average of 2.45 (median 2); HC(ANN): takes an average of 3.56 (median 3); HC(ANN) w/ disc. takes an average of 2.62 (median 2). Race for Test-Takers, ACT Outscores SAT-for Now Concrete problems in ai safety Icml'2003 workshop on learning from imbalanced data sets (ii) Special issue on learning from imbalanced data sets Strategyproof linear regression in high dimensions Automated mechanism design for classification with partial verification Classification with few tests through self-selection 640 (.011) .640 (.011) HC(LR) w/ disc 500 (.000) .500 (.000) IC-LR w/ disc IMP(RF) .842 (.018) 042) R-F(LR) .869 (.020) .776 (.065) 013) R-F(RF) .848 (.019) .767 (.064) HC(LR) .856 (.019) .856 (.019) 500 (.000) .500 (.000) IC-LR w/ disc. .853 (.017) .853 (.017) IMP(RF) .826 (.020) .765 (.086) 052) R-F(LR) .854 (.020) .801 (.042) 007) R-F(RF) .824 (.024) .760 (.051) HC(LR) .826 (.017) .826 (.017) 500 (.000) .500 (.000) IC-LR w/ disc. .829 (.017) .829 (.017) IMP(RF) .801 (.024) .734 (.083) 048) R-F(LR) .825 (.019) .797 (.033) 006) R-F(RF) .802 (.021) .751 (.049)