key: cord-0144692-022cpbld
authors: Gopalan, Parikshit; Kalai, Adam Tauman; Reingold, Omer; Sharan, Vatsal; Wieder, Udi
title: Omnipredictors
date: 2021-09-11
journal: nan
DOI: nan
sha: e8462e8fd21747f4d92fdb3b7ba69f6150980afa
doc_id: 144692
cord_uid: 022cpbld

Loss minimization is a dominant paradigm in machine learning, where a predictor is trained to minimize some loss function that depends on an uncertain event (e.g.,"will it rain tomorrow?''). Different loss functions imply different learning algorithms and, at times, very different predictors. While widespread and appealing, a clear drawback of this approach is that the loss function may not be known at the time of learning, requiring the algorithm to use a best-guess loss function. We suggest a rigorous new paradigm for loss minimization in machine learning where the loss function can be ignored at the time of learning and only be taken into account when deciding an action. We introduce the notion of an (${mathcal{L}},mathcal{C}$)-omnipredictor, which could be used to optimize any loss in a family ${mathcal{L}}$. Once the loss function is set, the outputs of the predictor can be post-processed (a simple univariate data-independent transformation of individual predictions) to do well compared with any hypothesis from the class $mathcal{C}$. The post processing is essentially what one would perform if the outputs of the predictor were true probabilities of the uncertain events. In a sense, omnipredictors extract all the predictive power from the class $mathcal{C}$, irrespective of the loss function in $mathcal{L}$. We show that such"loss-oblivious'' learning is feasible through a connection to multicalibration, a notion introduced in the context of algorithmic fairness. In addition, we show how multicalibration can be viewed as a solution concept for agnostic boosting, shedding new light on past results. Finally, we transfer our insights back to the context of algorithmic fairness by providing omnipredictors for multi-group loss minimization.

In machine learning, it is well-known that the best classifier may depend heavily on the choice of a loss function, and therefore correctly modeling the loss function is crucial for success in many applications. Modern machine learning libraries such as PyTorch, Tensorflow, and scikit-learn each offer a choice of over a dozen loss functions. However, this poses a challenge in applications where the loss is not known in advance or multiple losses may be used. For motivation, suppose you are training a binary classifier to predict whether a person has a certain medical condition (y = 1), such as COVID-19 or some heart condition, given their attributes x. The cost for misclassification may vary dramatically, depending on the application. For an infectious disease, the question of whether an individual should be allowed to go out for a stroll is different than whether a person should be allowed to go to work in a nursing home. Similarly, deciding on whether to advise a daily dose of aspirin carries very different risks than recommending cardiac catheterization. Furthermore, medical interventions that may be developed in the future may carry yet other, unforeseen, risks and benefits that may require retraining with a different loss function.

We adopt the following problem setup: we are given a distribution D on X × Y where X is the domain and Y is the set of labels (for example Y can be {0, 1} or [0, 1]). On each x, we take an action t(x) ∈ R, and suffer loss (y, t(x)) which depends on the label y and the action t(x). We will refer to the function t : X → R which maps points in the domain to actions as the hypothesis. Our goal is to find a hypothesis that minimizes E D [ (y, t(x))] in comparison to some reference class of hypotheses. 1 We will refer to a function f which maps X to probability distributions over Y as a predictor. The goal of a predictor is to model the conditional distribution of labels y|x = x for every point in the domain.

A classic example of different optimal hypotheses for different loss functions is the 2 vs. 1 losses which are minimized by the mean and median respectively. Consider a joint distribution D on X × [0, 1], where x ∈ X is a set of attributes (given) and y ∈ [0, 1] is an outcome. Consider an x such that the corresponding y is uniform in [0.8, 1] with probability 0.6 and 0 with probability 0.4. In this case, to minimize the expected 2 loss (y, t) = (y − t) 2 you would set t(x) = 0.45, whereas to minimize the expected 1 loss (y, t) = |y − t| you would set t(x) = 0.83. Not only is t different for the two losses, you cannot learn one from the other. In other words, learning to minimize the 2 loss looses information that is necessary to minimize the 1 loss and vice versa.

The phenomenon that there is no simple way to get one loss-minimizing predictor from another is not unique to 2 vs. 1 losses. Consider the distribution illustrated in Figure 1 , which is known as a nested halfspace [Kal07] . Consider a common and simple family of loss functions where (y, t) = c y |y − t| where y ∈ {0, 1} and c 0 , c 1 ≥ 0 are the costs of false positives and false negatives respectively. Even for linear classification, as the ratio c 0 /c 1 varies, a different direction is optimal. The standard ML approach of minimizing a given loss would require separate classifiers for each loss. There is no clear way to infer the optimal classifier for one set of costs from the classifier for another; applying standard post-processing techniques, such as Platt Calibration [Pla99] or Isotonic Regression [ZE01] , to the predictions so that Pr[y = 1|t = z] ≈ z will not fix the issue since the optimal direction is different. x 1 +x 2 for x ∈ [0.1, 1] 2 . As can be seen from the level sets, the direction of the optimal linear classifier varies depending on the cost of false positives and negatives. This example is learned to near optimal loss for any loss with fixed costs of false-positives and falsenegatives by an omnipredictor for the class C = {x 1 , x 2 }.

This paper advocates a new paradigm for loss minimization: train a single predictor that could later be used for minimizing a wide range of loss functions, without having to further look at the data. Why should such predictors exist, and are they computationally tractable? One source of optimism is that the ground-truth predictor f * does allow exactly that.

First consider the case of Boolean labels, and let D denote a distribution on X × {0, 1} (our results also apply to real-valued outcomes). We define f * (x) = E y∼D [y|x] ∈ [0, 1] to be the conditional expectation of the label for x. The value t(x) which minimizes the loss E y|x [ (y, t(x))] depends only on f * (x). Furthermore, as long as is smooth and easy to compute, t can easily be computed from f * . We denote the univariate post-processing function that optimizes loss given true probabilities by k * : [0, 1] → R. So for example for 2 (y, t) = (y − t) 2 , we have k * 2 (p) = p and for 1 (y, t) = |y − t|, k * 1 (p) = 1(p ≥ 1/2).

For every loss function , the composition of f * and k * minimizes the loss , even conditioned on complete knowledge of D. The connection between perfect predictors and choosing the optimal action is well understood and plays an important role in the Statistics literature on proper scoring rules and forecasting (cf. [Sch89] ). But learning f * from samples from D is information-theoretically and computationally impossible in general. The natural approach is to learn a model f for f * and then compose k * with f . Common instantiations of this approach (such as using logistic regression to model f ) do not yield particularly strong guarantees in the realistic non-realizable setting where f * does not come from the class of model distributions (see the further discussion in Section 2). Our main conceptual contribution is to introduce the notion of Omnipredictors, which provide a framework to derive strong rigorous guarantees using this composition approach even in the non-realizable setting.

The goal of an omnipredictor is to learn a predictor f that could replace f * for the purpose of minimizing any loss from a class L compared to some hypothesis class C. For a family L of loss functions, and a family C of hypotheses c : X → R, we introduce the notion of an (L, C)-omnipredictor, which is a predictor f : X → [0, 1] with the property that for every loss function ∈ L, there is a post-processing function k such that the expected loss of the composition k • f measured using is almost as small as that of the best hypothesis c ∈ C.

Omnipredictors can replace perfect predictors for the sake of minimizing loss in L compared with the class C. In this sense they extract all the predictive power of C for such tasks. The key questions are of efficiency and simplicity: how strong a learning primitive do we need to assume to get an omnipredictor for L, C, and how complex is the predictor. The main result of this paper is that one can efficiently learn simple omnipredictors for broad classes of loss functions and hypotheses, from weak learning primitives.

Our main result. We show that for any hypothesis class C, a weak agnostic learner for C is sufficient to efficiently compute an (L, C)-omnipredictor where L consists of all decomposable convex loss functions obeying mild Lipshcitz conditions; it includes popular loss functions such as the p losses for all p, the exponential loss and the logistic loss. Weak agnostic learnability captures a common modeling assumption in practice, and is a well-studied notion in the computational learning literature [BLM01, KMV08, KK09, Fel09] . The weak agnostic learning assumption says that if there is a hypothesis in C that labels the data reasonably well (say with 0-1 loss of 0.7), then we can efficiently find one that has a non-trivial advantage over random guessing (say with 0-1 loss of 0.51). In essence, our main result derives strong optimality bounds for a broad and powerful class of loss functions starting from a weak optimality condition for the 0-1 loss.

Perhaps surprisingly, our results are obtained not via the machinery of convex optimization, but by drawing a connection to work on fairness in machine learning, specifically the notion of multicalibration [HKRR18] . Multicalibration is a notion motivated by the goal of preventing unfair treatment of protected sub-populations in prediction; it does not explicitly consider loss minimization. We draw a connection to omnipredictors using a covariance-based recasting of the notion of multicalibration, that clarifies the connection of multicalibration to the literature on boosting [KS05, Kal04, KK09] . A multicalibrated predictor satisfying this definition can be computed by a branching program, building on existing work in the literature on boosting [MM02, KS05, Kal04] and multicalibration [GRSW21] . This new connection shows that the well-known boosting by branching programs algorithms [KM99, MM02] yield multicalibrated predictors and can in fact be used to derive strong guarantees for a broad family of convex loss functions.

The post-processing function k used in our positive results is essentially k * with small modifications. As the example in Figure 1 demonstrates, even in natural cases, an (L, C)omnipredictor cannot be a function in C; in other words, the learning task we solve is inherently not proper.

Omnipredictors for larger classes. An advantage of our covariance-based notion of multicalibration is that it is closed under linear combinations. We use this to show that any multicalibrated predictor for C is in fact an (L, Lin C )-omnipredictor where Lin C consists of linear combinations of functions in C. We give negative results for slightly larger classes, showing that a multicalibrated predictor for C is not necessarily an omnipredictor for the class Thr C which consist of thresholds of functions in C. Similarly, it is need not be an omnipredictor for the class C but with non-convex loss functions. This shows that the connection between multicalibration and omnipredictors that we present is fairly tight.

Omnipredictors for multi-group loss minimization A very recent line of research defines multi-group notions of loss minimization [BL20, RY21] . 2 Multi-group loss minimization is well motivated from the point of view of fairness as it guarantees that no sub-population's loss is sacrificed for the sake of global loss minimization. Say we have a collection of subpopulations T and hypotheses P and seek to take actions such that for every set T ∈ T , our actions can compare with the best hypothesis in P ∈ P for that sub-population T . Indeed, one may wish to vary the loss function for various subgroups: say in a medical scenario where different age-groups are known to react differently to the same treatment.

We derive strong multi-group loss minimization guarantees using the closure of multicalibration under conditioning on subsets in C. We show that in the scenario above, a multicalibrated predictor for T × P = {T · P, T ∈ T , P ∈ P} gives an (L, P)-omnipredictor for every sub-population T ∈ T . Hence given a sub-population T ∈ T and a loss ∈ L, the predictions of the omnipredictor can be post-processed to be competitive with the best hypothesis from P for that loss function and sub-population T .

Omnipredictors for real-valued labels. We extend the notion of omnipredictors to the setting where the labels come from an arbitrary subset Y ⊆ R. Our primary interest is in multi-class prediction where Y = [k] and the bounded real-valued setting where Y = [0, 1]. We show that omnipredictors can be learned in this setting, for similar families of loss functions, again assuming weak learnability of C. We also show a stronger bound for the 2 loss than what the general theorem implies.

One of our contributions in this work is to formalize and leverage the connections between multicalibration and the literature on agnostic boosting. We propose a covariance-based recasting of the notion of multicalibration, inspired by the literature on boosting [Kal04, KK09] . We show that this definition has several advantages, for instance it implies some general closure properties for multicalibration. In the other direction, our work suggests multicalibration as a solution concept for agnostic boosting. By specializing our main result on omnipredictors to the 1 loss, we derive a new proof of the classic result of [KKMS08] on agnostic learning. We elaborate on these connections in this subsection.

A covariance-based formulation of multicalibration. Calibration has been well-studied in the statistics literature in the context of forecasting [Daw82] . It was introduced to the algorithmic fairness literature by [KMR17] . In the setting of Boolean labels, we are given a distribution D on X × {0, 1} of labelled examples, and wish to learn a predictor f :

. The predictor f is (approximately) calibrated if for every value v in its range we have that Pr[y = 1|f (x) = v] ≈ v. This means that the prediction f (x) can be interpreted as a probability that is correct in expectation over individuals in the same level set of f . By itself, calibration is a very weak property, both in terms of fairness as well as in terms of accuracy. This motivated [HKRR18] to introduce the notion of multicalibration that asks for f to be calibrated on a rich collection of subgroups C rather than just a few protected sets.

We draw a connection to omnipredictors by introducing a covariance-based recasting of the notion of multicalibration, that clarifies the connection to the literature on boosting [Kal04, KS05, KK09] . Prior definitions consider multicalibration for a family C of sets or equivalently Boolean functions c : X → {0, 1}, whereas we allow arbitrary real-valued functions. Rather than work with predictors, we define multicalibration for partitions, inspired by the recent work of [GRSW21] in the unsupervised setting. A partition S = {S 1 , . . . , S m } of the domain is a collection of disjoint subsets whose union is X . Intuitively, we want these sets to be (approximate) level sets for f * . We let D i denote the distribution D conditioned on x ∈ S i . The partition S gives a canonical predictor f S where for each x ∈ S i , we predict f S (x) = E D i [y]. In analogy to boosting (see, e.g., [Kal04, KS05, KK09] ), we phrase multicalibration in terms of the covariance 3 between c(x) and y conditioned on each state S i of the partition. We say that S is α-multicalibrated for C if for every i ∈ [m] and c ∈ C,

(1)

In reality, we will weaken the definition to only hold in expectation under D rather than require it for every state of the partition, but we ignore this distinction for now. While our definition is formulated differently, we show that it matches the original definition when C consists of Boolean functions (see Lemma A.3 and the discussion in Appendix A). This lets us adapt existing algorithms in the literature [MM02, Kal04, GRSW21] to give an efficient procedure to compute multicalibrated partitions, assuming a weak agnostic learner for the class C. Working with covariance which is bilinear lets us derive powerful closure properties for multicalibration.

For instance, if f is multicalibrated with respect to C it is also multicalibrated (with some deterioration in parameters) with respect to the class Lin C of (sparse) linear combinations of C. We also show that conditioning on sets in C preserves multicalibration.

Agnostic boosting from multicalibration. The problem of agnostically learning a class C is, given samples from a distribution D on X × {0, 1}, find a binary classifier f : X → {0, 1} whose classification error aka 0-1 loss defined as err(f ) = Pr D [f (x) = y] is not much larger than that of the best classifier from C. The boosting approach to agnostic learning is to start from a weak agnostic learner, which only guarantees some non-trivial correlation with the labels, and boost it to obtain a classifier that agnostically learns C [BLM01, KMV08, Fel09] . Our work suggests multicalibration as a solution concept for agnostic boosting. Indeed, our definition of multicalibration based on covariance parallels that of [Kal04, KS05] , who use covariance as a splitting criterion. Hence when the algorithms of [MM02, Kal04, KS05] terminate, they have found a multicalibrated partition. Viewed in this light, our results show that these algorithms give a broad and powerful guarantee beyond just 0-1 loss: they are competitive with sparse linear combinations over C in optimizing a large family of convex, Lipschitz loss functions (with a simple post-processing step). While AdaBoost or Logistic regression are known to minimize the exponential and logistic loss respectively over sparse linear combinations of C [SSBD14] , no similar result was known for algorithms based on branching programs [MM02, Kal04, KS05] .

Let us now focus on 0-1 loss. Since 0-1 loss for Boolean functions equals 1 loss, our results apply to it. We show that for any multicalibrated partiton S, the predictor k 1 • f S is competitive not just with the best classifier in C, but with the best classifier in the larger class H of functions that are approximated (in 1 ) by linear combinations of C. This lets us re-derive the classic result of [KKMS08] on agnostic learning. While not a new result, we feel that our treatment clarifies and unifies existing results (see Theorem 7.3 and the comments following it). It strengthens known results on the noise-tolerant boosting abilities of such programs [KS05, KMV08] . Further, we show examples where err(k 1 • f S ) is markedly better than any linear combination of C, showing that multicalibration is a stronger solution concept than those considered previously.

Let : {0, 1} × R → R be a loss function that takes a label y ∈ Y and an action t ∈ R as arguments. A hypothesis t : X → R which prescribes an action for every point in the domain suffers a loss of E D [ (y, t(x)]. Our main technical result states that if S is multicalibrated, then f S is an (L, C)-omnipredictor where L consists of all convex losses satisfying some mild Lipschitz conditions. Here for simplicity, we make the stronger assumption that the loss (y, t) is convex and Lipschitz everywhere as a function of t. We do not assume anything about the relation between (0, t) and (1, t).

We sketch how multicalibration leads f S to be an omnipredictor, emphasizing intuition over rigor. We will argue that the loss E D [ (y, k * (f S (x))] is not much more than E D [ (y, c(x))] for any c ∈ C. We fix a state S i ∈ S and analyze the loss suffered by c(x) under D i as follows.

1. Reduction to predicting two values: In general, c(x) could take on many values under D i . However, since the goal is to minimize E D i [ (y, c(x))], we can pretend that c takes only two values, E D i |y=0 [c(x)] whenever y = 0 and E D i |y=1 [c(x)] whenever y = 1 (we say pretend since the actions taken can only depend on x and not on y). By the convexity of the loss functions, this can only reduce the expected loss.

2. Reduction to predicting one value: A consequence of multicalibration, which follows from the definition of covariance, is that conditioning on the label y = b does not change the expectation of c(x) much. Formally for b ∈ {0, 1},

Since the loss functions (b, t) are Lipschitz in t for b ∈ {0, 1}, we can replace

with only a small increase in the loss. At this point, we have reduced to the case where c predicts the constant value

We survey related work in Section 2, and set up notation in Section 3. We define the notion of omnipredictors in Section 4. We introduce our notion of multicalibration in Section 5 and derive closure properties for it in Section 5.2. A detailed discussion of how our definition compares to previous definitions can be found in Appendix A. We prove our main result on (L, C)-omnipredictors for binary labels (Theorem 6.3) in Section 6, along with the extensions to Lin C (Corollary 6.5), and its application to multi-group loss minimization (Corollary 6.6). Section 6.3 shows that multicalibration for C does not yield omnipredictors for thresholds of C or for non-convex losses. Section 7 presents applications of our results to agnostic learning, and an example showing that multicalibration can give stronger guarantee than OPT(Lin C ). We present the extension to the real valued setting in Section 8, and derive a stronger bound for 2 loss in Theorem 8.3. We present an algorithm for computing multicalibrated partitions in Section 9.

While the notion of an omnipredictor is introduced in this work, our definitions draw on two previous lines of work. The first is the notion of multicalibration for predictors that was defined in the work of Herbert-Johnson et al. [HKRR18] . 4 A detailed discussion of how our definitions of multicalibration compare to previous definitions appears in Appendix A. The other is work on boosting by branching programs of Mansour-McAllester [MM02] , which built on Kearns-Mansour [KM99] and the notion of correlation boosting [Kal04, KS05] .

Group fairness and multicalibration. While multicalibration was introduced with the motivation of algorithmic fairness, it has been shown to be quite useful from the context of accuracy when learning in an heterogeneous environment. This was done both experimentally [KGZ19, BYR + 21] and in real-life implementations [BRA + 20] . From a theoretical perspective, it has been shown in [HKRR18] that post-processing a predictor to make it multicalibrated cannot increase the 2 loss. This was extended to showing some optimality results of multicalibrated predictors with respect to 2 and even log-loss [GKR19, Kim20] . Multicalibrated predictors are also connected to loss minimization through the notion of outcome-indistinguishability [DKR + 21] in the work of [RY21] . Outcome indistinguishability shows that a multicalibrated predictor is indistinguishable from the true probabilities predictor in a particular technical sense. In [RY21] this is used to create a predictor that can be used to minimize a rather general and potentially global notion of loss even when restricted to sub-groups. The proof constructs a family of distinguishers, for a fixed loss function, such that if there exists a subset on which the a predictor doesn't minimize the loss function then one of the predictors can distinguish the predictor from the true-probabilities predictor in the sense of outcome indistinguishability. The main way in which all of these results are different from what we show is that they do not seek to simultaneously allow for the minimization of such a rich family of loss functions. In the case of [RY21] , since the result goes through outcome indistinguishability, to minimize a loss with respect to a class C, a predictor needs to be multicalibrated with respect to a different class that relies on the loss function and in-corporates the reduction from multicalibration and outcome indistinguishability. In contrast, our result addresses minimization of a family of losses that may not be known at the time of learning, and we assume the learnability of C alone. Convexity of the losses plays a key role in our upper bounds.

Agnostic boosting. Our work is closely related to work on boosting via branching programs [KM99, MM02, KS05, Kal04] . Indeed, the splitting criterion used in the work of Kalai [Kal04] is precisely that Cov[c(x), y] ≥ α for some c ∈ C, for the task of learning generalized linear models. The okay learners in the work of [KS05] are also based on having non-trivial Covariance with the target. Our results show that upon termination, these algorithms yields an approximately multicalibrated partition; hence we can view multicalibration as a solution concept for the output of Boosting by Branching Program family of algorithms. It was known that these algorithms have stronger noise tolerance properties than potential based algorithms such as AdaBoost, see [KS05, KMV08] . Our results significantly strengthen our understanding of the power of these algorithms, showing that they give guarantees for a broad family of convex loss functions, and not just 0-1 loss.

Naive instantiations of the composition approach. The obvious attempt to building omnipredictors would be to learn a model f for f * using a model family F such as logistic regression, and then compose it with the right post-processing function k * for a given loss . In the context of binary classification, for certain families of non-decomposable accuracy metrics (including F -scores and AUC), it is shown in [NKRD15, DKKN17] that this approach gives the best predictor for the hypothesis class C of binary classifiers derived by thresholding models from F. These results can be seen as a form of omnipredictors, but with strong restrictions on L and C. Their results do not apply to the decomposable convex losses we consider; indeed it seems unlikely that the output of logistic regression can give reasonable guarantees for say the exponential loss or squared loss, even with post-processing. More importantly, our results hold for arbitrary classes C that are weakly agnostically learnable. For any such class, we show how to construct a model f which is an omnipredictor. In contrast, their results prove optimality for a rather limited hypothesis class C derived from the model family F.

Condtional density estimation For real-valued y ∈ R, an omnipredictor solves the problem of Conditional Density Estimation (CDE) [Han94] . While CDE is recognized as an important problem in practice and a number of CDE algorithms have been proposed, it has received little attention in the computational learning theory literature. The notion of omnipredictor is related to the statistical notion of a sufficient statistic, which is a statistic that captures all relevant information about a distribution.

Surrogate loss functions for classification There is a large literature in statistics which shows that a convex surrogate loss function (with certain properties) can be used instead of the hard to optimize 0-1 loss, and any hypothesis c ∈ C which minimizes the surrogate loss will also minimize the original 0-1 loss with respect to C [LV + 04, Ste05, BJM06] . There are also similar results for the multi-class 0-1 loss [TB07] , and in the asymmetric setting when the false positive and false negative costs are known [Sco12] . However, this line of work is quite different from ours, crucially an (L, C)-omnipredictor optimizes not just for a single loss but for any loss in the family L (such as different false positive and false negative costs, we recall that as Fig. 1 shows this cannot be achieved with a single hypothesis from C).

Let D denote a distribution on X × Y. The set X represents points in our space, it could be continuous or discrete. The set Y represents the labels, we will typically consider

We use boldface for random variables. For any S ⊂ X , let D(S) = Pr x∼D [x ∈ S]. We will use y ∼ Ber(p) to denote sampling from the Bernoulli distribution with parameter p.

Let C = {c : X → R} be a collection of real-valued hypotheses on a domain X , which could be continuous or discrete. The hypotheses in C should be efficiently computable, and the reader can think of them as monomials, decision trees or neural nets. We will denote

A loss function takes a label y ∈ Y, an action t ∈ R and returns a loss value (y, t). Common examples are the p losses p (y, t) = |y − t| p and logistic loss (y, t) = log(1 + exp(−yt)). The problem of minimizing a loss function is to learn a hypotheses h :

. . , S m } of the domain X , is a collection of disjoint subsets whose union equals X , we refer to m as its size. We refer to each S i as a state in the partition.

is interpreted as the probability conditioned on x that y = 1. We define the ground truth predictor as f * (x) := E D [y|x = x]. For general label sets Y, let P(Y) denote the space of probability distributions on Y. Define the ground truth predictor f * : X → P(Y) where f * (x) is the distribution of y|x. A predictor is a function f : X → P(Y) which is intended to be an approximation of f * .

We denote by g • h the composition of functions.

We say : Y × R → R is a convex loss function, if (y, t) is a convex function of t for every y ∈ Y. Note that in the binary setting, our formulation allows for binary classification with different false-positive/negative costs, e.g., (y, t) = c y |t − y| where c 0 and c 1 are the different

for all t, t ∈ I. We say the function is B-Lipschitz if the condition holds for I = R. While assuming the loss is B-Lipschitz is sufficient for us, we can work with a weaker notion that only requires the Lipschitz property on a sufficiently large interval. This weaker notion covers most commonly used loss functions such as the exponential loss that are not Lipschitz everywhere. Also, we will define loss functions in the setting of labels that come from Y. In this section though, we will focus on the case Y = {0, 1}.

2. ( -optimality) For y ∈ Y, I contains an -optimal minimizer of (y, t):

We use some simple facts about this function without proof, the first is a simple consequence of -optimality and convexity.

Here are a few examples of nice loss functions:

• Binary classification with different false-positive/negative costs, e.g., (y, t) = κ y |y − t| where κ 0 = κ 1 are the different costs. Here I = [0, 1], B = max(κ 0 , κ 1 ), = 0.

• The p losses for p ≥ 1:

• The exponential loss (y, t) = e (1−2y)t . Here, for any > 0, we can take I = [− ln(1/ ), ln(1/ )] and B = 1/ .

• The logistic loss (y, t) = log(1 + exp((1 − 2y)t)). Here we take I = R, B = 1 and = 0.

• The hinge loss (y, t) = max(0, 1 + (1 − 2y)t). Here we take I = R, B = 1 and = 0.

It is worth noting that only for exponential loss did we need > 0.

In this section, we define our notion of Omnipredictors. Our definitions are simpler for the case of binary labels where Y = {0, 1}, hence we present that case first. Recall that for a predictor h, D (h) is the expected loss of h under D.

Definition 4.1 (Omnipredictor). Let C be family of functions on X , and let L be a family of loss functions. The predictor f :

The definition states that for every loss , there is a simple (univariate) transformation k of the predictions f , such that the composition k • f has loss comparable to the best hypothesis c ∈ C, which is chosen tailored to the loss .

Setting aside efficiency considerations, it is easy to show that f * is an omnipredictor for every C, L.

Lemma 4.2. For every C, L, the ground-truth predictor f * is an (L, C, 0)-omnipredictor.

Proof. By the definition of omnipredictors, given an arbitrary loss function :

(3)

Define the function k * : [0, 1] → R which minimizes expected loss under the Bernoulli distribution:

If there are multiple minima we break ties arbitrarily. Conditioned on x = x, y ∼ Ber(f * (x)), so k * (f * (x)) ∈ R is the value that minimizes the expected loss. Hence for every x ∈ X ,

Note that the function k * depends on but is independent of the distribution D. For instance, for the 1 loss, 1 (y, t) = |y − t|, we have k * 1 (p) = 1(p ≥ 1/2). For the 2 loss 2 (y, t) = (y − t) 2 , we have k * 2 (p) = p. While Definition 4.1 does not place any restrictions on the post-processing function k, in our upper bounds, we will choose k which is very close to the function k * above. Our upper bounds will be efficient for (convex) functions such that a good approximation to k * can be be approximated efficiently. Our lower bounds will hold for arbitrary k.

Finally, a natural family of predictors arising from partitions plays a key role in our results.

The canonical predictor simply predicts the expected label in each state of the partition. Since f S is constant within each state of the partition, it can be viewed as a function f S : S → R. This view will be useful in our results.

Omnipredictors for general Y. Consider the setting where we are given a distribution D on X × Y for Y ⊆ R, hence the labels can take on real values. We are primarily interested in the real-valued setting of Y = [0, 1], and the multi-class setting where Y = [l].

Given a state S i in a partition S, let P i ∈ P(Y) denote the distribution of y under D i . The canonical predictor f S : X → P(Y) is given by f S (x) = P i for all x ∈ S i .

We now define the notion of an omnipredictor. The main difference from the binary case is that the predictor now predicts a distribution in P(Y), so the post-processing function k maps distributions to real values. 

In this section, we present our definitions of multicalibration. We first consider the case of binary labels, and then extend it to the multi-class and real-valued settings. To streamline the presentation, routine proofs are deferred to Appendix C.1. Given real-valued random variables z 1 , z 2 from a joint distribution D, we define

We will use the fact that covariance is bilinear. The following identity will be useful for Boolean y. 

In this section, we define multicalibration for the binary labels setting where Y = {0, 1}. We build on a recent line of work [HKRR18, JLP + 20, KGZ19, GRSW21] . A detailed discussion of how our definitions compare to previous definitions is presented in Appendix A. The following definition is a generalization of the notion of α-multicalibration to real-valued c, and in the setting of partitions.

Definition 5.2. Let D be a distribution on X ×{0, 1}. The partition S of X is α-multicalibrated for C, D if for every i ∈ [m] and c ∈ C, the conditional distribution

A consequence of this definition is that for each D i , conditioning on y does not change the expectation of c(x) by much. Formally, by Equation (5), for i ∈ [m] and b ∈ {0, 1},

Definition 5.2 requires a bound on the covariance for every distribution D i . This might be hard to achieve if D(S i ) is tiny, and hence we hardly see samples from D i when sampling from D. This motivated a relaxed definition called (α, β)-multicalibration in [HKRR18, GRSW21] . We propose a different definition for which it is also easy to achieve sample efficiency. Rather than requiring the covariance be small for every i, we only require it to be small on average. Let i ∼ D denote sampling (the index of) a set from the partition S according to D so that

The next lemma shows that approximate multicalibration implies closeness to (strict) multicalibration under the distribution D. The proof is by applying Markov's inequality to Definition 5.3.

Lemma 5.4. If S is α-approximately multicalibrated for C, D, then for every c ∈ C and β ∈ [0, 1]

This lemma shows that being αβ-approximately multicalibrated is closely related to the notion of (α, β)-multicalibration in [HKRR18, GRSW21] , which roughly says that the αmulticalibration condition holds for all but a β fraction of the space X . Conversely, one can show that (α, β)-multicalibration gives (α + β C ∞ )-approximate multicalibration. We find the single parameter notion of α-approximate multicalibration more elegant. It is also easy to achieve sample efficiency, since as the next lemma shows, it only requires strong conditional guarantees for large states.

Lemma 5.5. Let the partition S be such that for all i ∈ [m] where

it holds that for every c ∈ C,

Then S is α-approximately multicalibrated for C, D.

Approximate multicalibration implies that for an average state in the partition, conditioning on the label does not change the expectation of c(x) much. The proof follows by plugging Equation (5) in the definition of approximate multicalibration.

Corollary 5.6. If S is α-approximately multicalibrated for C, D, then for c ∈ C and b ∈ {0, 1},

Extension to the multi-class setting In the multi-class setting Y = [l], so that l = 2 is exactly the Boolean case considered above. We use 1(y = j) to denote the indicator of the event that the label is j. The following definition generalizes Definition 5.3:

Definition 5.7. Let D be a distribution on X × [l] where l ≥ 2. The partition S of X is α-approximately multicalibrated for C, D if for every c ∈ C and j ∈ [l], it holds that

Extension to the bounded real-valued case. We now consider the setting where Y is a bounded interval, by scaling we may consider Y = [0, 1]. For interval J = [v, w] ⊂ Y, let 1(y ∈ J) be the indicator of the event that y ∈ J.

Definition 5.8. Let D be a distribution on X ×[0, 1]. The partition S of X is α-approximately multicalibrated for C, D if for every c ∈ C and interval J ⊆ [0, 1], it holds that

Computational efficiency Given these new definitions, a natural question is about the computational complexity of computing multicalibrated partitions. Following [HKRR18] , this task can be accomplished efficiently given a weak agnostic learner for the class C. We present a formal statement of this result for the multi-class setting in Theorem 9.2. The multi-class setting includes the Boolean labels setting as a special case. For the purposes of omniprediction, we show in Section 8 that the bounded real-valued setting reduces to the multi-class setting. We defer these results to Section 9 since we do not consider them to be the main contribution of this work, several of the ideas used in the algorithm and its analysis are present in previous work, they are presented here for completeness.

In this section, we prove that approximate multicalibration is closed under two natural operations on the class C and the distribution D:

1. Linear combinations of C: We take sparse linear combinations of the functions c ∈ C.

2. Conditioning D on a subset: We condition the distribution D on a subset X ⊆ X whose indicator lies in the set C.

Again we prove these for the case Y = {0, 1}, but the extension to arbitrary Y is routine.

Multi-calibration under linear combinations. We will typically start with an approximately multicalibrated partition for a base class of bounded or even Boolean functions, such as decision trees or coordinate functions. Since our definition allows the functions c to be realvalued and possibly unbounded, we can consider functions arising from linear combinations over this base class. The motivation for this comes from boosting algorithms like AdaBoost or Logistic Regression, where we take a base class of weak learners, and then construct a strong learner which is a linear combination of the weak learners [FS97, SF12] . We will denote by Lin C the set of all linear functions over C. We associate the vector w = (w 0 , w 1 , . . .) with the function g w ∈ Lin C defined by g w (x) = w 0 + j w j c j (x) where c j ∈ C and denote w 1 = j≥1 w j (note we have excluded w 0 ). Let

be the set of all W -sparse linear combinations. The following simple claim shows that multicalibration is closed under taking sparse linear combinations. The parameter α degrades with the sparsity. The proof follows from linearity of covariance, and is given in Appendix C.1.

Lemma 5.9. For any W > 0, if S is α-approximately multicalibrated for C, D, then it is αW -approximately multicalibrated for Lin C (W ), D.

Multi-calibration for sub-populations. Let T ⊆ X be a sub-population such that its indicator function belongs to C. Let D denote the distribution D|x ∈ T , where D (x) = D(x)/D(T ) ∀x ∈ T . Let S = {S i ∩ T } be the partition of T induced by S. We will use D i for the distribution D |x ∈ S i (which is the same as D|x ∈ S i since S i ⊆ T ), and will denote p i = E D i [y]. Let C ⊆ C denote the subset of functions from C that are supported on T (functions that are 0 outside of T ). Note that C is nonempty, since the indicator of T lies in it. The proof of the following result for the sub-population T in Appendix C.2.

Theorem 5.10. If S is α-approximately multicalibrated for C, D, then S is α(1+ C ∞ )/D(X )approximately multicalibrated for C , D .

In this section we consider the setting of binary labels where Y = {0, 1}.

Given an (B, )-nice loss function, there is a natural post-processing of the canonical predictor f S that we will analyze. Rather than choose the value k * (p) ∈ R which minimizes expected loss under Ber(p), we restrict ourselves to the best value from I . This restriction only costs us by the -optimality property. 

Given a partition S of X , define the -optimized hypothesis h S : X → I as

Since is convex as a function of t, so is

Hence computing k is a one-dimensional convex minimization problem, a classical problem with several known algorithms [BV04] . Being able to compute an -approximate solution suffices for us, we can absorb the term into the error , and pretend that is (B, + )-nice instead.

We can view the hypothesis h S as a function mapping S to I , since it is constant on each S i ∈ S, and its range is I . A simple consequence of the definition is that it is the best function in this class for minimizing expected loss. Proof. We sample i ∼ D and then x, y ∼ D i and show that the inequality holds conditioned on every choice of i = i. Since y ∼ D i is distributed as Ber(p i ),

where the inequality is by Definition 6.1.

Our main result in this section is the following theorem. The following lemma is the key ingredient in our result. Informally it says that c has limited distinguishing power within each state of the partition. Specifically, that c(x) is not much better at minimizing a loss than the function obtained by taking its conditional expectation within each state of the partition S.

Lemma 6.4. Given ∈ L and c ∈ C, define the predictor c : S → I by

We have

Proof. By the convexity of we have

By Lemma 3.2,

Plugging this into Equation (18), we get

From the definition of c,

Subtracting Equation (19) from Equation (20) we get

Since is B-Lipschitz on I and clip(t, I ) is 1-Lipschitz as a function of t,

Plugging this into Equation (21) gives

where the last inequality follows by the multicalibration condition (Equation (7)).

As a consequence we can now prove Theorem 6.3.

Proof of Theorem 6.3. Let h S = k • f S be the -optimized hypothesis. It suffices to show that for any c ∈ C

For any c ∈ C, we have D (h S ) ≤ D ( c) ≤ D (c) + 2αB + where the first inequality is by Corollary 6.2, which applies since c is a function mapping S to I . The second is by Lemma 6.4.

Consider the family Lin C (W ) of linear combinations over C of weight at most W . By Lemma 5.9, S is αW -approximately multicalibrated for Lin C (W ). Applying Theorem 6.3, we derive the following corollary.

Corollary 6.5. Let D be a distribution on X ×{0, 1}, Lin C (W ) be linear functions in C of with w 1 ≤ W (see eq. 15) and L = L(B, ) be the family of all (B, )-nice loss functions. If the partition S is α-approximately multicalibrated for C, D, then f S is an (L, Lin C (W ), 2αBW + )omnipredictor.

To interpret this, assume we have an (B, )-nice loss function and we wish to have a predictor that is within 2 of any function in Lin C (W ). Corollary 6.5 says that it suffices to have an α-approximately multicalibrated partition where α = /2BW . Note that algorithms for computing such partitions have running time which is polynomial in 1/α, which translates to running time polynomial in BW/ .

We derive a corollary for sub-populations follows from Theorem 6.3 and 5.10. For two families of functions T , P : X → R, we define their product as

Note that in the case when T ∈ T is binary-valued, T × P contains the restriction of every P ∈ P to the support of T . To informally instantiate this for a simple case, let T , P be the class of decision trees of depth d 1 and d 2 respectively. Let the partition S be multicalibrated with respect to decision trees of depth d 1 + d 2 . If we now consider any sub-population T identified by decision trees of depth d 1 , then the above result implies that the canonical predictor f S is an omnipredictor for T , when compared against the class of decision trees of depth d 2 evaluated on T .

Corollary 6.5 shows that multicalibration for C gives omnipredictors for Lin C . It is natural to ask whether we can get omnipredictors for a richer class of functions using multicalibration for C. A natural candidate would be thresholds of functions in C:

Another natural extension would be to relax the convexity condition for loss functions in L. We present a simple counterexample which shows that multicalibration for C is insufficient to give omnipredictors for both these classes. This shows that a significant strengthening of the bound from Corollary 6.5 might not be possible.

Lemma 6.7. There exists a distribution D, a set C : X → R of functions, and a 0multicalibrated partition S for C, D such that for any δ < 1/4, • f S is not an (L, Thr C , δ)-omnipredictor for any L containing the 1 loss function.

• f S is not an (L, C, δ)-omnipredictor for any L containing the (non-convex) loss function

be all affine functions. We claim the trivial partition S = {{0, 1} 3 } is 0-multicalibrated for C, D. This is because every x i is independent of y, so their covariance is 0. By linearity of expectation, the same is true for all functions in C. Thus f S (x) = 1/2 for every x ∈ {0, 1} 3 . Now consider the 1 loss. A simple calculation shows that for every k : {0, 1} → R,

since it gets the two middle layers correct. This proves part (1).

To deduce part (2), let (y, t) = |y − 1(t ≥ 0)| and g(

.

In contrast, for any k :

In part (2), the loss function |y − 1(t ≥ 0)| is not Lipshcitz or differentiable in t. We can ensure both these conditions by replacing it with the sigmoid function, which still preserves the correlation with parity, at the cost of some reduction in δ for which the bound holds [Kal04] .

In this section we apply our results to the setting of agnostic learning, where the labels are Boolean, and the relevant loss is the classification error aka 0-1 loss. 

Definition 7.2. A class H is ( , W )-approximated by the class C if for every h ∈ H and > 0, there exists g w ∈ Lin C (W ) such that

For p ∈ [0, 1] the function E y∼Ber(p) 1 (y, t) = p|1 − t| + (1 − p)|t| is minimized in t by the function k 1 (p) = 1(p ≥ 1/2). Hence the 1 -optimized hypothesis h S 1 outputs 1 for x ∈ S i where p i ≥ 1/2 and 0 elsewhere.

We are now ready to state our main result about agnostic learning via multi-calibration. We show that multicalibration gives a predictor that is competitive with the best predictor in a possibly much more powerful class H. Since multicalibration can be achieved using a weak learner for C, this implies a form of agnostic boosting.

Theorem 7.3. Let H be a class that is ( /2, W )-approximated by C. If the partition S is /2W -approximately multicalibrated for C, D, then err(h S 1 ) ≤ OPT(H) + . Hence an algorithm that outputs such a partition gives an agnostic learner for H.

Proof. Let h = arg min h∈H err(h) so that err(h) = OPT(H). There exists

Hence by the triangle inequality,

The 1 loss function is 1-Lipschitz, and (R, 1, 0)-nice. By Corollary 6.5 we conclude that since S is /2W approximately multicalibrated for C, D,

But since h S 1 is a Boolean function, we have E (x,y)∼D

[ (y, h S 1 (x))] = err(h S 1 ).

Hence err(h S 1 ) ≤ err(h) + , which proves our claim. Some comments on the connection of Theorem 7.3 to other work:

• The agnostic learnability of a class H that is 1 approximated by C was first proved by [KKMS08] using linear programming; see [Fel09, KK09] for a boosting based approach. Our result reproves this through multicalibration and the algorithm of [MM02] .

• Agnostic boosting was introduced in the work of [BLM01] . The agnostic boosting abilities of the [MM02] algorithm are analyzed in [KMV08] , however they bound the error by OPT(C). Our work shows that this algorithm is even more powerful; one can compare it to OPT(H).

• Theorem 7.3 gives an upper bound, but it is not a tight characterization of the error of the hypothesis h S 1 . Indeed, in subsection 7.1, we give an example showing that the error of the hypothesis can be much smaller than OPT(Lin C ).

• For the purposes of agnostic boosting, it is often desirable to keep the same marginal distribution over X . Following Kalai-Kanade [KK09] and Feldman [Fel09] , one can reduce multicalibration to weak agnostic learning over a distribution with the same marginal on X , by adding noise to the labels.

In the literature of agnostic boosting (specifically [KK09] ), a γ-weak agnostic learner [KK09] for H is defined as an algorithm that outputs c ∈ C such that,

for γ > 0 in terms of correlation cor D (h) ∈ [−1, 1] defined as,

It is shown that a γ weak agnostic learner can be used to achieve error within of OPT(H) using time and samples poly(1/γ, 1/ ). [KK09] do not explicitly limit what class H can be relative to C, beyond saying that it should satisfy condition Equation (25). However it is not difficult to see that this definition limits agnostic boosting to H ⊆ Lin C , i.e., linear combinations as defined in Equation (15). Therefore, agnostic boosting (as in [KMV08, KK09] ) essentially learns linear combinations of C, while we have shown that multi-calibrated predictors also satisfy this goal. Indeed, our proof shows that the marginal distribution on X can be considered fixed, we just require Equation (25) to hold for all h ∈ H and marginal distributions on the label y.

Lemma 7.4. For Equation (25) to hold for every h ∈ H, we must have H ⊆ Lin C .

Proof. For the purpose of contradiction consider any h / ∈ Lin C . Let g : X → R be the projection of h onto Lin C , i.e., that minimizes E D [(g(x) − h(x)) 2 ]. Since X and C are finite, g is bounded. Thus consider

In particular, for the distribution D which has the same marginal over X but so that f * (x) = E D y|x], it is not difficult to see that cor D (c) = 0 for all c ∈ C and yet cor D (h) > 0 violating the definition of a weak learner.

We now show that being approximately multicalibrated is a stronger notion than OPT(Lin C ) in that being multi-calibrated with respect to C implies a loss that may be even lower than the best classifier in Lin C . This says that the upper bound on the classification error in terms of OPT(Lin C ) given by Theorem 7.3 may not always be tight.

Theorem 7.5. For any > 0, there exists a distribution D on {0, 1} 2 × {0, 1} and a set of functions C : X → {0, 1} such that for any approximately multicalibrated partition S, we have

1}, the marginal distribution over x ∈ X uniform, and the conditional distribution over Y given by

Let C = {x 1 , x 2 } be the family of two classifiers based on the two coordinates. It is not hard to see that, for this C and for H := Lin C , which is the most agnostic boosting can cover, OPT(H) = 1/4 is the error of the best classifier c(x) = x 2 . On the other hand, it is not difficult to see that any multi-calibrated partition must have separate sets for {(0, 1)} and {(0, 0)}. This implies a smaller 0-1 loss of /2.

Clearly, by reducing the constant , the multi-calibrated loss can approach 0, but we chose not to set it to exactly 0 because standard boosting could be applied in the noiseless case.

In this section we will consider two settings:

1. The multi-class setting. Here the labels take values in [l] . For each j ∈ [l], the label j is associated with the loss function (j, t) which is (B, )-nice. This means that there is an interval I for which the B-Lipshcitz and -optimality property hold for the functions (j, t) for every j ∈ [k]. The loss functions for different labels could be very different, in analogy to different false positive and negative scores for the Boolean case.

2. The real-valued setting. Here the labels take values in [0, 1]. We will assume that the loss function (y, t) is B-Lipschitz in y for all t, and that (y, t) is (B, )-nice as a function of t. 5

We show that for nice loss functions, omniprediction in the real-valued setting reduces to the multi-class setting. We take l = B/ and partition [0, 1] into l disjoint buckets {b j } l j=1 of width /B each. For y ∈ b j , we use the loss function (j/l, t) in place of (y, t). This amounts to discretizing y to y where |y − y| ≤ /B. By the B-Lipschitz property, it follows that for any predictor f : X → R,

Since y takes on l discrete values, we have reduced to the multi-class setting.

Hence we now focus on generalizing our results to the multi-class setting where Y = [l] for l ≥ 2. Recall that the definition of multicalibration in the multi-class requires that for all c ∈ C and j ∈ [l], we have

Let P(l) denote the space of probability distributions on l. Given a partition S, we associate to each state S i a probability distribution over labels P i ∈ P(l). The canonical predictor f S : X → P(l) predicts a in P for every x ∈ X and is given by

In analogy with the corresponding definitions for the binary classification setting, for every P ∈ P(l) we define k (P ) ∈ I to be the action that minimizes expected loss under P :

and the -optimized hypothesis h S : X → I by

Our main theorem is a direct generalization of Theorem 6.3. The theorem follows from the next Lemma which generalizes Lemma 6.4 to l ≥ 2. Since the proof follows along similar lines till the last step, we only describe the difference. 

Proof. We follow the proof of Lemma 6.4 till Equation (23), this portion does not assume y ∈ {0, 1}. At this point we have established that

The following analog of Corollary 5.6 follows from Equation (5) about covariance with binary random variables. For every j ∈ [l]

We use this to bound the RHS of Equation (29) as

Plugging this into the RHS of Equation (29) gives the desired bound. 

One of the most commonly used losses for the real-valued case is the squared loss 2 (y, t) = (t−y) 2 . For this loss, we can show a stronger bound. Since 2 is a proper scoring rule, the postprocessing function k * 2 is just the identity function, hence h S 2 (x) = f S (x) = E y∼D i [y] = p i . We will show the following guarantee comparing it to any linear function.

Theorem 8.3. Consider the real-valued prediction problem with y ∈ [0, 1], let y be a discretization of y into 1/ buckets (as in Equation 26 ). Let S be α-approximately multicalibrated for C, D, and f S (x) be the predictor defined above. Let (y, t) = (t − y) 2 denoted the 2 loss. For any function g w ∈ Lin C and y ∈ [0, 1], we have

Let us compare this bound with the one implied by Corollary 6.5, which applies since 2 loss is ([0, 1], 1, 0)-nice. The main difference is the addition of the term E[(g w (x) − f S (x)) 2 ] to the LHS. This tells us that when 1/ α w 1 is small (say O( )), any linear function that is far (in squared distance) from f S incurs large 2 loss. This theorem follows from the following Pythagorean bound which holds for the squared error, proved in Appendix C.3.

Lemma 8.4. In the setting of Theorem 8.3,

In this section, we show how one can use a weak agnostic learner to compute a multi-calibrated partition for the multi-class setting, where we are given a distribution D on X × [l] and our goal is to compute a partition of X satisfying definition 5.7. We start with the definition of a weak agnostic leaner in the setting where the class of hypotheses C can be real-valued. Similar definitions appear in the literature in [Kal04, KMV08, KK09] . We restrict the class of output hypotheses C found by the weak learner to be Boolean for simplicity, but this requirement is easy to relax.

Definition 9.1. Let C = {c : X → R} be a family of real-valued hypotheses. Let w : (0, 1] → (0, 1] be such that w(α) ≤ α. Let C = {c : X → {0, 1}} be a family of Boolean hypotheses. An (w, C ) weak learner for C is given sample access to a distribution D on X × {0, 1}. If there exists c ∈ C such that Cov D [c(x), y] ≥ α, then with probability 1 − δ the weak learner returns a hypothesis c ∈ C such that Cov D [c (x), y] ≥ w(α).

A few comments on our definition:

• We have defined the weak agnostic learner to work for every distribution D on X × {0, 1}. One can work with a weaker notion called distribution-specific agnostic learning [KK09, Fel09] , where we only have guarantees for a fixed marginal distribution on X .

• Ideally, we would like w(α) to be lower bounded by a polynomial in α, and the sample complexity and running time to depend polynomially on 1/w(α), log(1/δ) and the dimensionality of X . Since w(α) ≤ α, this allows a polynomial dependence in 1/α. For simplicity, we will ignore the failure probability δ since it can be made arbitrarily small by repetition. We will state our results in terms of the number of calls made to the weak learner, through which the running time and sample complexity of our algorithm will depend on those for the weak learner.

• We can relax the assumption that C is Boolean to allow for bounded real-valued functions: if c : X → R satisfies Cov[c (x), y] ≥ w(α), then it is well-known [Kal04] that there exists θ such that the Cov[1(c (x) ≥ θ), y] ≥ w(α)/ max x∈X |c (x)|. Such a θ can be found by a simple line search.

Theorem 9.2. Given a (w, C ) weak learner for C, for any α > 0 and any distribution D on X × [l], there is an efficient algorithm to compute an α-approximately multicalibrated partition for C, D of size

The partition can be computed by a layered branching program with nodes labelled by hypotheses from C , of width m and length T where

The algorithm makes O((l/w(α/2)) O(l) ) queries to the weak agnostic learner.

Our algorithm uses ideas and analyses that appear previously in the literature [MM02, Kal04, GRSW21] , yet since it differs each of these works along one or more axes, we present the algorithm and sketch its analysis in Appendix B for completeness. We highlight the similarities and differences from previous work below.

• The notion of boosting using branching programs was introduced by [MM02], building on the work of [KM99] on boosting via decision trees. The work of [Kal04] shows that the [MM02] algorithm can be used to achieve correlation boosting in real-valued setting where the labels come form [0, 1]. Under the assumption that for any distribution D on X × [0, 1], the weak learner can produce a real-valued hypothesis that has non-trivial correlation with the labels; Kalai [Kal04] shows that the [MM02] algorithm produces a hypothesis whose correlation with the labels is close to 1. In contrast, we work with binary labels, and only assume that the weak learner can produce a correlated hypothesis for a specific marginal distribution D X . The outcome of our algorithm is a multicalibrated partition. Our definition of multicalibration is influenced by the correlation-based notion of weak learning form this work.

• The notion of multicalibrated partitions was introduced in [GRSW21] in the context of computing importance weights. Their work may be thought of defining approximate multicalibration in the unsupervised setting, which involves some subtle technical issues. Similarly, our algorithm and analysis follows the same high level outline, but there are technical differences.

The work of [HKRR18, JLP + 20] and follow-up papers considers the setting of predictors f : X → [0, 1] and where C is a collection of subsets or equivalently of Boolean functions. Our work departs from this along some axes:

1. We consider partitions and not predictors. While this is a seemingly minor change, it lets us work with the covariance under the conditional distribution, which proves to be the right notion for numerous technical reasons. This is inspired by the works of [GRSW21] who introduced partitions for multicalibration in the unsupervised setting, for the problem of computing importance weights, and [Kal04] who uses covariance as a splitting criterion for boosting.

2. We allow C to consist of arbitrary real-valued functions, not just Boolean functions.

Real-valued functions have been considered for multiaccuracy [KGZ19] but to our knowledge, not for multicalibration. In the Boolean setting, our definition is essentially equivalent to that of [HKRR18, JLP + 20], under a suitable setting of parameters.

A predictor is a function f : X → [0, 1]. The goal of our algorithm is to produce a predictor that approximates the ground truth values f * : X → [0, 1].

We say that f is α-calibrated for a set S if

One can switch between the notions of predictors and partitions with only a small loss in parameters. We first show how to derive a predictor from a partition and vice versa. 

Our notion of canonical partition is inspired by the notion of λ-discretization in [HKRR18] , however they discretize the range of f , rather than partition the domain X . Suppose we start from a partition S and let f = f S be its canonical predictor. The canonical λ-partion S f for f will merge together those states in S for which E[y] lies in J i , hence these expectations are all within λ of each other. Hence the canonical predictor for S and S f only differs from the canonical predictor for S by λ. In the other direction, if we start from f and let S = S f be the λ-canonical partition, and f S to be the corresponding canonical predictor,

Comparing the definitions for Boolean functions Let us consider the family C to consist of Boolean functions c : X → {0, 1}. We can associate each c ∈ C with the subset c −1 (1) ⊆ X . The two relevant definitions for us are the notion of multicalibration from [HKRR18] and mean-multicalibration from [JLP + 20]. Both definitions apply to families of sets and predictors, so we restate them specialized to the setting of a canonical predictor for a partition. [JLP + 20] were interested in the case when y ∈ [0, 1], since their goal was to define multicalibration for higher moments. In our setting, we consider the Boolean case.

Definition A.2. [JLP + 20] Let C be a collection of Boolean functions and let S be a partition. The canonical predictor associated with S is α-mean multicalibrated if for every c ∈ C

This is a slight variation of the original definition of multicalibration from [HKRR18] , who required the RHS be bounded by α, but then only consider those i, c where

The two definitions are equivalent up to a reparametrization, the above formulation handles the case of small sets by allowing the guarantee to degrade. See Remark 2.1 of [JLP + 20].

In Equation (7) we condition on y and consider the expectation of c(x), whereas (31) conditions on c(x) and takes the expectation of y. Conditioning on y extends naturally even to real-valued functions c and Boolean labels y, where the notion of conditioning on c may not make sense. We also extend it to real-valued y by considering the event 1(y ∈ J) for some interval J. But in fact the two definitions are equivalent when c and y are both Boolean. Proof. Since c is Boolean we have

We may assume E D i [c(x)] = 0, else the condition holds trivially. Multiplying both sides by E D i [c(x)], we can rewrite Equation (31) as

The quantity on the left is | Cov D i [c(x), y]|, hence this is identical to Equation (6).

In this Section, we sketch the proof of Theorem 9.2. A distribution in P(l) in a vector v ∈ [0, 1] l such that i v i = 1. We can partition this space into subcubes, where each subcube is the product of l intervals of length δ. We refer to this collection of subcubes as I δ . Naively |I δ | = O(1/δ l ), we can also bound it by lO(1/δ l−1 ) by observing that since the co-ordinates need to sum to 1, fixing the first l − 1 co-ordinates leaves at most l possible intervals of the length δ for the last coordinate.

Our algorithm maintains a partition S of the space X . We have a target bound on the number of states m = |I δ | for δ to be specified later. We will ensure that the number of states never exceeds 2m. We iteratively modify the partition until we reach a multicalibrated partition, using one of two operations Split and Merge. Split increases the number of states in the partition by 1. Merge is applied whenever the number of states exceeds 2m and brings the number down to at most m. The algorithm terminates when neither operation is applicable. This results a sequence {S t } T t=0 starting from the trivial partition S 0 = {X } until the algorithm terminates with S T , which we will show is multicalibrated. For very partition S t , we define a corresponding distribution function f t : S t → P(l), where f t (S i ) ∈ P(l) is the distribution over labels [l] conditioned on S i .

Since the Merge operation reduces the number of states from 2m to m, there are at least m Split operations before the next Merge happens. If we partition time into epochs, each ending with a Merge, then in each epoch, the potential increases by at least m α 4m w(α/2) 2 − 2δ = α 4 w(α/2) 2 − 2δ ≥ δ if we choose δ = α 12 w(α/2) 2 .

Since the potential can be at most 1, this implies a bound of 1/δ on the number of epochs.

We have m T ≤ 2m ≤ 2l/δ l−1 ≤ 2l 12 αw(α/2) 2 l−1 Since the number of epochs is at most 1/δ and a single epoch involves at most 2m Split operations, we have T ≤ 2m/δ ≤ 2l/δ l ≤ 2l 12 αw(α/2) 2 l . Sample Complexity. At each time step, the weak agnostic learner might be invoked at most l · m times. Since we only consider those states S i where D(S i ) ≥ α/4m, the number of samples from D needed for each call is roughly 4m/α times the sample complexity of the weak agnostic learner. Overall, the multiplicative overhead in sample complexity over the weak agnostic learner is Summing the two bounds, we conclude that S is α-approximately multicalibrated.

Proof of Corollary 5.6. By Equation (5),

Proof of Lemma 5.9. Let g w ∈ Lin C (W ). Then for each i ∈ [m], using the linearity of covariance

Averaging over states, using the triangle inequality and using approximate multicalibration,

≤ α j≥1 |w j | ≤ αW.

In order to prove Theorem 5.10, we will first restate the definition of multicalibration as follows. For a set S ⊆ X , let S(x) = 1 if x ∈ S and 0 otherwise.

Lemma C.1. The partition S of X is α-approximately multicalibrated for C, D iff for every c ∈ C,

Proof. We can rewrite the LHS of Equation Now averaging over i ∈ [m] with probability D(S i ) and plugging in l = 1/ gives the desired inequality.

Convexity, classification, and risk bounds

Advancing Subgroup Fairness via Sleeping Experts

Agnostic boosting

Ran Balicer, and Noa Dagan. Developing a COVID-19 mortality risk prediction model when individual-level data are not available

Convex Optimization

Addressing bias in prediction models by improving subpopulation calibration

Objective probability forecasts

Consistency analysis for binary classification revisited

Outcome indistinguishability

Distribution-specific agnostic boosting

A decision-theoretic generalization of online learning and an application to boosting

Tracking and improving information in the service of fairness

Multicalibrated partitions for importance weights

Autoregressive conditional density estimation

Multicalibration: Calibration for the (computationally-identifiable) masses

Moment multicalibration for uncertainty estimation. CoRR, abs

Learning monotonic linear functions

Learning nested halfspaces and uphill decision trees

Multiaccuracy: Blackbox post-processing for fairness in classification

AAAI/ACM Conference on AI, Ethics, and Society

A complexity-theoretic perspective on fairness

Potential-based agnostic boosting

Agnostically learning halfspaces

On the boosting ability of top-down decision tree learning algorithms

Inherent tradeoffs in the fair determination of risk scores

On agnostic boosting and parity learning

Preventing fairness gerrymandering: Auditing and learning for subgroup fairness

Boosting in the presence of noise

On the bayes-risk consistency of regularized boosting methods. The Annals of statistics

Boosting using branching programs

Optimal decision-theoretic classification using non-decomposable performance metrics

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods

Multi-group agnostic PAC learnability

A general method for comparing probability assessors

Calibrated asymmetric surrogate losses

Boosting: Foundations and Algorithms

Understanding Machine Learning -From Theory to Algorithms

Consistency of support vector machines and other regularized kernel classifiers

On the consistency of multiclass classification methods

Split: The Split operation takes a two arguments 1. A state S i ∈ S t such that D(S i ) ≥ α/4m. 2. A hypothesis c ∈ C such that there exists a label j ∈ [l] so that Cov D i [c (x), 1(y = j)] ≥ w(α/2). next partition S t+1 . To see if the Split operation can be applied, we iterate over all states S i which are sufficiently large (D(S i ) ≥ α/4m). For every j ∈ [l], we create a distribution D i,j on X ×{0, 1} by sampling a pair x, y according to D i and outputting the pair (x, 1(y = j)). We run the weak learner on each such distribution D i,j . If for some j

Merge: The Merge operation is applied whenever the number of states exceeds 2m, and brings it down to at most m. For every subcube I ∈ I δ , we merge all states S i so that

Since the squared loss is 1-Lipschitz, by the argument in Equation (26), we can work with an -discretization of the interval [0, 1] with at most loss. Therefore, for l = 1/ , let y = j