key: cord-0480519-wevy62qg
authors: Chen, Guanting; Li, Xiaocheng; Ye, Yinyu
title: Fairer LP-based Online Allocation via Analytic Center
date: 2021-10-27
journal: nan
DOI: nan
sha: 8ae34e1ccba335e574ceb7e54b286ba8f674e02e
doc_id: 480519
cord_uid: wevy62qg

In this paper, we consider an online resource allocation problem where a decision maker accepts or rejects incoming customer requests irrevocably in order to maximize expected reward given limited resources. At each time, a new order/customer/bid is revealed with a request of some resource(s) and a reward. We consider a stochastic setting where all the orders are i.i.d. sampled from an unknown distribution. Such formulation arises from many classic applications such as the canonical (quantity-based) network revenue management problem and the Adwords problem. While the literature on the topic mainly focuses on regret minimization, our paper considers the textit{fairness} aspect of the problem. On a high level, we define the fairness in a way that a fair online algorithm should treat similar agents/customers similarly, and the decision made for similar agents/customers should be consistent over time. To achieve this goal, we define the fair offline solution as the analytic center of the offline optimal solution set, and introduce textit{cumulative unfairness} as the cumulative deviation from the online solutions to the fair offline solution over time. We propose a fair algorithm based on an interior-point LP solver and a mechanism that dynamically detects unfair resource spending. Our algorithm achieves cumulative unfairness on the scale of order $O(log(T))$, while maintains the regret to be bounded without dependency on $T$. In addition, compared to the literature, our result is produced under less restrictive assumptions on the degeneracy of the underlying linear program.

Algorithm-based decision-making systems have become widespread in almost every aspect of modern life, such as policing (Rudin, 2013) , hiring (Miller, 2015) , credit lending (Petrasic et al., 2017) , and criminal justice (Huq, 2018) . Meanwhile, the data-driven predictions and decision-making have raised increasing concerns on discriminatory and unfair treatment. The quantification and promotion of fairness for machine learning algorithms are rapidly growing areas of research. Given the abundant research focusing on fair classification and fair regression, this paper focuses on the fairness aspect of the online resource allocation problem.

In many practical settings such as COVID-19 vaccine rollout, hiring, and service scheduling, the decision maker has to make decision online with incomplete knowledge of the demand/population distribution. Therefore, fairness concerns arise naturally in online resource allocation problems. Decision makers have to distribute limited supply of resources across multiple groups with different needs. In this paper, we formulate the online resource allocation problem as an online Linear Programming problem, which is the underlying model for various OR/OM problems such as quantity-based network revenue management (Talluri and Van Ryzin, 2006) , service reservations (Stein et al., 2020) , and order fulfillment (Simchi-Levi, 2010) .

Our motivations can be summarized from the following example. Emergency rooms are scarce resources because of the time-sensitive feature of emergency surgeries. A hospital has to make decisions on whether to accept or reject a patient. It is impossible for the hospital to accept all patients, and the rejected patients will be diverted to other hospitals. Consider a case where there is only one type of resource: the total operating hours of the emergency room. The objective is to minimize the idle time of the emergency room over a period of T days. We assume there are two types of patients. Patients of type A request surgery that takes shorter time and patients of type B request surgery that takes longer time.

Those two types of patients have different arrival rates λ 1 and λ 2 , which are unknown to the hospital.

The hospital has to make a real-time decision based on the historical information on the arrival rates and the operating time on each type of patient.

We assume that there are more patients than the available emergency rooms. As a result, every time the hospital needs to schedule a new surgery, it has to choose from a group of patients consisting of type A and B. Since the hospital's objective is to minimize the idle time of the emergency room, there could be allocation rules that are optimal in terms of minimizing the idle time, but are unfair in some aspects.

• Allocation rule a): when the hospital has the availability to schedule a new surgery, it always accepts one patient of type A from the patient group, while rejecting all patients of type B. This is obviously unfair to patients of type B.

• Allocation rule b): whenever a decision needs to be made before day T /2, the hospital always accepts one patient of type A, and when the decision needs to be made after day T /2, it always accepts one patient of type B. Although the hospital spends equal operating hours on servicing patients of type A and B, this policy is unfair for patients of type B arriving before day T /2 and patients of type A arriving after day T /2.

If the hospital knew the underlying arrival rates λ 1 and λ 2 , a natural fair allocation rule would be to randomly accept one patient of type A with probability λ1 λ1+λ2 and to randomly accept one patient of type B otherwise. It maintains the fairness between patients of type A and type B such that the operating times of the emergency rooms for both types of patients are proportional to their population.

Moreover, both patient types will have the same probability of getting accepted across time. For such reasons, we refer to this allocation rule as fair offline decision.

In the online environment, the hospital does not know the parameter λ 1 and λ 2 . To ensure fairness, the hospital has to make decisions that are close to the fair offline decision; as time increases, the online decision should converge to the fair offline decision. We also require that the treatment of similar patients across time should be similar, and therefore we expect the deviations from the online decision to the offline decision to be small.

We take the fairness notion that "similar individuals should be treated similarly" (Dwork et al., 2012) and formally propose two fairness requirements for online algorithms. Firstly, we require our online solution to converge to the fair offline solution, which is defined as the analytic center (see Luenberger and Ye (2021) ) of the optimal solution set of the underlying LP. Secondly, we want the decision to be consistent across time such that the online decision converges to the fair offline solution in a stable way.

To quantify the consistency, we define cumulative unfairness to be the cumulative deviation from the online solutions to the fair offline solution. A fair online algorithm should maintain a moderately small cumulative unfairness.

It is common for LP problems to have multiple solutions (Mittelmann and Spellucci, 2005) . The analytic center can be viewed as the average of the optimal solutions, because it maximizes the distance to the boundary of the optimal solution set. Consequently, our algorithm ensures certain levels of "group fairness" such that the algorithm will not favor one optimal solution over the other, and the final allocation can be distributed more evenly across different type of agents. Also, the analytic center ensures "individual fairness across time" such that throughout the whole decision horizon, as long as the agents have similar reward and resource consumption patterns, we cannot treat them too differently based on other features such as the time they enter the system, demographics, race, and gender.

Next, we analyze the relationship between fairness and regret. The common approach in fair sequential decision making literature is to formulate an optimization problem and put the fairness requirements as extra constraints. Assuming the optimization problem is to maximize the objective value (we refer to it as reward), putting extra constraints will almost always decrease the reward. As a result, it is easier to get a sub-linear upper bound for regret, because regret is defined as the difference between the online reward to the offline reward benchmark, which is lower due to the extra constraint.

Instead of putting fairness constraints into the optimization problem, we consider approaches to design our sequential decision making algorithm such that it features a sub-linear upper bound for cumulative unfairness, while keeping the regret low. Our definition of regret is stronger than the definition in previous works because without the extra fairness constraint: we are comparing the online reward with the original offline reward. This "getting fairness guarantee without fairness constraints" approach provides a better insight on the trade-off between regret and efficiency in the online environment.

Our solution is an adaptive online linear programming algorithm that uses an interior-point type of solver and dynamically detects unfair resource spending. This algorithm features a O(log(T )) upper bound for cumulative unfairness and a problem-dependent constant upper bound for regret. It implies that from the algorithm design perspective, it is possible to achieve extra fairness without affecting regret.

Furthermore, our analysis alleviates the unique and non-degenerate condition that is commonly used in the revenue management literature (Jasin and Kumar, 2012; Jasin, 2015; Chen et al., 2021) . By leveraging properties of analytic center, we require a condition which we name it as "dual non-degeneracy" (non-degeneracy condition for the binding resource constraints. See Assumption 2 for more details), and allow the existence of multiple optimal solutions.

To conclude, our contribution can be summarized as • We propose a LP-based online algorithm such that its online decision converges to the fair offline optimal solution in expectation, and its corresponding cumulative unfairness has an upper bound of order O(log(T )).

• The fair algorithm achieves a bounded regret that bears no dependency on T under less restrictive non-degeneracy conditions than other works in the literature.

The concept of fairness has been extensively studied in machine learning, operations research and economics. We will discuss the related literature from different research streams.

Literature on fair machine learning Generally speaking, the definitions of fairness in the machine learning literature can be summarized into two categories.

• Group Fairness (Kamishima et al., 2011; Hardt et al., 2016) : also known as statistical parity or disparate impact. It requires that the demographics of certain statistical measures (for example, positive classifiation rate) are similar to the demographics of the population as a whole.

• Individual Fairness (Dwork et al., 2012) : also known as disparate treatment. It advocates fairness through the goal that similar people should be treated similarly.

There has been a proliferate literature on promoting group fairness and individual fairness for supervised learning problems, where the algorithm takes the training data and apply the same predictive model to every new instance. Generally speaking, the existing fairness approach can be summarized into three categories:

• Pre-processing (Zemel et al., 2013) : modify the training data.

• In-processing (Zafar et al., 2017) : modify the learning algorithm.

• Post-processing (Hardt et al., 2016) : modify the output of the algorithm.

Our works is inline with the definition of individual fairness, and the way we design our algorithm to achieve fairness is an in-processing technique. It is also worth mentioning that under the case that the reward is a linear function of the resource consumption vector (such as the example in the introduction), our algorithm also exhibits group fairness property. What differentiate our work with the ones mentioned above are that, firstly, we are interested in making fair decisions instead of generate fair classi- which require less qualified individuals should not be favored over more qualified individuals. The authors apply the fairness notion into classic and contextual bandits, and propose algorithm that is fair and is of sub-linear regret. Jabbari et al. (2017) extend the notion of meritocratic fairness to reinforcement learning, whereby fairness requires the algorithm to prefer one action over another with low probability, if the long-term (discounted) reward of choosing the latter action is higher. Other fairness notions have also been proposed. Wang et al. (2021) argue that the conventional bandit formulation can lead to an undesirable and unfair winner-takes-all allocation of exposure, and propose algorithm that enables each arm to receive an amount of exposure proportional to its merit. Sub-linear fairness regret and reward regret are also guaranteed in this setting. Similarly, Liu et al. (2017) study the stochastic bandit with a smoothness constraint that if two arms have a similar quality distribution, the probability of selecting each arm should be similar. Patil et al. (2020) and Chen et al. (2020) study stochastic and contextual bandit problem with fairness requirement that there is a minimum rate that each arm must been pulled.

Under such condition the authors obtain sub-linear regrets. Also see Blum et al. (2018) ; Gillen et al. In the fair bandit setting, the most relevant work to ours is Gupta and Kamble (2019) . The authors propose two fairness notions: fairness-across-time (FT) and fairness-in-hindsight (FH). FT propose a Lipchitz metric that ensures treatment of individuals to be individually fair relative to the past as well as future, while FH only requires a one-sided notion of individual fairness that is defined relative to only the past decisions. They provide an algorithm with sub-linear regret under FH, and show linear regret is inevitable under FT. In our work, we consider a reward maximization problem under resource constraint, and we achieve fairness by controlling the cumulative deviation of online decision to the fair decision.

We are not putting our unfairness metric as a constraint as in Gupta and Kamble (2019) .

Another paper that shares the similar spirit is Heidari and Krause (2018 propose two measures: proportional fairness and max-min fairness, and characterize the efficiency loss due to maximizing fairness. Donahue and Kleinberg (2020) analyze the fairness notion that individuals from different groups should have (approximately) equal probabilities of receiving the resource, and characterize the trade-off between fairness and utilization in static resource allocation problems. The authors provide upper bounds for the gap between the maximum possible utilization of the resource and the maximum possible utilization subject to this fairness condition.

In the online decision making setting, one model that is close to ours is Elzayn et al. (2019) . The authors consider a centralized agent allocating a scarce resource amongst several groups. The agent have to maximize objective value under the "equality of opportunity" constraint that the probability of an individual receiving the scarce resource is approximately independent of the individual's group. The authors propose a learning algorithm that converges to an optimal fair allocation without distribution knowledge of the population. In contrast, our definition of fairness has an extra requirement that the decision must be consistent with respect to time. Our model is more general such that it allows multiple resource constraints. In addition to providing sub-linear bound for cumulative unfairness, we also prove the extra property on bounded regret.

Similarly, Sinclair et al. (2020) consider a fairness objective which is the maximum difference between the algorithms allocation and the fair offline solution. The authors prove that an approximation for such objective gives an allocation rule that achieves approximate Pareto-efficiency and envy-freeness. Ma and Xu (2020) formulate the online resource allocation problem into the online bipartite matching model.

The authors study the fairness objective of maximizing the minimum service rate across all groups, and analyze the competitive ratio of online algorithms. Manshadi et al. (2021) consider fair online rationing such that each arriving agent receives a fair share of resources proportional to its demand, and provide upper bounds on both the expected minimum fill rate and minimum expected fill rate.

In the pricing and revenue management literature, Li and Jain (2016) study the impact of consumers fairness concerns on firm's pricing strategy, profits, consumer surplus, and social welfare. Cohen et al.

(2019) consider a single-period model with a known linear demand, and study the impact of imposing various fairness criterion on the seller's profit, consumer surplus, and social welfare.

In a similar motivation to ours, Cohen et al. (2021) study the problem of dynamic pricing with unknown demand and with two fairness constraints. The fairness constraint for price requires similar prices for different customer groups, and ensures that the prices over time for each customer group are relatively stable. Under price constraint the authors propose an UCB type of algorithm that yields a near-optimal regret. The fairness constraint for demand requires that by setting price accordingly, the resulting demand from different customer groups is relatively similar. Under such constraint the authors design an algorithm that also achieves a near-optimal performance.

An important distinction that differentiates our work from the works mentioned previously is that, in the problem formulation, we are not imposing extra fairness constraint to the offline problem, therefore our online algorithm is comparing against the unconstrained benchmark.

Literature on LP-based revenue management problem We note that all the regret bounds here are problemdependent bound that mainly focuses on the dependence on horizon length T but will inevitably involve certain parameters related to the underlying distribution/optimization problem. Figure 1 : Notice that the dual nondegeneracy condition is a weaker condition than the widely assumed unique and non-degenerate condition in the literature There has been a proliferate literature on algorithm design and analysis for the online revenue management problem. We provide a short summary of our result on regret and the related literature in Table   1 .

One stream of literature investigates the existence of an online algorithm that achieves bounded regret (not dependent on the number of decision variables or time horizon). Specifically, under the knowledge of the customer arrival distribution, a line of works (Jasin and Kumar, 2012; Wu et al., 2015; Bumpensanti and Wang, 2020 ) study the canonical quantity-based network revenue management problem and design algorithms that achieve bounded regret. Jasin and Kumar (2012) show that the LP-based re-solving algorithm achieves bounded regret where the algorithm computes a control policy by periodically solving a linear program specified by the known arrival distribution. A critical condition is that the underlying LP should be unique and non-degenerate. A subsequent work (Bumpensanti and Wang, 2020) fully resolves the problem through an infrequent re-solving scheme which updates the control policy only at a few selected points. Another line of works (Vera et al., 2019; Vera and Banerjee, 2019; Banerjee and Freund, 2020) devise algorithms that achieve bounded regret for various problems including dynamic pricing, knapsack problem, and bin packing problem. The authors develop a novel and intuitive approach called "compensated coupling" to derive regret upper bound for an online algorithm. The idea is to bound the cumulative expected loss induced by the different decisions made by the online algorithm) against the hindsight optimal (which has all the future information). As in the aforementioned works on network revenue management, this line of works are also built upon the knowledge of the underlying distribution.

There are two types of benchmark for the online revenue management literature: fluid approximation and expected hindsight optimal. Fluid approximation benchmark is obtained by taking expectation on the reward and resource distribution and then solving the optimization problem. Expected hindsight optimal can be calculated from expectation of the maximization problem, where under the expectation, we observe the realizations of arrivals and solve for the optimal value in hindsight. As a result, fluid benchmark is stronger than hindsight benchmark, and from Bumpensanti and Wang (2020) the difference can be of order Ω( √ T ) under degenerate environment. In our paper, we consider the fluid approximation LP (also known as Deterministic LP, which we refer to as DLP) as the benchmark for regret analysis. Under a similar setting, Chen et al. (2021) show that bounded regret against the fluid benchmark is achievable for the online revenue management problem if the underlying LP is unique and non-degenerate, and the number of customer/order types are finite. We improve this result by extending the condition to that the DLP is dual non-degenerate. When the DLP is degenerate, both Bumpensanti and Wang (2020) and Vera et al. (2019) show that the gap between the fluid benchmark and the hindsight benchmark is Ω( √ T ), thereby arguing that the fluid benchmark can be too strong to compete against. Our result provides a positive view of the fluid benchmark under a weaker non-degenerate condition. In other words, our work shows that the large gap between the benchmarks and the deterioration of the algorithm performance is fully caused by the degeneracy of the binding constraints in the DLP. When the binding constraints are non-degenerate, bounded regret is still achievable against the fluid benchmark even when the distribution is unknown.

The rest of the paper is organized as follows. In section 2, we introduce the model, performance metrics and briefly discuss the fair algorithm. In section 3, we give a more detailed analysis on how we derived the algorithm with fair decisions and low regret. In section 4, we discuss the technical framework for our theoretical analysis. In section 5, we verify our findings in several simulated environments. Extra proofs are left in the appendix.

2 Model, Performance Metric, and Algorithm

Consider a resource allocation problem over a horizon of T . The objective is to maximize the cumulative reward over the horizon subject to the budget constraints on m types of resources. At each time period, a customer order/bid/request arrives and it asks for certain amount of each resource. The decision maker needs to decide whether to accept or reject the customer order. Upon the acceptance of the order, we need to satisfy the resource request, and we will collect the reward associated with the order. From the resource viewpoint, we act as a market maker who allocates the resources among all the customer orders.

The allocation decisions are made in an online manner and all the past decisions are irrevocable. In this paper, we study a stochastic model where the request-reward pair made by each customer is assumed to be i.i.d. and follows an unknown distribution. In addition, we assume the distribution has a finite support, i.e., there is a finite number of customer types and customers of each type place identical orders.

More specifically, our paper analyzes a class of resource allocation problems which take the following LP as its underlying (offline) form:

where r t = (r t1 , ..., r tk ) ∈ R k , A t = (a t1 , ..., a tk ) ∈ R m×k , and a ts = (a 1ts , ..., a mts ) ∈ R m , for are revealed, and we need to decide the value of x t instantly. Different from the offline setting, at time t,

we do not have the information of the subsequent coefficients to be revealed, i.e., {(r t , A t )} T t =t+1 . The problem (1) in an online setting is often referred to as online linear programming (Agrawal et al., 2014; Kesselheim et al., 2014) . With different specification of the input of the LP, the problem encompasses a wide range of applications, including secretary problem (Ferguson et al., 1989) , knapsack problem (Kellerer et al., 2003) , network routing problem (Buchbinder and Naor, 2009) , matching and Adwords problem (Mehta et al., 2005) , combinatorial auction problem (Agrawal et al., 2009), etc. In the rest of the paper, for notational simplicity we focus on the one-dimensional case where k = 1.

We note that the results and proof of the general case follow the same idea of the one-dimensional case.

For k = 1, the online formulation of (1) reduces to a one-dimensional online LP problem,

where a t = (a 1t , ..., a mt ) ∈ R m and the decision variables are x = (x 1 , ..., x T ) ∈ R T . There are m constraints and T decision variables. Throughout the paper, we use i to index constraints and t to index decision variables.

In this section we introduce the performance measure. Before doing so, we note that it is necessary to state the assumption on the distribution that governs the arrival of (r t , a t )'s.

Assumption 1 (Distribution). We assume (a) Stochastic: The column-coefficient pair (r t , a t )'s are i.i.d. sampled from a distribution P. The distribution P takes a finite support {(µ j , c j )} n j=1 where µ j ∈ R and c j ∈ R m . Specifically, P((r t , a t ) = (µ j , c j )) = p j for j = 1, ..., n. The probability vector p = (p 1 , ..., p n ) is unknown.

(b) Positiveness and boundedness: 0 ≤ µ j ≤ 1, c j ≥ 0 and c j ∞ ≤ 1 for j = 1, ..., n.

(c) Linear growth: The right-hand-side B = T b for some b = (b 1 , ..., b m ) > 0.

Assumption 1 (a) imposes a stochastic assumption for the customer orders. In addition, it states that the support of the distribution is finite and known, but the parameters of the distribution are unknown. In other words, it means that there are n known customer types, and the customer type at each time t follows a multinomial distribution with unknown parameters. Assumption 1 (b) requires all the entries of (µ j , c j ) are between 0 and 1. We remark that all the results in this paper still hold (up to a constant) when this part is violated, and the positiveness and boundedness are introduced only for simplicity of notation. Lastly, the linear growth condition in Assumption 1 (c) is commonly assumed in online resource allocation problems (Asadpour et al., 2019; Li and Ye, 2019; Bumpensanti and Wang, 2020) . In our context, the condition is mild in that if B = o(T ), we can always adjust the time horizon

with T T such that B = T b, and consequently the linear growth condition holds for T . Under Assumption 1 a), a commonly considered performance benchmark is the fluid approximation version of the "offline" LP (2),

(3)

Recall from Assumption 1 that µ j and c j represent the reward and requested resource consumption of the j-th customer type. The right-hand-side b = B/T represents the average resource capacity per time period, and p j is the probability of the j-th customer type. The decision variables y j 's prescribe a "probabilistic" decision rule for the orders, and y j can be interpreted as the proportion of accepted orders (or the probability of accepting orders) for the j-th customer type. The LP (3) can be viewed as a deterministic version obtained by taking expectation of the objective and the left-hand-side of LP (2).

For such reason, we refer to (3) as the deterministic LP (DLP).

We define y * to be the analytic center (see Luenberger and Ye (2021) ) of the optimal solution set of the DLP (3). Generally speaking, analytic center can be viewed as the average of optimal solutions and can be treated as the fair solution. We define y t = (y 1,t , · · · , y n,t ) to be the probability of accepting corresponding order types at time t. More specifically, if at time t the arriving order is of type j, then the algorithm will accept the arriving order with probability y j,t . For fairness reason, we want the online allocation policy to stay close to y * . Therefore, we define the cumulative unfairness of the online algorithm π as

After defining the fairness metric, we begin to define the regret. For the online problem, at each time t we decide the value of x t . x t = 1 means that we accept the order and allocate a t amount of resources to this order accordingly; x t = 0 means that we reject the order. Like the offline problem, we need to conform to the constraints throughout the procedure, i.e., no shorting of the resources is allowed. In this paper, we consider regret as the performance measure, formally defined as

where the quantity OPT D represents the optimal objective value of the DLP (3), and (x 1 , ..., x T ) represent the online solution. The superscript π denotes the online algorithm/policy according to which the online decisions are made. The expectation is taken with respect to (r t , a t )'s and the randomness introduced by the algorithm (for example, y t ).

In this section we introduce the commonly used adaptive algorithms (also known as re-solving technique (Reiman and Wang, 2008; Jasin and Kumar, 2012; Jasin, 2015; Bumpensanti and Wang, 2020) in the revenue management literature), and introduce modifications to make it a fairer algorithm.

We firstly introduce a few additional notations to characterize the constraint consumption process.

Define B 1 = B and B t = (B 1t , ..., B mt ) as the remaining resource capacity at the beginning of time t,

i.e.

Accordingly, we define b t = B t /(T − t + 1) as the average resource capacity at time t. In addition, we use B T +1 to denote the remaining constraint at the end of horizon, and use b 1 = (b 1,1 , ..., b 1,m ) = B 1 /T = B/T = b to denote the initial average resource. Let N j (t) denote the counting process of the j-th customer type, i.e., the number of observations (µ j , c j ) up to time t (inclusively) for j = 1, ..., n, and denote [n] the set {1, ..., n}. Because no shorting is allowed, the remaining constraint vector B t must be element-wise non-negative for all t = 1, ..., T .

Since the true probability distribution p = (p 1 , ..., p n ) is unknown, the counts N j (t)'s will be used by the online algorithm to construct empirical estimates for the corresponding probabilities. More specifically,

to be the empirical estimation at time t.

Now we present the LP-based adaptive algorithm. At each time t, the algorithm solves a sampled linear program

to compute y t = (y 1,t , · · · , y n,t ), the probability of acceptance for each customer type (µ j , c j ). The sampled LP (5) takes a similar form as the DLP (3) but differs in two aspects: (i) the probabilities p j 's in (3) are replaced by their empirical estimates since the underlying distribution is assumed unknown; (ii) the right-hand-side b in (3) is replaced with its adaptive counterpart b t . It then uses the sampled LP's optimal solution y * t to determine the online solution x t at time t. Given that there is enough inventory remaining, this probabilistic decision rule aims to follow the prescription of the optimal solution by accepting the j-th customer type with probability y * j,t . We informally summarize this approach in Algorithm 1.

Updatept and bt Solve y * t for the sampled LP (5) Allocate resource based on Bt and y * t

Observe new data and proceed as t ← t + 1

The performance of this type of algorithm has been analyzed under different conditions. When the distribution is known, we can replace the sample estimationp t by p when solving (5). Jasin and Kumar (2012) derive a bounded regret under a nondegeneracy assumption for the deterministic LP; Bumpensanti

and Wang (2020) show that an infrequent re-solving scheme with carefully designed re-solving time will also achieve bounded regret for the degenerate case. Without distributional knowledge, Jasin (2015) derive a O(log(T )) upper bound for the regret, and Chen et al. (2021) established that the bounded regret can also be achieved if the distribution is unknown but have finite support.

However, this type of algorithm can be unfair if not implemented correctly. In the following, we will analyze how to maintain a sub-linear upper bound at the order of O(log(T )) for cumulative unfairness defined in (4), while maintaining the regret to be bounded.

We make two improvements to the algorithm introduced above.

• When solving the sampled LP (5), we need a solver that outputs the analytic center of the sampled LP. For example, an interior-point type algorithm will output the analytic center.

• For the sampled LP (5), at each time t we have to determine the constraint b t for two categories of resources: For resource i that belongs to the "binding" category, we use the adaptive process

For resource i that belongs to the "non-binding" category, we use initial

We choose the analytic center as the solution because the analytic center is more tractable than other types of solutions. We can expect the analytic center of the sampled LP (5) to converge to y * , the analytic center of the DLP (3). As for dynamically adjusting the resource constraint for "binding" and "non-binding" categories, the intuition is that the existence of "non-binding" resource suggests that for this specific type of resource, on average the demand is less than the constraint. As time goes on, the average remaining resource for this type will increase. Adjusting it to the initial average resource prevent us from allocating resources inconsistently across time due to the increase of the average remaining resources.

Algorithm 2 details our fair allocation scheme and the next section contains more detailed analysis on this fair algorithm.

In this section, we present the theoretical results on the cumulative unfairness and regret for Algorithm 2. Moreover, we provide a more detailed analysis on the properties of Algorithm 2, especially on the effects of the two modifications we summarized in the previous section.

We firstly introduce the additional assumption needed for our theoretical results. The assumption is on the non-degeneracy for the binding resource, which we refer to as dual non-degeneracy of the underlying LP. We define the set of binding resource to be

The standard form of the dual program for the DLP (3) is

where λ and γ are decision variables. The dual program can also be simplified as a program with decision variable λ.

Algorithm 2 Fair Adaptive Allocation Algorithm

Solve the following linear program where the decision variables are (y 1 , ..., y n ):

8:

Denote the optimal interior primal solution for (6) as y t = (y 1t , ..., y nt ), and find the binding set

Solve the following linear program where the decision variables are (y 1 , ..., y n ):

11:

Denote the optimal interior primal solution for (7) as y * t = (y * 1t , ..., y * nt ). Observe (r t , a t ) and identify (r t , a t ) = (µ j , c j ) for some j, and let

x t = 1, with probability y * jt 0, with probability 1 − y * jt when the constraint permits; otherwise set x t = 0.

Update the counts forp t+1 such that

Next, we introduce our second assumption.

Assumption 2 (Dual Nondegereacy). Any optimal solution λ of (9) must satisfy λ i > 0 for i ∈ B.

Assumption 2 is a weaker assumption than the widely used unique and non-degenerate assumption in the literature (Jasin and Kumar, 2012; Jasin, 2015; Chen et al., 2021) . In practice, many common LP problems are degenerate and have multiple optimal solutions (Mittelmann and Spellucci, 2005) .

Assumption 2 only requires the nondegenreate condition for the binding resource constraints, and allows degeneracy in other non-binding constraints. Moreover, it allows the primal and dual program to have multiple optimal solutions. Now, we state the main results.

Theorem 1. Under Assumption 1 and 2, Algorithm 2 has the following upper bound on regret

This theorem implies that with an interior-point type of solver and modifications on the adaptive process b t , we can still achieve bounded regret under a weaker assumption on the DLP. In terms of the regret analysis, it is different from that in Chen et al. (2021) in two ways. First, without the unique and non-degenerate assumption, the stability analysis of the DLP is different. Second, we have to show that the introduction of the binding/non-binding detection mechanism in Algorithm 2 will not affect the regret. We also note that the regret upper bound has extra dependence on n that is hidden in some problem-dependent constants characterizing the continuity/Lipchitz property for the dual program. For more details see the proof in A1.

Next, we present the bound for the cumulative unfairness E T t=1 y t − y * 2 2 .

Proposition 1. Under Assumption 1 and 2, Algorithm 2 has the following bound on cumulative unfair-

The proposition states that the expected cumulative distance of our online decision to the offline fair decision has a logarithmic upper bound. The choice of y * t and y * being analytic center plays an essential role for obtaining this result. To establish this bound, we leverage the fact that the change of the analytic center is continuous with respect to the amount of perturbation on the DLP (3). If our average inventory process b t is "stable" with respect to b, andp t is close to p, we can expect that the analytic center of the sampled LP (5) to be close to y * . Therefore, in order to bound the cumulative unfairness, it suffices to characterize the distance of b t to b andp t to p. Lastly, we remark that the n 9 dependence in this bound is due to the concentration inequality we used and we think there is room for improvement by adopting different techniques or parameter choices. The proof of Proposition 1 is deferred in Section A2.

For the rest of the section, we will give a more comprehensive discussion on the two modifications we made when deriving Algorithm 2.

Since Assumption 2 does not exclude the situation of the DLP (3) having multiple and degenerate solutions, one would naturally raise the question "what would be the fair solution for the DLP"? When solving LPs, the basic optimal solution can be viewed as the corner (simplex) solution in the optimal solution set, and can often favor one group over the other. 

The optimal corner (simplex) solutions for the DLP are y 1 = [2/3, 2/3, 0] and y 2 = [0, 0, 1/2]. None of these solutions are fair because y 1 allocates all resources to order 1 and 2, and y 2 allocates all resources to order 3. The analytic center y * = [0.37, 0.37, 0.22] would be a fairer solution because it is not only optimal but also distributes resources across different orders more evenly. The analytic center is uniquely defined and it maximizes the distance to the boundary of the optimal solution set. Intuitively, it could be understood as the average of all optimal solutions.

Since the interior-point type solver will output analytic center as solution, we call y * the interior primal solution. To compare the difference, we test Algorithm 2 based on a simplex solver and an interior-point type of solver, and plot the acceptance probability of three types of orders across time. From Figure 2 : On a simulated environment, the plots show the acceptance probability of each order type across time. For the plots in the first row, the probability is obtained by solving the LP with a simplex type of algorithm. For the plot on the second row, the probability is obtained by solving the LP with an interior-point type of algorithm. Figure 2 we can see the interior-point type of algorithm generate a more evenly distributed allocation rule compared to the simplex solver. Moreover, we can see the online decision generated from an interior-point solver is converging to the analytic center of DLP.

It seems that we can also define a "fair" solution by taking weighted average of all the optimal basic solution. We argue that the analytic center is better than weighted averages, because it is continuous (and Lipchitz) with respect to perturbations of the DLP. It has been shown that if the perturbation on the DLP is small, then the analytic center of the perturbed LP is close to the analytic center of the original LP (Holder et al., 2001) . Therefore, ensuring the sampled LP being close to the DLP implies that the analytic center of the sampled LP being close to that of the DLP. Therefore, this property makes analytic center more suitable for being the fair solution.

To conclude, choosing the interior primal solution at time t provides extra consistency and convergence result. As will be shown later, the Lipchitz continuous property of the analytic center is essential to bound the cumulative unfairness. Notice that such continuity result is exclusive for analytic center of primal solutions, and the analytical center of dual solutions might be discontinuous with respect to the same type of perturbation (Holder et al., 2001 ).

Another source of unfairness comes from the degeneracy of non-binding resource. For non-binding resource i ∈ N := [m]/B, n j=1 p j c i,j y * j < b i implies that in expectation, the DLP will not use all the resource i, and as the algorithm proceeds the average leftovers will accumulate. Therefore, if we are running Algorithm 1, where at each time we are solving sampled LP (5) as a proxy of DLP (3), as time goes on the average remaining resource b i,t should be larger than the initial average resource b i . As a result, as t increases, with high probability we will solve a sampled LP with more resource budget for those non-binding resource. If the constraint on resource i is degenerate, those agents arriving late in the system will be allocated under more resource budget, and that will make the allocation rule different from that of the agents entering the system early. To eliminate such unfairness across time, we need to modify the algorithm such that every time we are solving a linear program with an inventory constraint process that is more stable.

For example, consider the following linear program.

There are two types of order A and B, and two types of resource 1 and 2 (corresponding to b 1 and b 2 ). From the formulation of the LP, type A order will generate more revenue than type B order. In terms of resource consumption, type A consumes more resource 1 than resource 2, and type B consumes more resource 2 than resource 1.

Observing the above, we can determine that resource 1 is more preferred than resource 2. Actually, the interior primal solution y * confirms this finding. Under the setup b = [1, 2], we have y * = [0.826, 0.226], and the resource consumption condition is

Notice that this LP satisfy Assumption 2, however it has degeneracy on the second resource. As indicated in Figure 3 , if we fix b 1 to be a constant and increase b 2 , we will find y * B increases as b 2 goes up. It is important to notice that the total revenue will always be the same with the increase of b 2 , and the problem is that the optimal solution set will be different, thereby changing the analytic center. (10) is non-binding, the nondegenereacy condition still create inconsistent allocation results as b 2 increase. Notice that the type B order still gets allocated with a tiny amount of resource 1, but is getting more resource 2 as b 2 increases.

In this section we elaborate on the intuition of choosing the "right" inventory process for the fair algorithm.

We will also compare the performance of Algorithm 1 to our modified algorithm in figure 4 based on example (10). As mentioned earlier, with high probability the average remaining non-binding resource will increase as the algorithm proceed. This gives inconsistent allocation rule across time for those orders that consume more non-binding resources than binding resource.

Our solution is to modify the inventory process for the right-hand-side of (5). An intuitive way is to replace those non-binding inventory process to be the same as the initial inventory process. In this way, we can always have a consistent allocation rule for the non-binding resources. However, the obstacle is We can see that both algorithms have bounded regret. However there is a dramatic difference in the cumulative unfairness, due to the existence of degenerate non-binding resource.

that we do not know the categorization of binding and non-binding resources, and have to learn it from the distribution as the algorithm proceeds.

More specifically, at every time t, when we obtain y * t , the interior primal solution for sample LP (5), we should check the binding resource and the non-binding resource, and denote the sets

Next, we pretend all the resource in set B t to be binding and all the resource in set N t to be non-binding. Define b t such that

where b t is the regular average inventory process and b is the initial average inventory process. We re-solve the LP again using b t instead of b t for the right-hand-side of (5). Under the sampled LP with different constraint b t , we calculate the interior primal solution y * t , and use it to make the probabilistic allocation at time t.

We remark that although we use b t for the decision y * t , we still have to maintain the regular inventory process b t . The reasons are twofolds. First, we have to keep track of the real inventory process to determine the end time for the algorithm. Second, the process b t is essential for proving that the modified algorithm still have bounded regret.

Our approach is justified by Corollary 2. We know that once we gain enough sample from the distribution, the sample estimatesp t will concentrate well to p. With good concentration, Corollary 2 ensures that we can correctly learn the binding/non-binding structure with high probability.

We compare Algorithm 2 to Algorithm 1 in the previous example ( Figure 4 , we find that both regrets are bounded, and the regret of Algorithm 2 is higher than Algorithm 1. This is expected since there are chances that Algorithm 2 mistakenly takes the binding resource as non-binding and that will terminate the algorithm faster. There is a drastic difference between the cumulative unfairness of Algorithm 1 and Algorithm 2. We find that the cumulative unfairness of Algorithm 1 is linear while its counterpart in Algorithm 2 is logarithmic.

In this section we discuss the the general approach for proving the fairness result and the bounded regret. Because the proof of fairness result (Proposition 1) partially relies on the proof of bounded regret (Theorem 1), we discuss the proof of Theorem 1 first.

We briefly discuss the main idea of the proof. Since there are two random variablesp t and b t for our algorithm, we want to find a "stable region" for those two processes such that as long as the processes are within the stable region, our regret will remain bounded. Under the Assumption 1 and 2, we determine the stable region in Lemma 2. Next, based on the stable region, we analyze the exit time ofb t out of the region and the concentration behavior ofp t to p. More specifically, Corollary 1 (based on Proposition 3) characterizes the regret as a summation of two components. In Lemma 3, we prove that the first component is bounded based on the notion of stable region and analytic center; The second component is analyzed in Proposition 4. It involves characterizing the exit time of b t out of the stable region. Notice that b t is not a martingale and we approximate it by a martingale-like process and analyze the distance.

For the result on cumulative unfairness, we use continuity argument (see Lemma 5) of the analytic center.

It turns out that the analytic center of the primal solution is Lipchitz continuous with respect to random perturbation of the sampled LP. Therefore, we can bound the cumulative unfairness by bounding the cumulative deviation of the random process b t andp t .

To facilitate the analysis, we introduce the standard form of the DLP (3). With a slight abuse of notation (omitting the probability vector p), we use µ to denote the vector (p 1 µ 1 , ..., p n µ n ) and C to denote the matrix (p 1 c 1 , ..., p n c n ). The standard form of the DLP (3) can be written compactly as 

where s ∈ R m and z ∈ R n are slack variables. Notice that if we denote I m and I n the corresponding identity matrix of dimension m and n, and defineC ∈ R (m+n)×(m+2n) ,ȳ,μ ∈ R (m+2n) ,b ∈ R (m+n) as

then (11) could be reformulated as maxμ ȳ

This standard form of LP will be useful when we analyze the cumulative unfairness. Then, we note that the dual of (12) is the same as the dual (8) of the original DLP. From its alternative form (9), we know that the dual program will always have feasible region and the dual optimal solution exist, which implies the existence of optimal primal solution.

Next, we focus on setting up the notations for analyzing the stable region of b t . Recall that y * is the analytic center of the solution in (3). Similarly, we denote y * (b ) as the analytic center of the solution in (3) with right-hand-side replaced by b ,

where B stands for binding and N stands for non-binding. The intuition of the definition is that we define the resource i to be binding if and only if for any optimal solution y, we always have the i-th inventory constraint being zero, i.e. b i − (Cy) i = 0. Because the analytic center of the optimal phase can be interpreted as the "average" of optimal solutions, it suffices to define the notions in the way above.

Moreover, we denote the basic optimal solution set for (9) to be I * such that I * := {λ|λ is the basic optimal solution for (9)} I * (b ) := {λ|λ is the basic optimal solution for (9) with input b }, and most importantly,

We have to define the notion of "strict binding" in this way because for the analytic center of the dual (8), λ * i > 0 does not necessarily mean λ i > 0 for any optimal λ, and this notion is quite essential in defining the condition for a stable region. We then state an lemma as an implication of Assumption 2. 

To see the relationship between the dual formulation and the regret, we define N a i (T ) to be the number of accepted order i during time horizon T . Moreover, we denote the optimal solution set for (9) to be D * such that D * = {λ|λ is the optimal solution for (9)}.

Next, we state a proposition for the regret decomposition formula. This proposition is not directly applicable to our conditions, but the main idea is relevant.

Proposition 2 (Chen et al. (2021)). Under Assumption 1, for any λ ∈ R m such that λ ∈ D * , we have

Here, τ is the stopping time that some resource is depleted, and B τ denotes the remaining resource vector when the algorithm terminates. Thus the first part on the right-hand-side of (13) represents a penalization for wasting resources. In particular, the first term suggests that only wasting the resources such that λ i > 0 will be penalized. As to the second and third terms on the right-hand-side of (13), they capture the costs induced by order acceptance. More specifically, the last two terms in (13) helps to identify the orders into three different categories. For any λ ∈ D * , we can have the following categories of order • All-accepted orders: µ j − c j λ > 0. For these orders, the optimal decision should be to accept all of them. On expectation, we will observe T p j such orders throughout the horizon and aim to have the number of acceptance N a j (T ) to be close to that.

• All-rejected orders: µ j − c j λ < 0. On the opposite of the previous case, the optimal decision should be to reject all of these orders. Each acceptance of such order will induce a cost of c j λ − µ j : resources of value c j λ are spent, but only reward of value µ j is received.

• Partially-accepted orders: µ j − c j λ = 0. The condition may lead to a partial acceptance of the orders, i.e., 0 ≤ y j ≤ 1. In terms of regret analysis, there is no need to worry about these orders because they do not contribute to the right-hand-side of (13).

Notice that the optimal primal solution y is responsible for making the decisions, while the optimal dual solution λ is used for prescribing the accept/reject rule to bound the regret. If the solution for (12) is unique and non-degenerate, from complementary slackness condition we know that we must have y * j = 1 (accept with probability one) corresponding to µ j > c j λ * and y * j = 0 (reject with probability one) corresponding to µ j < c j λ * . This relationship is consistent with our characterization of the categories above, and the last two terms of (13) can be bounded. However, once the solution is non-unique and degenerate, if we choose arbitrary optimal primal solution y and optimal dual solution λ, and we will not always get the property that y j = 1 corresponding to µ j > c j λ.

For this reason, we present another regret decomposition formula that has the freedom of choosing λ adaptively. We define λ max ∈ R m such that

define H t to be the information generated by {(r s , a s )} t s=1 up to t, and state the following regret decomposition formula.

Proposition 3. Under Assumption 1, for {λ t } T t=1 such that λ t is H t−1 -measurable and λ t ∈ D * , we have

We defer the proof of the proposition to A4. The difference of the above proposition to Proposition 2 is that, here we are using λ t that is adaptive and optimal, instead of fixing an accept/reject rule in Proposition 2 and accumulate the mistakes for each order. By choosing λ t adaptively we can have more flexibility when we need to bound the regret.

Our adaptive procedure can be summarized as follows. At time t, we choose an optimal dual solution λ t that gives us an accept/reject rule, and then choose optimal primal solution y t (which generates x t ) that solves the perturbed LP (5) and is consistent with the accept/reject rule. Proposition 3 implies that an appropriate optimal dual price λ t has to satisfy the following properties:

• λ t ∈ D * : This is because by setting λ t ∈ D * , the duality gap between the primal and dual of the DLP is zero. Any other dual solution will result in a linear regret hence useless for our analysis.

• Consistent with y t : Once λ t is determined, an algorithm features low regret should have y j,t = 1

for the order such that µ j > c j λ t , and y j,t = 0 for the order such that µ j < c j λ t .

Next, we introduce the following corollary that separates the upper bound of the regret into two components. For now take τ S to be a stopping time such that τ S ≤ τ < T a.s., and we have Corollary 1. Under Assumption 1, for {λ t } T t=1 such that λ t is H t−1 -measurable and λ t ∈ D * , we have the following inequality holds

From Corollary 1, we can see that the first component (14) 

which is to accept/reject the right order following the accept/reject rule prescribed by λ t .

In the following section, we are going to find a "stable region" that helps to define τ S exactly and

for x t generated by Algorithm 2.

In this section we characterize the stable region K. This region can be viewed as a neighbourhood of p and b, and characterizes ranges for the processesp t and b t such that as long asp t and b t belongs to the region, the sampled LP (5) is "close" to the DLP (3). With this condition we are able to provide a proof of the bounded regret based on the decomposition from Corollary 1.

Lemma 2. There exists problem dependent constant L > 0 depending on the DLP ( (3) or (12)) such that the region K, as a domain of (b,p) defined bŷ

has the property that for every (b t ,p t ) ∈ K, the dual optimal solution set of the sampled LP (5) is a subset of D * .

We leave the details of the proof in A5. Lemma 2 characterizes the notion of stable region for b t and p t . The result that the optimal dual solution set of (5) is a subset of D * is equivalent to the statement that the current accept/reject decision prescribed by λ t based on the sampled LP is a correct choice corresponding to the DLP. Notice that the stability analysis is different from that of Chen et al. (2021) .

We do not have the unique and non-degenerate condition so we cannot use the similar matrix inequality to get the result. We circumvent this difficulty by analyzing the structure of the dual program, and the fact that the distribution has finite support is essential in the proof. For convenience, we set the constant L to be strictly smaller than ∆ in Lemma 1. Another implication of the stability result is the following corollary.

Corollary 2. For the sampled LP (5), given (b t ,p t ) ∈ K, we have that

Moreover, L can also meet the condition that if we are solving the sampled LP in (7) with (b t ,p t ) in the "K-projected" region defined asb

Corollary 2 states that under the stable region, the sampled LP's solution features the same binding/nonbiding structure for the resource constraints as in the DLP (3). Moreover, the second statement concerns E[a t x t ] = n j=1 p j c j y * t , where x t is based on y * t , the solution of the sampled LP. It ensures that the expected resource consumption for the non-binding resource i ∈ N is lower than b i − L, given that we set the non-binding resources b N ,t to be the initial average resource b N . This condition will be useful in Section 4.4 where the dynamics of the average process b t is studied. The proof can be found in A6.

With the results on the stable region, we can work on bounding the regret in the next section.

In this section, we first discuss bounding the second component (15) in Corollary 1, and then derive the bound for the first component (14). From Lemma 1 we know that there exists region depending on ∆ such that the bindingness structure of (3) remains the same when

. Combined with Lemma 2, without loss of generality we assume that L < ∆ and take

We also precisely define the stopping time τ S mentioned in Corollary 1 as the exit time of b t out of Ω

Next, we present the result on the bound of (15).

Lemma 3. Under Assumption 1 and 2, Algorithm 2 gives the following bound

Notice that in Algorithm 2, the choice of interior primal (y * t ) and dual (λ * t ) solution of the sampled LP (5) plays an essential role for the result. With Lemma 2, we can show that the event {(b t ,p t ) ∈ K} will happen with high probability. Under the stable region K, our interior dual solution λ * t will characterize the accept/reject type of order according to the sampled LP (5) at time t, and the interior solution y * t will make sure that we accept the order j such that µ j − c j λ * t > 0 and reject the order j such that µ j − c j λ * t < 0. Moreover, Lemma 2 makes sure that when (b t ,p t ) ∈ K, the characterization of the accept/reject type based on the sampled LP (5) is also a "correct" characterization based on the DLP. Therefore, with high probability, we will make the right accept/reject decision so that (15) will be bounded. We defer the details for the proof of the lemma to Appendix A8.

With Lemma 3 and Corollary 1, to prove the bounded regret it remains to show that

which involves analyzing the exit time of b t out of Ω. The main approach is similar to Appendix C of Chen et al. (2021) , and the difference is that in this paper we are using a fairer inventory constraint b t for the allocation policy (see (7)). Generally speaking, we define another processb t which is similar to a martingale, and is close to the process b t with high probability. To ensure the closeness, we characterize high probability events under which the process b t is well behaved. Then, the exit time of b t is related to the exit time ofb t and some high probability events that we will define below.

Now we formally define event E t as follows. Let

where κ is a positive constant that will be specified in Proposition 4. Next, denote the event

where the subscript B denotes the corresponding dimensions of the vectors. Now, we provide some intuitions for the definition of E t . Ideally, for the binding resouces, we hope that the expected resource consumption at each time t stays close to b t . Because this will prevent the resources from an early depletion or a left-over at the end of the horizon. As for the condition ||p t − p|| ∞ ≤ L, combined with the condition b t ∈ Ω introduced later, we can ensure the stable region so that the binding/nonbinding structure of the sampled LP stays the same as the DLP (as ensured by

Corollary 2). This helps to deal with the effect of step 8 and 9 in Algorithm 2 to the stability of the process b t .

Here we only define tolerance level for binding dimensions, because for non-binding dimensions with high probability the average inventory process will be increasing, and we can tolerant larger deviation for the resource consumption as long as it does not sabotage the non-bindingness. In fact, the condition ||p t −p|| ∞ ≤ L will ensure such requirement. The event E t characterizes the algorithm behavior regarding to this aspect, and it defines a "good" event: At time t, if b t ∈ Ω, the binding structure for resources is the same as the DLP (3), and E[a B,t x t (b )|H t−1 ], the expected binding resource consumption at time t, stays close to b B,t with a range of t−1 .

To justify our choice of t , in the first κT time periods, since we still have a lot of resources, we can tolerate a relatively large deviation, and the event E t will happen almost surely for the first κT time periods. For t > κT , we define t = 1 t 1/4 . Intuitively, at time t, the estimation error for the binding resource is on the order of 1 t 1/2 , which is higher. We remark two points on the choice of t : (i) it guarantees that the event E t will happen with high probability; (ii) its value is not too large that the adaptive (re-solving) mechanism can still ensure the stability of the process b t . For completeness, we define E 0 as the whole space. Now we define our approximation processb t . According to the event E t , we can define the stopping timeτ

and the approximation processb

bτ , t ≥τ .

These two definitions lead to the following decomposition for the exit time.

The objective of our analysis is to bound the probability of b t / ∈ Ω. The inequality (17) separates the probability into two components. The first component involves the processb t , which is a very "regular" process in that if t <τ , the processb t 's fluctuation is subject to the event E t and if t ≥τ , the process freezes. The event E t further regulates the fluctuation for the processb t , and it makes the process behaves roughly like a martingale. This creates much convenience in analyzing the first component of (17). Next, the second component is about the probability ofĒ t 's, which can be analyzed individually for each t.

Overall, the inequality (17) disentangles the stability of the process b t from the estimation error. Its first component concerns the stability of the process given a good estimate, while its second component concerns the probability of obtaining the good estimate for the model parameters. By combining those two estimates, we have the following proposition. Note that b := min 1≤i≤m b i is the smallest value of the initial average constraint level b.

Proposition 4. If we execute Algorithm 2 under Assumption 1 and 2, we have

We leave the proof in A7. From Proposition 4 we know that E[T − τ S ] = O(1), hence (14) has an O(1) bound and Theorem 1 is true. For more details of the proof of Theorem 1 see A1.

In this section we begin to bound the cumulative unfairness. With the general DLP form (12), we can transfer the sampled LP (5) to another LP that is in similar form of (12) and the perturbation is on the right-hand-side constraint (b) only. In contrast, for the sampled LP (5) the perturbations are both in the left-hand-side (p t ) and right-hand-side (b t ). The reason we are formulating the sampled LP in this alternative form is that for the LP (12),ȳ * , the analytic center of the optimal solution, is continuous with the right-hand-side constraintb. Therefore, we can use the cumulative perturbation onb to give an upper bound for the cumulative perturbation onȳ * over time.

The next lemma states the upper bound for the cumulative distance of b t (the resource constraint we put on the sampled LP (7)) to b (the initial average resource).

Lemma 4. If we execute Algorithm 2 under Assumption 1 and 2, we have E T t=1 ||b t − b|| 2 2 = O(mn 9 log(T )).

The proof can be found in A9. Such result cannot be attained if we replace b t with b t under the existence of non-binding resources. However, for the case that every resource is binding, we are able to get the same result for E T t=1 ||b t − b|| 2 2 . Also, the n 9 dependence comes from our choice of characterizing the good event (see (22)) as a subset of E t and the concentration inequality we used. We note that the n 9 dependence is not optimal and can be reduced by adopting different characterization of the event or different concentration inequalities.

In order to leverage the continuity property, we introduce the LP in standard form. Denote the perturbed reward vector to be µ t := [p 1,t µ 1 , · · ·p n,t µ n ], resource matrix C t := [p 1,t c 1 , · · ·p n,t c n ], and the general matrix form

From Algorithm 2, we can rewrite the sample LP (7) in the following form

If we denote y j = y jp j,t pj , the above LP is equivalent to max µ y

By denotingb t = (b t , ξ t ) andȳ the corresponding decision variable in the standard form for y , the final standard form is maxμ ȳ

To formalize the continuity argument, we consider the functionb (θ, ∆b) to be the right-hand-side constraint for the LP maxμ ȳ s.t.Cȳ =b(θ, ∆b)

whereb(θ, ∆b) =b + θ∆b is a function of two variables: a scalar θ ∈ R, and an arbitrary directional vector ∆b ∈ R m+n . We denote the optimal solution for (20) to beȳ * (θ, ∆b), which is again a function of θ and ∆b. From Holder et al. (2001) we know thatȳ * (θ, ∆b) is continuous and differentiable with respect to θ for every fixed ∆b. We present the next lemma that shows the Lipchitz property.

Lemma 5.

||ȳ * (θ, ∆b) −ȳ * (0, ∆b)|| 2 ≤ χC||∆b|| 2 |θ|, where χ C := sup{||(CDC) −1 CD|| 2 : D is positive diagonal}.

We leave its proof in A10. It is worth noticing thatȳ * (0, ∆b) is the interior solution of the standard form DLP (12). Lemma 5 states that the interior primal solution is Lipchitz with respect to the perturbation on the right-hand-side. We note that such Lipchitz property holds for the interior primal solution, but does not hold for the interior dual solution. Therefore, to prove the convergence of the dual solution, one have to assume that the dual solution is unique. See Li and Ye (2019) for the convergence of the dual solution.

Lastly, Proposition 1 can be derived from Lemma 4 and 5. We already know b t is "close" to b from Lemma 4, and it is also easy to show ξ j,t =p j,t pj being close to 1 with high probability. Therefore, with high probabilityb t = (b t , ξ t ) stays close tob, hence the online decision will be close to the fair solution.

The rest of the proof for Proposition 1 is left in A2.

In this section, we evaluate the performance of different heuristics in the online resource allocation problem under different environments. We find that the numerical results are inline with our theoretical findings. It provides insights on how to choose algorithms under different problem instances. Based on the discussions above, we test the following three algorithms:

• Adaptive Simplex: this is Algorithm 1 with a simplex solver.

• Adaptive Interior: this is Algorithm 1 with an interior-point solver.

• Adaptive Fair: this is Algorithm 2.

We test the three algorithms above under simulated environments.

• In the first environment, there exists both binding and non-binding resources, and one of the non-binding resource is degenerate.

• In the second environment, there exists both binding and non-binding resources, but no degeneracy condition on non-binding resources exists.

• In the third environment, we have less supply than demand such that all resources are binding.

Intuitively, from the theoretical results and the design of Algorithm 1 and 2, one should expect that the adaptive fair algorithm has lower cumulative unfairness when the non-binding resource has degeneracy, and it should have higher regret than the adaptive simplex and adaptive interior for all environments.

Our tests on those environments confirm this finding. Moreover, we find Algorithm 2 features a lower cumulative unfairness in all environments mentioned above. Figure 5 : Performance evaluation for Section 5.1. AS stands for adaptive simplex, AI stands for adaptive interior, and AF stands for adaptive fair. The plots above suggests that the adaptive interior can balance the regret and cumulative unfairness well.

We choose the model such that and the corresponding fluid approximation LP is max 0.3y 1 + 0.15y 2 + 0.15y 3 + 0.15y 4 + 0.225y 5 + 0.015y 6 + 0.02y 7 s.t. 1.05y 1 + 0.525y 2 + 0.6y 3 + 0.525y 4 + 0.975y 5 + 0.06y 6 + 0.07y 7 ≤ 0.5 0.15y 1 + 0.075y 2 + 0.15y 3 + 0.075y 4 + 0.3y 5 + 0.015y 6 + 0.01y 7 ≤ 1 0.15y 1 + 0.3y 2 + 0.075y 3 + 0.03y 4 + 0.075y 5 + 1.5y 6 + 0.7y 7 ≤ 2

We choose this model because the first dimension of resource is binding, and the third dimension of resource is degenerate. Therefore this model could be interpreted as a more complicated model for Section 3.3.

As expected, from Figure 5 we find that all the adaptive algorithms have O(1) regret. We also notice similar behavior as in Section 3.3 that the Adaptive Interior (Algorithm 1) has a linear growth of cumulative unfairness while the Adaptive Fair (Algorithm 2) has an logarithmic order. Figure 6 : Performance evaluation for Section 5.2. AS stands for adaptive simplex, AI stands for adaptive interior, and AF stands for adaptive fair.

To modify the environment such that there is no non-degenereacy in non-binding resource, we change the reward vector, resource consumption matrix, and resource vector such that and the corresponding fluid approximation LP is max 1.05y 1 + 1.05y 2 + 0.975y 3 + 0.48y 4 + 0.825y 5 + 0.75y 6 + 0.35y 7 s.t. 0.3y 1 + 0.15y 2 + 0.15y 3 + 0.15y 4 + 0.225y 5 + 0.15y 6 + 0.1y 7 ≤ 0.5 0.15y 1 + 0.3y 2 + 0.3y 3 + 0.075y 4 + 0.15y 5 + 0.15y 6 + 0.05y 7 ≤ 0.5 0.15y 1 + 0.15y 2 + 0.075y 3 + 0.03y 4 + 0.075y 5 + 0.15y 6 + 0.05y 7 ≤ 1.5

where the first and second dimension of resource are binding, and the third dimension of resource is non-binding and non-degenerate.

From Figure 6 we find that all the adaptive algorithms have O(1) regret. Since we do not have degeneracy in non-binding resource, the Adaptive Interior (Algorithm 1) has a sub-linear growth of cumulative unfairness, and it is still higher than the Adaptive Fair algorithm.

Lastly we consider an environment that all resources have a higher demand than supply. We change the reward vector, resource consumption matrix, and resource vector such that and the corresponding fluid approximation LP is max 0.6y 1 + 0.6y 2 + 0.675y 3 + 0.405y 4 + 0.45y 5 + 0.45y 6 + 0.28y 7 s.t. 0.3y 1 + 0.15y 2 + 0.15y 3 + 0.15y 4 + 0.225y 5 + 0.15y 6 + 0.1y 7 ≤ 0.5 0.15y 1 + 0.3y 2 + 0.3y 3 + 0.075y 4 + 0.15y 5 + 0.15y 6 + 0.05y 7 ≤ 0.5 0.15y 1 + 0.15y 2 + 0.225y 3 + 0.18y 4 + 0.075y 5 + 0.15y 6 + 0.13y 7 ≤ 0.5

where all resources are binding.

From Figure 7 we find that all the adaptive algorithms have O(1) regret, and the performance is similar to previous example where there is no non-binding degeneracy. International conference on machine learning. PMLR, 325-333.

6 Proof of Section 4 7 Proof of Section 4 A1 Proof of Theorem 1

Recall that from Corollary 1 and Lemma 3, we have that

Next, from Proposition 4 we know that

Lastly, to finish the proof of bounded regret, we notice that

where the first inequality is because B t ≤ B t−1 , and the second inequality is becasue of the definition of τ S .

Before we prove Proposition 1, we need to review some notation. At time t the sampled LP we are solving is maxμ ȳ

where y = ξ t • y (• stands for entry-wise multiplication) andb t = (b t , ξ t ). To apply Lemma 5, we set θ = 1 and ∆b = (b t −b). By recalling the formulation in Section 4.5 we haveȳ * (θ, ∆b) 1:n = ξ t • y * t and y * (0, ∆b) 1:n = y * . From Lemma 5, we know that

Therefore, to calculate ||y * t − y * || 2 2 , we decompose it into

Then, it is easy to observe that for t > 1, from Hoeffding's inequality we have

where p = min j∈[n] p j . Therefore, we conclude the proof by applying Lemma 4 and the fact that

= O(mn 9 log(T )).

We denote OP T (b ) to be the optimal value of the dual LP (9) with the input being b , and define I(b ) = {λ|λ is the basic optimal solution for (9) with input b }.

We can define the gap between the optimal and the best suboptimal objective value as ε := min

whereĪ(b) is the complement set, i.e., the non-optimal basic solution. For the dual program, the change in b will not change the set of basic solutions. Consequently, for every fixed basic solution λ, b λ + n j=1 p j (µ j − c j λ) + can be viewed as a continuous function with variable b. The function is continuous and it implies the existence of ∆ > 0 such that for ||b − b|| ∞ ≤ ∆, we have

The above inequality implies that for a specific b such that ||b − b|| ∞ ≤ ∆, there cannot exists basic solution λ ∈Ī(b) such that λ ∈ I(b ). Therefore we can have the result that I(b ) ⊆ I(b). However there is still one step forward, which is to show

To show this argument, we begin with the definition of projection h such that

From the definition, we know that ||h(b ) − b|| ∞ ≤ ∆, therefore from previous argument we know

For any λ 1 ∈ I(h(b )) and λ 2 ∈Ī(h(b )), we know that

From Assumption 2, we know that λ 1 must have non-binding entry being zero.

positive only for the non-binding constraint, we have that

and arrive at the inequality that

Since λ 1 ∈ I(h(b )), the above inequality ensures that we must have I(b ) ⊆ I(h(b )).

, from the result that I(b ) ⊆ I(b) we know for any λ ∈ I(b ), it must have property such that λ i > 0 for i ∈ B as well. Therefore

We begin with the following decomposition. For any λ ∈ D *

For each t, if we choose λ t ∈ D * and being H t−1 -measurable, we can treat λ t as a constant (conditioned on H t−1 ) that belongs to D * , therefore for any fixed λ ∈ D * , we have

Therefore, by summing up we have

and the last line is equal to the decomposition of Reg π T . We also observe that the first term

and we finish the proof by showing

The proof follows the similar idea to the proof of Lemma 1, and the difference is that there is another variable p involved. We denote OP T (b , p ) to be the optimal value of the dual LP (9) with the input being b and p , and define I(b , p ) = {λ|λ is the basic optimal solution for (9) with input b , p } such that for λ ∈ I(b , p ) we have

Next, for the input b and p, we define the gap between optimal and the best suboptimal objective value as ε := min

WhereĪ(b, p) is the complement set, i.e., the non-optimal basic solution. Notice that here the set of basic solution will not change with the change of b and p. To see this one can recall the formation (18), look at the dual of the standard form of the sampled LP (19), and find out that the constraint for the dual variable is the same as the dual of (12). To derive an uniform bound, we note that (µ j − c j λ) + will be uniformly bounded for all basic solutions of λ such that λ ∈ I(b, p) ∪Ī(b, p). Therefore there exists η such that whenever ||p − p|| ∞ < η, we have that for all λ ∈ I(b, p) ∪Ī(b, p),

Then, we look at the variable b . Because I(b, p) andĪ(b, p) are finite, there exist L such that whenever

Combining with the fact that n j=1 (p j − p j )(µ j − c j λ) + < ε/4, we can arrive at the conclusion that for any ||p − p|| ≤ η and ||b − b|| ≤ L, we have

Therefore, we can have that I(b , p ) ⊆ I(b, p) whenever ||p − p|| ≤ η and ||b − b|| ≤ L. The next step is to show the same relationship for ||p − p|| ≤ η and

To show this argument, we continue with the definition of projection h such that

From the definition, we know that ||h(b ) − b|| ∞ ≤ L, therefore from previous argument we know

For any λ 1 ∈ I(h(b ), p ) and λ 2 ∈Ī(h(b ), p ), we know that

From Assumption 2 and the previous conclusion I(h(b ), p ) ⊆ I(b, p), we know that λ 1 must have non-binding entry being zero. Since b − h(b ) is positive only for the non-binding constraint, we have

and arrive at the inequality that

Since λ 1 ∈ I(h(b ), p ), the above inequality ensures that we must have

Finally, we have shown that for ||p − p|| ≤ η and

we have I(b , p ) ⊆ I(b, p). By adjusting L to be the minimum of the original L and η, we arrive at result that if (b t ,p t ) ∈ K, the dual optimal solution set of the sampled LP (5) is a subset of D * .

The proof of the first statement trivially follows the fact that the interior dual solution is a subset of D * , the condition of complementary slackness, and the Assumption 2. Therefore we focus on the second statement. Let us first focus on solving the sampled LP with fixedp such that ||p − p|| ≤ L but varying b that belongs tob

Because the perturbation of b andp is within the stable region, and the region (21) and ||p − p|| ≤ L are closed intervals, there exists δ such that

for all the solution y * from the DLP with input in (21). Having shown such property, by restricting thê p such that ||p − p|| ∞ ≤ δ/2, we can have the result that for the deviation

Therefore, by setting L to be min{L, δ/2} we finish proving this corollary.

By the definition of E t . When t ≤ κT , we know that E t happens with probability 1. Whenτ > t > κT we know that (b t ,p t ) is in the stable region and from Corollary 2 we know G t defined by

is true. G t states that the binding/nonbinding structure of the sampled LP is the same as the DLP, and this implies that Algorithm 2 is the same as Algorithm 1 with modified b t on the right binding/nonbinding index, and an interior-point solver. Therefore, by recalling that

we can bound P b s / ∈ Ω for some s ≤ t by the similar approach in Appendix C of Chen et al. (2021) . We denote that the modification b t in Algorithm 2 will change the dynamic of average remaining inventory b t compared with the b t in Chen et al. (2021) . However, under the event G t , we will show that the inventory process b t with modified dynamic will still comply with the characterization of "good" event E t . Therefore from the similar analysis, we can state the lemma below, and its proof is left in A13.

Lemma 6. For T ≥ T 1 and t ≤ T − 2

where the constant T 1 is defined as the minimal integer such that

In terms of t s=1 P(Ē s ), from the fact that E t happens with probability 1 for t ≤ κT , we know that t s=1 P(Ē s ) = t s=κT +1 P(Ē s ).

To see this, for t ≥ 2, let y * t = (y * 1,t , ..., y * n,t ) be the optimal solution of sampled LP (5) with b t = b for some b ∈ Ω. By the algorithm, we have the expected resource consumption at time t

Moreover, we know that given the event ∩ n j=1 A (j) t and b ∈ Ω, we are in the stable region K. Therefore, from Corollary 2 we have

Then, taking the difference,

Next, given the event ∩ n j=1 B

(j)

where we use the fact that c j ∞ ≤ 1 from Assumption 1. This meets the definition of the event E t and we have shown that ∩ n j=1 A (j)

Next, by applying Hoeffding's inequality for A (j) t and B

(j) t we have

Again recalling the formula we have

Lastly,

To bound

we first observe the following facts.

• At time t, we already can choose λ t−1 ∈ H t−1 , and the result for any λ t−1 ∈ H t−1 could be used as an upper bound.

• Only one order type arrives, hence it can only be one of the three types : r t −a t λ t > 0, r t −a t λ t = 0, and r t − a t λ t < 0.

• For the bound of (15), we can ignore the type r t − a t λ t = 0.

• (r t − a t λ t ) + (1 − x t ) + (a t λ t − r t ) + x t > 0 if and only if we reject/accept the wrong type.

From the observation above, we choose λ * t ∈ H t−1 , the analytic center of the sampled LP (5), and define the event of "making mistake at time t" to be

and define event Q t such that

We then remark some important observations. First, Q t is H t−1 -measurable and we can already know if Q t is true at time t. Second, At time t we also have already knew the information at (5), which is H t−1 -measurable, and we can calculate y * t and λ * t (we don't have to exactly calculate λ * t since it is only for bounding the expectation for theoretical purpose). Third, Q t ⊆ {(b t ,p t ) ∈ K}, and from Lemma 2, we know that if Q t happens, D * t , the set of optimal dual solution for the sampled LP (5), is a subset of D * .

Based on the observations above, we then show that at time t, by setting λ t = λ * t , and using the interior solution y * t of the smapled LP (5), we have

In the proof we analyze the binding resource and non-binding resource separately. Without loss of generalization we can permute the order of the resource such that

The reason we work with b t instead of b t is that by definition, ||b t − b|| 2 is an upper bound of ||b t − b|| 2 . For the convenience of notation, for now let's denote b = b B .

From the update rule we have

We also have the geometry propoerty that

Because the analysis of each i ∈ B will be the same, we use b t , a t to denote b 1,t , a 1,t . Observe that

Since we are dealing with binding dimensions, from ||c j || ∞ ≤ 1 we have b t ≤ 2 before the jump time τ S . This is because for binding resource we must have b ≤ 1, and by setting L ≤ 1 we know b t ≤ 2 before the stopping time. Since T − E[τ S ] = O(1), for bounding the difference we can just treat τ S = T and we know the difference will also be O(1). Next, we have that

We want the above inequality to be in a recursive form of E[(b t − b) 2 ], therefore we need to bound

The corresponding result is summarized from the following lemma:

Lemma 8. Under the binding case, there exists positive constants K, A 1 , A 2 , B 1 , B 2 , C 1 and C 2 such

t − 1 ∧ 1 + A 1 e −A2(T −t+1) + B 1 e −B2t + C 1 T 1/2 e −C2T 1/2 . Next, by denoting z t = E[(b t − b) 2 ], form (25) and the above lemma we know that

t − 1 ∧ 1 + A 1 e −A2(T −t+1) + B 1 e −B2t + C 1 T 1/2 e −C2T 1/2 + 4 (T − t) 2 .

(26)

Then we can finish the proof from the following lemma Lastly, let us identify the order forN . By comparing the parameter in (26) and (29) we find that K is of the order n 3 and C 1 /C 3 2 is of the order n 9 . Therefore we have the bound T t=1 E[(b i,t − b i ) 2 ] = O(n 9 log(T )) holds for every i ∈ B, and performing the same analysis for all the binding resources yields E T t=1 ||b t − b|| 2 2 ≤ E T t=1 ||b t − b|| 2 2 ≤ O(mn 9 log(T )). Next, let's denote b t = b N ,t and analyze the non-binding case. Again because each dimension i ∈ N can be treated the same way, we denote b t = b i,t . Like in Lemma 8, we define event t }, we know that the index i ∈ N is non-binding hence Algorithm 2 will set b t = b. Therefore, it suffices to bound the probability of bad events. We have

From Proposition 4 and calculation we know that T t=1 P (τ S ≤ t) ≤ 16m L 2 + nT L 2 exp −2L 2 κT + 4n 3 T 3/2 exp − κ 1/2 n 2 T 1/2 T t=1 n j=1 P (Ā (j) t ) ≤ n + T −1 t=1 2n exp −2tL 2 ≤ 2n 1 − exp(−2L 2 ) .

Therefore, for i ∈ N we know 

We prove by contradiction. If instead ||ȳ * (θ, ∆b) −ȳ * (0, ∆b)|| 2 > χC||∆b|| 2 |θ|,

From the fact that ||ȳ * (θ, ∆b)−ȳ * (0, ∆b)|| 2 = ||ȳ * (θ, ∆b)−ȳ * (θ/2, ∆b)|| 2 +||ȳ * (θ/2, ∆b)−ȳ * (0, ∆b)|| 2 .

It must be the case that either ||ȳ * (θ/2, ∆b) −ȳ * (0, ∆b)|| 2 > χC||∆b|| 2 |θ|/2 or ||ȳ * (θ, ∆b) −ȳ * (θ/2, ∆b)|| 2 > χC||∆b|| 2 |θ|/2.

Repeating such argument will generate a sequence {θ i } +∞ i=1 such that 2|θ i+1 − θ i | = |θ i − θ i−1 | and ||ȳ * (θ i+1 , ∆b) −ȳ * (θ i , ∆b)|| 2 > χC||∆b|| 2 |θ i+1 − θ i |. This implies that there exists θ * = lim i→+∞ θ i and a vectorb * =b + θ * ∆b such that the left or right derivative on direction ∆b is greater than χC||∆b|| 2 , which contradicts Theorem 3.4 of Holder et al. (2001) .

We define event A (j) t = {|p j,t − p j | ≤ L} , and analyze the good event that C t−1 = {b t ∈ Ω} ∩ {∩ n j=1 A (j) t }. Notice that we use C t−1 because we want to stress that C t−1 is H t−1 -measurable. From the definition of Ω, A (j) t , the convention b = b B , and Algorithm 2 we know that b t = n j=1p j,t c j y * j,t and E [a t x t |C t−1 ] = n j=1 p j c j y * j,t . Then

Therefore, by taking the one-dimensional convention (a = a 1 and the same for others) we have that

For the first term exp − 2(t − 1)y 2 n 2 dy = n 3 4(t − 1) .

Therefore, for the first term in (28) we have that

Next, for the second term in (28), we have that

Lastly, it suffices to give a bound for P (C t−1 ). Notice thatC t−1 ⊆ {τ S ≤ t − 1} ∪ ∪ n j=1Ā (j) t , therefore from Proposition 4 and Hoeffding's inequality we have P (C t−1 ) ≤ P (τ S ≤ t − 1) + n j=1 P Ā (j) t ≤ 2m exp − L 2 (T − t + 1) 8 + n L 2 exp −2L 2 κT + 4n 3 T 1/2 exp − κ 1/2 n 2 T 1/2 + 2n exp −2L 2 (t − 1) , and the last term can be dominated by the previous exponential terms. Therefore, by combining the results above we have

√ n 3 √ t − 1 ∧ 1 + 16m exp − L 2 (T − t + 1) 8 + 8n L 2 exp (−2L 2 κT ) + 32n 3 T 1/2 exp − κ 1/2 n 2 T 1/2 .

A12 Proof of Lemma 9

For

t − 1 ∧ 1 + A 1 e −A2(T −t+1) + B 1 e −B2t + C 1 T 1/2 e −C2T 1/2 + 4 2 (T − t) 2 .

From the fact that tc 1 e −c2t ≤ c1 c2e and t 3 c 1 e −c2t ≤ 27c1 c 3 2 e 3 for any t, c 1 , c 2 > 0, we have that for all t ≤ T /2 √ K √ t − 1 ∧ 1 + A 1 e −A2(T −t+1) + B 1 e −B2t + C 1 T 1/2 e −C2T 1/2

If we denoteM = 4 √ K + 2 2A1 eA2 + B1 eB2 + 27C1 e 3 C 3 2 , for t ≤ T /2 we have z t+1 ≤ z t +M (T −t) √ t √ z t + 4 2 (T −t) 2 . DenoteN = 4(M 2 ∨ 16 2 ), and suppose that we have z t ≤N (T −t) 2 t. We know the base case is true since Therefore, if we apply union bound with respect to binding constraint index, we have

for t ≤ T − 2 and T ≥ T 1 .

Next, for non-binding index i ∈ N , we use the same definition on Y t , X t such that

and want to show

Since in Algorithm 2 use b t instead for solving the sampled LP, we have to make sure the following is true for t <τ

To show the property (32), for t ≤ κT , we have

because for t <τ we know b i,t ≥ b − L and ||a t || ∞ ≤ 1 from Assumption 1. Next, for t > κT , since t <τ , from definition of E t , we know (b t ,p t ) ∈ K, and G t is true; From the second statement of Corollary 2, we have we that

From the same analysis we know

I(t > κT ).

Then we apply the same approach. By defining κ ≤ 1 − exp(− L 8(1+L−b) ), and T 1 to be the minimal integer such that T 1 ≥ 1 exp ( L 8(1+L−b) )−1 + 2 and log T1 T 1/4 1 ≤ κ 1/4 L 4 , we know that for T > T 1 ,

Therefore, with the choice of κ and T ≥ T 1 , we have

Next, for t ≤ T − 2. we have that P   s j=1 X j ≤ − L 2 for some s ≤ t   ≤ e − L 2 (T −t−1)

.

Therefore, for i ∈ N we have that

.

Summing up and taking the union bound, we know that for all T ≥ T 1 and t ≤ T − 2, we have P b s / ∈ Ω for some s ≤ t ≤ 2me − L 2 (T −t)

.

A unified framework for dynamic pari-mutuel information market design

A dynamic near-optimal algorithm for online linear programming

Online resource allocation with limited flexibility

Uniform loss algorithms for online stochastic decisionmaking with applications to bin packing

Equal opportunity in online classification with partial feedback

The price of fairness

On preserving nondiscrimination when combining expert advice

Online primal-dual algorithms for covering and packing. Mathematics of

A re-solving heuristic with uniformly bounded loss for network revenue management

Yinyu Ye. 2021. How linear reward helps in online resource allocation

Heramb Nemlekar, Stefanos Nikolaidis. 2020. Fair contextual multi-armed bandits: Theory and experiments. Conference on Uncertainty in Artificial Intelligence. PMLR

Price discrimination with fairness constraints. Available at SSRN 3459289

Dynamic pricing with fairness constraints. Available at SSRN 3930622

Fairness and utilization in allocating resources with uncertain demand

Fairness through awareness

Fair algorithms for learning in allocation problems

Who solved the secretary problem?

Online learning with an unknown fairness metric

Individual fairness in hindsight

Equality of opportunity in supervised learning

Preventing disparate treatment in sequential decision making

Marginal and parametric analysis of the central optimal solution

Racial equity in algorithmic criminal justice

Fairness in reinforcement learning. International conference on machine learning

Performance of an lp-based control for revenue management with unknown demand parameters

A re-solving heuristic with bounded revenue loss for network revenue management with customer choice

Fairness in learning: classic and contextual bandits

Fairness-aware learning through regularization approach

Knapsack problems

Primal beats dual on online packing lps in the random-order model

Combinatorial sleeping bandits with fairness constraints

Behavior-based pricing: An analysis of the impact of peer-induced fairness

Online linear programming: Dual convergence, new algorithms, and regret bounds

Calibrated fairness in bandits

Linear and nonlinear programming

Group-level fairness maximization in online bipartite matching

Fair dynamic rationing. Available at SSRN 3775895

Adwords and generalized on-line matching

Can an algorithm hire better than a human. The New York Times 25

To show (24), we have to validate our choice of y * t and λ * t . From the property of primal-dual central path, we know that the interior primal solutionȳ * = [y * , s * , z * ] of (11) and interior dual solution λ * satisfy the strict complementary condition such thatand this condition implies the following lemma Lemma 7. For the interior primal solution y * t for the DLP (3) and the interior dual solution λ * for the dual (9), we know y * i = 1 if and only if c i λ * < µ i , y * i = 0 if and only if c i λ * > µ i , and y * i ∈ (0, 1) if and only if c i λ * = µ i . Moreover, the same holds for the interior primal and dual solutions for the perturbed LP (5) and its dual.From Lemma 7 we know that the interior primal solution is consistent with the accept/reject rule given by the interior dual solution. More specifically, if at time t, we have (r t , a t ) = (µ j , c j ) for category j, Lemma 7 implies that we must have yHowever, this is not enough to show the bound (15) is small. This is because the sample LP might have different characterization of the accept/reject type from the DLP. The condition that (b t ,p t ) ∈ K (it suffices to ensure that Q t is true) makes sure sample LP and DLP has the same characterization of the accept/reject type. If we can show (b t ,p t ) ∈ K (which is equivalent to λ * t ∈ D * ) with high probability, then (15) is small.where the last equality is becausez 1 = 0 and the increment is at the order 4 2 (T −t) 2 . Then,Therefore, we have that z t ≤N (T −t) 2 t for all t ≤ T /2. Such bound will be insufficient to generate a log(T ) regret when t gets large. So for t > T /2, we need a different bound and choose to test if z t ≤N T −t . The base case is true from the result on t ≤ T /2, and for larger t we haveTherefore, combining it together yields

We first analyze the processb i,t , where i ∈ B. Then we analyze the non-binding process. Lastly we do a union bound and conclude the result. Throughout the proof, we have to use the following theorem.Theorem 2. (Hoeffding's inequality for dependent data (van de Geer, 2002)) Consider a sequence of random variables {X t } T t=1 adapted to the filtration F t 's andThen, the following inequality holds for any b > 0, c > 0 and T ∈ N + ,for t ≥ 1 andIn this way, to analyze the processb i,t , we can equivalently analyze the summation t−1 s=1 Y t . From the definition of the processb t , we know that when t >τ − 1, we havẽand when 1 ≤ t <τ , we havebFrom the fact thatb i,t is H t−1 -measurable, we can bound the absolute value of |x t | such thatfor each t ≤ T − 1. So we can define L t and U t asand the conditions of Theorem 2 are met for the process X t , L t and U t . Then as in Theorem 2,holds for all L > 0 and t ≤ T − 2.With this bound on the summation of X t , we return to analyze the summation of Y t by bounding the difference between these two sequences. By the definition, we havefor 1 ≤ t ≤ T −1. The second line comes from the definition of Y t . The third line comes from the definition of E t and the definition ofτ . For the last line, we denote that since t <τ , for t ≤ κT we have that (7) in the right-hand-side of the sampled LP; when t > κT , since t <τ , we know that G t is true such that we are putting b i,t = b i,t (because i ∈ B) in the right-hand-side of the sampled LP, and by definition of E t we know that E[a B,t x t (b t )|H t−1 ]−b B,t ≤ * t . By taking summation of (31), we have for s ≤ T − 2,The next is to find a proper value of κ such that the equation above is bounded by L 2 . For the first part, we have