key: cord-0658151-6vsx39t4
authors: Cheung, Wang Chi; Simchi-Levi, David; Zhu, Ruihao
title: Hedging the Drift: Learning to Optimize under Non-Stationarity
date: 2019-03-04
journal: nan
DOI: nan
sha: df997f1055879ffa99a53afb7703b8b69fa42609
doc_id: 658151
cord_uid: 6vsx39t4

We introduce general data-driven decision-making algorithms that achieve state-of-the-art emph{dynamic regret} bounds for non-stationary bandit settings. It captures applications such as advertisement allocation and dynamic pricing in changing environments. We show how the difficulty posed by the (unknown emph{a priori} and possibly adversarial) non-stationarity can be overcome by an unconventional marriage between stochastic and adversarial bandit learning algorithms. Our main contribution is a general algorithmic recipe that first converts the rate-optimal Upper-Confidence-Bound (UCB) algorithm for stationary bandit settings into a tuned Sliding Window-UCB algorithm with optimal dynamic regret for the corresponding non-stationary counterpart. Boosted by the novel bandit-over-bandit framework with automatic adaptation to the unknown changing environment, it even permits us to enjoy, in a (surprisingly) parameter-free manner, this optimal dynamic regret if the amount of non-stationarity is moderate to large or an improved (compared to existing literature) dynamic regret otherwise. In addition to the classical exploration-exploitation trade-off, our algorithms leverage the power of the"forgetting principle"in their online learning processes, which is vital in changing environments. We further conduct extensive numerical experiments on both synthetic data and the CPRM-12-001: On-Line Auto Lending dataset provided by the Center for Pricing and Revenue Management at Columbia University to show that our proposed algorithms achieve superior dynamic regret performances.

Consider the following general decision-making framework: a decision-maker (DM) interacts with an environment by picking actions one at a time sequentially. Upon selecting an action, it instantly receives a reward drawn randomly from a probability distribution tied to this action. The goal of the DM is to maximize its cumulative rewards. However, it also faces the following challenges:

• Uncertainty: the reward distribution of each action is initially unknown to the DM, and it thus has to estimate the underlying distributions via interacting with the environment.

• Non-Stationarity: moreover, the environment is non-stationary, and the reward distributions can evolve over time.

• Partial/Bandit Feedback: finally, the DM can only observe the random reward of the selected action each time; while the rewards of the un-chosen actions remain unknown.

Evidently, the DM faces a trilemma among exploration, exploitation as well as adaptation to changes. On one hand, the DM wishes to exploit, and to selects the action with the best historical performances to earn as much rewards as possible; while on the other, it also wants to explore other actions to get a more accurate estimation of the reward distributions. The changing environment makes the exploration-exploitation trade-off even more delicate. Indeed, past observations could become obsolete due to the changes in the environment, and the DM needs to explore for changes and refrain from exploiting possibly outdated observations. It turns out that many applications fall naturally into this general framework:

1. Advertisement Allocation: an online platform allocates advertisements (ads) to a sequence of users. For each arriving user, the platform has to deliver an ad to her, and only observe the users' responses to the displayed ads. The platform has full access to the features of the ads and the users. Following (Agrawal and Goyal 2013) , we assume that a user's click behavior towards an ad, or simply the click through rate (CTR) of this ad by a particular user, follows a probability distribution governed by a common, but initially unknown response function that is linear in the features of the ad and the user. The platform's goal is to maximize the total profit. However, the unknown response function can change over time. For instance, if it is around the time that Apple releases a new iPhone, one can expect that the popularity of an Apple's ad grows.

the adversarial settings could be too pessimistic. Recently, a stream of research works (see Related Works) focuses on MAB problems in a drifting environment, which is a hybrid of a stochastic and an adversarial environment. Although the environment can be dynamically and adversarially changed, the total change (quantified by a suitable metric) in a T step problem is upper bounded by B T (= Θ(T ρ ) for some ρ ∈ (0, 1)), the variation budget. The feedback is corrupted by a mean zero random noise. The aim is to minimize the dynamic regret, which is the optimality gap compared to the sequence of (possibly dynamically changing) optimal decisions, by simultaneously estimating the current environment and hedging against future changes every time step. Most of the existing works for non-stationary bandits have focused on the the somewhat ideal case in which B T is known. In practice, however, B T is often not available ahead. Though some efforts have been made towards this direction (Karnin and Anava 2016, Luo et al. 2018) , how to design algorithms with low dynamic regret when B T is unknown remains largely as a challenging problem.

In this paper, we design and analyze novel algorithms for bandit problems in a drifting environment. Our main contributions are listed as follows.

• When the variation budget B T is known, we characterize the lower bound of dynamic regret, and develop a tuned Sliding Window Upper-Confidence-Bound (SW-UCB) algorithm with matched dynamic regret upper bound up to logarithmic factors.

• When B T is unknown, we propose a novel Bandit-over-Bandit (BOB) framework that tunes SW-UCB adaptively. When the amount of non-stationarity is moderate to large, the application of BOB on SW-UCB algorithm recovers the optimal dynamic regret; otherwise, it obtains a dynamic regret bound with best dependence on T compared to existing literature.

• Our results are general in the sense that given any Upper-Confidence-Bound (UCB) algorithm with optimal regret for the stationary bandit settings, we are able to convert it into the corresponding SW-UCB algorithm with optimal dynamic regret (when B T is known) and the BOB algorithm with nearly optimal dynamic regret (when B T is unknown), respectively.

• Our algorithm design and analysis shed light on the fine balance between exploration, exploitation and adaptation to changes in dynamic learning environments. We rigorously incorporates the "forgetting principle" into the Optimism-in-Face-of-Uncertainty principle, by demonstrating that that the DM should dispose of observations which are sufficiently old. We provide precise criteria for the disposal, and rigorously show the convergence to optimality under these criteria.

• Finally, we point out that a preliminary version of this paper will appear in the 22 nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019) (Cheung et al. 2019 ), and the current paper is a significantly extended version of it. Specifically, the current version provides a substantially refined analysis and thus a greatly improved regret bound for the BOB algorithm when compared to (Cheung et al. 2019) . More importantly, we demonstrate the generality of the established results in the current paper, as exemplified by the extensions to the prevalent d-armed bandit and generalized linear bandit settings.

The rest of the paper is organized as follows. In Section 2, we review existing works in stationary and non-stationary environments. In Section 3, we formulate the non-stationary linear bandit model. We note that the choice of linear model is purely for the purpose of illustration, and all the results derived shall be applicable to other settings (Please see Section 8). In Section 4, we establish a minimax lower bound for our problem. In Section 5, we describe the sliding window estimator for parameter estimation under non-stationarity. In Section 6, we develop the sliding window-upper confidence bound algorithm with optimal dynamic regret (when the amount of non-stationarity is known ahead). In Section 7, we introduce the novel bandit-over-bandit framework with nearly optimal regret. In Section 8, we demonstrate the generality of the established results by applying them to other popular bandit settings, such as the multi-armed bandit and generalized linear bandit settings. In Section 9, we conduct extensive numerical experiments with both synthetic and CPRM-12-001: on-line auto lending datasets to show the superior empirical performances of our algorithms. In Section 10, we conclude our paper.

MAB problems with stochastic and adversarial environments are extensively studied, as surveyed in Cesa-Bianchi 2012, Lattimore and Szepesvári 2018) . To model inter-dependence relationships among different arms, models for linear bandits in stochastic environments have been studied. In (Auer 2002 , Dani et al. 2008 , Rusmevichientong and Tsitsiklis 2010 , Chu et al. 2011 , Abbasi-Yadkori et al. 2011 , UCB type algorithms for stochastic linear bandits were studied, and Abbasi-Yadkori et al. (Abbasi-Yadkori et al. 2011) possessed the state-of-the-art algorithm for the problem. Thompson sampling algorithms proposed in (Russo and Roy 2014 , Agrawal and Goyal 2013 , Abeille and Lazaric 2017 are able to bypass the high computational complexities provided that one can efficiently sample from the posterior on the parameters and optimize the reward function accordingly. Unfortunately, achieving optimal regret bound via TS algorithms is possible only if the true prior over the reward vector is known. (Besbes et al. 2014 (Besbes et al. , 2018 considered the K-armed bandit in a drifting environment. They achieved the tight dynamic regret boundÕ((KB T ) 1/3 T 2/3 ) when B T is known. Wei et al. (Wei et al. 2016) provided refined regret bounds based on empirical variance estimation, assuming the knowledge of B T . Subsequently, Karnin et al. (Karnin and Anava 2016) considered the setting without knowing B T and K = 2, and achieved a dynamic regret bound ofÕ(B 0.18 T T 0.82 + T 0.77 ). In a recent work, (Luo et al. 2018 ) considered K-armed contextual bandits in drifting environments, and in particular demonstrated an improved boundÕ(KB 1/5 T T 4/5 ) for the K-armed bandit problem in drifting environments when B T is not known, among other results. (Keskin and Zeevi 2016) considered a dynamic pricing problem in a drifting environment with linear demands. Assuming a known variation budget B T , they proved an Ω(B 1/3 T T 2/3 ) dynamic regret lower bound and proposed a matching algorithm. When B T is not known, they designed an algorithm withÕ(B T T 2/3 ) dynamic regret. In , a general problem of stochastic optimization under the known budgeted variation environment was studied. The authors presented various upper and lower bound in the full feedback settings. Finally, various online problems with full information feedback and drifting environments are studied in the literature (Chiang et al. 2012 , Jadbabaie et al. 2015 .

Apart from drifting environment, numerous research works consider the switching environment, where the time horizon is partitioned into at most S intervals, and it switches from one stochastic environment to another across different intervals. The partition is not known to the DM. Algorithms are designed for various bandits, assuming a known S (Auer et al. 2002a , Garivier and Moulines 2011 , Luo et al. 2018 , or assuming an unknown S (Karnin and Anava 2016, Luo et al. 2018) .

Notably, the Sliding Window-UCB and the "forgetting principle" is first proposed by Garivier et al. (Garivier and Moulines 2011) , while it is only analyzed under K-armed switching environments.

Finally, it is worth pointing out that our Bandit-over-Bandit framework has connections with algorithms for online model selection and bandit corralling, see e.g., (Agarwal et al. 2017 , Besbes et al. 2018 ) and references therein. This and similar techniques have been investigated under the context of non-stationary bandits in (Luo et al. 2018 , Besbes et al. 2018 . Notwithstanding, existing works either obtain sub-optimal dynamic regret bounds or only empirical performance guarantees.

In this section, we introduce the notations that will be used throughout the discussions and the model formulation.

Throughout the paper, all vectors are column vectors, unless specified otherwise. We define [n] to be the set {1, 2, . . . , n} for any positive integer n. The notation a : b is the abbreviation of consecutive indexes a, a + 1, . . . , b. We use x to denote the Euclidean norm of a vector x ∈ R d . For a positive definite matrix A ∈ R d×d , we use x A to denote the matrix norm √

x Ax of a vector x ∈ R d . We also denote x ∨ y and x ∧ y as the maximum and minimum between x, y ∈ R, respectively. When logarithmic factors are omitted, we use O(·) to denote function growth.

In each round t ∈ [T ], a decision set D t ⊆ R d is presented to the DM, and it has to choose an action

is revealed. Here, we allow D t to be chosen by an oblivious adversary whose actions are independent of those of the DM, and can be determined before the protocol starts (Cesa-Bianchi and Lugosi 2006) . The vector of parameter θ t ∈ R d is an unknown d-dimensional vector, and η t is a random noise drawn i.i.d. from an unknown sub-Gaussian distribution with variance proxy R. This implies

Following the convention of existing bandits literature (Abbasi-Yadkori et al. 2011, Agrawal and Goyal 2013), we assume there exist positive constants L and S, such that X ≤ L and θ t ≤ S holds for all X ∈ D t and all t ∈ [T ], and the problem instance is normalized so that | X, θ t | ≤ 1 for all X ∈ D t and t ∈ [T ].

Instead of assuming the stochastic environment, where the reward function remains stationary across the time horizon, we allow it to change over time. Specifically, we consider the general drifting environment: the sum of 2 differences of consecutive θ t 's should be bounded by some variation budget B T = Θ(T ρ ) for some ρ ∈ (0, 1), i.e.,

We again allow the θ t 's to be chosen adversarially by an oblivious adversary. We also denote the set of all possible obliviously selected sequences of θ t 's that satisfies inequality (1) as Θ(B T ).

The DM's goal is to design a policy π to maximize the cumulative reward, or equivalently to minimize the worst case cumulative regret against the optimal policy π * , that has full knowledge of θ t 's. Denoting x * t = arg max x∈D t x, θ t , the dynamic regret of a given policy π is defined as

where the expectation is taken with respect to the (possible) randomness of the policy.

We first provide a lower bound on the the regret to characterize the best achievable regret.

Theorem 1. For any T ≥ d, the dynamic regret of any policy π satisfies R T (π) = Ω d 

The statement then follows. Please refer to Section A for the complete proof.

As a preliminary, we introduce the sliding window regularized least squares estimator, which is the key tool in estimating the unknown parameters {θ t } T t=1 . Despite the underlying non-stationarity, we show that the estimation error of this estimator can gracefully adapt to the parameter changes.

Consider a sliding window of length w, and consider the observation history

during the time window (1 ∨ (t − w)) : (t − 1). The ridge regression problem with regularization parameter λ (> 0) is stated below:

Denoteθ t as a solution to the regularized ridge regression problem, and define matrix V t−1 := λI + t−1 s=1∨(t−w) X s X s . The solutionθ t has the following explicit expression:

The differenceθ t − θ t has the following expression:

The first term on the right hand side of eq. (4) is the estimation inaccuracy due to the nonstationarity; while the second term is the estimation error due to random noise. We now upper bound the two terms separately. We upper bound the first term in the 2 sense.

Lemma 1. For any t ∈ [T ], we have

Poof Sketch. Our analysis relies on bounding the maximum eigenvalue of

. . , t − 1}. Please refer to Section B of the appendix for the complete proof.

By applying (Abbasi-Yadkori et al. 2011), we upper bound the second term in the matrix norm sense.

Lemma 2 ((Abbasi-Yadkori et al. 2011)). For any t ∈ [T ] and any δ ∈ [0, 1], we have

holds with probability at least 1 − δ.

From now on, we shall denote

for the ease of presentation. With these two lemmas, we have the following deviation inequality type bound for the latent expected reward of any action x ∈ D t in any round t.

Theorem 2. For any t ∈ [T ] and any δ ∈ [0, 1], we have with probability at least 1 − δ,

Proof Sketch. The proof is a direct application of Lemmas 1 and 2. Please refer to Section C of the appendix for the complete proof.

In this section, we describe the Sliding Window Upper Confidence Bound (SW-UCB) algorithm.

When the variation budget B T is known, we show that SW-UCB algorithm with a tuned window size achieves a dynamic regret bound which is optimal up to a multiplicative logarithmic factor. When the variation budget B T is unknown, we show that SW-UCB algorithm can still be implemented with a suitably chosen window size so that the regret dependency on T is optimal, which still results in first order optimality in this case (Keskin and Zeevi 2016) .

In the stochastic environment where the reward function is stationary, the well known UCB algorithm follows the principle of optimism in face of uncertainty (Auer et al. 2002b , Abbasi-Yadkori et al. 2011 . Under this principle, the DM selects the action that maximizes the UCB, or the value of "mean plus confidence radius" (Auer et al. 2002b) . We follow the principle by choosing in each round the action X t with the highest UCB, i.e.,

When the number of actions is moderate, the optimization problem (6) can be solved by an enumeration over all x ∈ D t . Upon selecting X t , we have

by virtue of UCB. From Theorem 2, we further have with probability at least 1 − δ,

and

Combining inequalities (7), (8), and (9), we establish the following high probability upper bound for the expected per round regret, i.e., with probability 1 − δ,

The regret upper bound of the SW-UCB algorithm (to be formalized in Theorem 3) is thus

If B T is known, the DM can set w = d 2/3 T 2/3 B −2/3 T and achieve a regret upper bound

If B T is not known, which is often the case in practice, the DM can set w = (dT ) 2/3 to obtain a regret upper bound O(d 2/3 (B T + 1)T 2/3 ).

In this section, we describe the details of the SW-UCB algorithm. Following its design guideline, the SW-UCB algorithm selects a positive regularization parameter λ (> 0), and initializes V 0 = λI. In each round t, the SW-UCB algorithm first computes the estimateθ t for θ t according to eq. 3, and then finds the action X t with largest UCB by solving the optimization problem (6). Afterwards, the corresponding reward Y t is observed. The pseudo-code of the SW-UCB algorithm is shown in Algorithm 1.

Algorithm 1 SW-UCB algorithm 1: Input: Sliding window size w, dimension d, variance proxy of the noise terms R, upper bound of all the actions' 2 norms L, upper bound of all the θ t 's 2 norms S, and regularization constant λ.

We are now ready to formally state a regret upper bound of the SW-UCB algorithm.

Theorem 3. The dynamic regret of the SW-UCB algorithm is upper bounded as

When B t is unknown, by taking w = O (dT ) 2/3 , the dynamic regret of the SW-UCB algorithm is

Poof Sketch. The proof utilizes the fact that the per round regret of the SW-UCB algorithm is upper bounded by the UCB of the chosen action, and decomposes the UCB into two separated terms according to Lemmas 1 and 2, i.e., regret in round t is equal to regret due to non-stationarity in round t + regret due to estimation error in round t.

The first term can be upper bounded by a intuitive telescoping sum; while for the second term, although a similar quantity is analyzed by the authors of (Abbasi-Yadkori et al. 2011) using a (beautiful) matrix telescoping technique under the stationary environment, we note that due to the "forgetting principle" of the SW-UCB algorithm, we cannot directly adopt the technique. Our proof thus makes a novel use of the Sherman-Morrison formula to overcome the barrier. Please refer to Section D of the appendix for the complete proof.

Remark 1. When the variation budget B T is known, Theorem 3 recommends choosing the length w of the sliding window to be decreasing with B T . The recommendation is in agreement with the intuition that, when the learning environment becomes more volatile, the DM should focus on more recent observations. Indeed, if the underlying learning environment is changing at a higher rate, then the DM's past observations become obsolete faster. Theorem 3 pins down the intuition of forgetting past observation in face of drifting environments, by providing the mathematical definition of the sliding window length w that yields the optimal regret bound.

In Section 6, we have seen that, by properly tuning w, the DM can achieve a first order optimal O d 2/3 (B T + 1)T 2/3 regret bound even if the knowledge of B T is not available. However, in the case of an unknown and large B T , i.e., B T = Ω(T 1/3 ), the bound becomes meaningless as it is linear in T. To handle this case, we wish to design an online algorithm that incurs a dynamic regret of

and σ ∈ (0, 1), without knowing B T . Note from Theorem 1, no algorithm can achieve a dynamic regret of order o(d 2/3 B 1/3 T T 2/3 ), so we must have σ ≥ 2 3 . In this section, we develop a novel Bandit-over-Bandits (BOB) algorithm that achieves a regret of

T T 3/4 ). Hence, (BOB) still has a dynamic regret sublinear in T when B T = Θ(T ρ ) for any ρ ∈ (0, 1) and B T is not known, unlike the SW-UCB algorithm.

Reviewing Theorem 3, we know that setting the window length w to a fixed value

can give us a O d 2/3 (B T + 1) 1/3 T 2/3 regret bound. But when B T is not provided a priori, we need to also "learn" the unknown B T in order to properly tune w. In a more restrictive setting in which the differences between consecutive θ t 's follow some underlying stochastic process, one possible approach is applying a suitable machine learning technique to learn the underlying stochastic process at the beginning, and tune the parameter w accordingly. In the more general setting, however, this strategy cannot work as the change between consecutive θ t 's can be arbitrary (or even adversarially) as long as the total variation is bounded by B T .

The above mentioned observations as well as the established results motivate us to make use of the SW-UCB algorithm as a sub-routine, and "hedge" (Auer et al. 2002a, Audibert and Bubeck 2009) against the (possibly adversarial) changes of θ t 's to identify a reasonable fixed window length.

To this end, we describe the main idea of the Bandit-over-Bandits (BOB) algorithm. As illustrated in Fig. 1 To determine H and J, we first consider the regret of the BOB algorithm. As mentioned above, since w * is not necessarily attainable, i.e., by definition in eq. (12), w * = (dT ) 2/3 (B T + 1) −2/3 might be larger than H when B T is small, we hence denote the optimally (over J) tuned window length as w † . By design of the BOB algorithm, its regret can be decomposed as the regret of the SW-UCB algorithm with the optimally tuned window length w i = w † for each block i plus the loss due to learning the value w † with the EXP3 algorithm, i.e.,

Here, eq. (13) holds as the BOB algorithm restarts the SW-UCB algorithm in each block, and for a round t in block i, X t (w) refers to the action selected in round t by the SW-UCB algorithm with window length w ∧ (t − (i − 1)H − 1) initiated at the beginning of block i.

By Theorem 3, the first expectation in eq. (13) can be upper bounded as

where

is the total variation in block i.

We then turn to the second expectation in eq. (13). We can easily see that the number of rounds for the EXP3 algorithm is T /H and the number of possible values of w i 's is |J|. Denoting the maximum absolute sum of rewards of any block as random variable Q, the authors of (Auer et al.

2002a) gives the following regret bound. 14 To proceed, we have to give a high probability upper bound for Q.

Lemma 3. With probability at least 1 − 2/T, Q does not exceed H + 2R H ln(T / √ H), i.e.,

Proof Sketch. The proof makes use of the R-sub-Gaussian property of the noise terms as well as the union bound. Please refer to Section E of the for the complete proof.

Note that the regret of our problem is at most T, eq. (15) can be further upper bounded as

Combining eq. (13), (14), and (16), the regret of the BOB algorithm is

Eq. (17) exhibits a similar structure to the regret of the SW-UCB algorithm as stated in Theorem 3, and this immediately indicates a clear trade-off in the design of the block length H :

• On one hand, H should be small to control the regret incurred by the EXP3 algorithm in identifying w † , i.e., the third term in eq. (17).

• On the others, H should also be large enough to allow w † to get close to w * = (dT ) 2/3 (B T + 1) −2/3 so that the sum of the first two terms in eq. (17) is minimized.

A more careful inspection also reveals the tension in the design of J. Obviously, we hope that |J| is small to minimize the third term in eq. (17), but we also wish J to be dense enough so that it forms a cover to the set H. Otherwise, even if H is large enough that w † can approach w * , approximating w * with any element in J can cause a major loss.

These observations suggest the following choice of J.

for some positive integer ∆, and since the choice of H should not depend on B T , we can set H = d T α with some α ∈ [0, 1] and > 0 to be determined. We then distinguish two cases depending on whether w * is smaller than H or not (or alternatively, whether (B T + 1) is larger than

Under this situation, w † can automatically adapt to the nearly optimal window length clip J (w * ) , where clip J (x) finds the largest element in J that does not exceed x. Notice that |J| = ∆ + 1, the regret of the BOB algorithm then becomes

Case 2: w * > H or (B T + 1) < d (2−3 )/2 T (2−3α)/2 . Under this situation, w † equals to H, which is the window length closest to w * , the regret of the BOB algorithm then becomes

where we have make use of the fact that (B T + 1) < d (2−3 )/2 T (2−3α)/2 . 

according to eq. (20). Here we have to emphasize that the choice of w † , α, and are purely for the purpose of analysis, while the only parameters that we need to decide are

which clearly do not depend on B T .

We are now ready to describe the details of the BOB algorithm. With H, ∆ and J defined as eq.

(23), the BOB algorithm additionally initiates the parameter γ = min 1, (∆ + 1) ln(∆ + 1) (e − 1) T /H , s j,1 = 1 ∀j = 0, 1, . . . , ∆.

for the EXP3 algorithm (Auer et al. 2002a ). The BOB algorithm then divides the time horizon T into T /H blocks of length H rounds (except for the last block, which can be less than H rounds).

At the beginning of each block i ∈ [ T /H ] , the BOB algorithm first sets

and then sets j i = j with probability p j,i ∀j = 0, 1, . . . , ∆. The selected window length is thus w i = H j i /∆ . Afterwards, the BOB algorithm selects actions X t by running the SW-UCB algorithm with window length w i for each round t in block i, and the total collected reward

Finally, the rewards are rescaled by dividing 2H + 4R H ln(T / √ H), and then added by 1/2 so that it lies within [0, 1] with high probability, and the parameter s j i ,i+1 is set to

while s u,i+1 is the same as s u,i for all u = j i . The pseudo-code of the BOB algorithm is shown in Algorithm 2. Define distribution (p j,i ) ∆ j=0 by eq. (25), and set j t ← j with probability p j,i .

w i ← H j t /∆ , V (i−1)H ← λI. 6: for t = (i − 1)H + 1, . . . , i · H ∧ T do 7: Updateθ t ← V −1 t−1 t−1 s=[(i−1)H+1]∨(t−w i ) X s Y s . 8: Pick action X t ← arg max x∈D t x θ t + x V −1 t−1 R d ln (T (1 + w i L 2 /λ)) + √ λS .

Observe Y t = X t , θ t + η t .

10: 

We are now ready to present the regret analysis of the BOB algorithm.

Theorem 4. The dynamic regret of the BOB algorithm is upper bounded as follows.

• When B T + 1 ≥ d −1/2 T 1/4 ,

Proof Sketch. The proof of the theorem essentially follows Section 7.2, and please refer to Section F of the appendix for the complete proof.

Remark 2. The block structure and restarting the SW-UCB algorithm with a single window length for each block are essential for the correctness of the BOB algorithm. Otherwise, suppose the DM utilizes the EXP3 algorithm to select the window length w t for each round t, and implements the SW-UCB algorithm with the selected window length without ever restarting it. Instead of eq.

(13), the regret of the BOB algorithm is then decomposed as

Here, with some abuse of notations, SW-UCB

refers to in round t,

the DM runs the SW-UCB algorithm with window length w † (w t ) and historical data, e.g., (action, reward) pairs, generated by running the SW-UCB algorithm with window length w † (w τ ) for rounds τ = 1, . . . , t − 1. Same as before, the second term of eq. (27) can be upper bounded as a result of Theorem 3. It is also tempting to apply results from the EXP3 algorithm to upper bound the first term. Unfortunately, this is incorrect as it is required by the adversarial bandits protocol (Auer et al. 2002a ) that the DM and its competitor should receive the same reward if they select the same action, i.e., the reward of SW-UCB w † t−1 τ =1 , w t = w in round t and the reward of SW-UCB {w τ } t−1 τ =1 , w t = w † in round t should be the same for every w. Nevertheless, this is violated as running the SW-UCB algorithm with different window length for previous rounds can generate different (action,reward) pairs, and this results in possibly different estimatedθ t 's for the two SW-UCB algorithms even if both of them use the same window length in round t. Hence, the selected actions and the corresponding rewards by these two instances might also be different.

By the careful design of blocks as well as the restarting scheme, the BOB algorithm decouples the SW-UCB algorithm for a block from previous blocks, and thus fixes the above mentioned problem, i.e., the regret of the BOB algorithm is decomposed as eq. (13).

Remark 3. The assumptions that the decision sets D t 's and the underlying parameters θ t 's are chosen by an oblivious adversary as well as the i.i.d. noise terms are important. With these, the EXP3 algorithm is also facing an obliviously adversarial environment, and thus satisfies the adversarial bandits protocol (Auer et al. 2002a, Audibert and Bubeck 2009) . To see this, assuming additional access to all the i.i.d. (and thus independent of the DMs' actions) noise terms in advance, the adversary can determine the total rewards of each block with respect to each window length in J, independently of the EXP3 algorithm, by running the SW-UCB algorithm with that window length as well as the obliviously chosen D t 's and θ t 's, Remark 4. The structure of the BOB algorithmcan be roughly seen as using a EXP3 algorithm to govern the behaviors of many copies of SW-UCB algorithms with different window lengths. As mentioned in Section 2, it has a flavor similar to the technique of bandits corralling/aggregation, see e.g., (Agarwal et al. 2017 , Besbes et al. 2018 ) and references therein. Existing works, such as (Luo et al. 2018 , Besbes et al. 2018 , have tried to apply bandits corralling/aggregation to the non-stationary bandits settings, but when B T is unknown, they can only obtain either sub-optimal dynamic regret bounds (Luo et al. 2018) or empirical performance guarantees (Besbes et al. 2018 ).

In this section, we demonstrate the generality of our established results. As illustrative examples, we apply the sliding window UCB algorithm as well as the bandit-over-bandit framework to the prevalent multi-armed bandits (Auer et al. 2002b ) and generalized linear bandit settings. In what follows, we shall derive the SW-UCB algorithm as well as the parameters required by the BOB algorithm, i.e., similar to those defined in eq. (23), for both cases. We note that same steps can be applied to other settings, such as Gaussian process bandits (Srinivas et al. 2010 ) and combinatorial bandits with sem-bandit feedback (Kveton et al. 2015) , to derive similar results.

In the d-armed bandits setting, every action set D t is comprised of d actions e 1 , . . . , e d . The i th action e i has coordinate i equals to 1 and all other coordinates equal to 0. Therefore, the reward of choosing action X t = e I t in round t is

where θ t,I t is the I th t coordinate of θ t . For a window length w, we also define N i,t−1 as the number of times that action i is chosen within rounds. (t − w), . . . , (t − 1), i.e., for all i ∈ [d],

Here 1[·] is the indicator function.

Similar to the procedure in Section 5, we set the regularization parameter λ = 0, and compute the sliding window least squares estimateθ t for θ t in each round, i.e.,

where V * t−1 is Moore-Penrose pseudo-inverse of V t−1 . We can also derive the deviation inequality type bound for the latent expected reward of every action x ∈ D t in any round t.

Theorem 5. For any t ∈ [T ] and any i ∈ [d], we have with probability at least 1 − 1/T,

Proof Sketch. The proof essentially follows that of Theorem 2, but it further takes advantage of the fact that all the actions are "1-hot" vectors. Please refer to Section G of the appendix for the complete proof.

We can now follow the same principle in Section 6 by choosing in each round the action X t with the highest UCB, i.e.,

and arrive at the following regret upper bound for the SW-UCB algorithm.

Theorem 6. The dynamic regret of the SW-UCB algorithm is upper bounded as

When B t (> 0) is known, by taking w = O d 1/3 T 2/3 B −2/3 T , the dynamic regret of the SW-UCB algorithm is

When B t is unknown, by taking w = O d 1/3 T 2/3 , the dynamic regret of the SW-UCB algorithm is

. Proof Sketch. The proof of this theorem is very similar to that of Theorem 3, and is thus omitted. The key difference is that β (defined in eq. (5) for the linear bandit setting) is now set to R 2 ln (2dT 2 ), and this saves the extra √ d factor presented in eq. (60). Hence the dynamic regret bounds can be obtained accordingly.

Comparing the results obtained in Theorem 6 to the lower bound presented in , we can easily see that the dynamic regret bound is optimal when B T is known. When B T is unknown, we can implement the BOB algorithmwith the following parameters:

The regret of the BOB algorithm for the MAB setting is characterized as follows.

Theorem 7. The dynamic regret of the BOB algorithm is upper bounded as follows.

•

. Proof Sketch. The proof of the theorem uses the same regret decomposition and a similar cutoff for B T (except that the exponent of d is halved) as those of Theorem 4, and it is thus omitted. 

Letμ(·) andμ(·) denote the first derivative and second derivative of µ(·), respectively, we follow (Filippi et al. 2010 ) to make the following assumptions.

Assumption 1. There exists a set of d actions a 1 , . . . , a d ∈ D such that the minimal eigenvalue

Assumption 2. The link function µ(·) : R → R is strictly increasing, continuously differentiable, Lipschitz with constant k µ , and we define c µ = inf x∈D,θ∈R d : θ ≤Sμ ( x, θ ) .

Assumption 3. There exists Y max > 0 such that for any t ∈

Similar to the procedure in Section 5, we compute the maximum quasi-likelihood estimateθ t for θ t in each round t ∈ [T ] by solving the equation

Defining β = 2k µ Y max 2d ln(w) ln(2dT 2 ) (3 + 2 ln (1 + 2L 2 /λ 0 )) c µ , we can also derive the deviation inequality type bound for the latent expected reward of every action x ∈ D t in any round t.

Theorem 8. For any t ∈ [T ], we have with probability at least 1 − 1/T,

Proof Sketch. The proof is a consequence of Proposition 1 of (Filippi et al. 2010 ) and Theorem 2. Please refer to Section H of the appendix for the complete proof.

We can now follow the same principle in Section 6 to design the SW-UCB algorithm. Note that in order for V t−1 to be invertible for all t, our algorithm should select the actions a 1 , . . . , a d every w rounds for some window length w. For the rest of the rounds, it chooses in each round the action X t with the highest UCB, i.e.,

and arrive at the following regret upper bound.

Theorem 9. The dynamic regret of the SW-UCB algorithm is upper bounded as

When B t is unknown, by taking w = O (dT ) 2/3 , the dynamic regret of the SW-UCB algorithm is

. Proof Sketch. The proof of this theorem is similar to that of Theorem 3, and is thus omitted.

The only difference is that we need to include the regret contributed by selecting actions a 1 , . . . , a d every w rounds. But these sums to O (dT /w) , which is dominated by the term O (dT / √ w) . Hence the dynamic regret bounds can be obtained similarly as the linear bandit setting.

Since the regret expression is the same as the linear bandit setting, we can implement the BOB algorithm with the same set of parameters as eq. (23), and obtain the same dynamic regret bound as specified in Theorem 4 when B T is unknown.

Theorem 10. The dynamic regret of the BOB algorithm is upper bounded as follows.

• When B T + 1 ≥ d −1/2 T 1/4 ,

As a complement to our theoretical results, we conduct numerical experiments on synthetic dataset and the CPRM-12-001: On-Line Auto Lending dataset provided by the Center for Pricing and Revenue Management at Columbia University to compare the regret performances of the SW-UCB algorithm and the BOB algorithm with a modified EXP3.S algorithm analyzed in (Besbes et al. 2018 ) as well as the UCB type algorithms for stationary K-armed and linear bandit settings proposed in (Auer et al. 2002b , Abbasi-Yadkori et al. 2011 .

For synthetic dataset, we first evaluate the growth of regret when T increases, under each of the above-mentioned algorithms in Section 9.1.1. We follow the setup of (Besbes et al. 2018) for fair comparisons Then, in Section 9.1.2, we fix T = 10 5 , and evaluate the behavior of these algorithms across time steps.

9.1.1. The Trend of Regret with Varying T We consider a 2-armed bandit setting, and we vary T from 3 × 10 4 to 2.4 × 10 5 with a step size of 3 × 10 4 . We set θ t to be the following sinusoidal process, i.e., ∀t ∈ [T ], θ t = 0.5 + 0.3 sin (5B T πt/T ) 0.5 + 0.3 sin (π + 5B T πt/T ) .

The total variation of the θ t 's across the whole time horizon is upper bounded by

We also use i.i.d. normal distribution with R = 0.1 for the noise terms.

Known Constant Variation Budget. We start from the known constant variation budget case, i.e., B T = 1, to measure the regret growth of the two optimal algorithms, i.e., the SW-UCB algorithm and the modified EXP3.S algorithm, with respect to the total number of rounds. The log-log plot is shown in Fig. 2(a) . From the plot, we can see that the regret of SW-UCB algorithm is only about 20% of the regret of EXP3.S algorithm.

Unknown Time-Dependent Variation Budget. We then turn to the more realistic time-dependent variation budget case, i.e., B T = T 1/3 . As the modified EXP3.S algorithm does not apply to this setting, we compare the performances of the SW-UCB algorithm and the BOB algorithm. The log-log plot is shown in Fig. 2(b) . From the results, we verify that the slope of the regret growth of both algorithms roughly match the established results, and the regret of BOB algorithm's is much smaller than that of the SW-UCB algorithm's. We further denote τ 0 = 1, τ 31 = T . After that, we randomly sample 32 random unit length vectors v 0 , . . . , v 31 ∈ R d . Finally, for each t ∈ [T ], we define θ t as the linear interpolation between v s , v s+1 , where τ s ≤ tτ s+1 . More precisely, we have θ t = ((τ s+1 − t)v s + (t − τ s )v s+1 )/(τ s+1 − τ s ). Note that the random reward in each period can be negative.

In what follows, we first evaluate the performance of the algorithms by (Besbes et al. 2018) as well as our algorithms in a 2-armed bandit piece-wise linear instance. Then, we evaluate the performance of our algorithms in a linear bandit piece-wise linear instance, where d = 5, and each D t is a random subset of 40 unit length vectors in R d . We do not evaluate the algorithms by (Besbes et al. 2018) in the second instance, since the algorithms by (Besbes et al. 2018 ) are only designed for the non-stationary K-armed bandit setting. For each instance, each algorithm is evaluated 50 times.

Two armed bandits. We first evaluate the performance of the modified EXP.3S in (Besbes et al. 2018 ) as well as the performance of the SW-UCB algorithm, BOB algorithmin a randomly generated 2armed bandit instance . Fig 3(a) illustrates the average cumulative reward earned by each algorithm in the 50 trials, and Fig 3(b) depicts the average dynamic regret incurred by each algorithm in the 50 trials. In Figs 3(a), 3(b) , shorthand SW-UCB-opt is the SW-UCB algorithm , where B T is known and w = w opt is set to be regret-optimal (see Appendix I for the expression of w opt ). Shorthand EXP3.S stands for the modified EXP3.S algorithm by (Besbes et al. 2018) , where B T is known and the learning rate is set to be regret-optimal. Shorthand BOB stands for the BOB algorithm. Shorthand SW-UCB-obl is the SW-UCB algorithm , where B T is not known, and w = w obl is obliviously set (see Appendix I for the expression of w obl ). Finally, shorthand UCB stands for the UCB algorithm by (Abbasi-Yadkori et al. 2011) , which is designed for the stationary linear bandit problem. Note that B T is known to SW-UCB-opt, EXP3.S, but not to BOB, SW-UCB-obl, UCB.

Overall, we observe that SW-UCB-opt is the better performing algorithm when B T is known, and BOB is the best performing when B T is not known. It is evident from Fig 3(a) that SW-UCB-opt, EXP3.S and BOB are able to adapt to the change in the reward vector θ t across time t. We remark that BOB, which does not know B T , achieves a comparable amount of cumulative reward to EXP3.S, which does know B T , across time. It is also interesting to note that UCB, which is designed for the stationary setting, fails to converge (or even to achieve a non-negative total reward) in the long run, signifying the need of an adaptive UCB algorithm in a non-stationary setting.

Linear bandits. Next, we move to the linear bandit case, and we consider the performance of SW-UCB-opt, SW-UCB-obl, BOB and UCB, as illustrated in Figs 4(a), 4(b). While the performance of the algorithms ranks similarly to the previous 2-armed bandit case, we witness that UCB, which is designed for the stationary setting, has a much better performance in the current case than the 2-armed case. We surmise that the relatively larger size of the arm space D t here allows UCB to choose an arm that performs well even when the reward vector is changing.

We now conduct experiments on the on-line auto lending dataset, which was first studied by (Phillips et al. 2015) , and subsequently used to evaluate dynamic pricing algorithms by (Ban and Similar to (Ban and Keskin 2018) , we use the first T = 5 × 10 4 arrivals that span 276 days for this experiment, and impute the price of a loan as the net present value of future payments (a function of the monthly payment, customer rate, and term approved, please refer to the cited references for more details). The range of price in our experiment is set to [0, 500] with a step size of 10, and we use the feature selection results in (Ban and Keskin 2018) to pick FICO score, the term of contract, the loan amount approved, prime rate, the type of car, and the competitor's rate as the feature vector for each customer. We adopt the commonly used (Li et al. 2010, Besbes and linear model to interpolate the response of each customer: for a customer with feature x, if price p is offered, she accepts the offer with "probability" θ, [x; px] . Although the customers' responses are binary, i.e., whether or not she accepts the loan, theoretically justified that the revenue loss caused by using this misspecified model is negligible. For the changing environment, we assume that the θ t remains stationary in a single day period, but can change across days, and use linear regression method to recover the underlying θ t 's. The resulted B T is 1.9 × 10 2 (≈ T 0.48 ) , which means we are in the moderately non-stationary environment.

We measure the dynamic regrets of the SW-UCB algorithm (known B T and unknown B T ), the 

In this paper, we developed general data-driven decision-making algorithms with state-of-the-art dynamic regret bounds for non-stationary bandit settings. We characterized the minimax dynamic regret lower bound and presented a tuned Sliding Window Upper-Confidence-Bound algorithm with matching dynamic regret. We further proposed the parameter-free bandit-over-bandit framework that automatically adapts to the unknown non-stationarity. Finally, we conducted extensive numerical experiments on both synthetic and real-world data to validate our theoretical results.

In the proof, we denote B(1) as the unit Euclidean ball, and λ max (M ) as the maximum eigenvalue of a square matrix M . By folklore, we know that λ max (M ) = max z∈B(1) z M z. In addition, recall the definition

s=1∨(t−w) X s X s We prove the Lemma as follows:

Equality (39) is by the observation that both sides of the equation is summing over the terms (40) is by the triangle inequality.

To proceed with the remaining steps, we argue that, for any index subset S ⊆ {1 ∨ (t − w), . . . , t − 1}, the matrix V −1 t−1 s∈S X s X s is positive semi-definite (PSD). Now, let's denote A = s∈S X s X s . Evidently, matrix A is PSD, while matrix V −1 t−1 is positive definite, and both matrices A, V −1 t−1 are symmetric. Matrices

have the same sets of eigenvalues, since these matrices have the same characteristics polynomial (with the variable denoted as η below):

are symmetric). Altogether, we have shown that

Inequality (41) is by the fact that, for any matrix M ∈ R d×d with λ max (M ) ≥ 0 and any vector y ∈ R d , we have M y ≤ λ max (M ) y . Without loss of generality, assume y = 0. Now, it is evident that (1) M z · y = |λ max (M )| · y = λ max (M ) y .

Applying the above claim with M = V −1 t−1 p s=1∨(t−w) X s X s , which is PSD, and y = θ p − θ p+1 demonstrates inequality (41).

where inequality (43) is by the property that both matrices V −1 t−1 t−1 s=p+1 X s X s , V −1 t−1 are PSD, as we establish previously. Altogether, the Lemma is proved.

Fixed any δ ∈ [0, 1], we have that for any t ∈ [T ] and any x ∈ D t ,

where inequality (44) uses triangle inequality, inequality (45) follows from Cauchy-Schwarz inequality, and inequality (46) are consequences of Lemmas 1, 2.

In the proof, we choose λ so that β ≥ 1, for example by choosing λ ≥ 1/S 2 . By virtue of UCB, the regret in

Inequality (47) is by an application of our SW-UCB algorithm established in equation (10). Inequality (48) is by an application of inequality (46), which bounds the difference | X t ,θ t − θ t | from above. By the evident fact that X t ,θ t − θ t ≤ 2, we have

Summing equation (49) over 1 ≤ t ≤ T , the regret of the SW-UCB algorithm is upper bounded as

What's left is to upper bound the quantity 2β t∈

. Following the trick introduced by the authors of (Abbasi-Yadkori et al. 2011), we apply Cauchy-Schwarz inequality to the term

By dividing the whole time horizon into consecutive pieces of length w, we have

While a similar quantity has been analyzed by Lemma 11 of (Abbasi-Yadkori et al. 2011), we note that due to the fact that V t 's are accumulated according to the sliding window principle, the key eq. (6) in Lemma 11's proof breaks, and thus the analysis of (Abbasi-Yadkori et al. 2011) cannot be applied here. To this end, we state a technical lemma based on a novel use of the Sherman-Morrison formula.

Lemma 5. For any i ≤ T /w − 1,

Proof of Lemma 5. For a fixed i ≤ T /w − 1,

Note that i · w + 1 ≥ 1 and i · w + 1 ≥ t − w ∀t ≤ (i + 1)w, we have

Consider any d-by-d positive definite matrix A and d-dimensional vector y, then by the Sherman-Morrison formula, the matrix

is positive semi-definite. Therefore, for a given t, we can iteratively apply this fact to obtain

Plugging inequality (57) to (54), we have

which concludes the proof.

From Lemma 5 and eq. (52), we know that

Here, eq. (59) follows from Lemma 11 of (Abbasi-Yadkori et al. 2011).

Now putting these two parts to eq. (50), we have 

For any block i, the absolute sum of rewards can be written as i·H∧T t=(i−1)H+1

where we have iteratively applied the triangle inequality as well as the fact that | X t , θ t | ≤ 1 for all t.

Now by property of the R-sub-Gaussian (Rigollet and Hütter 2018), we have the absolute value of the noise term η t exceeds 2R √ ln T for a fixed t with probability at most 1/T 2 i.e.,

Applying a simple union bound, we have

Therefore, we have

The statement then follows.

Following the discussions in Section 7.2, the regret of the BOB algorithm can be decomposed as the regret of an algorithm that optimally (over J) tunes the window length w i = w † for each block i plus the loss due to learning the value w † with the EXP3 algorithm,

Here, eq. (64) holds as the BOB algorithm restarts the SW-UCB algorithm in each block, and for a round t in block i, X t (w) refers to the action selected in round t by the SW-UCB algorithm with window length w ∧ (t − (i − 1)H − 1) initiated at the beginning of block i.

By Theorem 3, the first expectation in eq. (64) can be upper bounded as

where B T (i) = (i·H∧t)−1 t=(i−1)H+1 θ t − θ t+1 is the total variation in block i. We then turn to the second expectation in eq. (64). We can easily see that the number of rounds for the EXP3 algorithm is T /H and the number of possible values of w i 's is |J|. Recall that Q is the maximum absolute sum of rewards of any block, the authors of (Auer et al. 2002a) gives the following regret bound.

By Lemma 3, and the fact that the regret of our problem is at most T, eq. (66) can be further upper bounded

Combining eq. (64), (65), and (67), as well as the choices of H and J in eq. (23), the regret of the BOB algorithm is

Therefore, we have that when B T + 1 ≥ d −1/2 T 1/4 , the BOB algorithm is able to converge to the optimal window length i.e., w † = w * , and the dynamic regret of the BOB algorithm is upper bounded as

while if B T + 1 < d −1/2 T 1/4 , the BOB algorithm converges to the window length w † = H, and the dynamic regret is

G. Proof of Theorem 5

Similar to eq. (4), we can rewrite the differenceθ t − θ t as

We then bound the two terms in eq. (71) separately. For the first term,

Here, almost all the steps follow exactly the same arguments as those of eq. (39)-(42), except that in inequality (72), we make the direct observation that 

where N i,p is the number of times that action e i is selected during rounds 1 ∨ (t − w), . . . , p for all i ∈ [d]. As p ≤ t − 1, we have N i,p ≤ N i,t−1 for all i ∈ [d]. Hence, V * t−1 p s=1∨(t−w) X s X s is a diagonal matrix with all entries less than 1, and its maximal eigenvalue is thus upper bounded by 1.

For the second term of eq. (71), we consider for any fixed i ∈ [d],

where the first step again use the definition of V * t−1 in eq. (74). Now if N i,t−1 = 0, eq. (76) equals to 0; while if N i,t−1 > 0, we can apply the Corollary 1.7 of (Rigollet and Hütter 2018) to obtain that

Hence, with probability at least 1 − 1/dT 2 , for any fixed t ∈ [T ] and any fixed i ∈ [d],

where inequality (78) applies the triangle inequality, inequality (79) follows from the Cauchy-Schwarz inequality and inequality (76) and (77), and inequality (80) follows from inequality (73).

The statement of the theorem now follows immediately by applying union bound over the decision set and the time horizon as well as the simple observation e i V * t−1 = 1/N i,t−1 .

From the proof of Proposition 1 in Filippi et al. (2010) , we know that for all x ∈ D µ ( x, θ t ) − µ x,θ t ≤ k µ x G −1

where

X s X s µ X s , s 0 θ t + (1 − s 0 )θ t   ds 0

By virtue of the maximum quasi-likelihood estimation, i.e., eq. (31) we have

and (81) is (µ ( X s , θ t ) − µ ( X s , θ s )) X s + k µ x G −1

Here, inequality (83) is a consequence of the triangle inequality, inequality (84) again follows from Proposition 1 of Filippi et al. (2010) , inequality (85) is the Cauchy-Schwarz inequality, and the last step uses the fact that G t−1 c µ V t−1 . For the firs quantity, we have X sμ X s ,θ p X s (θ p+1 − θ p )

where inequality (86) is an immediate consequence of the triangle inequality, eq. (87) utilizes the mean value theorem (withθ p being some certain linear combination of θ p and θ p+1 for all p), and inequalities (88) and

(89) follow from the same steps as the proof of Lemma 1 in Section B.

When B T is known , we select w opt that minimizes the explicit regret bound in (60), resulting in w opt = w B 2/3 T , wherew = d 1/3 T 2/3 2 1/3 L 2/3 R d ln (T + T 2 L 2 /λ) + √ λS 2/3 log 1/3 1 + T L 2 dλ 2 .

When B T is not known, we select w obl = w , which is independent of B T .

Improved algorithms for linear stochastic bandits

Linear thompson sampling revisited

Corralling a band of bandit algorithms

Thompson sampling for contextual bandits with linear payoffs

Minimax policies for adversarial and stochastic bandits

The nonstochastic multiarmed bandit problem

Using confidence bounds for exploitation-exploration trade-offs

Finite-time analysis of the multiarmed bandit problem

Personalized dynamic pricing with machine learning

Stochastic multi-armed bandit with non-stationary rewards

Non-stationary stochastic optimization

Optimal exploration-exploitation in a multi-armed-bandit problem with non-stationary rewards

On the (surprising) sufficiency of linear models for dynamic pricing with demand learning

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning

Prediction, Learning, and Games

Learning to optimize under non-stationarity

Online optimization with gradual variations

Contextual bandits with linear payoff functions

Center for pricing and revenue management datasets

Stochastic linear optimization under bandit feedback

Parametric bandits: The generalized linear case

On upper-confidence bound policies for switching bandit problems

Online optimization : Competing with dynamic comparators

Multi-armed bandits: Competing with optimal sequences

Chasing demand: Learning and earning in a changing environments

Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies

Tight regret bounds for stochastic combinatorial semi-bandits

Bandit Algorithms

A contextual-bandit approach to personalized news article recommendation

Provably optimal algorithms for generalized linear contextual bandits

Efficient contextual bandits in non-stationary worlds

The effectiveness of field price discretion: Empirical evidence from auto lending

High Dimensional Statistics

Linearly parameterized bandits

Learning to optimize via posterior sampling

Gaussian process optimization in the bandit setting: No regret and experimental design

Tracking the best expert in non-stationary stochastic environments

The authors would like to express sincere gratitude to Xi Chen, Dylan Foster, Yonatan Gur, Akshay Krishnamurthy, Haipeng Luo, Sasha Rakhlin, Vincent Tan, and Kuang Xu for helpful discussions and comments.The authors also gratefully acknowledge Columbia University Center for Pricing and Revenue Management for providing us the dataset on auto loans.

A. Proof of Theorem 1 First, let's review the lower bound of the linear bandits setting. The linear bandits setting is almost identical to ours except that the θ t 's do not vary across rounds, and are equal to the same (unknown) θ, i.e., ∀t ∈[T ] θ t = θ.Lemma 4 ((Lattimore and Szepesvári 2018)). For any Moreover, θ t 's remain fixed within a block, and can vary across different blocks, i.e.,We argue that even if the DM knows this additional information, it still incur a regret Ω(d 2/3 B 1/3 T T 2/3 ). Note that different blocks are completely decoupled, and information is thus not passed across blocks. Therefore, the regret of each block is Ω d √ H , and the total regret is at leastIntuitively, if H, the number of length of each block, is smaller, the worst case regret lower bound becomes larger. But too small a block length can result in a violation of the variation budget. So we work on the total variation of θ t 's to see how small can H be. The total variation of the θ t 's can be seen as the total variation across consecutive blocks as θ t remains unchanged within a single block. Observe that for any pair of θ, θ ∈ ± d/4H d , the 2 difference between θ and θ is upper bounded asand there are at most T /H changes across the whole time horizon, the total variation is at mostBy definition, we require that B ≤ B T , and this indicates thatTaking H = (dT )