key: cord-0625440-w4hp201r
authors: Wang, Yining; Chen, Yi; Fang, Ethan X.; Wang, Zhaoran; Li, Runze
title: Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection
date: 2020-09-04
journal: nan
DOI: nan
sha: 42d7b8afb2ab7745bbd6d3af547aa7dc371a7f7a
doc_id: 625440
cord_uid: w4hp201r

We consider the stochastic contextual bandit problem under the high dimensional linear model. We focus on the case where the action space is finite and random, with each action associated with a randomly generated contextual covariate. This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine. However, it is very challenging as we need to balance exploration and exploitation. We propose doubly growing epochs and estimating the parameter using the best subset selection method, which is easy to implement in practice. This approach achieves $ tilde{mathcal{O}}(ssqrt{T})$ regret with high probability, which is nearly independent in the ``ambient'' regression model dimension $d$. We further attain a sharper $tilde{mathcal{O}}(sqrt{sT})$ regret by using the textsc{SupLinUCB} framework and match the minimax lower bound of low-dimensional linear stochastic bandit problems. Finally, we conduct extensive numerical experiments to demonstrate the applicability and robustness of our algorithms empirically.

Contextual bandit problems receive significant attention over the past years in different communities, such as statistics, operations research, and computer science. This class of problems studies how to make optimal sequential decisions with new information in different settings, where we aim to maximize our accumulative reward, and we iteratively improve our policy given newly observed results. It finds many important modern applications such as personalized recommendation [Li et al., 2010 , online advertising , cost-sensitive classification [Agarwal et al., 2014] , and personalized medicine [Goldenshluger and Zeevi, 2013 , Bastani and Bayati, 2015 . In most contextual bandit problems, at each iteration, we first obtain some new information. Then, we take action based on a certain policy and observe a new reward. Our goal is to maximize the total reward, where we iteratively improve our policy. This paper is concerned with the linear stochastic bandit model, one of the most fundamental models in contextual bandit problems.

We model the expected reward at each time period as a linear function of some random information depending on our action. This model receives considerable attention [Auer, 2002 , Abe et al., 2003 , Dani et al., 2008 , and as we will discuss in more detail, it naturally finds applications in optimal sequential treatment regimes.

In a linear stochastic bandit model, at each time period t P t1, 2,¨¨¨, T u, we are given some action space A t . Here each feasible action i P A t is associated with a d-dimensional context covariate X t,i P R d that is known before any action is taken. Next, we take an action i t P A t and then observe a reward Y t P R. Assume that the reward follows a linear model

where θ˚P R d is an unknown d-dimensional parameter, and tε t u T t"1 are noises. Without loss of generality, we assume that we aim to maximize the total reward. We measure the performance of the selected sequence of actions ti t u T t"1 by the accumulated regret R T pti t u; θ˚q :" T ÿ t"1 max iPAt xX t,i , θ˚y´xX t,it , θ˚y.

Essentially, the regret measures the difference between the "best" we can achieve if we know the true parameter θ˚and the "noiseless" reward we get. It is clear that regret is always nonnegative, and our goal is to minimize the regret.

Meanwhile, due to the advance of technology, there are many modern applications where we encounter the high-dimensionality issue, i.e., the dimension d of the covariate is large. Note that here the total number of iterations T is somewhat equivalent to the sample size, which is the number of pieces of information we are able to "learn" the true parameter θ˚. Such a high-dimensionality issue presents in practice. For example, given the genomic information of some patients, we aim to find a policy that assigns each patient to the best treatment for him/her. It is usually the case that the number of patients is much smaller than the dimension of the covariate. In this paper, we consider the linear stochastic contextual bandit problem under such a high-dimensional setting, where the parameter θ˚is of high dimension that d " T . We assume that θ˚is sparse that only at most s ! d components of θ˚are non-zero. This assumption is commonly imposed in high-dimensional statistics and signal processing literature [Donoho, 2006, Bühlmann and Van De Geer, 2011 ].

In addition, for the action spaces tA t u T t"1 , it is known in literature [Dani et al., 2008 that if there are infinitely many feasible actions at each iteration, the minimax lower bound is of order r Op ? sdT q, which does not solve the curse of dimensionality. We assume that the action spaces tA t u T t"1 are finite, small, and random. In particular, we assume that for all t, |A t | " k ! d, and each action in A t is associated with independently randomly generated contextual covariates.

In most practical applications, this finite action space setting is naturally satisfied. For example, in the treatment example, there are usually only a small number of feasible treatments available. We refer the readers to Section 3.1 for a complete description and discussion of the assumptions.

We briefly review existing works on linear stochastic bandit problems under both low-dimensional and high-dimensional settings. Under the classical low-dimensional setting, Auer [2002] pioneers the use of upper-confidence-bound (UCB) type algorithms for the linear stochastic bandit, which is one of the most powerful and fundamental algorithms for this class of problems, and is also considered in . Dani et al. [2008] , study linear stochastic bandit problems with large or infinite action spaces, and derive corresponding lower bounds. Under the high-dimensional setting, where we assume that θ˚is sparse, when the action spaces A t 's are hyper-cube spaces r´1, 1s d , develop the SETC algorithm that attains nearly dimension-independent regret bounds. We point out that this algorithm exploits the unique structure of hyper-cubes and is unlikely to be applicable for general action spaces including the ones of our interest where the action spaces are finite. Abbasi-Yadkori et al. [2012] consider a UCB-type algorithm for general action sets and obtain a regret upper bound of r Op ? sdT q, which depends polynomially on the ambient dimension d. Note that here the r Op¨q notation is the big-O notation that ignores all logarithmic factors. Carpentier and Munos [2012] consider a different reward model and obtain an r Op}θ˚} 2 2 s ? T q regret upper bound. Goldenshluger and Zeevi [2013] , Bastani and Bayati [2015] study a variant of the linear stochastic bandit problem in which only one contextual covariate X t is observed at each time period t, while each action i t corresponds to a different unknown model θi t . We point out that this model is a special case of our model (1), as discussed in Foster et al. [2018] .

Another closely related problem is the online sparse prediction problem [Gerchinovitz, 2013 , Foster et al., 2016 , in which sequential predictions p Y t 's of Y t " xX t , θ˚y`ε t are of the interest, and the regret is measured in mean-square error ř t | p Y t´Yt | 2 . It can be further generalized to online empiricalrisk minimization or even the more general derivative-free/bandit convex optimization , Flaxman et al., 2005 , Agarwal et al., 2010 , Besbes et al., 2015 , Bubeck et al., 2017 , Wang et al., 2017 . Most existing works along this direction have continuous (infinite) action spaces tA t u. They allow small-perturbation type methods like estimating gradient descent. We summarize the existing theoretical results on stochastic linear bandit problems in Table 1 .

From the application perspective of finding the optimal treatment regime, existing literatures focus on achieving the optimality through batch settings. General approaches include model-based methods such as Q-learning [Watkins and Dayan, 1992 , Chakraborty et al., 2010 , Goldberg and Kosorok, 2012 and A-learning ] and model-free policy search methods 

Contextual covariates X t,i associated with each i P A t are randomly generated. et al., 2012b et al., , Zhao et al., 2012 . These methods are all developed based on batch settings

where we use the whole dataset to estimate the optimal treatment regime. This approach is applicable after the clinical trial is completed, or when the observational dataset is fully available. However, the batch setting approach is not applicable when it is emerging to identify the optimal treatment regime.

For a contemporary example, during the recent outbreak of coronavirus disease (COVID-19), it is extremely important to quickly identify the optimal or nearly optimal treatment regime to assign each patient to the best treatment among a few choices. However, since the disease is novel, there is no or very little historical data. Thus, the batch setting approaches mentioned above are not applicable.

On the other hand, our model naturally provides a "learn while optimizing" alternative approach to sequentially improve the policy/treatment regime. This provides an important motivating application of the proposed method.

In this paper, we propose new algorithms, which iteratively learn the parameter θ˚while optimizing the regret. Our algorithms use the "doubling trick" and modern optimization techniques, which carefully balance the randomization for exploration to fully learn the parameter and maximizing the reward to achieve the near-optimal regret. In particular, our algorithms fall under the general UCB-type algorithms Robbins, 1985, Auer et al., 2002] . Briefly speaking, we start with some pure exploration periods that we take random actions uniformly, and we estimate the parameter. Then, for the next certain periods, which we call an epoch, we take the action at each time period by optimizing some upper confidence bands using the previous estimator. At the end of each epoch, we renew the estimator using new information. We then enter the next epoch using the new estimator and renew the estimator at the end of the next epoch. We repeat this until the T -th time period.

The high-dimensional regime (i.e., d " T ) poses significant challenges in our setting, which cannot be solved by exiting works. First, unlike in the low-dimensional regime where ordinary least squares (OLS) always admits closed-form solutions and error bounds, in the high-dimensional regime, most existing methods like the Lasso or the Dantzig selector [Candes and Tao, 2007] require the sample covariance matrix to satisfy certain restricted eigenvalue"conditions [Bickel et al., 2009 ], which do not hold under our setting for sequentially selected covariates. Additionally, our action spaces tA t u are finite. This rules out several existing algorithms, including the SETC method that exploits the specific structure of hyper-cube actions sets and finite-difference type algorithms in stochastic sparse convex optimization [Wang et al., 2017, Balasubramanian and Ghadimi, 2018] . We adopt the best subset selector estimator to derive valid confidence bands only using ill-conditioned sample covariance matrices. Note that while the optimization for best subset selection is NP-hard in theory , by the tremendous progress of modern optimization, solving such problems is practically efficient as discussed in ], Bertsimas et al. [2016 . In addition, the renewed estimator may correlate with the previous one. This decreases the efficiency. We let the epoch sizes grow exponentially, which is known as the "doubling trick" [Auer et al., 1995] . This "removes" the correlation between recovered support sets by best subset regression. Our theoretical analysis is also motivated from some known analytical frameworks such as the elliptical potential lemma [Abbasi-Yadkori et al., 2011] and the SupLinUCB framework [Auer, 2002] in order to obtain sharp regret bounds.

We summarize our main theoretical contribution in the following corollary, which is essentially a simplified version of Theorem 2 in Section 3.3.

Corollary 1. We assume that |A t | " k, X t,i i.i.d.

" N p0, I d q, and ε t i.i.d.

" N p0, 1q, for each 1 ď t ď T .

Then there exists a policy π such that with probability at least 1´δ, for all }θ˚} 0 ď s and }θ˚} ď r,

where polylog denotes a polynomial of logarithmic factors.

Note that this corollary holds even if T ! d. Theorem 2 provides more general results than this corollary, and can be applied for a broader family of distributions for contextual covariates tX t,i u. A simpler and more implementable algorithm, with a weaker regret guarantee of r Ops ?

T q, is given in Algorithm 1 and Theorem 1, respectively. The form of regret guarantee in this Corollary 1 looks particularly similar to its low-dimensional counterparts in Table 1 , with the ambient dimension d replaced by the intrinsic dimension s.

Notations. Throughout this paper, for an integer n, we use rns to denote the set t1, 2, . . . , nu. We use }¨} 1 , }¨}, }¨} 8 to denote the 1 , 2 and 8 norms of vector, respectively. Given a matrix A, we use }¨} A to denote the 2 norm weighted by A. Specifically, we have }X} A :"

? X J AX. We also use x¨,¨y to denote the inner product of two vectors. Given a set S Ď rds, S c is it complement and |S| denotes the cardinality. Given a d-dimensional vector X, we use rXs i to denote its i-th coordinate.

We also use supppXq to represent the support of X, which is the collection of indices corresponding to nonzero coordinates. Furthermore, we use rXs S " prXs i q iPS to denote the restriction of X on S, which is a |S|-dimensional vector. Similarly, for a dˆd matrix A " prAs ij q i,jPrds P R dˆd , we denote by rAs SS 1 " prAs jk q jPS,kPS 1 the restriction of A on SˆS 1 , which is a |S|ˆ|S 1 | matrix. When S " S 1 , we further abbreviate rAs S " rAs SS . In addition, given two sequences of nonnegative real numbers ta n u ně1 and tb n u ně1 , a n À b n and a n Á b n mean that there exists an absolute constant 0 ă C ă 8

such that a n ď Cb n and a n ě Cb n for all n, respectively. We also abbreviate a n -b n , if a n À b n and a n Á b n hold simultaneously. We say that a random event E holds with probability at least 1´δ, if there exists some absolute constant C such that the probability of E is larger than 1´Cδ. Finally, we remark that arm, action, and treatment all refer to actions in different applications. We also denote by i t the action taken in period t and X t " X t,it the associated covariate.

Paper Organization. The rest of this paper is organized as follows. We present our algorithms in Section 2. Then we establish our theoretical results in Section 3, under certain technical conditions.

We conduct extensive numerical experiments using both synthetic and real datasets to investigate the performance of our methods in Section 4. We conclude the paper in Section 5.

In this section, we present the proposed methods to solve the linear stochastic bandit problem where we aim to minimize the regret defined in (2). In Section 2.1, we first introduce an algorithm called "SPARSE-LINUCB" (SLUCB) as summarized in Alg. 1, which can be efficiently implemented, and demonstrate the core idea of our algorithmic design. The SLUCB algorithm is a variant of the celebrated LinUCB algorithm for classical linear contextual bandit problems. The SLUCB algorithm is intuitive and easy to implement. However, we cannot derive the optimal upper bound for the regret. To close this gap, we further propose a more sophisticated algorithm called "SPARSE-SUPLINUCB" (SSUCB) (Alg. 2) in Section 3.2. In comparison with the SLUCB algorithm, the SSUCB algorithm constructs the upper confidence bands through sequentially selecting historical data and achieves the optimal rate (up to logarithmic factors).

As we mentioned above, our algorithm is inspired by the standard LINUCB algorithm, which balances the tradeoff between exploration and exploitation following the principle of "optimism in the face of uncertainty". In particular, the LINUCB algorithm repeatedly constructs upper confidence bands for the potential rewards of the actions. The upper confidence bands are optimistic estimators. We then pick the action associated with the largest upper confidence band. This leads to the optimal regret under the low-dimensional setting. However, under the high-dimensional setting, directly applying the LINUCB algorithm incurs some suboptimal regret since we only get lose confidence bands under the high-dimensional regime. Thus, it is desirable to construct tight confidence bands under the highdimensional and sparse setting to achieve the optimal regret. Inspired by the remarkable success of the best subset selection (BSS) in high-dimensional regression problems, we propose to incorporate this powerful tool into the LINUCB algorithm. Meanwhile, since the BSS procedure is computationally expensive, it is impractical to execute the BSS method during every time period. In addition, as we will discuss later, it is not necessary.

Before we present our algorithms, we briefly discuss the support of parameters. Given a ddimensional vector θ, we denote by supppθq the support set of θ, which is the collection of dimensions of θ with nonzero coordinates that supppθq " j P rds : rθs j ‰ 0 ( . This definition agrees with that of most literatures. However, for the BSS procedure, it is desirable to generalize this definition. We propose the concept of "generalized support" as follows.

Definition 1 (Generalized Support). Given a d-dimensional vector θ, we call a subset S Ď rds the generalized support of θ and denote it by supp`pθq, if

The generalized support supp`pθq is a relaxation of the normal support, since any support is a generalized support (but not vice versa). Moreover, the generalized support is not unique. Any subset including the support is a valid generalized support.

We distinguish the difference between support and generalized support in order to define the best subset selection without causing confusion. For example, we consider a linear model θ˚P R d , X t P R d , and Y t " xX t , θ˚y`ε t . Calculating the ordinary least square estimator restricted on the generalized support S P rds, which is denoted by

) , means that we consider a low-dimensional model only using the information in S and set the coordinates of estimator except in S as zeros. Formally, let r p θs S P R |S| be

) .

Since we do not guarantee r p θ ‰ j ‰ 0, @j P S, we call S the generalized support instead of support. We are ready to present the details of SLUCB algorithm now. Our algorithm works as follows. We first apply the "doubling trick", which partitions the whole T decision periods into several consecutive epochs such that the lengths of the epochs increase doubly. We only implement the BSS procedure at the end of each epoch to recover the support of the parameter θ˚. Within each epoch, we start with a minimal epoch length of n 0 for pure exploration, which is called the pure-exploration-stage. At each time period t of this stage, we pick an action within A t uniformly at random. Intuitively speaking, in pure-exploration-stage, we explore the unknown parameters at all possible directions through uniform randomization. From a technical perspective, the pure-exploration-stage is desirable since it maintains the smallest sparse eigenvalue of the empirical design matrix lower bounded away from zero and hence, help us obtain an efficient estimator for θ˚. After the pure-exploration-stage, we keep the estimated support, say of s nonzero components, from the previous epoch unchanged, and treat the problem as an s-dimensional regression. Specifically, at each time period, we use the ridge estimator with penalty weight λ to estimate θ˚and construct corresponding confidence bands to help us make decisions in the next time period.

In summary, we partition the time horizon rT s into consecutive epochs tE τ u r τ τ "1 such that

The length of the last epoch E r τ may be less than 2 r τ or n 0 . By definition, the number of epochs r τ -logpT q. Hence, in the SLUCB algorithm, we run the BSS procedure r OplogpT qq times. In our later simulations studies, we find that this is practical for moderately large dimensions.

Next, we introduce the details of constructing upper confidence bands in the SLUCB algorithm.

We assume that at period t P E τ , we pick action i t P A t and observe the associated covariate X t,it and reward Y t . We also abbreviate X t,it as X t , if there is no confusion. We denote by p θ τ´1,λ the BSS estimator for the true parameter θ˚at the end of previous epoch E τ´1 , and let S τ´1 " supp`p p θ τ´1,λ q be its generalized support, i.e., the generalized support recovered by epoch E τ´1 . For periods in the pure-exploration-stage of E τ , we pick arms uniformly at random and do not update the estimator of θ˚.

Beyond the pure-exploration-stage, we estimate θ˚by a ridge estimator. Let p θ t´1 τ,λ be the most updated ridge estimator of θ˚by t P E τ , which is estimated by restricting its generalized support on S τ´1 using

In particular, all components of p θ t´1 τ,λ outside S τ´1 are set as zeros and

Given p θ t´1 τ,λ , we calculate the upper confidence band of potential reward xX t,i , θ˚y for each possible action i P A t . In particular, let the tuning parameters α and β be α " σν ? sr τ¨logpσρ´1kT d{δq, β " rσ¨as logpkT d{δq, which correspond to the confidence level and upper estimate of potential reward, respectively. Then we calculate the upper confidence band associated with action i as

After that, we pick the arm i t corresponding to the largest upper confidence band to play and observe the corresponding reward Y t " xX t,it , θ˚y`ε t . We repeat this process until the end of epoch E τ .

Then we run the BSS procedure using all data collected in E τ to recover the support of θ˚. We also enlarge the size of generalized support by s. To be specific, let S τ be the generalized support recovered in this step. We require that S τ satisfies constraints

and obtain the BSS estimator p θ τ,λ as

where Proj r t¨u denotes the projection on a centered 2 -ball with radius r, the 2 -norm of θ˚. Note that in comparison with the standard BSS estimator, we project the minimizer for regularization. The boundedness also simplifies our later theoretical analysis. We also add the inclusion restriction S τ Ě S τ´1 for some technical reasons, which does not lead to any fundamental difference. As a result, we need to consider the sparsity τ s instead of s. It boosts the probability of recovering the true support.

A pseudo-code description of the SLUCB algorithm is presented in Algorithm 1. We remark that although computing the BSS estimator is computationally expensive, many methods exist that can solve this problem very efficiently in practice. See, for example, , Bertsimas et al., 2016 for more details. We also conduct extensive numerical experiments to demonstrate that our proposed algorithm is practical in Section 4.

Although the SLUCB algorithm is intuitive and easy to implement, we are unable to prove the optimal upper bound for its regret due to some technical reasons. Specifically, as discussed in the next section, we can only establish an r Ops ?

T q upper bound for the regret of the SLUCB algorithm, while the optimal regret should be r Op ?

sT q. Here we omit all the constants and logarithmic factors and only consider the dependency on horizon length and dimension parameters. The obstacle leading to sub-optimality is the dependency of covariates on random noises. Recall that at each period t P E τ , the SLUCB algorithm constructs the ridge estimator p θ t´1 τ,λ using all historical data, where designs tX t 1 u t 1 PE t´1 τ are correlated with noises tε t 1 u t 1 PE t´1 τ due to the UCB-type policy. Such a complicated correlation impedes us from establishing tight confidence bands for predicted rewards, which results in suboptimal regret.

To close the aforementioned gap and achieve the optimality, we modify the seminal SUPLINUCB algorithm [Auer, 2002 , which is originally proposed to attain the optimal regret for classic stochastic linear bandit problem, as a subroutine in our framework. Then we propose the SPARSE-SUPLINUCB (SSUCB) algorithm. Specifically, we replace the ridge estimator and UCBtype policy with a modified SUPLINUCB algorithm. The basic idea of the SUPLINUCB algorithm is to separate the dependent designs into several groups such that within each group, the designs and noises are independent of each other. Then the ridge estimators of the true parameters are calculated based on group individually. Thanks to the desired independency, now we can derive tighter confidence bands by applying sharper concentration inequality, which gives rise to the optimal regret in the final.

In the next, we present the details of SUPLINUCB algorithm and show how to embed it in our framework. For each period t P E τ that belongs to the UCB-stage, the SUPLINUCB algorithm

where r ζ " rlogpβT qs. These groups are initialized by evenly partitioning periods from the pureexploration-stage and are updated sequentially following the rule introduced as follows. For each period t, we screen the groups tΨ t´1,ζ τ u one by one (in an ascending order of index ζ) to determine the action to take or eliminate some obvious sub-optimal actions.

Suppose that we are at the ζ-th group now. Let N t´1,ζ τ be the set of candidate actions that are still kept by the ζ-th step, which is initialized as the whole action space A t when ζ " 1. We first calculate the ridge estimator p θ t´1,ζ τ,λ restricted on the generalized support S τ´1 , using data from group Ψ t´1,ζ τ .

Then for each action i P N t´1,ζ τ , we calculate ω t´1,ζ τ,λ piq, the width of confidence band of the potential reward. Specifically, we have

where the tuning parameter

corresponds to the confidence level. Our next step depends on the values of ω t´1,ζ τ,λ piq.

If

which means that the widths of confidence bands are uniformly small, we pick the action associated with the largest upper confidence band

where β " rσ¨as logpkT d{δq is an upper estimate of potential rewards. In this case, we discard the data point pX t , Y t q and do not update any group, i.e., setting Ψ t,ζ τ " Ψ t´1,ζ τ , for all ζ P r r ζs.

Otherwise, if there exists some i P N t´1,ζ τ such that ω t´1,ζ τ,λ piq ě 2´ζβ, which means that the width of confidence band is not sufficiently small, then we pick such an action i to play for exploration. In this case, we add the period t into the ζ-th group while keeping all other groups unchanged, i.e.,

Finally, if neither one of the above scenarios happens, which implies that for all i P N t´1,ζ τ ω t´1,ζ τ,λ piq ď 2´ζβ, then we do not take any action for now. Instead, we eliminate some obvious sub-optimal actions and move to the next group Ψ t´1,ζ`1 τ . Particularly, we update the set of candidate arms as

We repeat the above procedure until an arm is selected. Since the number of groups is r ζ " rlogpβT qs and 2´r ζ β " 1{T ď 1{ ? T , the SUPLINUCB algorithm stops eventually. Finally, by replacing the direct ridge regression and UCB-type policy with the SUPLINUCB algorithm above, we obtain the SSUCB algorithm. The pseudo-code is presented in Algorithm 2.

In this section, we present the theoretical results of the SLUCB and SSUCB algorithms. We use the regret to evaluate the performance of our algorithms, which is a standard performance measure in literature. We denote by ti t u tPrT s the actions sequence generated by an algorithm. Then given the true parameter θ˚and covariates tX t,i u tPrT s,iPAt , the regret of the sequence ti t u tPrT s is defined as

where it " argmax iPAt xX t,i , θ˚y denotes the optimal action under the true parameter. The regret measures the discrepancy in accumulated reward between real actions and oracles where the true pa-rameter is known to a decision-maker. In what follows, in Section 3.1, we first introduce some technical assumptions to facilitate our discussions. Then we study the regrets of the SLUCB and SSUCB algorithms in Section 3.2 and Section 3.3, respectively.

We present the assumptions in our theoretical analysis and discuss their relevance and implications.

We consider finite action spaces tA t u T t"1 and assume that there exists some constant k such that |A t | " k, @t P rT s.

We also assume that for each period t, the covariates tX t,i u iPAt are sampled independently from an unknown distribution P 0 . We further impose the following assumptions on distribution P 0 .

Assumption 1. Let random vector X P R d follow the distribution P 0 . Then X satisfies:

(A1) (Sub-Gaussianity): Random vector X P R d is centered and sub-Gaussian with parameter σ 2 ě 1, which means that ErXs " 0 and

(A2) (Non-degeneracy): There exists a constant ρ P p0, σs such that E " rXs 2 j ‰ ě ρ, @j P rds;

(A3) (Independent coordinates): The d coordinates of X are independent of each other.

We briefly discuss Assumption 1. First of all, (A1) is a standard assumption in literature, with sub-Gaussianity covering a broad family of distributions like Gaussian, Rademacher, and bounded distributions. Assumption (A2) is a non-degeneracy assumption which, together with (A3), implies that the smallest eigenvalue of the population covariance matrix ErXX J s is lower bounded by some constant ρ ą 0. Similar assumptions are also adopted in high-dimensional statistics literature in order to prove the "restricted eigenvalue" conditions of sample covariance matrices , which are essential in the analysis of penalized least square methods , Bickel et al., 2009 ]. However, we emphasize that in this paper the covariates indexed by the selected actions ti t u do not satisfy the restricted eigenvalue condition in general, and therefore, we need novel and non-standard analyses of high-dimensional M-estimators. For Assumption (A3), at a higher level, independence among coordinates enables relatively independent explorations in different dimensions, which is similar to the key idea of the SETC method . Technically, (A3) is used to establish the key independence of sample covariance matrices restricted within and outside the recovered support. Due to such an independence, the rewards in the unexplored directions at each period are independent as well, which can be estimated efficiently.

We also impose the following assumptions on the unknown d-dimensional true parameter θ˚.

Assumption 2. (B1) (Sparsity): The true parameter θ˚is sparse. In other words, there exists an s ! d such that |supppθ˚q| " s.

(B2) (Boundedness): There exists a constant r ě 1 such that }θ˚} ď r.

Note that in Assumption 2, (B1) is the key sparsity assumption, which assumes that only s ! d components of the true parameter θ˚are non-zero. Assumption (B2) is a boundedness condition on the 2 -norm of θ˚. This assumption is often imposed, either explicitly or implicitly, in contextual bandit problems for deriving an upper bound for rewards [Dani et al., 2008 .

Finally, we impose the sub-Gaussian assumption on noises sequence tε t u T t"1 , which is a standard assumption adopted in most statistics and bandit literatures.

Assumption 3. (C1) (Sub-gaussian noise): The random noises tε t u T t"1 are independent, centered, and sub-Gaussian with parameter ν 2 ě 1.

In this section, we analyze the performance of the SLUCB algorithm. As discussed earlier, we measure the performance via the regret defined in (4). We show that with a tailored choice of pure-explorationstage length n 0 , tuning parameters α, and β, the accumulated regret of the SLUCB algorithm is upper bounded by r Ops ?

T q (up to logarithmic factors) with high probability. Formally, we have the following theorem.

Theorem 1. For any δ P p0, 1q, let

Under Assumptions 1-3, the regret of the actions sequence ti t u T t"1 generated by the SPARSE-LINUCB algorithm is upper bounded by

with probability at least 1´δ.

Note that in Theorem 1, if we omit all the constants and logarithmic factors, the dominating part in the accumulated regret is

Theorem 1 builds on a nontrivial combination of the UCB-type algorithm and the best subset selection method. To save space, we put the complete proof of Theorem 1 in Appendix A.

In comparison with the SLUCB algorithm, the SSUCB algorithm splits the historical data into several groups dynamically. In each period, we sequentially update the ridge estimator and corresponding confidence bands using data from a single group instead of the whole data. The motivation of only using a single group of data is to achieve the independence between the design matrix and random noises within each group, which leads to tighter confidence bands by applying a sharper concentration inequality. The tighter upper bounds of the predicted rewards lead to an improved regret. In particular,

we have the following theorem.

Theorem 2. For any δ P p0, 1q, let n 0 -ρ´1¨pνσr τ q 2¨s3¨l og 4`k T d{pδλq˘, β " σr a s logpkT d{δq, γ " σpλ 1{2`ν`σ q log 3{2 pkT d{δq.

Then under Assumptions 1-3, the regret of actions sequence ti t u T t"1 generated by the SPARSE-SUPLINUCB algorithm is upper bounded by

with probability at least 1´δ.

Note that in Theorem 2, if we omit all constants and logarithmic factors, the dominating part in the regret upper bound is of order r Op ?

sT q. This improves the rate in Theorem 1 by an order of r Op ? sq and achieves the optimal rate (up to logarithmic factors). Theorem 2 builds on a tailored analysis of the SUPLINUCB algorithm. To save space, we also put the complete proof of Theorem 2 in Appendix B.

In this section, we use extensive numerical experiments to demonstrate our proposed algorithm's efficiency and validate our theoretical results. To simplify implementation, we only test the SLUCB algorithm in our experiments, which can be implemented efficiently. Theoretically, the SLUCB algorithm only achieves a sub-optimal regret guarantee due to technical reasons. However, extensive numerical experiments imply that it attains the near-optimality in practice. Specifically, we first empirically verify the regret guarantee established in Section 3 via simulation under various settings.

Then we apply the proposed method to a sequential treatment assignment problem using a real medical dataset. In simulation studies, we sample the contextual covariates from d-dimensional standard Gaussian distribution. For the real dataset test, we sample the covariates from the empirical distribution, which may not be Gaussian. Some assumptions in Section 3.1 are also violated. However, in all scenarios, our algorithm performs well and achieves r Op ?

T q regrets, which demonstrates the robustness and practicability.

In this section, we present the details of our simulation studies. We first show the r Op ?

T q growth rate of regret empirically. Then we fix time horizon length T and dimension d and study the dependency of accumulated regret on sparsity s. To demonstrate the power of best subset selection, we also compare our algorithm's performance with the oracle, where the decision-maker knows the true support of underlying parameters. Since the bottleneck of computing time in our algorithm is the best subset selection, which requires solving a mixed-integer programming problem, it is appealing to replace this step with other variable selection methods, such as Lasso and iterative hard thresholding (IHT) [Blumensath and Davies, 2009 ]. So we test the performance of those variants. We fix the number of actions k " 60 in all settings.

In this experiment, we study the growth rate of regret. We run two sets of experiments, where in the first case d " 100, T " 1300, s " 5, 10, 15, and in the second case, d " 300, T " 1970, s " 15, 20, 25 . For each setup, we replicate 20 times and then calculate corresponding mean and 90%-confidence interval.

We present the results in Figure 1 . As we see, for each fixed d and s, the growth rate of regret is about r Op ?

T q, which validates our theory.

In this experiment, we fix the dimension d and horizon length T and let sparsity s varies. We calculate the accumulated regret at the end of horizon. We also run two sets of experiments, where in the first case d " 100, T " 1300, s " 5, 8, 11,¨¨¨, 23 and in the second case d " 300, T " 1970, s " 5, 8, 11,¨¨¨, 23 . We present the results are presented in Figure 2 . Although Theorem 1 only provides an r Ops ?

T q regret guarantee for the SLUCB algorithm. The linear dependency of accumulated regret on ? s suggests that it actually attains the optimal r Op ?

sT q rate in practice. s . In (a), we set the dimension d " 100, the horizon length T " 1300, and the sparsity s " 5, 8, 11,¨¨¨, 23. In (b), we set the dimension d " 300, the horizon length T " 1970, and the sparsity s " 5, 8, 11,¨¨¨, 23.

In this experiment, we compare the performance and the computing time of our algorithm with several variants that substitute the best subset selection subroutine with Lasso and IHT. We also compare with the oracle regret where the decision-maker knows the true support of parameters. In more detail, for the first variant, we use Lasso to recover the support of the true parameter at the end of each epoch.

We tune the 1 -penalty λ such that the size of the support of the estimator is approximately equal to s, and then use it in the next epoch. For the second variant, we apply IHT to estimate the parameter and set the hard threshold to be s.

We run two settings of experiments, corresponding to d " 100, s " 15, T " 1300, and d " 300, s " 15, T " 1300. We also replicate 20 times in each setting. For the first case, the averaged computing times are 4.58 seconds for Lasso, 34.62 seconds for IHT, and 310.61 seconds for best subset selection. For the second case, the average computing times are 9.10 seconds for Lasso, 39.47 seconds for IHT, and 516.92 seconds for best subset selection. We display the associated regret curves in Figure 3 . As we see, although the computing time of Lasso is shorter than the other two methods, the performance of Lasso is significantly weaker. The computing time of IHT is slightly longer than Lasso, but it achieves the similar performance as best subset selection, which suggests that IHT might be a good alternative in practice when the computing resource is limited. Finally, although the computing time of the best subset selection is the longest, its performance is the best. The regret almost achieves the same performance as the oracle. Such a result demonstrates the power of best subsection selection in high-dimensional stochastic bandit problems.

In this section, we apply our methods in a sequential treatment assignment problem based on the ACTG175 dataset from the R package "speff2trial". The scenario is that patients arrive at a clinic sequentially. Several candidate treatments are available, and different treatments may have different treatment effects. Hence, for each patient, the doctor observes his/her health information and then chooses a treatment for the patient, to maximize the total treatment effect. We first briefly introduce the ACTG175 dataset [Hammer et al., 1996] . This dataset records the results of a randomized experiment on 2139 HIV-infected patients, and the background information of those patients. The patients are randomly assigned one of the four treatments: zidovudine (AZT) monotherapy, AZT+didanosine (ddI), AZT+zalcitabine (ddC), and ddI monotherapy. The treatment effect is measured via the number of CD8 T-cell after 20 weeks of therapy. Besides, for each patient, an 18-dimensional feature vector is provided as the background information, which includes age, weight, gender, drug history, etc.

Similar to the simulation studies, we validate our algorithms by regret. However, in practice, the potential outcomes of unassigned treatments on each patient are unavailable, which prevents us from calculating the regret. To overcome this difficulty, we assume a linear model for the treatment effect of each therapy. Particularly, we assume that the effect of treatment i (number of CD8 T cell) is xX, θi y`ε, where X denotes the background information of each patient, ε is a standard Gaussian noise, and θi is estimated via linear regression. Similar assumptions are commonly adopted in statistics literatures to calculate the counterfactual, when real experiments are unavailable [Zhang et al., 2012a] . Above sequential treatment assignment problem can be easily transformed to our formulation by considering the joint covariate and space. Given a patient with background X, let X 1 " pX, 0, 0, 0q, X 2 " p0, X, 0, 0q, X 3 " p0, 0, X, 0q, X 4 " p0, 0, 0, Xq, and θ˚" pθ1 , θ2 , θ3 , θ4 q. Since xX i , θ˚y " xX, θi y, assigning treatment to a patient is equivalent to pick an action from space A " tX 1 , X 2 , X 3 , X 4 u. Note that with this formulation, Assumption 1 is violated. However, the numerical results introduced below still demonstrate the good performance of our algorithm.

In this experiment, for simplicity, we pick 9 coordinates (age, weight, drugs in history, Karnofsky score, number of days of previously received antiretroviral therapy, antiretroviral history stratification, gender, CD4 T cell number at baseline, and CD8 T cell number at baseline), from the whole dataset as the covariate. We denote by it X. Then we use the real dataset to estimate the true parameters θi , i " 1, 2, 3, 4, i.e., the treatment effect of each therapy. To test our algorithm in the high dimensional setting, we further concatenate X with a 41-dimensional random noise drawn from standard Gaussian distribution. The parameters corresponding to those injected noises are zeros. For each patient, we use the estimated parameters to calculate the potential treatment effect and regret. In addition, we choose the total number of patients T " 2600. Like the simulation studies, we compare our algorithm's performance with other variants (Lasso,IHT, and oracle). For each algorithm, we repeat 20 times and calculate the mean regret curve. We present the result in Figure 4 .2. Compared with the simulation studies, the performance of Lasso is still the worst. However, the discrepancy is much smaller. Although there is almost no difference in IHT and best subset selection, in real data tests, we may observe some instances where the IHT algorithm fails to converge, which means a lack of robustness. Again, for our proposed algorithm, we observe an r Op ?

T q growth rate of regret. It is pretty close to the oracle as well. So The BSS algorithm also achieves the optimality. Since in this experiment, technical assumptions in Section 3.1 are violated, the numerical results demonstrate the robustness of our method.

In this paper, we first propose a method for the high-dimensional stochastic linear bandit problem by combining the best subset selection method and the LINUCB algorithm. It achieves the r Ops ?

T q regret upper bound and is nearly independent with the ambient dimension d (up to logarithmic factors).

In order to attain the optimal regret r Op ?

sT q, we further improve our method by modifying the SU- Proof of Theorem 1. The proof includes four major steps. First, we quantify the performance of the BSS estimator obtained at the end of each epoch. In particular, we estimate the 2 -norm of the true parameter θ˚outside S τ , the generalized support recovered by epoch E τ . It measures the discrepancy between S τ and the true support of θ˚and hence, the quality of BSS estimator. Then, within epoch E τ , for each period t, we analyze the error of the restricted ridge estimator p θ t´1 τ,λ . Based on that, we upper bound the predictive errors of potential rewards, and build up corresponding upper confidence bounds.

Next, similar to the classic analysis of UCB-type algorithm of linear bandit problem, we establish an elliptical potential lemma to upper bound the sum of confidence bands. Finally, we combine the results to show that the regret of the SLUCB algorithm is of order r

Ops

In what follows, we present the details of each step.

Step 1. Upper bound the error of the BSS estimator: In this step, we analyze the error of the BSS estimator obtained at the end of each epoch. Recall that S τ " supp`p p θ τ,λ q, which is the generalized support of the BSS estimator p θ τ,λ . We have the following lemma.

Lemma 1. If n 0 Á ρ´1¨pνσr τ q 2¨s3¨l og 4`k T d{pδλq˘,

we have that, with probability at least 1´δ, for each epoch τ P rr τ s,

Note that when τ is large, |E τ | -T . Hence, Lemma 1 states that the 2 -norm of the unselected part of true parameter θ˚decays at a rate of r Op a s{T q. Recall that in the next epoch τ`1, we restrict the estimation only on S τ . Failing to identify the exact support of θ˚incurs a regret that is roughly

We interpret this part as "bias" in the total regret. Formally, by Lemma 1 and Proposition 1, for each covariate X t,i , we upper bound the inner product of rθ˚s S c τ and rX t,i s S c τ via the following corollary.

Corollary 2. Under the condition of Lemma 1, with probability at least 1´δ, for all τ P rr τ s, t P E τ , and i P A t , we havěˇˇ@

Essentially, Corollary 2 implies that the regret incurred by inexact recovery of the support of true parameter at a single period is of order r Op a s{T q. This result is critical for our regret analysis later.

Step 2. Upper bound the prediction error: In this step, for each time period t P E τ and arm i P A t , we establish an upper bound for the prediction error xX t,i , p θ t´1 τ,λ´θ˚y . In particular, we have the following lemma.

Lemma 2. If n 0 Á ρ´1¨pνσr τ q 2¨s3¨l og 4`k T d{pδλq˘, with probability at least 1´δ, for all τ P rr τ s, t P E τ , and i P A t , we havěˇˇ@

and E t´1 τ " tt 1 : t 1 ď t´1, t 1 P E τ u.

The upper bound in Lemma 2 includes two parts. The first term, which is of order r Op a s{T q, corresponds to the bias of the estimator p θ t´1 τ,λ . Particularly, it is caused by the inexact recovery of supppθ˚q. In contrast, the second term, which is determined by the variability of rX t,i s S τ´1 and sample covariance matrix p Γ t´1 τ´1,λ , corresponds to the variance.

Step 3. Elliptical potential lemma: In this this step, we establish the elliptical potential lemma, which a powerful tool in bandit problems , Auer, 2002 , Filippi et al., 2010 , that upper bounds the sum of prediction errors in each period. Specifically, we upper bound the sum the right-hand side of the inequality in Lemma 2 for all periods. Recall that in each epoch, the arms are picked uniformly at random in the first n 0 periods.

Lemma 3. Let two constants p, q satisfy 1 ď p ď q. Then with probability at least 1´δ, we have

The upper bound in Lemma 3 consists of two terms. The first one is linear in n 0 , the length of the pure-exploration-stage. The second part, which is of order r Op ?

sT q, shows the trade-off of between exploration and exploitation. In later proof, we show that this part leads to the order r Ops ?

T q regret under a tailored choice of p, q.

Step 4. Putting all together: Based on Lemmas 1, 2, and 3, we are ready to prove Theorem 1.

For each t P rT s and i P A t , let rpX t,i q " xX t,i , θ˚y be the true expected reward of arm i at period t. By Assumption 3 and Proposition 1, the inequality rpX t,i q ď rσ a logpkT d{δq holds for all t and i with probability at least 1´δ. Then under the the following choice of n 0 , n 0 -ρ´1¨pνσr τ q 2¨s3¨l og 4`k T d{pδλq˘, Lemma 2 guarantees that

is an upper bound for rpX t,i q for each t P rT s and i P A t (up to an absolute constant), where β " rσ a s logpkT d{δq, α " ν¨`?s log`kT d{pδλq˘`λ 1{2 r˘.

Next, recall that at period t, our algorithm picks arm i t . The optimal action is it " arg max iPAt xX t,i , θ˚y.

Let T e be the periods of pure-exploration-stage, in which the arms are picked uniformly at random.

Correspondingly, we use T g to denote the periods of UCB-stage, in which the UCB-type policy is employed. Then we have |T e | À n 0 logpT q.

Since with probability at least 1´δ, rpX t,i q ď max rpX t,i q, rσ a logpkT d{δq ( holds for all t and i, by the definition of regret, we have

For the second part in the right-hand side of inequality (8), we have

.

Here the first inequality holds because in period t P T g , the picked arm i t has the largest upper confidence band. The second inequality follows by the definition of rpX t,i q in (7). By Lemma 3, we

Finally, by combining inequalities (8)-(9), we obtain that with probability at least 1´δ,

which concludes the proof of Theorem 1.

Proof of Theorem 2. The proof follows a similar routine as the proof of Theorem 1. The main difference lies in the upper bound of prediction error (Step 2 in the proof of Theorem 1). Recall that in the SLUCB algorithm, we construct the ridge estimator using all historical data. The complicated correlation between design matrices and random noises jeopardizes us applying the standard concentration inequality for independent random variables to obtain an order r Op ? sq upper bound for the prediction error. Instead, we use the self-normalized martingale concentration inequality, which only gives an order r Opsq upper bound and hence, a suboptimal regret. In comparison, in the SSUCB algorithm, the SPARSE-SUPLINUCB subroutine overcomes the aforementioned obstacle by splitting the historical data into disjoint groups such that within each group, the designs and random noises are independent.

Due to such independences, we establish the desired order r Op ? sq upper bound for prediction errors.

In particular, we have the following key lemma.

Lemma 4. When

with probability at least 1´δ, we have that for every tuple pτ, t, i, ζq, it holdšˇˇ@

The upper bound in Lemma 4 includes two parts, corresponding to the bias and variance, respectively. The bias part is still of order r Op ? sq. In comparison with Lemma 2, this lemma decreases an r Op ? sq factor in the variance part. Note that by Lemma 3, we have

Such an improvement leads to an order r Op ? sq improvement in the accumulated regret, which achieves the optimal rate. The detailed proof of Lemma 4 is in the supplementary document.

Same as the last step in the proof of Theorem 1, by Lemma 4, we obtain that, with probability at least 1´δ, for all tuples pτ, t, i, ζq, it holds

where rpX t,i q " xX t,i , θ˚y and

γ " σpλ 1{2`ν`σ q log 3{2 pkT d{δq.

Without loss of generality, we assume that inequality (10) holds for all tuples pτ, t, i, ζq, which is guaranteed with probability at least 1´δ. We also assume that in period t P E τ , the SUPLINUCB algorithm terminates at the ζ t -th group with selected an arm i t . Let it " argmax iPAt xX t,i , θ˚y be the optimal arm for period t. If the first scenario in the SUPLINUCB algorithm happens, i.e.,

Otherwise, we have

In this case, since the SUPLINUCB algorithm does not stop before ζ t , then for all ζ ă ζ t , we have ω t´1,ζ τ,λ piq ď 2´ζβ, @i P N t´1,ζ τ .

Consequently, for all i P N t´1,ζ τ , we have @ X t,it , p θ t´1,ζ τ,λ D`ω t´1,ζ τ,λ pit q Á xX t,it , θ˚y ě xX t,i , θ˚y Á @ X t,i , p θ t´1,ζ τ,λ D´ω t´1,ζ τ,λ piq.

Moreover, by setting ζ " ζ t´1 , we have

By combining inequalities (13)-(15), we obtain rpX t,it q´rpX t,it q " @ X t,it´Xt,it , θ˚D

For the accumulated regret R T pti t u, θ˚q, similar to inequality (8) in the proof of Theorem 1, we have that, with probability at least 1´δ,

where T g denotes the periods of UCB-stage. Then by inequality (16) and Lemma 4, with probability at least 1´δ, we have

where ω t,ζ τ,λ piq is defined in equation (11), and Ψ ζ τ denotes the ζ-th group of periods at the end of epoch E τ . For each fixed ζ, we have that, with probability at least 1´δ,

Finally, combining inequalities (17) and (18), we have that, with probability at least 1´δ,

which concludes the proof.

Action Spaces via Best Subset Selection"

6 Proof of Auxiliary Lemmas in Theorem 1 6.1 Proof of Lemma 1

Lemma (Restatement of Lemma 1). If

we have that, with probability at least 1´δ, for each epoch τ P rr τ s,

Proof. The proof of Lemma 1 depends on the following lemma, which establishes an upper bound for the estimation error p θ τ,λ´θ˚i n an 2 -norm weighted by sample covariance matrix

Lemma 5. With probability at least 1´δ, we have

Similarly, we can establish the same upper bound for the second part in the right-hand side of inequality (19) . Then by combing inequalities (19) and (20), we obtaiňˇˇr

Then we turn to the lower bound for the left-hand side of inequality (21). Recall that in the SLUCB algorithm, the selection of actions in epoch E τ does not rely on rX t s S c . Furthermore, by Assumption 1 (A2, A3), for each t and j P S c , rX t s j is sampled from distribution P 0 independently and satisfies

Subsequently, by matrix Bernstein inequality, when

with probability at least 1´δ, we have

Then by combining inequalities (20)-(21) and Lemma 5, we obtain with probability at least 1´δ that,

Finally, we solve above inequality and obtain

Then Lemma 1 follows from r p θ τ,λ s S c " 0 by the definition S.

Lemma (Resatement of Lemma 2). If

with probability at least 1´δ, for all τ P rr τ s, t P E τ , and i P A t , we havěˇˇ@

and E t´1 τ " tt 1 : t 1 ď t´1, t 1 P E τ u.

Proof. Recall that p θ t´1 τ,λ is the ridge estimator of θ˚using data tpX t 1 , Y t 1 qu t 1 PE t´1 τ restricted on the generalized support S τ´1 . Then for each i P A t , we have decomposition

We can upper bound the first term in the right-hand side of equation (23) by Corollary 2. Formally, with probability at least 1´δ, we havěˇˇ@

It remains to upper bound the second part. By Cauchy-Schwartz inequality, we havěˇˇ@

The following lemma establishes an upper bound for the second term in the right-hand side of inequality (25). The proof relies the self-normalized martingale concentration inequality (i.e., Lemma 9 in Section 9), which is similar to the proof of Lemma 5.

Lemma 8. For each τ P rr τ s and t P E τ , we have

with probability at least 1´δ.

The proof of Lemma 8 is deferred to Section 8.4. We continue the proof of Lemma 2 now. By combing inequalities (23)-(25) and Lemma 8, we obtain with probability at least 1´δ thaťˇˇ@

Hence, when s ! T , we havěˇˇ@

with probability at least 1´δ. Finally, by applying the union bound argument for all t P E τ , we finish the proof of Lemma 2.

Lemma (Restatement of Lemma 3). Let two constants p, q satisfy 1 ď p ď q. Then with probability at least 1´δ, we have

Proof. For a fixed epoch E τ , let t 1 be the end of pure-exploration-stage and t 2 be the end of E τ . Then we have t 1 ď n 0 ď t 2 . We also have

Here the first inequality holds because q ě p and n 0 ď t 1 . The second inequality follows from Cauchy-Schwartz inequality. The last inequality is due to the basic inequality mint1, uu ď 2 logp1`uq, @u ą 0.

On the other hand, we have the recursion

Subsequently, we have

By combining equations (26)-(27) and taking telescoping sum, we obtain

By Proposition 1, we have with probability at least 1´δ that, for all t P rT s and i P A t ,

Then by applying the determinant-trace inequality, we have

Hence, we have log`detp p Γ t 2 τ´1,λ q˘À |S τ´1 |¨`λ`logpσdT k{δq˘(29) with probability at least 1´δ. On the other hand, by definition, we have p Γ t 1 τ´1,λ Á λI |S τ´1 | , which implies log`detp p Γ t 1 τ´1,λ q˘Á |S τ´1 |¨logpλq.

Finally, by combining inequalities (28) 

with probability at least 1´δ, we have that for every tuple pτ, t, i, ζq, it holdšˇˇ@

Proof. To simplify notation, for each fixed tuple pτ, t, ζq, we denote by

where m is the number of elements in set Ψ t´1,ζ τ . Recall that

which is the sample covariance matrix corresponding to X t´1,ζ τ restricted on S τ´1 . Recall that p θ t´1,ζ τ,λ is the ridge estimator with generalized support S τ´1 using data pX t´1,ζ τ , Y t´1,ζ τ q. Particularly, we have

Then we have

Hence, for each X t,i , we can upper bound the prediction error xX t,i , p θ t´1,ζ τ,λ´θ˚y ašˇx X t,i , p θ t´1,ζ τ,λ´θ˚yˇďˇ@ rX t,i s S c τ´1 , rθ˚s S c τ´1 Dˇˇˇ`ˇˇˇ@ rX t,i s S τ´1 , r p θ t´1,ζ τ,λ´θ˚s S τ´1 Dˇˇˇ.

We can use Corollary 2 to upper bound the first term in the right-hand side of inequality (31) and obtainˇˇˇ@ rX t,i s S c τ´1 , rθ˚s S c τ´1 DˇˇˇÀ σ a logpkT d{δq¨d ν 2 s log 2`k T d{pδλq˘`λr 2

with probability at least 1´δ. It remains to upper bound the second term. We first note that

where the first inequality is Cauchy-Schwartz inequality and the second one follows from the fact p Γ t´1,ζ τ´1,λ ľ λI |S τ´1 | .

In the next, we claim that for each fixed tuple pτ, t, ζq, random vectors ε t´1,ζ τ and X t´1,ζ τ are independent of each other. Recall the policies of picking arms and updating Ψ t´1,ζ τ in SUPLINUCB algorithm, which are specified in details in Section 2.2. We consider a fixed period t 1 ď t´1 and the ζ-th group. Note that the random event that covariate X t 1 enters X t´1,ζ τ only depends on the widths of confidence bands tω t 1´1 ,ζ 1 τ,λ u ζ 1 ďζ . Moreover, since tω t 1´1 ,ζ 1 τ,λ u ζ 1 ďζ only relies on the values of design matrices tX t 1´1 ,ζ 1 τ u ζ 1 ďζ but not noises sequence tε t 2 u t 2 ďt 1 , whether X t 1 enters X t´1,ζ τ is also independent of tε t 1 u t 1 PΨ t,ζ τ . Then we obtain the result. The independency between ε t´1,ζ τ and X t´1,ζ τ enables us to establish a tighter upper bound for the second term in the right-hand side of inequality (31) than Lemma 2. Recall that by Assumption 1, given the information by epoch τ´1, random variables rX t´1,ζ s S τ´1 and rX t´1,ζ s S c τ´1 are independent. Moreover, rX t´1,ζ s S c τ´1 is also independent of ε t´1,ζ τ . Hence, each element of ε t´1,ζ τ`r X t´1,ζ τ s S c τ´1 rθ˚s S c τ´1 is a centered sub-Gaussian random variable with parameter ν 2`σ2 }rθ˚s S c τ´1 } 2 2 , which is also independent of covariates rX t´1,ζ τ s S τ´1 . By applying the sub-Gaussian concentration inequality and the union bound argument, we obtain with probability at least 1´δ that

Here we emphasize that inequality (34) is interpreted as a random event given the realization of rX t´1,ζ s S τ´1 for later proof. It also holds regardless the values of rX t´1,ζ s S τ´1 . That is where the independency of ε t´1,ζ τ and rX t´1,ζ s S τ´1 is indispensable: otherwise, given rX t´1,ζ s S c τ´1 , elements of ε t´1,ζ τ are not i.i.d., which prevents us establishing concentration result similar to (34). Based on inequality (34) and Cauchy-Schwartz inequality, with probability at least 1´δ, we have

By combing inequalities (33)-(35), we havěˇˇ@ rX t,i s S τ´1 , r p θ t´1,ζ τ,λ´θ˚s S τ´1

with probability at least 1´δ. Recall that Lemma 1 shows that › › rθ˚s S c τ´1 › › " r Op a s{T q.

Hence, when s ! T , }rθ˚s S c τ´1 } À 1. Subsequently, by combining inequalities (32) and (36), we obtain with probability at least 1´δ thaťˇˇ@ X t,i , p θ t´1,ζ τ,λ´θ˚DˇÀ σpλ 1{2`ν`σ q log 3{2 pkT d{δq¨´as{|E τ |`› › rX t,i s S τ´1 › › r p Γ t´1,ζ τ´1,λ s´1¯, which concludes the proof of Lemma 4. hold for all possible r S, which further implies that

holds with probability at least 1´δ.

For the second part of the right-hand side of inequality (39), by Cauchy-Schwartz inequality and the confidence region of ridge estimator (i.e., Lemma 10 provided in Section 9), we obtain with probability at least 1´δ that 2pθ˚´θ τ,λ q J p Γ τ,λ p p θ τ,λ´θτ,λ q ď › › θ˚´θ τ,λ which further implies that max r S Λ r S ( , Λ S˚À ν ? s¨log`kT d{pδλq˘.

As a result, by inequality (45), we finally obtain with probability at least 1´δ that › ›p θ τ,λ´θ˚› › 2 p Γ τ,λ

Now we use the concentration inequality for the maximal of sub-Gaussian random variables again. Then with probability at least 1´δ, for each t,

with probability at least 1´δ. Hence, we obtain with probability at least 1´δ that

which concludes the proof of Lemma 8.

In this section, we list the results established in [Abbasi-Yadkori et al., 2011] , which are used extensively in our proof.

Lemma 9. (Self-normalized Martingale Concentration inequality) Let tX t u tě1 be an R d -valued stochastic process and tη t u tě1 be a real valued stochastic process. Let F t " σpX 1 , . . . , X t`1 , η 1 , . . . , η t q be the natural filtration generated by tX t u tě1 and tη t u tě1 . Assume that for each t, η t is centered and sub-Gaussian with variance proxy R 2 given F t´1 , i.e., Erη t |F t´1 s " 0, E " exptλη t uˇˇF t´1 ‰ ď exptλ 2 R 2 {2u, @λ P R.

Also assume that V is a dˆd positive definite matrix. For any t ě 0. we define

Then for any δ ą 0, with probability at least 1´δ, for all t ě 0,

δ¯. Lemma 10. (Confidence Region of Ridge Regression Estimator under Dependent Design) Consider the stochastic processes tpX t , Y t , η t qu tě1 . Assume that tX t u tě1 and tη t u tě1 satisfy the condition in Lemma 9 and Y t " xX t , θ˚y`η t . Also assume that }θ˚} 2 ď r. For any T ě 0, let θ t "`X J 1:t X 1:t`λ I d˘´1 X J 1:t Y 1:t be the ridge regression estimator, where X 1:t " pX 1 , . . . , X t q J , Y 1:t " pY 1 , . . . , Y t q J and I d is the d-dimensional identity matrix. Then with probability at least 1´δ, for all t ě 0,

where p Γ t " λI d`ř

Contextual learning with online convex optimization: Theory and application to chronic diseases

Contextual gaussian process bandit optimization

Asymptotically efficient adaptive allocation rules

Sparse online learning via truncated gradient

Linear multi-resource allocation with semi-bandit feedback

A contextual-bandit approach to personalized news article recommendation

Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms

Provably optimal algorithms for generalized linear contextual bandits

Subset selection in regression

Demystifying optimal dynamic treatment regimes

Optimal dynamic treatment regimes

An experimental design for the development of adaptive treatment strategies

Sparse approximate solutions to linear systems

Problem complexity and method efficiency in optimization

Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part I: main content

Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part II: proofs of results

Sparse learning via boolean relaxations

Restricted eigenvalue properties for correlated gaussian designs

Estimation and extrapolation of optimal treatment and testing strategies

Marginal structural models and causal inference in epidemiology

Linearly parameterized bandits

On the complexity of bandit and derivative-free stochastic convex optimization

On the complexity of bandit linear optimization

Penalized Q-learning for dynamic treatment regimens

Lower bounds for stochastic linear bandits

Regression shrinkage and selection via the lasso

User-friendly tail bounds for sums of random matrices

Sharp thresholds for high-dimensional and noisy sparsity recovery using 1 -constrained quadratic programming (Lasso)

Improved algorithms for linear stochastic bandits

Online-to-confidence-set conversions and application to sparse stochastic bandits

Reinforcement learning with immediate rewards and linear hypotheses

Optimal algorithms for online convex optimization with multipoint bandit feedback

Taming the monster: A fast and simple algorithm for contextual bandits

Using confidence bounds for exploitation-exploration trade-offs

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Finite-time analysis of the multiarmed bandit problem

Zeroth-order (non)-convex stochastic optimization via conditional gradient and gradient updates

Online decision-making with high-dimensional covariates

Best subset selection via a modern optimization lens

Non-stationary stochastic optimization

Simultaneous analysis of lasso and dantzig selector

Iterative hard thresholding for compressed sensing

Kernel-based methods for bandit convex optimization

Statistics for high-dimensional data: methods, theory and applications

The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics

Bandit theory meets compressed sensing for high dimensional stochastic linear bandit

Inference for non-regular parameters in optimal dynamic treatment regimes

Contextual bandits with linear payoff functions

Stochastic linear optimization under bandit feedback

Compressed sensing

Parametric bandits: The generalized linear case

Online convex optimization in the bandit setting: gradient descent without a gradient

Online sparse linear regression

Practical contextual bandits with regression oracles

Sparsity regret bounds for individual sequences in online linear regression

Q-learning with censored data

A linear response bandit problem

A trial comparing nucleoside monotherapy with combination therapy in hiv-infected adults with cd4 cell counts from 200 to 500 per cubic millimeter

Contextual learning with online convex optimization: Theory and application to chronic diseases

Contextual gaussian process bandit optimization

Asymptotically efficient adaptive allocation rules

Sparse online learning via truncated gradient

Linear multi-resource allocation with semi-bandit feedback

A contextual-bandit approach to personalized news article recommendation

Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms

Provably optimal algorithms for generalized linear contextual bandits

Subset selection in regression

Demystifying optimal dynamic treatment regimes

Optimal dynamic treatment regimes

An experimental design for the development of adaptive treatment strategies

Sparse approximate solutions to linear systems

Problem complexity and method efficiency in optimization

Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part I: main content

Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part II: proofs of results

Sparse learning via boolean relaxations

Restricted eigenvalue properties for correlated gaussian designs

Estimation and extrapolation of optimal treatment and testing strategies

Marginal structural models and causal inference in epidemiology

Linearly parameterized bandits

On the complexity of bandit and derivative-free stochastic convex optimization

On the complexity of bandit linear optimization

Penalized Q-learning for dynamic treatment regimens

Lower bounds for stochastic linear bandits

Regression shrinkage and selection via the lasso

User-friendly tail bounds for sums of random matrices

Sharp thresholds for high-dimensional and noisy sparsity recovery using 1 -constrained quadratic programming (Lasso)

Stochastic zeroth-order optimization in high dimensions

Q-learning

Estimating optimal treatment regimes from a classification perspective

A robust method for estimating optimal treatment regimes

Estimating individualized treatment rules using outcome weighted learning

Doubly robust learning for estimating individualized treatment with censored data

À ν 2 s log 2`k T d{pδλq˘`λr 2 .The proof of Lemma 5 is provided in Section 8.1. Now we continue the proof of Lemma 1. To simplify notation, from now on, we fix the epoch τ and let S " S τ " supp`p p θ τ,λ q.Then by definition, r p θ τ,λ´θ˚sS c "´rθ˚s S c . We can decompose the error } p θ τ,λ´θ˚}where the inequality follows from the fact that the matrix r p Γ τ,λ s S is positive definite. By rearranging above inequality, we obtain r p θ τ,λ´θ˚sIn the next, we upper bound the right-hand side of inequality (19) and then lower bound the left hand side, which leads to a quadratic inequality with respect to the estimation error }r p θ τ,λ´θ˚sS c τ }. By solving this quadratic inequality, we finish the proof of Lemma 1.We establish the upper bound first. According to Hölder's inequality, we havěˇˇrwhere the last inequality follows from the fact that r p θ τ,λ´θ˚sS c has at most pτ`1qs non-zero coordinates. To continue the proof, we also need the following two lemmas, which upper bound }r p Γ τ,λ s SS c } 8 and } p θ τ,λ´θ˚}1 individually.Lemma 6. With probability at least 1´δ, we haveLemma 7. For arbitrary constant ς ą 0, when n 0 Á maxtς 2 ρ´1, σ 2 ρ´2logpdr τ {δqu, with probability at least 1´δ, we have } p θ τ,λ´θ˚}1 À νs{ς¨?r τ¨log`kT d{pδλq˘.The proofs of Lemma 6-7 are deferred to Section 8.2-8.3. We continue the proof of Lemma 1 now. By applying Lemma 6-7 and setting ς " νσr τ¨s 3{2¨l og 2`k T d{pδλq˘, we obtain the following inequality with probability at least 1´δ,ˇˇrNote that with such a choice of ς, we require n 0 Á ς 2 ρ´1 " ρ´1¨pνσr τ q 2¨s3¨l og 4`k T d{pδλq˘.Proof. We use the optimality of p θ τ,λ to derive an upper bound of the quadratic form of p θ τ,λ´θ˚. In the following, we use S˚to denote the support of true parameter θ˚. Let θ τ,λ be the projected ridge estimator with support S˚using data tX t , Y t u tPEτ , i.e.,where Proj r t¨u denotes the projection operator on a centered 2 -ball with radius r. We remark that θ τ,λ is unavailable in practice since S˚is unknown. However, it is not necessary in the SLUCB algorithm and only serves as an auxiliary estimator in theoretical analysis. Recall the definition of BSS estimator p θ τ,λ , which is given bywhere S τ´1 is the generalized support recovered by the pτ´1q-th epoch. Since supp`pθ τ,λ q " S˚, we have supp`pθ τ,λ q " S˚Y S τ´1 Ě S τ´1 . We also have |S˚Y S τ´1 | ď τ s. Hence, θ τ,λ satisfies the constraints in equation (37). Then by the optimality of p θ τ,λ , we haveand furthermore,After some algebra, we obtainThen inequality (38) becomesIn the next, we establish upper bounds for the three parts of the right-hand side of inequality (39) individually. We first introduce some notations. By the definition of p θ τ,λ , we haves.t. these exists some subset S P rds : supp`pθq " S, S Ě S τ´1 , |S| ď τ s, }θ} ď r.where the optimization is taken over all feasible set S. From now on, for convenience, given a fixed subset S Ď rds with |S| ď τ s, we denote byParticularly, letThen | r S| ď pτ`1qs by definition. We are ready to upper bound the right-hand side of inequality (39). For the first part, note that supp`p p θ τ,λ´θτ,λ q " r S τ . Then by Cauchy-Schwartz inequality, we havéwhere the maximization in the last step is taken over all possible r S, whose number is at most d pτ`1qs . For each fixed r S, by applying the self-normalized martingale concentration inequality (i.e., Lemma 9 provided in Section 9), we obtain with probability at least 1´δ thatδ¯.Since the number of possible r S is upper bounded by d pτ`1qs , by applying the union bound argument, we obtain with probability at least 1´δ that,where S˚" supppθ τ q Y supppθ˚q.For the last part of the right-hand side of inequality (39), since θ˚, p θ τ , and θ τ all are in the centered 2 -ball with radius r, we haveˇˇ2λp p θ τ,λ´θτ,λ q J θ˚ˇˇď 4λr 2 .Finally, by combing inequalities (39)-(44) and Lemma 10 (provided in Section 9), we obtain with probability at least 1´δ thatwhere the maximization is taken over all subsets r S P rds with capacity at most pτ`1qs. In order to conclude the proof of Lemma 5, it remains to establish an upper bound for Λ r S . By Proposition 1, with probability at least 1´δ, for all t P E τ , i P A t , and j P rds,ˇr X t,i s jˇÀ σ logpkT d{δq.Then by determinant-trace inequality, we have.Recall the definition of Λ r S in equation (41). Then with probability at least 1´δ, we havewhich concludes the proof of Lemma 5.

Lemma. (Restatement of Lemma 6) With probability at least 1´δ, we haveProof. By the definition of matrix r p Γ τ,λ s SS c , for every i P S and j P S c , we haveRecall that in epoch E τ , the SLUCB algorithm picks arm only based on the information in dimension supp`p p θ τ´1,λ q. Also recall that SLUCB algorithm requires S " supp`p p θ τ,λ q Ě supp`p p θ τ´1,λ q,Since all covariates X t,i have independent coordinates, for each j P S c , trX t s j u tPEτ are independent centered sub-Gaussian random variables. As a result, conditional on rX t s i , the random variable r p Γ τ,λ s ij is sub-Gaussian with parameter σ 2¨ř tPEτ rX t s 2 i . Note thatwhich is the maximum of |S|¨|S c | sub-Gaussian random variables whose parameters are uniformly upper bounded by σ 2¨m ax iPS ř tPEτ rX t s 2 i . By applying the concentration inequality for the maximal of sub-Gaussian random variables, we obtain with probability at least 1´δ thatHence, we obtain with probability at least 1´δ thatSimilarly, we also havewith probability at least 1´δ, which concludes the proof of Lemma 6.

Lemma. (Restatement of Lemma 7) For arbitrary constant ς ą 0, when n 0 Á maxtς 2 ρ´1, σ 2 ρ´2logpdr τ {δqu, with probability at least 1´δ, we have } p θ τ,λ´θ˚}1 À νs{ς¨?r τ¨log`kT d{pδλq˘.Proof. To simplify notation, letThen we have | r S τ | ď spτ`1q. Recall that in SLUCB algorithm, the first n 0 arms in epoch E τ are chosen uniformly at random. Then by matrix Bernstein inequality and Assumption 1 (A2), for arbitrary constant ς ą 0,when n 0 Á maxtς 2 ρ´1, σ 2 ρ´2logpdr τ {δqu, we havewith probability at least 1´δ. Since suppp p θ τ,λ´θ˚q Ď r S τ , when inequality (47) holds, we haveHence, we obtain with probability at least 1´δ thatwhere the last inequality follows from Lemma 5. Now we finish the proof of Lemma 7.

Lemma. (Restatement of Lemma 8) For each τ P rr τ s and t P E τ , we havewith probability at least 1´δ.Proof. For each t 1 P E τ , we have decompositionD`ε t 1 . Given the information by epoch τ´1, for each t 1 P E τ and i P A t 1 , by Assumption 1 (A3), rX t 1 ,i s S c τ´1 is independent of rX t 1 ,i s S τ´1 . Recall that in epoch τ , all decisions on picking arms only depend on the information of covariates in dimension S τ´1 . Hence, r ε t 1 is independent of the design rX t 1 s S τ´1 . Moreover, conditioning on the information by epoch τ´1, tr ε t 1 u t 1 PEτ are independent centered sub-Gaussian random variables with parameterRecall the definition of p θ t´1 τ,λ , i.e.,where E t´1 τ " tt 1 : t 1 ď t´1, t 1 P E τ u. Then by using the confidence region of ridge regression estimator (i.e., Lemma 10 provided in Section 9), we have with probability at least 1´δ thatSimilar to the proof of Lemma 5, we have det`p Γ t´1 τ´1,λ˘À´λ`T σ 2 log 2 pkT d{δq τ s¯τ s , Input: sequentially arriving covariates tX t,i u tPrT s,iPAt , length of pure-exploration-stage n 0 , confidence level α, estimated upper bound of reward β, sparsity level s, ridge regression penalty λ. Output: action sequence ti t u tPrT s . 1 partition rT s into consecutive epochs E 1 , E 2 ,¨¨¨, E r τ such that |E τ | " maxt2 τ , n 0 u; 2 initialization: p θ 0,λ " 0, S " H; 3 for τ " 1, 2,¨¨¨, r τ do 4 for the first n 0 time periods t P E τ do 5 select i t P A t uniformly at random; 6 set p θ t τ,λ " p θ τ´1,λ ; 7 end 8 for the remaining time periods t P E τ do 9 calculate matrixcalculate the upper confidence band of reward for each arm Algorithm 1: SPARSE-LINUCB Algorithm Input: epoch index τ , sequential arriving covariates tX t,i u tPEτ ,iPAt , length of pure-exploration-stage n 0 , confidence level γ, estimated upper bound of reward β, support recovered in previous epoch S τ´1 , sparsity level s, ridge regression penalty λ. Output: action sequence ti t u tPEτ . 1 for the first n 0 time periods t P E τ do 2 select i t P A t uniformly at random; 3 end 4 set r ζ " rlogpβT qs, S " S τ´1 , and initialize sets tΨ t,1 τ ,¨¨¨, Ψ t, r ζ τ u by evenly partitioning the n 0 time periods of pure-exploration-stage; 5 for the remaining time periods t in E τ do 6 initialize ζ " 1, N t´1,ζ τ " A t ; until an arm i t P A t is selected;

Algorithm 2: SPARSE-SUPLINUCB Subroutine