key: cord-0556428-jpmb6f30
authors: Baby, Dheeraj; Wang, Yu-Xiang
title: Second Order Path Variationals in Non-Stationary Online Learning
date: 2022-05-04
journal: nan
DOI: nan
sha: 1cdd4c676721f4475b921cffe13d3e0c7d9aeb68
doc_id: 556428
cord_uid: jpmb6f30

We consider the problem of universal dynamic regret minimization under exp-concave and smooth losses. We show that appropriately designed Strongly Adaptive algorithms achieve a dynamic regret of $tilde O(d^2 n^{1/5} C_n^{2/5} vee d^2)$, where $n$ is the time horizon and $C_n$ a path variational based on second order differences of the comparator sequence. Such a path variational naturally encodes comparator sequences that are piecewise linear -- a powerful family that tracks a variety of non-stationarity patterns in practice (Kim et al, 2009). The aforementioned dynamic regret rate is shown to be optimal modulo dimension dependencies and poly-logarithmic factors of $n$. Our proof techniques rely on analysing the KKT conditions of the offline oracle and requires several non-trivial generalizations of the ideas in Baby and Wang, 2021, where the latter work only leads to a slower dynamic regret rate of $tilde O(d^{2.5}n^{1/3}C_n^{2/3} vee d^{2.5})$ for the current problem.

Online Convex Optimization (OCO) (Zinkevich, 2003; Hazan, 2016 ) is a widely studied setup in machine learning that has witnessed a myriad of influential applications such as time series forecasting, building recommendation engines etc. In this setting, a learner plays an iterative game with an adversary that last for n rounds. In each round t ∈ [n] := {1, . . . , n}, the learner makes a decision p t that belongs to a decision space D ⊂ R d . Then a convex loss loss function f t : R d → R is revealed by the adversary. The learner suffers a cost of f (p t ) at round t for making its decision. Now, given a benchmark space of decisions W ⊆ D, we aim to study learners that can control its dynamic regret against any sequence of comparators from the benchmark:

where we abbreviate the comparator sequence w 1:n := w 1 , . . . , w n where each w t ∈ W. This is known to be a good metric for characterizing the performance of a learner in non-stationary environments (Zinkevich, 2003; Zhang et al., 2018a; Cutkosky, 2020) . The quantity in Eq.

(1) is also sometimes referred as universal dynamic regret (Zhang et al., 2018a ) because we do not impose any constraints on the comparator sequence w 1:n except that each sequence member must belong to the benchmark set W. This is a different and more powerful way of tackling distribution-shifts than other methods that model the environment explicitly (e.g., Besbes et al., 2015; Baby and Wang, 2020) . Let us illustrate the point in a weather forecasting application in which f t (w t ) = (y t , x T t w t ) where x t is a feature vector (e.g., humidity and temperature at Day t), y t is the actual precipitation of the next day and is a loss function. The underlying distribution of y t |x t is determined by nature and could drift over time due to unobserved variables such as climate change. The approach of Besbes et al. (2015) ; Baby and Wang (2020) would be to assume a model, e.g., y t = x T t w * t + noise and control the regret against w * 1:n in terms of the variation of the true regression coefficients over time. In contrast, a universal dynamic regret approach will not make any assumption about the world, but instead will compete with the best time-varying sequence of comparators that can be chosen in hindsight. In the case when the model is correct, we can choose the comparators to be w * 1:n ; otherwise, we can compete with the best sequence of linear predictors that optimally balances the bias and variance.

A bound on R n (w 1:n ) is usually expressed in terms of the time horizon n and a path variation that captures the smoothness of the comparator sequence w 1:n . Some examples of such path variationals include P (w 1:n ) = n−1 t=1 w t − w t+1 2 (Zinkevich, 2003) and more recently T V(w 1:n ) = n−1 t=1 w t − w t+1 1 (Baby and Wang, 2021) .

Comparator as a discretized function. When we view the sequence of comparators as a function of time, it is natural to describe them as a discretization of (continuous-time) functions residing in some non-parametric function classes. We now proceed to expand upon this idea. For a function f : [0, 1] → R that is k times (weakly) differentiable, define the Total Variation (TV) of its k th derivative f (k) to be:

If the function has k + 1 continuous derivatives, then T V (f (k) ) is equivalent to 1 0 |f (k+1) (x)|dx. Given n, C n > 0 one may define the function space:

This space is known to contain functions that have a piece-wise degree k polynomial structure (Tibshirani, 2014) . We can generate interesting comparator sequence families by discretizing such function spaces. First we fix some notations. For a sequence of vectors v 1: := v 1 , . . . , v , define the first order discrete difference operation Dv := v 2 − v 1 , . . . , v − v −1 . For any positive integer k, the k th order discrete difference -D k -of a sequence is obtained via applying the operation D for k times. For a sequence v 1: , we define v 1: 1 = j=1 v j 1

The discrete TV class. Next, we define a sequence class which is the discrete analogue of F k in Eq.(3) as follows:

T V (k) (C n ) := w 1:n n k D k+1 w 1:n 1 ≤ C n where each w t ∈ W .

(4)

As noted in Baby and Wang (2020) , this class features sequences such that along any coordinate j ∈ [d], the sequence w 1 [j], . . . , w n [j] is obtained via sampling a function f j (x) ∈ F k (C n,j ) at points x = i/n for i ∈ [n] with the property that d j=1 C n,j = C n . The multiplicative factor of n k used in Eq.(4) arises naturally while discretizing the continuous Total Variation displayed in Eq.

(2) at a resolution 1/n. For instance, if T V (f (k) ) is O(1), then n k D k+1 f 1:n 1 is also O(1) where f 1:n = f (1/n), f (2/n), . . . , f (1). We refer the readers to Baby and Wang (2020) for more elaborate explanations on the n k factor.

Why is this useful? In this paper, we focus on comparators that reside in the T V (1) (C n ) class. Our goal will be to bound the dynamic regret Eq.(1) against w 1:n ∈ T V (1) (C n ) as a function of n and C n . We emphasize that our algorithm does not take C n as an input and can simultaneously compete with the TV1 family described by any C n > 0. As discussed earlier, comparators in T V (1) (C n ) class exhibits a piece-wise linear structure across each coordinate (see Definition 1). The points where the sequence transition from one linear structure to other can be interpreted as abrupt changes or events in the underlying comparator dynamics. Many real world time series data are known to contain piece-wise linear trends. See for example Fig.2 or Kim et al. (2009) for more examples. Hence controlling the dynamic regret against comparators from T V (1) (C n ) class has significant practical value.

Fast rate phenomenon. Sequence classes of the form T V (1) (C n ) or more generally T V (k) (C n ) have gained significant attention and have been the subject of extensive study in the non-parametric regression community for over two decades (van de Geer, 1990; Donoho and Johnstone, 1998; Kim et al., 2009; Tibshirani, 2014; Wang et al., 2016) . These works aim to estimate an unknown scalar (i.e d = 1) sequence θ 1:n ∈ T V (k) (C n ) from n noisy observations y t = θ t + N (0, σ 2 ) in an offline setting. They propose algorithms that produce estimatesθ t , t ∈ [n] such that the expected total squared error n t=1 E[(θ t − θ t ) 2 ] is controlled. In particular, for the case when θ 1:n ∈ T V (1) (C n ) a (near) optimal estimation rate ofÕ(n 1/5 C 2/5 n ) is shown to be attainable for the squared loss (Õ(·) hides poly-logarithmic factors of n). This rate is faster than the typical O( √ nC n ) orÕ(n 1/3 C 2/3 n ) dynamic regret rates found in non-stationary online learning literature (see for eg. Zhang et al. (2018a) ; Baby and Wang (2021) ).

T V (1) (C n )Õ * (n 1/5 C 2/5 n ) (This work) Figure 1 : Hierarchy of TV classes for the comparator sequence and the corresponding optimal dynamic regret rates under exp-concave and gradient smooth losses. HereÕ * hides dependencies on d and log n. κ is a constant independent of n and C n . We assume C n = Ω(1). BW21 refers to the work of Baby and Wang (2021) .

Failure of state-of-the-art. For the purpose of facilitating quick comparisons, assuming C n = Ω(1), it can be shown that T V (1) (C n ) ⊆ T V (0) (κC n ) for a constant κ that doesn't depend on n or C n (see Proposition 32 in Appendix and Fig.1 ). The state-of-the-art result (Baby and Wang, 2021) provides a dynamic regret Eq.(1) ofÕ(n 1/3 C In both scenarios we can see that the underlying trend (obtained via an L1 Trend Filter (Kim et al., 2009) ) exhibits a weakly differentiable piece-wise linear structure (orange) which belongs to an appropriate T V (1) class.

Here we recall the most relevant works. The work of Baby and Wang (2020) aims at controlling Eq.(1) under squared error when noisy realizations of a sequence in a T V (k) class is revealed sequentially. For this setting, they propose an algorithm namely AdaVAW that combines Vovk-Azoury-Warmuth forecaster with wavelet denoising which relies strongly on the iid noise assumption and losses being squared error. The absence of such stochastic assumptions and handling general exp-concave losses in our setting poses a significant challenge in controlling the dynamic regret. Overall, we can conclude that results in the current paper dominates that of AdaVAW for TV order k = 1. As mentioned in Section 1, the work of Baby and Wang (2021) fails to attain optimal regret rate for the current problem. We refer the readers to Appendix E for a detailed description on why the analysis of Baby and Wang (2021) fails to attain optimal regret rate in our setting where the comparators reside in T V (1) (C n ) class. Baby et al. (2021) reported experiments where they use a Strongly Adaptive algorithm for competing against best linear predictor in each time window for the task of forecasting COVID-19 cases. This method was shown to empirically out-perform state-of-the-art trend forecasting strategies. However, they didn't provide analysis for this strategy while our work supplements it with necessary theoretical grounding albeit with a slightly different Strongly Adaptive algorithm. An elaborate literature survey is deferred to Appendix A also detailing connections to other works that are not discussed in Section 1.

FLH-SIONS: inputs: exp-concavity factor σ and n SIONS base learners E 1 , . . . , E n initialized with parameters = 2, η = σ and C = 20. (see Fig. 4 )

3. In round t, set ∀j ≤ t, y j t ← E j (t) (the prediction of the j th base learner at time t). 

2. At round t + 1: In this section, we formally describe the main algorithm FLH-SIONS (Follow the Leading History-Scale Invariant Online Newton Step) in Fig.3 and provide intuition on why it can favorably control the dynamic regret against comparators from T V (1) class. For the sake of simplicity, we capture the intuition in a uni-variate setting where the comparators w t ∈ W ⊂ R for all t ∈ [n].

Definition 1. Within an interval [a, b] , we say that the comparator w a:b is a linear signal or assumes a linear structure if the slope w t+1 − w t is constant for all t ∈ [a, b − 1].

As described in Section 1, we are interested in competing against comparator sequences w 1:n that have a piece-wise linear structure (across each coordinate in multi-dimensions). The durations / intervals of [n] where the comparator is a fixed linear signal is unknown to the learner. Suppose that an ideal oracle provides us with the exact locations of these intervals of [n] . Consider an interval [a, b] provided by the oracle where the comparator has a fixed linear structure given by w

]. An effective strategy for the learner is to deploy an online algorithm E a that starts from time a such that within the interval [a, b] its regret:

is controlled. Here E a (t) is the predictions of the algorithm E a at time t. Under exp-concave losses, an O(log n) bound on the above regret can be achieved by the SIONS algorithm ( Fig.4) from (Luo et al. (2016) , Theorem 2) run with co-variates x (t) a . In practice, the locations of such ideal intervals are unknown to us. So we maintain a pool of n base SIONS experts in Fig.3 where the expert E τ starts at time τ with the monomial co-variate x

The adaptive regret guarantee of FLH with exp-concave losses (due to Hazan and Seshadhri (2007) , Theorem 3.2) keeps the regret wrt any base expert to be small. In particular, FLH-SIONS satisfies that

where p t are the predictions of FLH-SIONS and j ≥ τ for any τ ∈ [n]. Hence for the interval [a, b] given by the ideal oracle, it follows that

where in the last equation, we appealed to the logarithmic static regret of SIONS from Luo et al. (2016) . As a minor technical remark, we note that the original results of Luo et al. (2016) assume that the losses are of the form f j (w) = f j (x T j w) for a uni-variate function f j . However, we show in Lemma 27 (in Appendix) that their regret bounds can be straightforwardly extended to handle multivariate losses f j as in Line 1 of Fig.4 which is useful in our multi-dimensional setup.

Thus ultimately, the regret of the FLH-SIONS procedure is well controlled within each interval provided by the ideal oracle, thus allowing us to be competent against the piece-wise linear comparator sequence from a T V 1 class. We remark that while both FLH and SIONS are well-known existing algorithms, our use of them with monomial co-variates is new. Our dynamic regret analysis is new too, which uncovers previously unknown properties of a particular combination of these existing algorithmic components using novel proof techniques.

In this section, we explain the assumptions used and the main results of this paper. Then we provide the proof summary for Theorem 3 in a uni-variate setting highlighting the technical challenges overcome along the way. Following which we explain how to handle multiple dimensions by constructing suitable reductions that will allow us to re-use much of the analytical machinery developed for the case of uni-variate setting. The following are the assumptions made.

A1. For all t ∈ [n], the comparators w t belongs to a given benchmark space W ⊂ R d . Further we have W ⊆ [−1, 1] d .

The loss function f t : R d → R revealed at time t is 1-Lipschitz in · 2 norm over the interval [−20, 20] 

. The losses f t are 1-gradient Lipschitz over the interval [−20, 20] 

A4

. The losses f t are σ exp-concave over [−20, 20] 

Assumptions A3 and A4 ensure the smoothness and curvature of the losses which we crucially rely to derive fast regret rates. Assumptions about Lipschitzness as in A2 are usually standard in online learning. In assumption A1 we consider comparators that belong to an interval that is smaller than the intervals in other assumptions. This is due to the fact that we allow our algorithms to be improper in the sense that the decisions of the algorithm may lie outside the benchmark space W.

We start with a lower bound on the dynamic regret (Eq.(1)) which is obtained by adapting the arguments in Donoho and Johnstone (1998) to the case of bounded sequences as in Assumption A1. See Appendix D for a proof.

Proposition 2. Under Assumptions A1-A4, any online algorithm necessarily suffers sup w1:n∈T V (1) (Cn) R n (w 1:n ) = Ω(d 3/5 n 1/5 C 2/5 n ∨ d).

We have the following guarantee for FLH-SIONS.

Theorem 3. Let p t be the predictions of FLH-SIONS algorithm with parameters = 2, C = 20 and exp-concavity factor σ. Under Assumptions A1-A4, we have that,

for any C n > 0 and any comparator sequence w 1:n ∈ T V (1) (C n ). HereÕ hides poly-logarithmic factors of n and a ∨ b = max{a, b}.

Remark 4. Compared with the lower bound in Proposition 2, we conclude that the regret rate of the above theorem is optimal modulo factors of d and log n. We note that the guarantee of Theorem 3 is truly universal as no apriori knowledge of C n is required.

Proposition 5. It can be shown that the same algorithm FLH-SIONS under the setting of Theorem 25 enjoys optimal rates against comparators from the T V 0 (C n ) class as well. When combined with Theorem 3 we conclude that under Assumptions A1-A4, FLH-SIONS attains an adaptive guarantee of n t=1 f t (p t ) − f t (w t ) =Õ(d 2 min{n 1/3 Dw 1:n 2/3 1 , n 1/5 (n D 2 w 1:n 1 ) 2/5 } ∨ d 2 ), for any comparator sequence w 1:n . HereÕ hides poly-logarithmic factors of n and a ∨ b = max{a, b}. See Appendix C for a proof.

Remark 6. One may ask if a simpler algorithm such as carefully tuned online gradient descend (OGD) can enjoy these fast rates too. However, Proposition 2 of Baby and Wang (2020) implies that properly tuned OGD algorithm which is optimal against comparators in T V (0) class under convex losses, necessarily suffers a slower dynamic regret of Ω(n 1/4 ) against comparators in T V (1) (1) class under exp-concave losses (see Lemma 12 in Appendix A).

In what follows, we present several useful lemmas and provide a running sketch on how to chain them to arrive at Theorem 3 in a uni-variate setting (i.e d = 1). Detailed proofs are deferred to Appendix B.1.

Suppose that we need to compete against comparators whose TV1 distance (i.e n D 2 w 1:n 1 ) is bounded by some number C n . This quantity could be unknown to the algorithm. Consider the offline oracle who has access to the entire sequence of loss functions f 1 , . . . , f n and the TV1 bound C n . It may then solve for the strongest possible comparator respecting the TV1 bound through the following convex optimization problem.

Let u 1 , . . . , u n be the optimal solution of the above problem. This sequence will be referred as offline optimal hence-forth. Clearly we have that the regret against any comparator sequence w 1:n ∈ T V 1 (C n ) obeys

and hence it suffices to bound the right side of the above inequality.

Lemma 7. (KKT conditions) Let u 1 , . . . , u n be the optimal primal variables and let λ ≥ 0 be the optimal dual variable corresponding to the constraint (6a). Further, let γ − t ≥ 0, γ + t ≥ 0 be the optimal dual variables that correspond to constraints (6b) and (6c) respectively for all t ∈ [n]. By the KKT conditions, we have

Here sign(x) = x/|x| if |x|> 0 and any value in [−1, 1] otherwise. For convenience of notations, we also define s −1 = s 0 = s n−1 = s n = 0.

• complementary slackness: (a) λ D 2 u 1:

Next, we provide a partition of the horizon with certain useful properties. 

Going forward, the idea is to bound the dynamic regret within each bin in P by anÕ(1) quantity. Then we can add them up across all bins to arrive at the guarantee of Theorem 3 (with d = 1). We pause to remark that even-though this high-level idea resembles to that of (Baby and Wang, 2021) , the underlying details of our analysis to materialize this idea requires highly non-trivial deviations from the path followed by (Baby and Wang, 2021) .

First, we need some definitions. Consider a bin [i s , i t ] ∈ P with length at-least 2. Let's define a co-variate x j := [1, j − i s + 1] T . Let X T := [x is , . . . , x it ] be the matrix of co-variates and u is:it := [u is , . . . , u it ] T . Let β = X T X −1 X T u is:it be the least square fit coefficient computed with co-variates x j and labels u j . Define a second moment matrix A = it j=is x j x T j . Let α := β − A −1 it j=is ∇f j (β T x j )x j . (A −1 is guaranteed to exist when length of the bin is at-least 2). We remind the reader that ∇f j (β T x j ) is a scalar as we consider uni-variate f j in this section.

We connect these quantities via a key regret decomposition as follows:

It can be shown that |α T x j |≤ 20 = O(1). Hence the term T 1 can be controlled by an O(log n) bound due to Strong Adaptivity of FLH-SIONS as described in Section 3, Eq.(5). The quantity α is obtained via moving in a direction reminiscent to that of Newton method. This is in sharp contrast to the one step gradient descent update used in Baby and Wang (2021) . More precisely, consider the function F (β) = it j=is f j (β T x j ). Then α = β − A −1 ∇F (β). By exploiting gradient Lipschitzness of f j , the correction matrix A can be shown to satisfy the Hessian dominance ∇ 2 F (β) A. This Newton style update is shown to keep the term T 2 to be negative through the following generalized descent lemma:

The negative descent term displayed in the above Lemma is similar to the standard (squared) Newton decrement (Nesterov, 2004) in the sense that it is also influenced by the local geometry through the norm induced by the inverse correction matrix A −1 . We then proceed to show that the negative T 2 can diminish the effect of T 3 by keeping T 2 + T 3 to be an O(1) quantity. Thus the dynamic regret within the bin [i s , i t ] ∈ P is controlled toÕ(1). Adding the bound across all bins in P from Lemma 8 yields Theorem 3 in one dimension. In Appendix E we show that our partitioning scheme leads to larger average bin widths (and hence fewer number of bins) than that of the scheme in Baby and Wang (2021). However, the high level idea is that the smoothness of T V 1 sequence class (as described in Section 1) enables us to keep the regret within each bin to beÕ(1) despite its larger width thus leading to faster rates when summed across a fewer number of bins.

A major challenge in the analysis is to prove that the term T 2 + T 3 = O(1) without imposing restrictive assumptions such as Self-Concordance or Hessian Lipschitnzess as in the classical analysis of Newton method (see for eg. (Nesterov, 2004) ). In the rest of this section, we outline the arguments leading to this result.

Lemma 10. We have that T 2 + T 3 = O(1) where T 2 and T 3 are as defined in Eq. (7) Proof Sketch. Here the main idea is T 2 + T 3 = O(1) even-though |T 2 | and |T 3 | can be very large individually. Eventhough this is the same observation as that in Baby and Wang (2021), our regret decomposition and the associated proof is more subtle and interesting as it wasn't apriori clear that T 2 + T 3 can be possibly bound by O(1) for the current problem. The key novelty is that we bound T 2 + T 3 by introducing an auxiliary function that is concave in its arguments which allows us to systematically explore the properties of its maximizers. We proceed to expand upon this proof summary further.

For the sake of explaining ideas, we consider a case where the offline optimal within a bin [i s , i t ] ∈ P doesn't touch the boundary 1 but may touch boundary −1 at multiple time points. (In the full proof, we show that the partition P can be slightly modified so that in non-trivial cases, the offline optimal can only touch one of the boundaries due to the TV1 constraint within the bins described in Lemma 8.) Then by complementary slackness of Lemma 7 we conclude that γ + j = 0 for all j ∈ [i s , i t ]. Our analysis starts by considering a scenario where the offline optimal touches boundary −1 at precisely two points r, w ∈ [i s , i t ] with r < w (see Fig.5 ). Again via complementary slackness, only γ − r and γ − w can be potentially non-zero in this case. Through certain careful bounding steps, we show that:

where B is a function jointly convex in its arguments λ, γ − r , γ − w . We treat r and w to be fixed parameters. The exact form of the function B is present at Eq.(46) in Appendix. Then we consider the following convex optimization procedure:

First, we perform a partial minimization wrt γ − r and γ − w keeping λ fixed. Note that even-though γ − r ≥ 0 and γ − w ≥ 0 via Lemma 7, we choose to perform an unconstrained minimization wrt these variables as doing so can only increase the bound on T 2 + T 3 .

Let the optimal solutions of the partial minimization procedure be denoted byγ − r andγ − w . We find that:

where L(λ) is a linear function of λ that doesn't depend on r or w (Eq.(49) in Appendix). The constrained minimum of this linear function is then found to be attained at λ = 0 and we show that

This leaves us with an important question on how to handle more than two boundary touches at −1 where many of γ − j , j ∈ [i s , i t ] can potentially be non-zero. One could perform a similar unconstrained optimization as earlier wrt all γ − j . However, deriving the closed form expressions for the optimalγ − j becomes very cumbersome as it involves solving for a complex system of linear equations. In the following, we argue that this general case can be handled via a reduction to the previous setting where only two dual variables γ − r and γ − w can be potentially non-zero. Specifically we show that the same auxiliary function B as in Eq.(8) can be used to obtain

wherer,w,γ − r and γ − w can be computed from the sequence of dual variables γ − is:it . Now we can proceed to optimize similarly as in Eq.(9a) with the optimization variables being λ,γ − r ,γ − w and use the same arguments as earlier to bound T 2 + T 3 = O(1). We remark that while doing so, it is an extremely fortunate fact that the partially minimized objective in Eq.(10) does not depend on the parameter values r and w. This fact in hindsight is what permitted us to fully eliminate the dependence of all γ − j where j ∈ [i s , i t ] on the bound via the method of reduction to the case of two non-zero dual variables considered earlier.

In rest of this section, we focus on outlining the analysis ideas that facilitated the main result Theorem 3. The high-level idea is to construct a reduction that helps us to re-use much of the machinery developed in Section 4.1. We emphasize that this reduction happens only in the analysis, and we do not run d uni-variate FLH-SIONS algorithms for handling multi-dimensions. Following Lemma serves a key role in materializing the desired reduction.

Lemma 11. Let X j ∈ R d×2d be as defined as:

In multi-dimensions also we form a partition P of the offline optimal similar to Lemma 8. Then we consider following regret decomposition for any bin

where we shall shortly describe how to construct the quantities X j ∈ R d×2d , α j ∈ R 2d and β j ∈ R 2d . For compactness of notations later, let's define α j,k = α j [2k − 1 : 2k] ∈ R 2 , β j,k = β j [2k − 1 : 2k] ∈ R 2 and y j,k = x j [2k − 1 : 2k] ∈ R 2 for some x j ∈ R 2d as in lemma 11. The Hessian dominance in Lemma 11 leads to:

Further, due to gradient Lipschitzness of f j ,

Combining Eq. (12) and (13), we see that T 2 + T 3 in any bin [i s , i t ] can be bounded coordinate-wise:

This form allows one to bound it j=is t 2,j,k + t 3,j,k = O(1) separately for each coordinate by constructing α j,k , β j,k and y j,k similar to Section 4.1. We then sum across all coordinates to bound T 2 + T 3 = O(d). We remark that the situation is a bit more subtle here because in-order to handle certain combinatorial structures imposed by the KKT conditions, we had to use a sequence of comparators α is , . . . , α it (for linear predictors in Eq.(11)) that switches at-most O(d) times . Finally by appealing to strong adaptivity of FLH-SIONS, we show that T 1 =Õ(d 2 ) for each bin [i s , i t ] ∈ P and Theorem 3 then follows by adding theÕ(d 2 ) regret across all O(n 1/5 C 2/5 n ∨ 1) bins in P.

In this work, we derived universal dynamic regret rate parametrized by a novel second-order path variational of the comparators. Such a path variational naturally captures the piecewise linear structures of the comparators and can be used to flexibly model many practical non-stationarities in the environment. Our results for the exp-concave losses achieved an adaptive universal dynamic regret ofÕ(d 2 n 1/5 C 2/5 n ∨ d 2 ) which matches our minimax lower bound up to a factor that depends on d and log n. This is the first result of such kind in the adversarial setting and the first that works with general exp-concave family of losses. We conjecture that a similar algorithm as in Fig.3 based on degree k monomial co-variates [1, t, . . . , t k ] can lead to optimal dynamic regret for comparators from T V (k) class.

Dheeraj Baby, Xuandong Zhao, and Yu-Xiang Wang. An optimal reduction of tv-denoising to adaptive online learning.

AISTATS, 2021.

Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization. Operations research, 63 (5) 

In this section, we elaborate upon the related works mentioned in Section 2. We inherit all the notations and terminologies introduced in Section 1. Apart from the works discussed in Section 1, there is a rich body of literature on dynamic regret minimization such as (Jadbabaie et al., 2015; Yang et al., 2016; Mokhtari et al., 2016; Chen et al., 2018; Zhang et al., 2018a,b; Yuan and Lamperski, 2020; Goel and Wierman, 2019; Baby and Wang, 2019; Zhao et al., 2020; Zhao and Zhang, 2021; Zhao et al., 2022; Chang and Shahrampour, 2021; Baby and Wang, 2022) . However, to the best of our knowledge none of these works are known to attain the optimal dynamic regret rate for our setting. We proceed to survey the literature in more detail.

Dynamic regret against T V (1) (C n ) in stochastic setting. Perhaps, the most relevant to our work is that of Baby and Wang (2020). They consider an online protocol where at each round, the learner makes a predictionθ t ∈ R. Then a label y t = θ t + N (0, 1) is revealed. They assume that the ground truth sequence θ 1:n ∈ T V (k) (C n ) (see Eq. (4)). The goal of the learner is control the expected cumulative squared error of the learner namely n t=1 (θ t − θ t ) 2 . In this setting, they propose policies that can attain a near optimal estimation error ofÕ(n 1 2k+3 C 2 2k+3 n ) for any k > 0. In retrospect, in this work we consider the case where comparators belong to a T V (1) (C n ) class (i.e, with k=1). Further, the absence of stochastic assumptions in our setting poses a significant challenge in controlling the universal dynamic regret (Eq.(1)).

Restricted dynamic regret minimization. In this line of work, we consider a similar learning setting as mentioned in Section 1. However, the goal of the learner is to control the dynamic regret against point-wise minimizers:

where w * t ∈ argmin x ∈ Wf t (x). When the losses are strongly convex and gradient smooth, Mokhtari et al. (2016) proposes algorithms that can attain a restricted dynamic regret of

However, as noted in Zhang et al. (2018a) , such a guarantee can be sometimes overly pessimistic. For example, in the context of statistical learning where the losses are sampled iid from a distribution, the point-wise minimizers can incur a path length C * n = O(n) due to random perturbations introduced by sampling.

Universal dynamic regret minimization. This is the same framework as considered in the introduction. Obtaining universal dynamic regret guarantees is challenging since we need to bound the dynamic regret for any comparator sequence from the bench mark set W while automatically adapting to their path length. When the losses are convex Zhang et al. (2018a) ; Cutkosky (2020) provides an optimal universal dynamic regret of O( n(1 + P n )), where

This is a sub-optimal rate when applied to our setting as expected, since they don't assume the losses are exp-concave.

When the losses are in-fact exp-concave, one can apply the result of Baby and Wang (2021) to produce a dynamic regret rate ofÕ * (n 1/3 C 2/3 n ∨ 1). However, as noted in Section 1, this rate is sub-optimal. In Appendix E, we give an elaborate description on why their analysis lead to sub-optimal rate in our setting of competing against comparators from T V (1) (C n ) class.

Dynamic regret based on functional variations. It is also common to measure the non-stationarity of the environment in terms of the variation incurred by the loss function sequence. Define

There are works such as (Besbes et al., 2015; Yang et al., 2016; Chen et al., 2018) that aims in controlling the dynamic(restricted / universal) in terms of D n . Jadbabaie et al. (2015) proposes algorithms that can control dynamic regret simultaneously in the terms of D n and P n .

Static regret minimization. In classical OCO, a well known metric is to control the static regret of an algorithm namely,

. Algorithms such as Online Gradient Descent (Zinkevich, 2003) can attain an optimal O( √ n) static regret when losses are convex. When the losses are exp-concave or strongly convex it is possible to attain logarithmic static regret (Hazan et al., 2007) . However, static regret is not a good metric for measuring the performance of a leraner in non-stationary environments.

Strongly Adaptive (SA) regret minimization. This notion of regret is introduced by Daniely et al. (2015) . In this framework, the learner aims in controlling its static regret in any local time window as a function of window-length (modulo factors of log n). The algorithms in Daniely et al. (2015); Cutkosky (2020) provides a static regret ofÕ( |I|) across any local interval I whenever the losses are convex. When the losses are strongly convex or exp-concave, the algorithms in (Hazan and Seshadhri, 2007; Adamskiy et al., 2016; yields logarithmic static regret in any local time window when the base learners are chosen appropriately. Zhang et al. (2018b) shows that SA algorithms can be used to control the dynamic regret in terms of the functional variation D n .

Online non-parametric regression. In section 1, we modelled the dynamics of the comparator sequence as a member of a non-parametric function class. Sridharan (2014, 2015) studies the minimax rate of learning against a non-parametric function class. They establish the right minimax rate (in terms of dependencies wrt n) using arguments based on sequential Rademacher complexity in a non-constructive manner. In-fact, their results on Besov spaces imply that the minimax dynamic regret rate for our problem is indeed O(n 1/5 ) since the T V (1) class is sandwiched between two Besov spaces having the same minimax rate (see for eg. (DeVore and Lorentz, 1993) and (Donoho and Johnstone, 1998) ). There are other line of works that study the non-parametric regression problem against other function classes of interest such as Gaillard and Gerchinovitz (2015) 2015) (for Sobolev functions). These function classes doesn't capture the T V (1) class that we study in this work. For instance, the (discretized) higher order Holder and Sobolev classes features sequences that are more regular than that of T V (1) class (see for example Baby and Wang (2019)).

Further explanations on Remark 6. In rest of this section, we focus on proving Remark 6.

It is sufficient to focus on OGD algorithms with step-size η < 1. Otherwise if η ≥ 1, one can come up with sequence of losses that can enforce linear regret. An example of this scenario is described as follows:

The loss at time t is given by

with y t = −1 at odd rounds and y t = 1 at even rounds. The decision set is given by D = [−1, 1]. We focus on OGD algorithms which plays 0 at the first round, though similar arguments can be given for any valid initialization point. Suppose the step size is η ≥ 1. The iterate at step t + 1, denoted by w t+1 , is maintained as

where clip function clips the argument to [−1, 1].

Recall that η ≥ 1. So we have

Thus the iterates oscillates between −1 and 1. However, the best fixed comparator for the sequence of losses is given by 0. Hence we have that

Thus choosing step-size η ≥ 1 can be exploited by the adversary to enforce a linear regret even for the case of static comparators.

So it suffices to consider OGD algorithms with step size η < 1 as what is done in the following lemma.

Lemma 12. There exist a choice of loss functions, comparator sequence and decision set such that OGD with steps size η < 1 necessarily suffers a dynamic regret of Ω(n 1/4 ) for all n ≥ 35.

Proof. Consider a setup where the decision set

Under this setup, the projected online gradient descent (OGD) with learning rates η < 1 doesn't need to do any projection. This can be seen as follows. Assume that OGD till step t doesn't project. Let Π denote the projection to set D. Then the iterate at time t + 1 (denoted by x t+1 ) is given by

We have that

where we applied triangle inequality, summed up the infinite series and used the fact that |y t |≤ 1/(4 √ 2.2). So z t+1 ∈ D and therefore x t+1 = z t+1 . Hence by induction, we have that OGD with learning rate η < 1 doesn't need to project.

Looking at Eq. (14) we see that the OGD output at any time is a fixed linear function of the revealed labels y t . Baby and Wang (2020) calls such forecasters to be linear forecasters. They provide the following proposition (para-phrased here for clarity) about such forecasters:

Proposition 13. (Proposition 2 in Baby and Wang (2020) for k = 1) For any online estimator producing estimatesθ t which is a fixed linear function of past labels y 1:t−1 , t ∈ [n] we have

Thus we have

where line (a) is due to the fact thatθ t and t are mutually independent (because of online nature of algorithm) and line (b) is due to Eq.(15). Thus we conclude that

where the losses f t are as defined in the beginning of the proof. This concludes the lemma.

We start with the analysis in the uni-variate setting followed by the proof in multi-dimensions. The analysis requires very clumsy algebraic manipulations in certain places. We used Python's open-source simplification engine SymPy (Meurer et al., 2017) to assist with the algebraic manipulations.

A remark. The constants occurring in the proofs may be optimized further though we haven't aggressively focused on doing so.

Lemma 7. (KKT conditions) Let u 1 , . . . , u n be the optimal primal variables and let λ ≥ 0 be the optimal dual variable corresponding to the constraint (6a). Further, let γ − t ≥ 0, γ + t ≥ 0 be the optimal dual variables that correspond to constraints (6b) and (6c) respectively for all t ∈ [n]. By the KKT conditions, we have

Proof. By introducing auxiliary variables, we can re-write the offline optimization problem as:

The Lagrangian of the optimization problem can be written as

Due to stationary conditions wrt u t , we have

where we define v −1 = v 0 = v n−1 = v n = 0 and, due to staionarity conditions wrt v t we have v t = λsign(z t ).

Combining the above two equations and the complementary slackness rule now yields the Lemma.

Terminology. In what follows, we refer to u 1:n from the Lemma above to be the offline optimal sequence. Proof. Let the total number of bins formed be M . Consider the case where M > 1. We have that

where line (a) follows due to the construction of the partition and line (b) is due to Jensen's inequality applied to the convex fucntion f (x) = 1/x 3/2 for x > 0.

Rearranging and including the trivial case where M = 1 yields the lemma.

Proposition 14. In the following analysis we will often use a useful represent offline optimal within a bin [a, b] to be m a , m a + m a+1 , . . . , b t=a m t WLOG. We can view this sequence to be samples obtained from a piece-wise linear signal that is continuous at every sampling point. Let β = (X T X) −1 X T u a:b be the least square fit coefficient computed with labels u t and co-variates

Then we have that the residuals satisfy

Proof. We follow the notations of Proposition 14 for representing the offline optimal u a , . . . , u b . The residual at time i ∈ [a, b] can be computed through straight forward algebra as:

where I{·} is the indicator function assuming value 1 if the argument evaluates true and 0 otherwise. Now we note that if all m k for k ∈ [a + 1, b] are same, then the residuals u i − β T x i must be zero for all i as the least square fit exactly matches the labels in this case. In particular, this implies from Eq.(17) that 1 ( 2 − 1) j=2 6(1 + (1 − 2i)/ )( − j + 1/)( + j)/2

Subtracting Eq. (18) from (17) we get,

where the last line is due to Holder's inequality. Further, we have |m j − m a+1 |≤ a+2 t=j |m j − m j−1 |≤ D 2 u a:b 1 by the definition of the discrete difference operator D 2 . Now applying triangle inequality and the crude bounds 1 + (1 − 2i)/ ≤ 3, ( − j + 1) ≤ , ( + j) ≤ 2 , i/ ≤ 1, 2 ≥ 2/ and −2/ ≤ 0 we obtain 6(1 + (1 − 2i)/ )( − j + 1/)( + j)/2

So,

where the last line is due to 19 2 + 2 ≤ 20 2 − 20 for all ≥ 6. 

Further we have |M a |≤ D 2 u a:b 1 and |M b |≤ D 2 u a:b 1 whenever ≥ 2.

Here the semantics is that each M t = r t+1 − r t for all t > a and M a = r a+1 − r a . Any two points r t and r t+1 can be joined using a unique line segment which in turn defines C t appropriately.

Proof. By gradient Lipschitzness of f we have

Now will focus on bounding the last two terms above.

From the construction of bins in Lemma 8, we know that D 2 u a:b 1 ≤ 1/ √ . Hence we obtain using Lemma 15 that

Recall the representation of the residual β T x t − u t = tM t + C t mentioned in the lemma statement. Observe that in accordance with Proposition 14 this residual can also be viewed as samples of a piece-wise linear signal that is continuous at every sampled point. In particular observe that for every t ∈ [a, b] we have:

From KKT conditions of Lemma 7 we have

where in line (a) we used Eq.(20) and in line (b) we used the fact that

By Holder's inequality and Lemma 15, we have

where the last line is due to D 2 u a:b 1 ≤ −3/2 as assumed in the lemma's statement. Putting everything together completes the proof. Next, we proceed to give useful bounds on |M a | and |M b−1 |.

Since M a = M a+1 and C a = C a+1 , we have M a = (u a+1 − β T x a+1 ) − (u a − β T x a ).So Eq.(17) we have,

where in the last line we used

Lemma 17. Consider a bin [a, b] ∈ P of length from Lemma 8. Suppose |u a |< 1. Then either γ − j = 0 or γ + j = 0 for all j ∈ [a, b].

Proof. We will provide arguments for the case when the offline optimal first hits −1 before hitting 1 for some point in [a, b] . The arguments for the alternate case where it hits 1 first are similar.

If the offline optimal hits −1 at some point in [a, b] it can only rise upto at-most −1 + 1/ √ l afterwards. This is due to the constraint D 2 u a:b 1 ≤ 1/ 3/2 .

Since −1+1/ √ l < 1 as > 1/4, we have that the offline optimal never touches 1 within the bin [a, b]. Consequently

Definition 18. The slope of the optimal solution at a time point t is defined to be u t+1 − u t for all t ∈ [n − 1].

Proposition 19. The bins in P can be further refined in such a way that each bin either satisfy the condition in Lemma 17 or has constant slope, meaning the L1 TV distance is zero. Further in doing so the size of partition P only gets increased by at-most 2.

Proof. Suppose for a bin [a, b] ∈ P, if the offline optimal starts at 1. Then we can split that bin into two bins [a, c] and [c + 1, b] such that u c > −1 and D 2 u a:c 1 = 0. Similar splitting can also be done for bins that start from −1. Observe that this refinement only increases the partition size only by at-most 2.

Corollary 20. One powerful consequence of Lemma 17 and Proposition 19 when combined with the fact that γ − t and γ + t are both non-negative (Lemma 7) is that

in the refined partition of Proposition 19 whenever the D 2 u a:b 1 > 0.

From here on WLOG we will assume that the bins [a, b] in partition P will satisfy the conditions: x j x T j Consider the following update:

We have,

Further we have:

) bounded above by Eq. (29) Similar expressions can be derived for bins [a, b] that may touch boundary 1 instead of -1.

Proof. We note that due to gradient Lipschitnzess of f ,

where

Next we turn to lower bounding the above RHS

where Γ andΓ are as defined in the statement of the lemma. For the sake of brevity let's denote

and so g 2 A −1 = 2λ 2 ( − 1) ( + 1) (2 2 − 3 + 1)s 2 a−2 + (4 − 4 2 )s a−2 s a−1 − (2 2 − 6 + 4)s a−2 s b + (2 2 − 2)s a−2 s b−1 + (2 2 + 3 + 1)s 2 a−1 + (2 2 − 2)s a−1 s b − (2 2 + 6 + 4)s a−1 s b−1 +

Using Eq. (23) we get

Using gradient Lipschitzness, triangle inequality and Lemma 15 we have

and similarly

So continuing from Eq. (25),

where we used D 2 u a:b 1 ≤ −3/2 . We have

where the last line is obtained by using similar arguments used for obtaining Eq.(28). By substituting the expression for A −1 and simplifying,

Using Eq.(23), we obtain

Lemma 22. (bounding T 1 ) Consider a bin [a, b] . Let p t be the predictions of FLH-SIONS algorithm with parameters = 2, C = 20 and exp-concavity factor σ. Suppose α and β are as defined in Lemma 21. For any µ ∈ {α, β} FLH-SIONS satsifies:

Proof. We will derive the guarantee for µ = α. The guarantee for µ = α follows similarly. Let's begin by calculating v :

where line (a) is obtained via Lipschitzness and Holder's inequality x T y ≤ x 1 y ∞ and the fact that |2 + 1 − 3j|≤ 2( − 1) for all j ∈ [1, ].

Similarly

where we used |−3( + 1) + 6j|≤ 3( − 1) for all j ∈ [1, ].

Combining Eq. (32) and (33) we conclude that |v T x j | ≤ 4 + (j − a + 1) 6 ( + 1) ≤ 10, where the last line follows due to the fact (j − a + 1) ≤ .

Hence by Triangle inequality we have,

Further note that v 2 ≤ 8

Notice that β = A −1 j=a u j x j which have similar functional form as v. Since |u j |≤ B for all j ∈ [n], by following similar arguments used in bounding v we obtain |β T x j |≤ 10 and β 2 ≤ 8.

|α T x j | ≤ 20.

Further,

Since 

Lemma 23. (monotonic slopes) Consider a bin [i s , i t ] ∈ P such that the slopes are monotonic (i.e either nondecreasing or non-increasing). Let p j be the predictions made by the FLH-SIONS algorithm with parameters as set in Lemma 22. Then we have,

log(1 + σn/2) + 4 σ log n + 210408

Proof. We will consider the case of non-decreasing slopes. The alternate case can be handled similarly. Assume that the slope within the bin is not constant, otherwise we trivially get logarithmic regret as we need only to compete with the best fixed linear fit which is handled by the static regret of FLH-SIONS in any interval (µ = β in Lemma 22).

The optimal solution within a bin of P obtained via Proposition 19 which doesn't have constant slope may touch either −1 or 1 but not both. Consider the case where the optimal touches −1. Then as the slopes are non-decreasing, once it leaves −1, it never touches −1 again. So we can split the bin [i s , i t ] into at-most 3 bins [a, b], [b + 1, c] and [c + 1, d] such that the optimal touches −1 only within [b + 1, c]. (This bin can be empty if the optimal doesn't touch −1 anywhere within [i s , i t ]). Now we will bound the regret within bin [a, b] . Suppose that s a−1 = 1 and s b = 1. If this condition is not satisfied, we can refine the bin [a, b] into at-most 3 bins [a 1 , b 1 ], [a 2 , b 2 ], [a 3 , b 3 ] such that the optimal has constant slope in the first and last bins and s a2−1 = s b2 = 1. This is possible because the slopes in [a, b] are non-decreasing.

Let ∆ := D 2 u a:b 1 and := b − a + 1. Let p and q be two numbers in [0, 2]. Substituting s a−2 = 1 − p, s a−1 = 1, s b−1 = 1 − q and s b = 1 into Lemma 16 and using the fact that |jM j + C j |≤ 20 ∆ for all j ∈ [a, b] due to Lemma 15, we get

where we observed that a term arising from Lemma 16: −M a + M b−1 − b t=a+1 |M t − M t−1 |= 0 as the slopes are non-decreasing.

By making similar sign substitutions in Lemma 21 and noting that h = 0, we get

(2 2 − 3 + 1)p 2 + (2 2 + 3 + 1)q 2 + 2( 2 − 1)pq

where in the last line we used the fact that − 1 ≥ /2 and + 2 ≤ 2 for all ≥ 2. Now consider the case where p ≥ q. Then,

(2 2 − 3 + 1)p 2 + (2 2 + 3 + 1)q 2 + 2( 2 − 1)pq ≥ (2 2 − 3 + 1)p 2

where the last line holds for all ≥ 3. (If ≤ 3, the regret within the bin is trivially O(1) appealing to the Lipschitzness of the losses f t and the boundedness of the predictions and the comparators (see proof of Lemma 22)). Thus by using − 1 ≤ and + 1 ≤ 2 , we get −2λ 2 2( − 1) ( + 1) (2 2 − 3 + 1)p 2 + (2 2 + 3 + 1)q 2 + 2( 2 − 1)pq ≤ −λ 2 p 2 2 .

Combining Eq. (38) and (40) and using the fact that p ≥ q, we have

Similarly from (37) using p ≥ q we get

Combining Eq. (41) and (42) we have

where in the last line we dropped the negative term and used the facts that ∆ ≤ 1/ 3/2 . For the case of q ≥ p, we have

(2 2 − 3 + 1)p 2 + (2 2 + 3 + 1)q 2 + 2( 2 − 1)pq ≥ (2 2 − 3 + 1)q 2

where the last line holds for all ≥ 3. This is the same expression as in Eq.(39) with p replaced by q. By replacing p with q in the arguments we detailed for the case of p ≥ q earlier, we arrive at the same conclusion that T 2 +T 3 ≤ 210152 even when q ≥ p. 

Similarly, the regret within bin [c + 1, d] is also bounded by the above expression.

As the slope within bin [b + 1, c] is constant, the regret incurred within this bin is trivially bounded by 256 + 1 2σ log(1 + σn/2) + 4 σ log n due to Lemma 22. Adding the regret incurred across the bins [a, b], [b + 1, c] and [c + 1, d] together yields the lemma.

Next, we will focus on bounding T 2 + T 3 for general non-monotonic bins in P.

Lemma 24. (non-monotonic slopes) Consider a bin [i s , i t ] ∈ P such that the slopes are not monotonic. Let p j be the predictions made by the FLH-SIONS algorithm with parameters as set in Lemma 22. Then we have,

Proof. Let [a, b] ∈ P be a bin where the slope is not monotonic and not constant. Assume that |s a−1 |= |s b |= 1. Otherwise we can split the original bin into at-most 3 bins

, b] such that |s b1−1 |= |s b2 |= 1 and slopes are constant in the the other two bins. This is possible because slope in [a, b] is not constant or monotonic.

For a bin [a, b] we define boundary signs to be s a−2 , s a−1 , s b−1 and s b . First, we will study the case where the offline optimal touches the boundary −1 at two point r and w with r < w. The case of arbitrary number of boundary touches will be discussed towards the end. (All arguments can be mirrored appropriately for the case where optimal touches boundary 1).

In what follows we use the notations in the proof of Lemma 21. From Eq.(22) we have

where µ ∈ R 2 is a vector depending on the boundary signs and the length := b − a + 1. x r = [1, r − a + 1] T and x w defined similarly.

Since g +h is an affine map of [λ, γ − r , γ − w ] T and since A is positive definite for ≥ 2, we conclude that g +h 2 A −1 is jointly convex in λ, γ − r , γ − w via appealing to the convexity of squared L2 norm. First let's focus on the case where boundary signs obey s a−1 = 1 and s b = −1. Let s a−2 = 1−p and s b−1 = −1+q for some p, q ∈ [0, 2].

Making these sign substitutions in Lemma 21, we get:

(2 2 − 3 + 1)p 2 + (2 2 + 3 + 1)q 2 − (2 2 − 2)pq + 12( − 1)p − 12( + 1)q + 24 .

where r = r − a + 1 and w = w − a + 1.

Let ∆ := −3/2 . By using equation (28) and the facts − 1 ≥ /2, + 1 < , p, q ∈ [0, 2] and triangle inequality, we bound

where the line (a) is obtained by equation (28) and making the boundary sign substitutions. Line (b) is obtained using the facts − 1 ≥ /2, + 2 ≤ 2 whenever ≥ 2 and p, q ∈ [0, 2] along with triangle inequality. From Eq.(29), by using similar triangle inequality based arguments and the fact that |Γ|≤ |Γ| by Holder's inequality and Corollary 20 in as above we obtain

To bound T3 we observe from Lemma 16

where in the last line we used the fact that |(j − a + 1)M j + C j |≤ 20 ∆ from Lemma 15. Recall that ∆ = −3/2 . Combining all the above equations / inequalities above and Eq. (30), define:

We have,

The expression in Eq.(44) can be compactly written as:

, where g + h is as in Eq.(43) (which only depends on the boundary signs and λ, γ − r + γ − w and rγ − r + wγ − w ) and Φ(λ,γ − r ,γ − w ) is a linear function of its arguments namely,

Since we have established earlier that g 2 A −1 is convex in λ, γ − r , γ − w we will certainly have T (λ, γ − r , γ − w ) as a function jointly convex in its arguments.

The function B referred in main text is defined to be:

with r and w in Eq.(44) to be taken as r = r − i s + 1 and w = w − i s + 1 , = i t − i s + 1 and T (λ, γ − r , γ − w ) is as in Eq.(44).

So we consider the following convex optimization problem:

Note that in the program above we do unconstrained minimization over γ − r and γ − w . Doing so can only further decrease the objective function leading to a valid upper bound on T 2 + T 3 .

First we will perform a partial minimization wrt the variables γ − r and γ − w . Differentiating the objective wrt γ − r and setting to zero yields: 2(2 2 + 3 + 1) − 12( + 1)r + 12(r ) 2 γ − r + 2(2 2 + 3 + 1) − 6( + 1)(r + w ) + 12r w γ − w = λr (24 + 6 (p − q) − 6(p + q)) − λ 2 2 (2p − q) − 6 q − 4(p + q) + 12( + 1)

Similarly differentiating the objective wrt γ − w and setting to zero yields:

2(2 2 + 3 + 1) − 12( + 1)w + 12(w ) 2 γ − w + 2(2 2 + 3 + 1) − 6( + 1)(r + w ) + 12r w γ − r = λw (24 + 6 (p − q) − 6(p + q)) − λ 2 2 (2p − q) − 6 q − 4(p + q) + 12( + 1) + 1220 2 ( 2 − 1) −3/2 .

Solving the above two equations yields:

Substituting the above two expression we get:

− 5340λ 7.0 p − 5340λ 7.0 q + 797λ 8.0 + 1780λ 9.0 p + 1780λ 9.0 q + 744200 3.5 − 2232600 5.5 + 2232600 7.5 − 744200 9.5 2.5 − 2 4.5 + 6.5

Looking at Eq.(48) we notice that it is a linear fucntion of λ which defined the function L(λ) mentioned in Section 4.1 of the main text: L(λ) = − 797λ 2.0 − 1780λ 3.0 p − 1780λ 3.0 q + 2391λ 4.0 + 5340λ 5.0 p + 5340λ 5.0 q − 2391λ 6.0 − 5340λ 7.0 p − 5340λ 7.0 q + 797λ 8.0 + 1780λ 9.0 p + 1780λ 9.0 q + 744200 3.5 − 2232600 5.5 + 2232600 7.5 − 744200 9.5 2.5 − 2 4.5 + 6.5

We observe that the leading term (i.e terms whose magnitude is biggest) in the denominator is a positive quantity namely 6.5 . The leading term in the numerator that contains λ grows as 1780λ 9 (p+q)+797λ 8 . So the unconstrained minimum of this linear function is attained at λ = −∞.

Hence the constrained minimum (with constraint λ ≥ 0) of the optimization problem 47 is attained at λ = 0. We calculate the optimal objective to the constrained problem via Eq.(48) as

where we consider bins with length ≥ 14.

Since 4 ≥ 2 2 for all ≥ 2, we continue from the previous display to obtain:

Hence we have

where in the last line we used the fact that − 1 ≥ /2 is satisfied for all ≥ 14 and + 1 > .

Hence continuing from Eq.(45) we conclude that Next, we provide the full regret guarantee in a uni-variate setting.

Theorem 25. Let p t be the predictions of FLH-SIONS algorithm with parameters = 2, C = 20 and exp-concavity factor σ. Under Assumptions A1-A4, we have that,

for any comparator sequence w 1:n ∈ T V (1) (C n ). HereÕ hides poly-logarithmic factors of n and a ∨ b = max{a, b}.

Proof. The proof is complete by adding theÕ(1) dynamic regret bound from Lemmas 23 and 24 across O(n 1/5 C 2/5 n ∨1) bins in the partition P.

The proof of Lemma 9 stated in the main text is similar to the arguments used to derive Eq.(21). We record it for the sake of completeness.

Lemma 9. We have that T 2 ≤ − 1 2 ∇F (β) 2 A −1 . Proof. We follow the same notations used in defining Lemma 9 in the main text.

Let's begin by calculating v :

where line (a) is obtained via Lipschitzness and Holder's inequality x T y ≤ x 1 y ∞ and the fact that |2 + 1 − 3j|≤ 2( − 1) for all j ∈ [1, ]. Similarly

where we used |−3( + 1) + 6j|≤ 3( − 1) for all j ∈ [1, ]. Combining Eq. (50) and (51) we conclude that |v T x j | = 4 + (j − a + 1) 6 ( + 1) ≤ 10,

where the last line follows due to the fact (j − a + 1) ≤ . Hence by Triangle inequality we have

Now we bound |β T x j | using similar arguments. We have v := A −1 b j=a u j x j . Now noting that |u j |≤ 1 by Assumption A1 and using similar arguments used to obtain Eq.(52) we conclude that |β T x j |≤ 10.

(54) So continuing from Eq.(53) we have |α T x j |≤ 20. For some z = tα + (1 − t)β, t ∈ [0, 1] we have by Taylor's theorem that

where in the first inequality we used that fact that ∇ 2 F (z) A due to the fact that the functions f j are 1 gradient Lipschitz in [−20, 20] d via Assumption A3.

B.2 Multi-dimensional setting splitMonotonic: Inputs-(1) offline optimal sequence (2) Lemma 26. Consider the following convex optimization problem. miñ u 1 , ... ,ũ n ,z 1 , ... ,z n−1

Let u 1 , . . . , u n , z 1 , . . . , z n−2 be the optimal primal variables and let λ ≥ 0 be the optimal dual variable corresponding to the constraint (74a). Further, let γ − generateBins: Input-(1) offline optimal sequence 1. Form consecutive bins [i s , i t ] such that: // coarse partition based on TV1 distance

where a→b := b − a + 1.

2. Let the partition of the time horizon be represented as P :

3. Initialize R ← Φ. The proof of above Lemma is similar to that of Lemma 7 and hence omitted.

Lemma 27. (Luo et al., 2016) Consider an online learning setting where at each round t, we are given a feature vector

Let the function f (r) be σ exp-concave and G Lipschitz for r ∈ R d with r ∞ ≤ C. Define K t := {w ∈ R 2d : |x T t w[2k − 1 : 2k]|≤ C ∀k ∈ [d]}. Let K := ∩ T t=1 K t and g t := ∇f t (p t ). Consider a variant of the algorithm proposed by (Luo et al., 2016) where the algorithm makes a predictionp t+1 ∈ R d at round t + 1 as:

where A t = I + t s=1 σg s g T s with I is the identity matrix and is an input parameter. Then for any w ∈ K we have the regret controlled as

where the last line is by the definition of A t . By using the arguments of Lemma 12 of (Hazan et al., 2007) we have

Thus overall we have, 

where w ∈ ∩ b j=a K j andf is as defined in Lemma 27.

Proof. Since the loss functions f j are σ exp-concave, by Lemma 3.3 in (Hazan and Seshadhri, 2007) 

Subtractingf j (w) from both sides and using Lemma 27 now yields the result.

Corollary 29. The number of bins M := |P| formed via a call to generateBins(u 1:n ) is at-most O(n 1/5 C 2/5 n ∨ 1).

Proof. The proof is similar to that of Lemma 8.

Lemma 30. Let [i s , i t ] ∈ P where P is the partition produced via the generateBins procedure. We have that the dynamic rgeret of FLH-SIONS within this bin controlled as

wherep j ∈ R d are the predictions of the algorithm.

for v ∈ R 2d . Next, we proceed to construct the details of a regret decomposition within a bin [i s , i t ]:

where we will construct appropriate y j , α j , β j ∈ R 2d and X j ∈ R d×2d in what follows.

AssignCo-variatesAndSlopes1: Inputs-(1) offline optimal sequence (2) 

1. Let β k be the least square fit coefficient computed with labels being u a [k], . . . , u b [k] and co-variates x j := [1, j − a + 1] T so that the fitted value at time j is given byû 3. Set y j [2k − 1 : 2k] ← x j for all j ∈ [a, b]. 

The alternate case where s a−1 = −1 and s b = 1 with u[k] non-increasing within [b + 1, c] can be handled similarly. All the arguments we explain for the case of offline optimal touching the boundary −1 can be mirrored to handle the case where the offline optimal touches the boundary 1. (The offline optimal can't touch both boundaries simultaneously along a coordinate, see Lemma 17) We will use 1-based indexing. (i.e v[1] denotes the first element of a vector). For each k ∈ [d]:

• Call AssignCo-variatesAndSlopes1(u 1:n , [i s , a − 1], k).

• Call AssignCo-variatesAndSlopes2(u 1:n , [a, b], k).

• Let [b + 1, t 1 − 1], [t 1 , t 2 ], [t 2 + 1, c] be the bins returned by a call to splitMonotonic(u 1:n , [b + 1, c], k).

• Call AssignCo-variatesAndSlopes1(u 1:n , [b + 1, t 1 − 1], k).

• Call AssignCo-variatesAndSlopes2(u 1:n , [t 1 , t 2 ], k).

• Call AssignCo-variatesAndSlopes1(u 1:n , [t 2 + 1, c], k).

• Call AssignCo-variatesAndSlopes1(u 1:n , [c + 1, i t ], k).

For a vector y we treat y[m : n] = [y[m], . . . , y[n]] T . Define X j ∈ R d×2d as

where 0 = [0, 0] T and y j is set according to various calls of AssignCo-variatesAndSlopes1 and AssignCo-variatesAndSlopes2 as done previously. We proceed to bound T 2 + T 3 in Eq.(57). First notice that due to Taylor's theorem,

where v = tα j + (1 − t)β j for some t ∈ [0, 1]. Now we use Lemma 31 to obtain,

Further, due to gradient Lipschitzness,

Looking at Eq. (59) and (60), we see that they decompose across each coordinate k . So we can bound T 2 + T 3 in any bin [i s , i t ] coordinate wise:

where in the last line we define:

Next, we proceed to bound it j=is t 2,j,k + t 3,j,k for the coordinate k with a structure as mentioned in Paragraph (A1).

Recall 

Next, we focus on the bin [a, b] . We note that by construction, α j [2k − 1 : 2k] and β j [2k − 1 : 2k] are fixed for all j ∈ [a, b]. Let's denote these fixed values by α k and β k respectively. For the sake of brevity let's denote

We have the relation,

By the new compact notations, we have

Proceeding similarly to Eq. (26) and (27) by gradient Lipschitzness we obtain,

where in the last line we used Lemma 15 coordinate-wise and the fact that D 2 u a:

and

We observe that Eq.(64),(65),(66) are semantically same as Eq. (61), (26) and (27) respectively in the 1D case. Next, we proceed to setup a similar observation for bounding b j=a t 3,j,k . From KKT conditions in Lemma 26 and proceeding similar to the arguments in Lemma 16 we get,

where similar to Lemma 16, we represent β

The last line is obtained due to Lemma 15. Further, by using Lemma 15 we obtain,

Combining the last two inequalities yields,

We observe that the last inequality is semantically similar to Eq.(19) for 1D case. Recall that Eq.(64),(65),(66) are also semantically same as Eq.(61), (28) and (29) respectively in the 1D case.

Hence we can proceed to bound b j=a t 2,j,k + t 3,j,k = O(1),

using the same arguments as in Lemma 24.

Observe that by construction, the slopes across coordinate k are constant in the bins [b + 1, t 1 − 1], [t 2 + 1, c] and [c + 1, i t ]. So by using similar arguments used for handling the bin [i s , a − 1] we obtain, j∈I t 2,j,k + t 3,j,k = 0,

where

By appealing to our reduction to 1D case facilitated by Eq. (64) and (67) and using similar arguments used to handle the monotonic slopes case as in Lemma 23 we obtain, t2 j=t1 t 2,j,k + t 3,j,k = O(1).

So far we have discussed bounding it j=is t 2,j,k + t 3,j,k for a bin with structure across coordinate k as described in Paragraph (A1). We remark that if the slopes across a coordinate k assumes a monotonic structure across [i s , i t ], we can handle it in the same way as we handled the sub-bin [t 1 , t 2 ] above.

We pause to remark that Eq. (62), (68), (69) and (70) together gives a way to bound to it j=is t 2,j,k + t 3,j,k across any coordinate k as we comprehensively considered all the possible structures across a coordinate k . (The alternate cases where where s a−1 = −1 and s b = 1 with u[k ] non-increasing within [b + 1, c] can be handled similarly to the case described in Paragraph (A1). Finally the case where the offline optimal touches boundary 1 instead of −1 can be handled using similar arguments.)

Thus overall we obtain that for any bin [i s , i t ] ∈ P we have:

where T 2 and T 3 are as defined in Eq.(57). Next, we proceed to control T 1 . Recall that

Let's revisit bin [i s , i t ] with structure as described in Paragraph (A1) across coordinate k. First we consider the bin [a, b] . Through the call to AssignCo-variatesAndSlopes2(u 1:n , [a, b], k) we set α k . Further α j [2k − 1 : 2k] = α k for all j ∈ [a, b]. By using similar arguments as in the proof of Lemma 22 which lead to Eq.(35), we have that |y j [2k − 1 : 2k] T α j |≤ 20. For other bins such as

where the slope of the offline optimal across coordinate k remains constant, we set α j [2k − 1 : 2k] for j ∈ I with

to be a constant value obtained as the least square fit coefficients with co-variates y j [2k − 1 : 2k] and labels set appropriately via the call to AssignCo-variatesAndSlopes1. Hence in this case also we have |y T j [2k − 1 : 2k]α j [2k − 1 : 2k]|≤ 10 via the arguments in Lemma 22. For the alternate cases (i) where s a−1 = −1 and s b = 1 with u[k ] non-increasing within [b + 1, c] as described in Paragraph (A1) (ii) case where the offline optimal touches boundary 1 instead of -1 (iii) The offline optimal across coordinate k is non-decreasing within [i s , i t ] and (iv) The offline optimal across coordinate k is non-increasing within [i s , i t ]. In all these cases we can set the quantities α j [2k − 1 : 2k], y j [2k − 1 : 2k] by similar calls to AssignCo-variatesAndSlopes1 or AssignCo-variatesAndSlopes2 such that y j [2k − 1 : 2k] T α j [2k − 1 : 2k] ≤ 20 for all j ∈ [i s , i t ]. For example, for case (iii) we can resort to similar arguments used for handling sub-bin [t 1 , t 2 ] which is again similar to how we handled the bin [a, b] . (see Paragraph (A1)).

Further, even-though we create at-most 6 sub-bins across each coordinate for an interval [i s , i t ] ∈ P (see Paragraph (A1) and the sequence of calls beneath), doing so for each coordinate can result in at-most 6d partitions of u is:it overall. However, if we consider any sub-bin [p, q] 

where the last line is due to Eq. (36) and (71) (Fig.3 ) that provides the co-variate y j [2k − 1 : 2k ] to all coordinates where j ∈ [p, q].

This expert will have a regret ofÕ(d) against µ via Lemma 27. By using Strong Adaptivity from Corollary 28 (set w = µ there and recall that µ 2 2 ≤ 584d) and adding the regret across all 6d sub-bins of [i s , i t ] lead to anÕ(d 2 ) on T 1 in Eq.(57). Thus for any bin in P produced by generate bins procedure, we have its dynamic regret bounded byÕ(d 2 ).

Proof of Theorem 3. The proof is now complete by adding theÕ(d 2 ) dynamic regret bound across all O(n 1/5 C 2/5 n ∨1) bins in P from Corollary 29.

The proof of Lemma 11 is same as that of the lemma below, albeit with slightly different notations for X j .

Lemma 31. Let X j be as defined in Eq.(58). Letf j (v) = f j (X j v) for some v ∈ R 2d and let Σ := X T j X j ∈ R 2d×2d . We have, 

where ⊗ denotes the Kronecker product and • denotes the Hadamard product.

Recall that the loss functions f j are 1-gradient Lipschitz. So we have I − ∇ 2 f (b) is Positive Semi Definite (PSD). The matrices 1 and y j y T j are also PSD. Since both Kronecker and Hadamard products preserves positive semidefiniteness, we have ∇ 2f j (v) Σ which proves the lemma.

Proposition 32. Consider the sequence class T V 1 (C n ) as per Eq.(4). Under Assumption A1 (see Section 4) we have that T V (1) (C n ) ⊆ T V (0) (2C n + 20d).

Proof. We start by considering a 1D setting. Consider a sequence w 1:n ∈ T V (1) (C n ). We can represent it as sum (point-wise) of two sequences as w 1:n = p 1:n + q 1:n ,

where q 1:n = β T x t where x t = [1, t] T and β is the least square fit coefficnts computed by using covariates x t and labels w t , t ∈ [n]. Here the p 1:n is the residual sequence obtained by subtracting the least square fit sequence from the true sequence. Following the terminology in Lemma 16, we can represent p t = tM t + C t . Further, due to Eq.(20) (with a = 1) we have that p t+1 − p t = M t+1 .

Applying triangle inequality to Eq.(73) we have Dw 1:n 1 ≤ Dp 1:n 1 + Dq 1:n 1 .

Further,

|M 1 |+D 2 p 1:n 1 = (a) n|M 1 |+nD 2 w 1:n 1

where in line (a) we used the fact that D 2 p 1:n 1 = D 2 w 1:n 1 as subtracting a linear sequence doesn't affect the TV1 distance. In line (b) we applied |M 1 |≤ D 2 w 1:n 1 as shown in Lemma 16. It remains to bound Dq 1:n 1 . For this we note that q t ≤ 10 for all t ∈ [n] due to Eq.(54). Since q 1:n is a monotonic sequence we have that its variation Dq 1:n 1 ≤ 20.

Thus overall we obtain that Dw 1:n 1 ≤ 2nD 2 w 1:n 1 +20 ≤ 2C n + 20.

For multiple dimensions we apply the same argument across each dimension and add them up to yield the lemma.

In this section, we first prove the following result.

Theorem 33. Let p t be the predictions of FLH-SIONS algorithm with parameters = 2, C = 20 and exp-concavity factor σ. Under Assumptions A1-A4, we have that,

for any C n > 0 and any comparator sequence w 1:n ∈ T V (0) (C n ). HereÕ hides poly-logarithmic factors of n and a ∨ b = max{a, b}.

Proof. The proof follows almost directly from the arguments in Baby and Wang (2021) . First, we use the partition P mentioned in Lemma 30 in Baby and Wang (2021 

Let u 1 , . . . , u n be the optimal solution to the above problem. Let w j be the prediction of the FLH-SIONS algorithm at time j. Define:

Defineū i = 1 ni it j=is u j andu i =ū i − 1 ni it j=is ∇f j (ū i ). We can use the regret decomposition of Baby and Wang (2021) .

For any bin [i s , i t ] ∈ P, we can bound T 2,i + T 3,i = O(1) by using the arguments in the proof of Theorem 14 of Baby and Wang (2021) since the losses in our case are also gradient-Lipschitz as per Assumption A3. So we only need to consider the term T 1,i . Observe that

as per Assumptions A1-A2. Further we can view the comparatoru i as a linear predictor with slope zero. The output of this linear predictor is bounded in magnitude by 2 which is less that 20. Hence FLH-SIONS under the setting of the current theorem leads to T 1,i =Õ(d). Since M = O(dn 1/3 C 2/3 n ∨ d) for the partition in Lemma 30 of adding the regret across all bins results in the theorem.

Theorem33 when combined with Theorem 3 now directly leads to Proposition 5.

The result proven in this section is mainly due to the geometric arguments in Donoho et al. (1990) ; Donoho and Johnstone (1998) (or see Johnstone (2017) for a comprehensive monograph) with an extra technicality of handling boundedness constraint as per Assumption A1 (in Section 4).

In the proof we make extensive use of wavelet theory and refer readers to Johnstone (2017) for necessary preliminaries.

Proposition 2. Under Assumptions A1-A4, any online algorithm necessarily suffers sup w1:n∈T V (1) (Cn) R n (w 1:n ) = Ω(d 3/5 n 1/5 C 2/5 n ∨ d).

Proof. We consider a uni-variate setting with the losses f t (w) = (d t − w) 2 where d t = u t + N (0, 1) with u 1:n ∈ T V (1) (C n ). At each step, d t is revealed to the learner as doing so can only make learning easier.

Let W be the set of whole numbers. For the purposes of analysis, we start with an abstract observation model:

where θ j are the wavelet coefficients in a regularity-three CDJV multi-resolution basis (Cohen et al., 1993 ) of a function in F 1 (C n ) from which the discrete samples u 1:n are generated.

In what follows we will show that for any procedure estimating the wavelet coefficients (let the estimate beθ j , j ∈ W) we have that j∈W (θ j − θ j ) 2 = Ω(C 2/5 8/5 ).

Due to Section 15.5 of (Johnstone, 2017), by taking = 1/ √ n, such a guarantee will then imply a lower bound of Ω(n −4/5 C 2/5 ) for 1 n n t=1 (u t −û t ) 2 , whereû t is the estimate produced by observing the data d t (assume C = Ω(1/ √ n) for now). This rate will finally imply a dynamic regret lower bound in the following manner:

where in line (a) we used the bias variance decomposition and the fact thatû t is independent of d t for online algorithms.

In what follows we use a dyadic indexing scheme for referring to wavelet coefficients in Eq.(75) as θ jk which means the k th wavelet coefficient in resolution j ≥ 0. There are 2 j wavelet coefficients in resolution j. We will also use θ j· to denote a sequence of 2 j wavelet coefficients at resolution j.

Let β be the subset of wavelet coefficients at resolutions less than or equal to 2. i.e, β = [θ 0· , θ 1· , θ 2· ] which has a length of 7.

Define a Besov norm as follows: It is known that A(κC) ⊆ F 1 (C) for some constant 0 < κ ≤ 1. (see for eg. Eq.(33) in (Tibshirani, 2014 ) along with Theorem 1 in (Donoho and Johnstone, 1998)).

Since the space A(B) is solid and orthosymmetric (see Section 4.8 in Johnstone (2017)) we have that the risk of estimating coefficients from A is lower bounded by the risk (i.e j≥0 (θ j − θ j ) 2 ) of the hardest rectangular sub-problem as shown by Donoho et al. (1990) .

A hyper-rectangle is defined as follows:

Θ(τ ) = {θ : |θ j |≤ τ j , j ≥ 0}.

From Donoho et al. (1990) , the minimax risk over a hyper-rectangle under the observation model Eq.(75) is known to be:

So all we need to show is an appropriate hyper-rectangle (which is identified by τ ) within A(B) whose minimax risk is sufficiently large.

We next proceed to give such a hyper-rectangle. Let j * ∈ W be the smallest number such that 2 j * ≥ C 2/5 2/5 .

For simplicity, from now on-wards, let's assume that j * is an integer that satisfy 2 j * = C 2/5 2/5 . Define the hyper-rectangle coordinates by

for all k = 0, 1, . . . , 2 j * − 1 and τ j· = 0 for all other resolutions. Note that κC 2 5j * /2 = . The minimax risk over such a hyper-rectangle then becomes R * (τ ) = 2 j * = (κC) 2/5 8/5 . Now it remains to verify that 1. The hyper-rectangle in Eq.(77) is indeed in A(κC).

2. The function produced by the coefficients in that hyper rectangle is bounded by 1 point-wise in magnitude.

First we notice that by taking = 1/ √ n as mentioned earlier, we have 2 j * > 4, whenever C > 4 5/2 / √ n. We first consider the case where C is within this regime. For the first item, we have that τ b 3/2 1,1 = 2 3j * /2 · 2 j * κC 2 5j * /2 = κC, where we used the fact that j * > 2 in the regime C > 4 5/2 / √ n. Hence Θ(τ ) ⊆ A(κC). For the second item, we notice that due to Lemma B.18 in Johnstone (2017) , it is sufficient to show that 2 j * /2 θ j * · ∞ = O(1). Taking = 1/ √ n as mentioned earlier, we have that 2 j * /2 θ j * · ∞ = κC 2 2j * = κ 1/5 C 1/5 4/5 = κ 1/5 C 1/5 n 2/5 ≤ 1, in the non-trivial regime of C ≤ n 2 where we recall that κ ≤ 1.

For the regime where C ≤ 1/ √ n, the trivial lower bound of Ω(1) estimation error kicks in. Thus overall we have shown that for any online algorithm producing estimatesû t we have that n t=1 E[(û t − u t ) 2 ] = Ω(n 1/5 C 2/5 ∨ 1), thus obtaining a lower bound on the dynamic regret as per Eq.(76).

In multiple-dimensions we can consider a similar setup as before with losses f t (w) = d t − w 2 2 with d t [k] = u t [k] + N (0, 1) where u 1:n ∈ T V (1) (C). We can consider a sequence u 1:n such that nD 2 u 1:n [k] 1 = C/d across each coordinate k ∈ [d]. = Ω(d 3/5 n 1/5 C 2/5 ∨ d).

This completes the proof of the proposition.

For sake of conveying the ideas, we consider a uni-variate setting. First we derive a tighter regret guarantee (than one implied by Proposition 32) of O(n 1/3 C 2/3 n ∨ 1) for the results of Baby and Wang (2021) when applied to our setting. Then we explain the source of sub-optimality in their analysis.

Let u 1:n be the offline optimal sequence as per Lemma 7.

In accordance to the details in Section 3, we can interpret a comparator sequence u 1:n ∈ T V (1) (C n ) as a continuous piece-wise linear sequence. Then the dynamic regret can be expressed as:

where in Line (a) we define x t = [1, t/n] T and α and β are chosen such that p t = α T t x t and u t = β T t x t . Further the predictors β t are chosen to satisfy β T t x t = β T t+1 x t so that the sequence u 1:n can be interpreted as a piece-wise linear signal that is also continuous at every transition point where the slope changes (see Definition 1).

In Line (b) we definef t (v) = f t (v T x t ). We chose the co-variates as x t = [1, t/n] T instead of x t = [1, t] T so that the lossesf t (v) remains Lipschitz and gradient Lipschitz whenever |v T x t |= O(1) which is a requirement for the results in (Baby and Wang, 2021 where we used the fact that the sum of difference of the slopes (see Definition 1) in the linear representation of u 1:n with co-variates x t = [1, t/n] T is exactly equal to n D 2 u 1:n 1 . Thus overall, we obtain that n−1 t=1 β t − β t+1 1 ≤ 2n D 2 u 1:n 1

as u 1:n ∈ T V (1) (C n ).

Hence by the results of Baby and Wang (2021) we have that R n (u 1:n ) =Õ(n 1/3 C 2/3 n ∨ 1).

Next, we proceed to explain source of this sub-optimality in the analysis of Baby and Wang (2021). In Baby and Wang (2021) (Lemma 5) they form a partition P of β 1:n so that in the i th bin (represented by [i s , i t ]) we have that:

where we recall that a→b = b − a + 1.

So we have that within bin [i s , i t ] ∈ P , Dβ is:it [2] 1 ≤ 1/ is→it . This amounts to saying that D 2 u is:it 1 ≤ 1/(n is→it ).

While in the partition P that we construct in Lemma 8 we have that D 2 u is:it 1 ≤ 1/ is→it 3/2 .

Comparing the previous two inequalities, we conclude that the sequence within each bin of P is much smoother than that of P.

This will result in the formation of |P |= O(n 1/3 C 2/3 n ∨ 1) bins overall as per Eq.(78) (see Lemma 5 in Baby and

Wang (2021)) which is much larger than the O(n 1/5 C 2/5 n ∨ 1) bins in P. Within each bin [i s , i t ] ∈ P Baby and Wang (2021) uses a three term regret decomposition as follows:

whereβ = 1 n it j=is β j andβ =β − 1 is→i t it j=is ∇f j (β). Then Baby and Wang (2021) proceed to show that this one step gradient descent based decomposition is sufficient to keep T [is,it] = O(1) leading to an overall regret of O(n 1/3 C 2/3 ∨ 1) when summed across all bins.

In our case the main challenge is to keep T [is,it] =Õ(1) for [i s , i t ] ∈ P while dealing with the fact that sequence within each bin of P is much less smooth than that in P . We accomplish this via a newton step based decomposition with a careful analysis as detailed in Section 4.1 (It was found that the one-step gradient descent as in Eq.(79) doesn't keep T 2 negative enough to make T 2 + T 3 = O(1) for bins in P). Even-though the sequence in bins P is wigglier then that in bins P , overall the sequence, u 1:n from a T V (1) class is much smoother than the sequences from T V (0) class as the former sequence class is formed by discretizing functions that are (weakly) differentiable (see Section 1). This extra smoothness property is what allowed us to consider larger (in terms of mean bin width) bins and hence smaller partition size (when compared to P ) and still bound the regret within each bin to beÕ(1). Adding this bound across all bins in P then lead to the optimal rate ofÕ(n 1/5 C 2/5 n ∨ 1).

A closer look at adaptive regret

Sympy: symbolic computing in python

Online optimization in dynamic environments: Improved regret rates for strongly convex problems

Introductory lectures on convex optimization -a basic course

Online non-parametric regression

Online nonparametric regression with general loss functions

Adaptive piecewise polynomial estimation via trend filtering

Estimating a regression function

Trend filtering on graphs

Tracking slowly moving clairvoyant: optimal dynamic regret of online learning with true and noisy gradient

Trading-off static and dynamic regret in online least-squares and beyond

Adaptive online learning in dynamic environments

Dynamic regret of strongly adaptive methods

Dual adaptivity: A universal algorithm for minimizing the adaptive regret of convex functions

Improved analysis for dynamic regret of strongly convex and smooth functions

Dynamic regret of convex and smooth functions

Non-stationary online learning with memory and non-stochastic control

Online convex programming and generalized infinitesimal gradient ascent

T 2 + T 3 ≤ (11907200 + 200) = 11907400.The term T 1 can be bound as T 1 ≤ 256 + 1 2σ log(1 + σn/2) + 4 σ log n =Õ(1), by Lemma 22. Now suppose that the offline optimal within bin [a, b] touches boundary −1 more than two times. In this case we propose a reduction to the previous type of analysis where only γ − r and γ − w are potentially non-zero. The reduction is facilitated by two observations:1. While performing the minimization of function T (λ, γ − r , γ − w ) in Eq.(44) via the optimization problem 47 we neither used the fact that r and w are integers nor constrained any bounds on them as well 2. The partially minimized objective in Eq.(48) fortunately doesn't depend on neither r nor w. Now let's consider the case where arbitrary number of γ − j , j ∈ [a, b] can be non-zero. We can then write,where we assume thatγ − w > 0 (otherwise, we fall back to the earlier analysis).With these re-definitions we note thatis jointly convex in its arguments. This can be seen as follows: Note that T (λ,γ − r ,γ − w ) assumes the formis an affine function of its arguments andwhere the last line follows due to our re-parametrizations. By following essentially same arguments as earlier for proving convexity ofis also jointly convex in its arguments. This completes our reduction to the case of two-boundary touches and rest of analysis proceeds by minimizing T (λ,γ − r ,γ − w ) as earlier. We now consider the case where s a−1 = s b = 1. We can split the original bin [a, b] into two sub-bins [a 1 , b 1 ] and [a 2 , b 2 ] with a 2 = b 1 + 1 such that (i) s b1 = −1 with u b1+1 − u b1 > u a2+1 − u a2 and (ii) the slopes are non-decreasing within [a 2 , b 2 ]. This can be achieved by picking b 1 as the last point within [a, b] where u b1+1 − u b1 > u b1+2 − u b1+1 .In the bin [a 1 , b 1 ] we apply the previous analysis to bound regret byÕ(1). For the bin [a 2 , b 2 ] we resort to Lemma 23 to bound regret byÕ(1).The analysis for the case of boundary signs assignments s a−1 = −1 and s b = 1 as well as s a−1 = −1 and s b = −1 can be done similarly.Adding the regret bounds across all newly formed bins due to potential splitting yields the lemma.We will call this algorithm as SIONS (Scale Invariant Online Newton Step).Proof. First we show that exp-concavity is invariant to affine transforms. Since f t is σ exp-concave, we havẽFor the sake of brevity let's denote fWith this, we observe that,Thus, we obtain the affine invariance of exp-concavity as:Note that the set K t is convex. This can be seen as follows: if v, w ∈ K , t , then we have |x T t v[2k − 1 : 2k]|≤ C and |x T t w[2k − 1 : 2k]|≤ C for all k ∈ [d]. Now for any t ∈ [0, 1] let z = tv + (1 − t)w. Then we have for any k ∈ [d] thatwhere the first inequality is via triangle inequality. Thus z ∈ K t so the set K t is convex.So by the properties of projection to convex sets (see for example, Lemma 16 in (Hazan et al., 2007) ) and the definition of the algorithm, we have that