key: cord-0559276-euc6ag3e authors: Wang, Weichen; An, Ran; Zhu, Ziwei title: Volatility prediction comparison via robust volatility proxies: An empirical deviation perspective date: 2021-10-04 journal: nan DOI: nan sha: 2c447f9f93319dfa216dfd6093cfdce32ea1a205 doc_id: 559276 cord_uid: euc6ag3e Volatility forecasting is crucial to risk management and portfolio construction. One particular challenge of assessing volatility forecasts is how to construct a robust proxy for the unknown true volatility. In this work, we show that the empirical loss comparison between two volatility predictors hinges on the deviation of the volatility proxy from the true volatility. We then establish non-asymptotic deviation bounds for three robust volatility proxies, two of which are based on clipped data, and the third of which is based on exponentially weighted Huber loss minimization. In particular, in order for the Huber approach to adapt to non-stationary financial returns, we propose to solve a tuning-free weighted Huber loss minimization problem to jointly estimate the volatility and the optimal robustification parameter at each time point. We then inflate this robustification parameter and use it to update the volatility proxy to achieve optimal balance between the bias and variance of the global empirical loss. We also extend this Huber method to construct volatility predictors. Finally, we exploit the proposed robust volatility proxy to compare different volatility predictors on the Bitcoin market data. It turns out that when the sample size is limited, applying the robust volatility proxy gives more consistent and stable evaluation of volatility forecasts. Volatility forecasting is a central task for financial practitioners, who care to understand the risk levels of their financial instruments or portfolios. There have been countless researches on improving the volatility modeling for financial time series, including the famous ARCH/GARCH model for better modeling volatility clustering, its many variants and more general stochastic volatility models (Engle, 1982; Bollerslev, 1986; Baillie et al., 1996; Taylor, 1994) , and on proposing better volatility predictors under different model settings and objectives (Poon and Granger, 2003; Brailsford and Faff, 1996; Andersen et al., 2005; Brooks and Persand, 2003; Christoffersen and Diebold, 2000) . This list of volatility forecasting literature is only illustrative and far from complete for the large body of researches on this topic. The prediction ideas range from the simplest Exponentially Weighted Moving Average (EWMA) (Taylor, 2004) , which is adopted by J. P. Morgan's RiskMetrics, to more complicated time series models and volatility models including GARCH (Brandt and Jones, 2006; Park, 2002) , to optionbased or macro-based volatility forecasting (Lamoureux and Lastrapes, 1993; Vasilellis and Meade, 1996; Christiansen et al., 2012) , and to the more advanced machine learning techniques such as the nearest neighbor truncation (Andersen et al., 2012) and Recurrent Neural Netowrks (RNN) (Guo et al., 2016) . Correspondingly, the underlying model assumption ranges from only smoothness of nearby volatilities, to different versions of GARCH, to Black-Scholes model (Black and Scholes, 2019) and its complicated extensions. The data distribution assumption can also vary in whether data are normally distributed, or heavy-tailed distributed from a known distribution e.g. t-distribution, or generally non-normal. When data are generally non-normal, researchers have proposed to use the quasi maximum likelihood estimation (QMLE) (Bollerslev and Wooldridge, 1992; Charles and Darné, 2019; Carnero et al., 2012) and its robust standard error for inference, but the theoretical results are typically asymptotic. Albeit good theoretical guarantee, industry practitioners seldom apply QMLE and tend to employ the naive approach of truncating the returns by an ad-hoc level and then applying EWMA. In this work, we consider a model assumption requiring only smoothness of volatilities. For simplicity, we also assume the volatility time series are given a-priori, and after conditioning on the volatilities, return innovations are independent. We choose this simple setting for the following reasons. Firstly, our main focus of study is on building effective robust proxies rather than testing volatility models and constructing fancy volatility predictors. Secondly, although we ignore the weak dependency between return innovations (think of ARMA models (Brockwell and Davis, 2009 ) for weakly dependency), the EWMA predictors and proxies can still have strong temporal dependency, due to data overlapping of a rolling window, so our analysis is still nontrivial. Also note that we allow the return time series to be non-stationary. Thirdly, the motivating example for us is volatility forecasting for the crypto market. Charles and Darné (2019) applied several versions of GARCH models characterized by short memory, asymmetric effect, or long-run and short-run movements and concluded that they all seem not appropriate to model Bitcoin returns. Therefore, starting from conditionally independent data without imposing a too detailed model e.g. GARCH may be a good general starting point for the study of robust proxies. Besides the native EWMA predictor as our comparison benchmark, we consider a type of robust volatility predictor when the instrument returns present heavy tails in their distributions. Specifically, we only require the returns to bear finite fourth moment. We consider the weighted Huber loss minimization, which turns out to be a nontrivial extension from the equal-weighted Huber loss minimization. To achieve the desired rate of convergence, the optimal Huber truncation level for each sample should also depend on sample weight. In addition, we apply a tuningfree approach following Wang et al. (2020) to tune the Huber truncation level adaptively and automatically. Unlike QMLE, our results focusing on a non-asymptotic empirical deviation bound. Therefore, although the main contribution of the paper is on robust proxy construction, we also claim a separate contribution on applying Huber minimization in the EWMA fashion. Now, given two volatility predictors, the evaluation of their performance is often quite challenging due to two things: (1) selection of loss functions, and (2) selection of proxies, since obviously we cannot observe the truth volatilities. The selection of loss functions have been studied by Patton (2011) . In Patton's insightful paper, he defined a class of robust losses with the ideal property that for any unbiased proxy, the ranking of two predictors using one of the robust losses will be always consistent in terms of the long-run expectation. The property is desired for it tells risk managers to select a robust loss, then not to worry much on designing proxies. As long as the proxy is unbiased, everything should just work out. Commonly used robust losses include the mean-squared error (MSE) and the quasi-likelihood loss (QL). However, there is one weakness of Patton's approach which has not been emphasized much in previous literature: the evaluation has to be in long-run expectation. The deviation of the empirical loss, which is what people actually use in practice, from the expected loss may still cause a large variance due to a bad choice of volatility proxy. Put it in the other way, his theory did not tell the risk managers how much an empirical loss can differ from its expected counterpart. In this work, we hope to bring the main message that besides the selection of a robust loss, the selection of a good proxy also matters for effective comparison of predictors, especially when the sample size T is not large enough. For a single time point, we show that the probability of making false comparison could be very high. So the natural question is that by averaging the performance comparison over T time points, are we able to get a faithful comparison of two predictors with high probability, so that the empirical loss ranking does reflect the population loss ranking? The answer is that we need robust proxies in order to have this kind of guarantee. We propose three robust proxies and compare them. The first choice uses the clipped squared return at the single current time t as the proxy. This may be the simplest practical choice of a robust proxy. However, it cannot achieve the desired convergence in terms of empirical risk comparison due to large variance of only using a single time point. The second option mimics the EWMA proxy, so now we clip and average at multiple time points close to t. To find out the proper clipping, we first run EWMA tuning-free Huber loss minimization on local data for each time t. This will give a truncation level adaptive to the unknown volatility. Then the clipping bound will be rescaled to reflect the total sample size. According to literature on Huber minimization Catoni (2012); Fan et al. (2016); Sun et al. (2020) , the truncation level needs to scale with the square root of the sample size to balance the bias and variance optimally. Therefore, it is natural to rescale the clipping bound by square root of the ratio of the total sample size T over the local effective sample size. The third proxy exactly solves the EWMA Huber minimization, again with the rescaled truncation. Compared to the first and second proxies, this gives further improvement on the deviation bound of the proxy, depending on the central kurtosis rather than the absolute kurtosis. We will illustrate the above claims in more detail in later sections. The Huber loss minimization has been proposed by Huber (1964) under Huber's -contamination model and its asymptotic properties have been studied in Huber (1973) . At that time, the truncation level was set as fixed according to the 95% asymptotic efficiency rule and "robustness" means achieving minimax optimality under the -contamination model (Chen et al., 2018) . But recently, Huber's M-estimator has been revisited in the regression setting under the assumption of general heavy-tailed distributions (Catoni, 2012; Fan et al., 2016) . Here "robustness" slightly changes its meaning to achieving sub-Gaussian non-asymptotic deviation bound under the heavy-tailed data assumption. In this setting, the truncation level grows with sample size and the resultant M-estimator is still asymptotically unbiased even when data distribution is asymmetric. Huber's estimator fits the goal of robust volatility prediction and robust proxy construction very well, as squared returns indeed have asymmetric distributions. Since Catoni (2012) Robustness issue is indeed an important concern for real data volatility forecasting. It has been widely observed that financial returns have fat tails. When it comes to the crypto markets e.g. Bitcoin product (BTC), the issue gets more serious, as crypto traders frequently experience huge jumps in the BTC price. For example, BTC plummeted more than 20% in a single day in March 2020. The lack of government regulation probably leaves the market far from efficient. This posts a stronger need for robust methodology to estimate and forecast volatility for crypto markets. Some recent works include Catania et al. (2018) ; Trucíos (2019); Charles and Darné (2019) . With the BTC returns, we will compare the non-robust EWMA predictor with the robust Huber predictor, with different decays, and evaluate their performance using the non-robust forward EWMA proxy and the robust forward Huber proxy. Both the predictors and proxies will be rolled forward and compared at the end of each day. We apply two robust losses, MSE and QL, to evaluate their performance. Interestingly, we will see that when sample size T is large, our proposed robust proxy will be very close to forward EWMA proxy, and both will lead to sensible and similar comparison. However, when T is small, non-robust proxy could lead to higher probability of making wrong conclusions, whereas the robust proxy, which automatically adapts to the total sample size and the time-varying volatilities, can still work as expected. This matches with our theoretical findings and provides new insights about applying robust proxies for practical risk evaluation. The rest of the paper is organized as follows. In Section 2, we first review the definition of robust loss by Patton (2011) and explain our analyzing strategy for high probability bound of the empirical loss. We bridge the empirical loss and the unconditional expected loss, by the conditional expected loss conditioning on proxies. In Section 3, we propose three robust proxies and prove that they can all achieve the correct ranking with high probability, if measured by the conditional expected loss. However, the proxy based on Huber loss minimization will have the smallest probability of making false comparison, if measured by the empirical loss. In Section 4, we will discuss robust predictors and see why the above claim is true and why comparing robust predictors with nonrobust predictors can be a valid thing to do. Simulation studies as well as an interesting case study on BTC volatility forecasting are presented in Section 5. We finally conclude the paper with some discussions in Section 6. All the proofs are relegated to the appendix. In this section, we first review the key conclusions of Patton (2011) on robust loss functions for volatility forecast comparison. We then use examples to see why we also care about the randomness from proxy deviation beyond picking a robust loss. Suppose we have a time series of returns (X i ) −∞ 0. This ensures that the ranking of two predictors is invariant to the re-scaling of data. Define the mean squared error (MSE) loss and quasi-likelihood (QL) loss as respectively. Here the QL loss can be viewed, up to an affine transformation, as the negative loglikelihood function of X that follows N (0, h) when we observe that X 2 = σ 2 . Besides, QL is always positive and the Taylor expansion gives that QL(σ 2 , h) ≈ (σ 2 /h − 1) 2 /2 when σ 2 /h is around 1. Patton (2011) shows that among many commonly used loss functions, MSE and QL are the only two that satisfy all the three properties above. Specifically, Proposition 1 of Patton (2011) says that given that L satisfies property (a) and some regularity conditions, L further satisfies property (b) if and only if L takes the form: where C(h) is the derivative function ofC(h) and is monotonically decreasing. Proposition 2 in Patton (2011) establishes that MSE is the only proxy-robust loss that depends on σ 2 − h and that QL is the only proxy-robust loss that depends on σ 2 /h. Finally, Proposition 4 in Patton (2011) gives the entire family of proxy-robust and homogeneous loss functions, which include QL and MSE (MSE and QL are homogeneous of order 2 and 0 respectively). Given such nice properties of MSE and QL, we mainly use MSE and QL to evaluate and compare volatility forecasts throughout this work. Besides selecting a robust loss as Patton (2011) suggested, one has to also nail down the proxy selection for prediction loss computation. Patton (2011)'s framework did not separate the randomness from the predictors and the proxies, and the proxy-robust property (b) compares two predictors in long-term unconditional expectation, which averages both randomnesses. However, it is not clear from Patton (2011) that for a given selected proxy, what is the probability that we end up a wrong comparison of two predictors. How does the random deviation of a proxy affect the comparison? Can some proxies outperform others in terms of less probability to make mistakes in finite sample? In practice, one has to use empirical risk to approximate the expected risk to evaluate volatility forecasts. This implies one important issue that property (b) neglects: Property (b) concerns only the expected risk and ignores the deviation of the empirical risk from its expectation. Such empirical deviation is further exacerbated by replacing the true volatility with its proxies, jeopardizing accurate evaluation of volatility forecasts. Our strategy of analysis is as follows: we first link the empirical risk to the conditional risk (conditioning on the selected proxy), claiming that they are close with high probability (see formal arguments in Section 4), and then study the relationship of comparing the unconditional risk and conditional risk. Specifically, we are interested in comparing the accuracy of two series of volatility forecasts . For notational convenience, we drop the subscript "t ∈ [T ]" when we refer to a time series unless specified otherwise. The empirical loss comparison can be decomposed into the conditional loss comparison and the difference between empirical loss and conditional loss. Therefore, we study the following two probabilities for any ε > 0: We aim to select stable proxies to make I small, so that the probability of obtaining false rank of the empirical risk is small. Meanwhile, with a selected proxy, we hope II can be well controlled for the predictors we care to compare. Note that only randomness from proxy matters in I. So we can focus on proxy design by studying this quantity. Then we would like to make sure the difference between the empirical risk and conditional risk are indeed small via studying II. The probability in II is with respect to both the proxy and the predictor. By following this analyzing strategy, we separate the randomness from predictor and proxy and eventually give results on empirical deviation rather than in expectation. To illustrate this issue, we first focus on a single time point t. We compare two volatility forecasts We are interested in the probability of having a reverse rank of forecast precision between h 1t and h 2t , conditioning on the selected Note that this probability is with respect to the randomness of the proxy σ 2 t ; in the sequel, we show that it may not be small for a general proxy. But if we can select a good proxy to control this probability well, we can ensure a correct comparison with high probability. Now consider MSE and QL as the loss functions, so that we can derive explicitly the condition for the empirical risk comparison to be consistent with the expected risk comparison. For simplicity, i.e., the probability of having the forecast rank in conditional expectation be opposite of the rank in unconditional expectation. When L is chosen to be MSE, we have Therefore, For illustration purposes, consider a deterministic scenario where h 1t = σ 2 t is the oracle predictor, and where h 2t = σ 2 t + √ η t (so that (2.6) holds). Then We can see from the two equations above that a large deviation of σ 2 t from σ 2 t gives rise to inconsistency between forecast comparisons based on empirical risk and expected risk. When we choose L to be QL, we have that Similarly, we consider a deterministic setup where h 1t = σ 2 t , and where h 2t = mσ 2 t with a misspecified scale. To ensure that EL( if m > 1, Similarly, we can see that the volatility forecast rank will be flipped once the deviation of σ 2 t from σ 2 t is large. Note again that in the derivation above, the probability of reversing the expected forecast rank is evaluated at a single time point t, which is far from enough to yield reliable comparison between volatility predictors. The common practice is to compute the empirical average loss of the predictors over time for their performance evaluation. Two natural questions arise: Does the empirical average completely resolve the instability of forecast evaluation due to the deviation of volatility proxies? If not, how should we robustify our volatility proxies to mitigate their empirical deviation? 3 Robust volatility proxies Our goal in this section is to construct robust volatility proxies { σ 2 t } to ensure that {h 1t } maintains empirical superiority with high probability, or more precisely, that P Gt We first present our assumption on the data generation process. Assumption 1. Given the true volatility series {σ t }, instrument returns {X t } are independent with EX t = 0 and Var(X t ) = σ 2 t . The central and absolute fourth moments of X t , denoted by Now we introduce some quantities that frequently appear in the sequel. At time t, define the smoothness parameters where w s,t = λ s−t / t+m j=t λ j−t is the forward exponential-decay weight at time s from time t with rate λ, and where ν s,t = λ t−1−s / t−1 s=t−m λ t−1−s is the backward exponential-decay weight with rate λ. These smoothness parameters characterize how fast the distribution of volatility varies as time evolves, and our theory explicitly derives their impact. As we shall see, our robust volatility proxies yield desirable statistical performance as long as these smoothness parameters are small, meaning that the variation of the volatility distribution is slow. Besides, define the forward and backward effective sample sizes as respectively, and define the forward and backward exponential-weighted moving average (EWMA) of the central fourth moment as We first review the tuning-free adaptive Huber estimator proposed in Wang et al. (2020) . Define the Huber loss function τ (x) with robustification parameter τ as of Y satisfying that EY = µ and that var(Y ) = σ 2 . The Huber mean estimator µ is obtained by solving the following optimization problem: Fan et al. (2017) show that when τ is of order σ √ n, µ achieves the optimal statistical rate with a sub-Gaussian deviation bound: In practice, σ is unknown, and one therefore has to rely on cross validation (CV) to tune τ , which incurs loss of sample efficiency. Wang et al. (2020) propose a data-driven principle to estimate µ and the optimal τ jointly by iteratively solving the following two equations: where z is the same deviation parameter as in (3.4). Specifically, we start with θ (0) =Ȳ and solve the second equation for τ (1) . We then plug τ = τ (1) into the first equation to get θ (1) . We repeat these two steps until the algorithm converges and use the final value of θ as the estimator for µ. Wang et al. (2020) proved that (i) if θ = µ in the second equation above, then its solution gives τ = σ n/z with probability approaching 1; (ii) if we choose τ = σ n/z in the first equation above, its solution satisfies (3.4), even when Y is asymmetrically distributed with heavy tails. Note that Wang et al. (2020) call the above procedure tuning-free, in the sense that the knowledge of σ is not needed, but we still have the deviation parameter z used to control the exception probability. The paper suggested to use z = log n in practice. In the context of volatility forecast, σ t always varies across time. The well known phenomenon of volatility clustering in the financial market implies that σ t typically changes slowly, so that we can borrow data around time t to help with estimating σ 2 t with little bias. A common practice in quantitative finance is to exploit an exponential-weighted average of {X 2 s } t+m s=t to estimate σ 2 t , thereby discounting the importance of data that are distant from time t. To accommodate such exponential-decaying weights, we now propose a sample-weighted variant of the Huber estimator for volatility estimation as follows: where {w s,t } s∈{t,t+1,...,t+m} are the sample weights. Note that the robustification parameters for the observations can be different: intuitively, the higher the sample weight is, the lower τ should be, so that we can better guard against heavy-tailed deviation of important data points. More technical justification on such choice of robustification parameters is given after Theorem 1. Correspondingly, to adaptively tune τ , we iteratively solve the following two equations for τ t and θ t until convergence: where σ 2 t equals θ t that solves the first equation of (3.7). Remark 1. Given that {w s,t } t+m s=t are forward exponential-decay weights, we have that which converges to (1 + λ)/(1 − λ) as m → ∞, and which converges to m + 1 as λ → 1. Therefore, n † eff → ∞ requires both that m → ∞ and that λ → 1. As n † eff → ∞, when δ 0,t = o( 1/n † eff ) and δ 1,t = o(1/n † eff ), the exception probability is of order e −z+o(z) , which converges to 0 as z → ∞. Remark 2. One crucial step of our proof is using Bernstein's inequality to control the derivative of the weighted Huber loss with respect to θ, i.e., L τ (θ; {w s,t }) = t+m s=t w s,t min(|X 2 s − θ|, τ t /w s,t )sgn(X 2 s − θ). Through setting τ t /w s,t as the robustification parameter for the data at time s, we can ensure that the corresponding summand in the derivative is bounded by τ t in absolute value, which allows us to apply Bernstein's inequality. This justifies from the technical perspective our choices of the robustification parameters for different sample weights. Remark 3. Theorem 1 is in fact not limited to exponential-decay weights {w s,t } t+m s=t . Bound (3.8) applies to any sample weight series. Our next theorem provides theoretical justification of the second equation of (3.7). It basically says that the solution to that equation gives an appropriate value of τ t . Theorem 2. On top of Assumption 1, we further assume that δ 1,t n † eff ≤ c 1 κ † t and w s,t 1/m for all t ≤ s ≤ t + m. If θ t = σ 2 t in the second equation of (3.7), we have that as n † eff , z → ∞ with z = o(n † eff ), its solution τ t κ † t n † eff z with probability approaching 1. Remark 4. Define the half-life parameter l := log(1/2)/ log(λ). If we fix m/l = C for a universal constant C, which is common practice in volatility forecast, then we can ensure that {w s,t } t≤s≤t+m are all of order 1/m. We are now in position to evaluate I, which concerns average deviation of the volatility proxies over all the T time points. To illustrate the advantage of the sample-weighted Huber volatility proxy as proposed in (3.6), we first introduce and investigate two benchmark volatility proxies that are widely used in practice. Then we present our average deviation analysis of the sample-weighted Huber proxy. The first benchmark volatility proxy, which we denote by ( σ c ) 2 t , is simply a clipped squared return: where c t is the clipping threshold, and where z is a similar deviation parameter as in (3.7). Here τ t is tuned similarly as in (3.7), except that now the second equation of (3.9) does not depend on σ 2 c and thus can be solved independently. Following Theorem 2, we can deduce that τ t κ † t n † eff z and thus that c t κ † t T z . The main purpose of choosing such a rate of c t is to balance the bias and variance of the average of ( σ c ) 2 t over T time points. The following theorem develops a non-asymptotic bound for the average relative deviation of σ 2 c . Theorem 3. Under Assumption 1, if c t = κ † t T /z in (3.9), for any bounded series {q t } t∈ [T ] such that max t∈[T ] |q t | ≤ Q and any z > 0, we have P 1 T The second benchmark volatility proxy, which we denote by ( σ e ) 2 t , is defined as (3.10) The second equation of (3.10) is the same as that of (3.9). The only difference between σ 2 e and σ 2 c is that σ 2 e exploits not only a single time point, but multiple data points in the near future to construct the volatility proxy. Accordingly, the clipping threshold is updated as τ t T /n eff . The following theorem characterizes the average relative deviation of σ 2 e . Theorem 4. Under Assumption 1, for any (z, c) > 0 satisfying that and any bounded series Remark 5. Let z = log T . To achieve the optimal rate of log T /T , Theorem 4 requires that 1 T T t=1 δ 0,t qt σ 2 t is of order log T /T . We also require n † eff to be at least the order of √ T . Remark 6. One technical challenge of proving Theorem 4 lies in the overlap of squared returns that are used to construct neighboring σ 2 t , which leads to temporal dependence across {( σ e ) 2 t } t∈ [T ] . To resolve this issue, we apply a more sophisticated variant of Bernstein's inequality for time series data (Zhang, 2021). See Lemma 1 in the appendix. We now move on to the Huber volatility proxy. At time t, denote the solution to (3.7) by ( θ t , τ t ). Note that τ t is tuned based on just m data points. Given that now our focus is on the average deviation of volatility proxies over T data points, we need to raise our robustification parameters to reduce the bias of our Huber proxies and rebalance the bias and variance of the average deviation. After all, averaging over a large T mitigates the impact of possible tail events, so that we can relax the thresholding effect of the Huber loss. Specifically, let c t = τ t T /n † eff , which is of order κ † t T /(n †2 eff z) according to Theorem 2. Then we substitute τ t = c t into the first equation of (3.7) and solve for σ 2 t therein to obtain the adjusted proxy; that is to say, the final ( σ H ) 2 t satisfies the following: The inflation factor T /n † eff in c t implies that the larger sample size we have, the closer the corresponding Huber loss is to the square loss. This justifies the usage of vanilla EWMA proxy as the most common practice in financial industry when the total evaluation period is long. However, when T is not sufficiently large, the Huber proxy yields more robust estimation of the true volatility. The following theorem characterizes the average relative deviation of the Huber proxies. Theorem 5. Under Assumption 1, for any (c, z) > 0 satisfying that Remark 7. Compared with the previous two benchmark proxies, the main advantage of the Huber volatility proxy is that its average deviation error depends on the central fourth moment of the returns instead of the absolute one. Remark 8. α-strong convexity is a standard assumption that can be verified for Huber loss. For example, Proposition B.1 of Chen and Zhou (2020) shows that the equally weighted Huber loss L τt (θ; {1/(m + 1)} t+m s=t ) enjoys strong convexity for α = 1/4 in the region (σ 2 t − r, σ 2 t + r) where r τ with probability at least 1 − e −t when m ≥ C(1 + t). Such strong convexity paves the way to apply Lemma 1 to the Huber proxies, so that we can establish their Bernstein-type concentration. Please refer to Lemma 2 in the appendix for details. Remark 9. To achieve the optimal log T /T rate of convergence if we choose z log T , we need to additionally assume 1 T T t=1 is of a smaller order, which in practice requires certain volatility smoothness. In this section, we further take into account the randomness from predictors and study bounding II, the difference between empirical risk and conditional risk. We essentially follow (3.6), the sample-weighted Huber mean estimator, to construct robust volatility predictors. The only difference is that now we cannot touch any data beyond time t; we can only look backward at time t. Consider the following volatility predictor based on the past m data points with backward exponential-decay weights: Similarly to (3.7), we iteratively solve the following two equations for h t and τ t simultaneously: where we recall that ν s,t = λ t−1−s / t−1 s=t−m λ t−1−s . Theorem 1 showed concentration of σ 2 t around σ 2 t , i.e. the loss is MSE. More generally, we hope to give results on P(L(σ 2 t , h t )−min h L(σ 2 t , h t ) > ε). According to (a) in Section 2.1, min h L(σ 2 t , h t ) = f (σ 2 t ) + B(σ 2 t ) + C(σ 2 t )σ 2 t . Therefore, we hope to bound This should be easy to control if we assume smoothness of the loss fucntion. We give the following theorem. where σ 2 t is the solution of the first equation. Remark 10. Here for notational simplicity, we used the same estimation horizon m, same exponentialdecay λ for constructing both predictors and proxies. But in practice, of course they do not necessarily need to be the same. In our real data example, we will use a slower decay for constructing predictors while using a faster decay for proxies, which seems to be the common practice for real financial data, where we typically use more data for constructing predictors and less data for constructing proxies. We will stick to m equals twice the half-life so that equivalently, we use a longer window for predictors and shorter window for proxies. Remark 11. In addition, here we also simplify the theoretical results by assuming the same z for constructing proxies and predictors. We do not need to use the same z controlling the tail probability for predictors and proxies practically. For predictors, as we focus on local performance, it is more natural to use z = log n ‡ eff following Wang et al. (2020) . For proxies, as we focus on overall evaluation, for a given T we can take z = log T . Sometimes, we want to monitor the risk evaluation as T grows, then a changing z may not be a good choice; we do not want to re-solve the local weighted tuning-free Huber problem every time T changes. Therefore, we recommend to use z = C log n † eff for a slightly larger C, e.g. z = 2 log n † eff as in our real data analysis. Recall a robust loss satisfies E{L( So we further bound II as follows: We wish to show that both ∆ A and ∆ B can achieve the desired rate of log T /T for a broad class of predictors and loss functions. When h t is the proposed robust predictor in Section 4.1, we can obtain sharp rates of ∆ A and ∆ B as expected. Moreover, for the vanilla (non-robust) EWMA predictor h t = t−1 s=t−m ν s,t X 2 s , we are also able to obtain the same sharp rates for ∆ A and ∆ B for Lipschitz robust losses. The third option is to truncate the predictor, i.e. h t = ( t−1 s=t−m ν s,t X 2 s )∧M with some large constant M and we can control the two terms for general robust losses. The bottom line is that for non-robust predictors, we need to make sure the loss does not go too crazy either via shrinking predictor's effect on the loss (bounded Lipschitz) or clipping the predictor directly. The interesting observation is that bounding II, or the difference between empirical risk and conditional risk, requires minimal assumption such that most of the reasonable predictors, say in the M-estimator form, have no problem satisfying the concentration bound with proper choice of a robust proxy, although we do require the loss not to become too wild (see Theorem 7 for details). Technically, the concentration here only cares about controlling variance and does not care about the bias between E[L(σ 2 t , h t )] and L(σ 2 t , σ 2 t ) and between E[C(h t )] and C(h t ). There is no need to carefully choose the truncation threshold to balance variance and bias optimally. Theorem 7. For any t ∈ [T ], L τt (θ; {ν s,t } t−1 s=t−m ) is α-strongly convex for θ ∈ (σ 2 t /2, 3σ 2 t /2). Choose σ t = ( σ e ) t (or ( σ H ) t ) with robustification parameter c t specified in Theorem 4 (or Theorem 5). Under the assumptions of Theorem 4 (or Theorem 5), the bound holds for all the following three cases: 1. For the robust predictors proposed in Section 4.1, assume sup |h−σ 2 the above bound holds with C depending on max t B t (σ 2 t /2) and C in Theorem 4 (or Theorem 5). t /σ 4 t , the above bound holds with C depending on max t B t (M ) and C in Theorem 4 (or Theorem 5). 3. For the vanilla non-robust exponentially weighted moving average predictors h t = t−1 s=t−m ν s,t X 2 s , assume |∂L(σ 2 t , h)/∂h| ≤ B 0 and |L(σ 2 t , h)| ≤ M 0 for ∀σ 2 t , h, if |∆ 0,t | n ‡ eff /κ ‡ t +2∆ 1,t n ‡ eff /κ ‡ t < 1/2 and n ‡ eff ≥ 64z max tκ ‡ t /σ 4 t , the above bound holds with C depending on B 0 , M 0 and C in Theorem 4 (or Theorem 5). Remark 12. The proof can be easily extended to more general robust or non-robust predictors of the M-estimator form. Theorem 7 tells us that comparing robust predictors with optimal truncation for a single time point (rather than adjusting the truncation as in constructing proxies) and nonrobust predictors (either with rough overall truncation when using a general loss or without any truncation when using a truncated loss) is indeed a valid thing to do, when we employ proper robust proxies. Remark 13. Although the first proxy achieves the optimal rate of convergence for comparing average conditional loss in Theorem 3, we did not manage to show it is valid for comparing the average empirical loss in Theorem 7. The reason is that single time clipping has no concentration guarantee like Theorem 1 for a single time point, therefore cannot ensure |( σ c ) 2 t − σ 2 t | to be bounded with high probability for all t, which is important to make sure the sub-exponential tail in Bernstein inequality does not dominate the sub-Gaussian tail. Taking into consideration of central fourth moment versus absolute fourth moment (see Remark 7), we would recommend the third proxy as the best practical choice among our three proposals. In this section, we first verify the advantage of the Huber mean estimator over the truncated mean in terms of estimating the variance through simulations. As illustrated by Theorem 5, the statistical error of the Huber mean estimator depends on the central moment, while that of the truncated mean depends on the absolute moment. Then we apply the proposed robust proxies for volatility forecasting comparison, using data from crypto currency market. Specifically, we focus on the returns of Bitcoin (BTC) quoted by Tether (USDT), a stable coin pegged to the US Dollar, in the years of 2019 and 2020, which witness dramatic volatility of Bitcoin. We first examine numerically the finite sample performance of the adaptive Huber estimator (Wang et al., 2020) for variance estimation, i.e., we solve (3.5) iteratively for θ given the data and z until convergence. We first draw an independent sample Y 1 , . . . , Y n of Y that follows a heavy-tailed distribution. We investigate the following two distributions: 1. Log-normal distribution LN(0, 1), that is, log Y ∼ N (0, 1). Given that σ 2 := Var(Y ) = E(Y 2 ) − (EY ) 2 , we estimate E(Y 2 ) and EY separately and plug these mean estimators into the variance formula to estimate σ 2 . Besides the Huber mean estimator, we investigate two alternative mean estimators as benchmarks: (a) the naïve sample mean; (b) the Figure 1 shows that the Huber variance estimator outperforms the optimally tuned truncated method with z roughly between (1.5, 3.5) under the log-normal distribution and between (1.5, 4) under the Student's t distribution. The performance gap is particularly large under the t distribution, where the optimal Huber method achieves around 20% less MSE than the optimal truncated method. any z ∈ (1, 2) under both distributions of our focus. Together with the z ranges where the Huber approach is superior in terms of MSE, our results suggest that z = 1.5 can be a good practical choice, at least to start with. Such a universal practical choice of z demonstrates the adaptivity of the tuning-free Huber method. We use the BTC/USDT daily returns to demonstrate the benefit of using robust proxies in volatility forecasting comparison. Fig 3 presents the time series, histogram and normal QQ-plot of the daily BTC returns from 2019-01-01 to 2021-01-01. It is clear that the distribution of the returns is heavy-tailed, that the volatility is clustered and that there are extreme daily returns beyond 10% or even 20%. The empirical mean of the returns over this 2-year period is 38 basis points, which is quite close to zero compared with its volatility. We thus assume the population mean of the return is zero, so that the variance of the return boils down to the mean of the squared return. In the sequel, we focus on robust estimation of the mean of the squared returns. Let r t denote the daily return of BTC from the end of day t − 1 to the end of day t. We emphasize that a volatility predictor h t must be ex-ante. Here we construct h t based on r t−m b , . . . , r t−1 in the backward window of size m b and evaluate it at the end of day t − 1. Our proxy σ 2 t for the unobserved variance σ 2 t of r t is instead based on r t , . . . , r t+m f in the forward window of size m f . We consider two volatility prediction approaches: (i) the vanilla EWA of the backward squared returns, i.e., t−1 s=t−m b ν s,t X 2 s ; (ii) the exponentially weighted Huber predictor proposed in Section 4.1. Each approach is evaluated with half lives equal to 7 days (1 weeks) and 14 days (2 weeks), giving rise to four predictors, which are referred to as EWMA HL7, EWMA HL14, Huber HL7, Huber HL14. We always choose m b to be twice the corresponding half life and set z = n ‡ eff for the two 'Huber predictors. As for volatility proxies, we similarly consider two methods: (i) the vanilla forward EWA proxy, i.e., t+m f s=t w s,t X 2 s ; (ii) the robust Huber proxy proposed in Section 3.3. We set the half life of the exponential decay weights to be always 7 days, m f = 14 and z = 2 log n † eff . We evaluate the Huber approach on two time series of different lengths: T = 720 or 180, which imply two different c t values that are used in (3.11). We refer to the two corresponding Huber proxies as Huber 720 and Huber 180. Given the theoretical advantages of the Huber proxy as demonstrated in Remarks 7 and 13, we do not investigate the first two proxies proposed in Section 3.3. Crypto currency market is traded 24 hours every day nonstop, which gives us 732 daily returns from 2019-01-01 to 2021-01-01. After removing the first 27 days used for predictor priming and the last 13 days used for proxy priming, we then have 691 data points left. For each day, we compute the four predictors and three proxies that are previously described. We plot the series of squared volatility (variance) proxies in Fig 4. As we can see, the vanilla EWMA proxy (blue line) is obviously the most volatile one, reaching the peak variance of 0.016, or equivalently, volatility of 12.6% in March, 2020, when the outbreak of COVID-19 in the US sparked a flash crash of the crypto market. In contrast, the Huber proxies react in a much milder manner, and the smaller T we consider, the more truncation effect on the Huber proxies. With the predictors and proxies computed, we are ready to conduct volatility prediction evaluation and comparison. Now we would like to emphasize one issue that is crucial to the evaluation procedure: the global scale of the predictors. Different loss functions may prefer different global scales of volatility forecasts. For example, QL penalizes underestimation much more than overesti- Some algebra yields that for MSE, the optimal β MSE = t h t σ 2 t / t h 2 t , and for QL, the optimal β QL = T −1 t ( σ 2 t /h t ). Table 1 reports the loss of the four predictors and their optimal scaled versions based on all the 691 time points with the non-robust EWMA proxy and the robust proxy Huber 720. Several interesting observations are in order. • Using the longer half life of 14 days gives a smaller QL loss, regardless of whether the predictor is robust or non-robust, original or optimally scaled, and regardless of whether the proxy is robust or non-robust. In terms of MSE, the half-life comparison is mixed: Huber HL7 is slightly better than Huber HL14, but EWMA HL14 is better than EWMA HL7. We only focus on the longer half-life from now on. • If we look at the original predictors without optimal scaling, it is clear that MSE favors the robust predictor and QL favors the non-robust predictor, regardless of using robust or non-robust proxies. This confirms that different loss functions can lead to very different comparison results. • However, the above inconsistency between MSE and QL is mostly due to scaling, which is clearly demonstrated by the column of the optimal scaling β. For MSE, the optimal scaling of the EWMA predictor is around 0.5, while that of the Huber predictor is around 1. In contrast, for QL, the optimal scaling needs to be much larger than 1.0 and Huber needs a even larger scaling. If we look at the loss function values with optimally scaled predictors, it is interesting to see that the Huber predictor outperforms the EWMA predictor in terms of both MSE (slightly) and QL (substantially). This means that the Huber predictor is more capable of capturing the relative change of time-varying volatility than the non-robust predictor. • Last but not least, when the sample size T is large compared with n † eff (here T /n † eff = 691/24.87 = 27.78), the difference between the EWMA and Huber proxies is small, which explains the reason they give consistent comparison results. When T is not large enough in the next subsection, we can see that the robust proxies gives more sensible conclusions. Now suppose we only have T = 180 data points to evaluate and compare volatility forecasts. In Fig 5, we present the curve of 180-day rolling loss difference, i.e., T −1 t s=t−179 {L( σ 2 s , h Huber HL14,s ) − L( σ 2 s , h EWMA HL14,s )} with t ranging from 180 to 691, where σ 2 s can either be the EWMA proxy or Huber 180. Positive loss difference at t indicates that the EWMA predictor outperforms the Huber predictor in the past 180 days. We see that most of the time, Huber HL14 defeats EWMA HL14 (negative loss difference) in terms of MSE, while EWMA HL14 defeats Huber HL14 (positive loss difference) in terms of QL. In terms of MSE, robust proxies tend to yield more consistent comparison between the two predictors throughout the entire period of time. We can see from the upper panel of Figure 5 that the time period for EWMA HL14 to outperform Huber HL14 is much shorter with the robust proxies (orange curve) than that with the EWMA proxies (blue curve). In terms of QL, Figure 5 : 180-day rolling loss difference between EWMA HL14 and Huber HL14 with robust or non-robust proxies. The upper panel corresponds to MSE and the lower one corresponds to QL. if we use the EWMA proxy, we can see from the lower panel of Figure 5 that the robust predictor is much worse than the non-robust predictor, especially towards the end of 2020. However, the small MSE difference at the end of 2020 suggests that the EWMA proxy should overestimate the true Figure 6 : 180-day rolling loss difference between optimally scaleda EWMA HL14 and Huber HL14 with robust or non-robust proxies. The upper panel corresponds to MSE and the lower one corresponds to QL. volatility and exaggerate the performance gap in terms of QL. With the Huber proxy, however, the loss gap between the two predictors is much narrower, suggesting that the Huber proxy is more robust against huge volatility. and EWMA HL14, based on robust and EWMA proxies respectively. For MSE, from the previous subsection, we know that when the optimal scaling is applied, two predictors do not differ much in terms of the overall loss, and the Huber predictor only slightly outperforms the EWMA predictor. In the upper panel of Fig 6, we see that the robust-proxy-based curve is closer to zero than the EWMA-proxy-based curve, displaying more consistency with our result based on large T . For QL, the loss differences using the robust or the non-robust proxy look quite similar. We also plot β Huber,t and β EWMA,t versus time t based on robust and non-robust proxies in Fig 7. For both MSE and QL losses, using the robust proxy leads to more stable optimal scaling values, which are always preferred by practitioners. In a nutshell, we have seen that how our proposed robust Huber proxy can lead to better interpretability and sensible comparison of volatility predictors. When total sample size is small compared to the local effective sample size, using a robust proxy is necessary and leads to a smaller probability of misleading forecast evaluation and comparison. When the total sample size is large enough, the proposed robust proxy automatically truncates less and resembles the EWMA proxy. This also provides justification for using a non-robust EWMA proxy when the sample size is large. But we still recommend the proposed robust proxy that can adapt to the sample size and the time-varying volatilities. Sometimes, even if the robust proxy only truncates data for a very small volatile period, in terms of risk evaluation, it could still cause significant difference. Compared with the literature of modeling volatility and predicting volatility, evaluating volatility prediction has not been given enough attention. Part of the reason is due to lacking a good framework for its study, which makes practical volatility forecast comparison quite subjective and less systematic in terms of loss selection and proxy selection. Patton (2011) is a pioneering work to provide one framework based on long term expectation and provides guidance for loss selection, while our work gives a new framework based on an empirical deviation perspective and further provides guidance on proxy selection. In our framework, we focus on predictors that can achieve desired probability bound for II so that the empirical loss is close to the conditional expected loss. Then the correct comparison of the conditional expected loss of two predictors with large probability rely on a good control for I, which imposes requirements for proxies. With this framework, we proposed three robust proxies, each of which guarantees a good bound for I when the data only bears finite fourth moment. Among the three proxies, although they can all obtain the optimal rate of convergence for bounding I, we recommend the exponentiallyweighted tuning-free Huber proxy. It is better than the clipped squared returns in that it leverages neighborhood volatility smoothness and it is better than the proxy based on direct truncation in that its improved constant in the deviation bound, depending only on the central moment. To construct this proxy, we need to solve the exponentially-weighted Huber loss, whose truncation level for each sample also needs to change with the weights surprisingly. We then applied this proxy to the real BTC volatility forecasting comparison and reached some interesting observations. Firstly, robust predictors with better control in variance may use a faster decay to reduce the approximation bias. Secondly, different losses can lead to drastically different comparison, so even restricting to robust losses, loss selection is still a meaningful topic in practice. Thirdly, predictor rescaling according to the loss function is necessary and could further extract the value of robust predictors. Finally, the proposed robust Huber proxy adapts to both the timevarying volatility and the total sample size. When the overall sample size is much larger than the local effective sample size, the robust Huber proxy barely truncate, which provides justification for even using the EWMA proxy for prediction evaluation. However, the robust Huber proxy in theory still gives high probability of concluding the correct comparison. There are still limitations of the current work and open questions to be addressed. Assumption 1 excludes the situation when σ 2 t depends on previous returns such as in GARCH models. We require n † eff → ∞ for local performance guarantee, leading to a potentially slow decay. However, in practice, it is hard to know how fast the volatility changes. Also, we ignored the auto-correlation of returns and assumed temporal independence of the innovations for simplicity. Extensions of the current framework to time series models and more relaxed assumptions are of practical value to investment managers and financial analysts. Our framework may have nontrivial implications on how to conduct cross-validation with heavy-tailed data, where we use validation data to construct proxies for the unknown variable to be estimated robustly. Obviously, subjectively choosing a truncation level for proxy construction could favor certain truncation level used by a robust predictor. Motivated by our study, rescaling the optimal truncation level for one data splitting according to the total (effective) number of sample splitting sounds an interesting idea worth further investigation in the future. This section provides proof details for all the theorems in the main text. Proof of Theorem 1. The proof follows Theorem 5 of Fan et al. (2016) . Denote φ(x) = min(|x|, 1)sgn(x), Define r(θ) = t+m s=t min(w s,t |X 2 s − θ|, τ t )sgn(X 2 s − θ), so σ 2 t is the solution to r(θ) = 0. Defineσ 2 t = t+m s=t w s,t σ 2 s = σ 2 t + δ 0,t . The RHS can be further bounded by E{exp[r(θ)/τ t ]} ≤ exp (σ 2 t − θ)/τ t + t+m s=t w 2 s,t (κ † t + 2(σ 2 s − σ 2 By Chebyshev inequality, and their population versions for any j ≥ 0 where X t−j is an iid copy of X t−j and {X t } are independent random innovations satisfying Assumption 1. Assume E[Y t ] = 0, |Y t | ≤ M for all t and there exists constant ρ ∈ (0, 1) such that Also assume T ≥ 4 ∨ (log(ρ −1 )/2). We have for y > 0, where C 1 = 2 max{(e 4 − 5)/4, [ρ(1 − ρ) log(ρ −1 )] −1 } · (8 ∨ log(ρ −1 )) 2 , C 2 = max{(c log 2) −1 , [1 ∨ (log(ρ −1 )/8)]} with c = [log(ρ −1 )/8] ∧ (log 2) log(ρ −1 )/4. The proof of Lemma 1 follows closely with Theorem 2.1 of Zhang (2021). The only extension here is that we do not require X t or even X t /σ t to be identically distributed, so there is no assumption on the stationarity of the process Y t . However, we requires a stronger assumption on the maximal perturbation of each Y t = h t (X t , X t−1 , . . . ). The entire proof of Zhang (2021) will go through with this new definition of Y · 2 and γ j . We omit the details of the proof. Proof of Theorem 4. Let σ t = ( σ e ) t and define Y t = qt . It is not hard to see that for j ≤ m, γ 2 j = max t E [(w j,0 X 2 t+j ) ∧ c t − (w j,0 X t+j 2 ) ∧ c t ] 2 q 2 t σ 4 t ≤ max t E (w j,0 X 2 t+j − w j,0 X t+j 2 ) 2 + (w j,0 X 2 t+j − c t ) 2 I(w j,0 X 2 t+j < c t , w j,0 X t+j 2 > c t ) + (w j,0 X t+j 2 − c t ) 2 I(w j,0 X 2 t+j > c t , w j,0 X t+j 2 < c t ) Q 2 σ 4 t ≤ 4w 2 j,0 Q 2 max tκ t+j /σ 4 In addition, we claim |Y t | ≤ CQ Firstly, in bounding ∆ A , we need L(σ 2 Predicting the volatility of cryptocurrency time-series Challenging the empirical mean and empirical variance: a deviation study Volatility estimation for bitcoin: Replication and robustness Robust covariance and scatter matrix estimation under huber's contamination model Robust inference via multiplier bootstrap A comprehensive look at financial volatility prediction by economic variables How relevant is volatility forecasting for financial risk management? Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation Robust estimation of high-dimensional mean regression Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions Robust online time series prediction with recurrent neural networks Robust estimation of a location parameter P(r(θ) > B + (θ)) ≤ exp{−B + (θ)/τ t }E exp{r(θ)/τ t } = exp{−z + |δ 0,t |/τ t + 2δ 1,t /τ 2 t } Similarly, P(r(θ) < B − (θ)) ≤ exp{−z + |δ 0,t |/τ t + 2δ 1,t /τ 2 t }. Following the same argument with Fan et al. (2016) , we can show that for large enough n † eff such that 8/(n † eff τ t )(κ † t /(n † eff τ t ) + τ t z) ≤ 1, the root θ + of B + (θ) satisfies that θ + ≤ σ 2 t + 2(κ † t /(n † eff τ t ) + τ t z) , and the root θ − of B − (θ) satisfies thatWith the choice of τ t given in Theorem 1, we have P(| σ 2 t −σ 2 t | ≤ 4 κ † t z/n † eff ) ≥ 1−2e −z+|δ 0,t |/τt+2δ 1,t /τ 2 t .And the requirement for the effective sample size is that n † eff ≥ 16z.Proof of Theorem 2. We extend Theorem 2.1 of Wang et al. (2020) to the weighted case. Note again we are solving the following equation for τ t : t+m s=t min w 2 s,t |X 2 s − σ 2 t | 2 , τ 2 t τ 2 t − z = 0 .Also defined τ t as the solution of the corresponding population equation:We will first show that (a) τ t κ † t n † eff z and then (b) with probability approaching 1, | τ t /τ t − 1| ≤ c 0 for a small fixed c 0 . To prove (a), it is straightforward to see thatConsider the solution q s (a) to the equation P(|X 2 s − σ 2 t | 2 > qa) = a −1 of the variable q. Note that the solution is unique. Since all w s,t are on the same order, we know that w s,t 1/m, n † eff m. Let a = a s = ( t+m s=t w 2 s,t )/(zw 2 s,t ) m/z, the corresponding solution is q s (a s ). Define q min = min s q s (a s ). So we havewe know that for any s, q s (a s ) a −1 z/m, so is q min . Therefore, we have τ t /w s,t ≥ τ 0 /w s,t mτ 0 q min m/z 1. Write τ t /w s,t ≥ c 2 for some c 2 ≥ 0.So we have shown (a) holds, that is, τ tNext, we need to show (b), so that the solution τ t from the second equation gives us the desired optimal truncation rate. To this end, we still follow the proof of Theorem 1 of Wang et al. (2020) closely. Specifically, define Y s = w s,t |X 2 s − σ 2 t |, using their notations, we defineOne important fact here is that q n (t) = −2t −1 p n (t) and q (t) = −2t −1 p(t), which is key to prove Theorem 1 of Wang et al. (2020) . The only difference of our setting here is that we do not assume Y s 's are identically distributed. So when applying Bernstein's inequality as in (S1.8) of Wang et al.(2020), we need to use the version for non-identically distributed variables, and also bound theSo we can indeed apply Bernstein's inequality on s ζ s . For more details, we refer the interested readers to Wang et al. (2020).To bound the first term, we can apply Bernstein inequality forTo bound the second term, note thatHere we can also choose y = CQ √ zT for a large enough C to make the second probability equal to 0.To prove this, we need the followingThis can be shown following a similar proof as Theorem 1, thus we omit the details. Note that here c t is not chosen to optimize the error bound, as we have another average over T to take care of the extra variance in σ 2 t . So here we only need to choose c t to make sure the error bound to be in the order ofWe require |Y t | ≤ CQ √ z to hold for all time points, so the exception probability for all events isNow conditioning on that |Y t | ≤ CQ √ z, we are ready to call Lemma 1 for Y t . We can choose y = CQ √ zT in Lemma 1 to make the exception probability smaller than e −z . So in total the exception probability is 2e −z .Next, in terms of the bias term E[ σ 2 t ]−σ 2 t = t+m s=t w s,t E[min(X 2 s , c t /w s,t )−σ 2 s ]+( t+m s=t w s,t σ 2 s − σ 2 t ). From the proof of Theorem 3, we know that E[min(X 2 s , c t /w s,t ) − σ 2 s ] ≤ 2w s,tκs /c t . ThereforeThus, the bias term will not affect the total error bound as shown in the theorem.Lemma 2. Assume the weighted Huber loss F (θ) = L ct (θ; {w s,t } t+m s=t ) is α-strongly convex for some α ∈ (0, 1/2) in some local neighborhood Θ around θ * ∈ Θ.F (θ) is the perturbed version of F (θ).If θ * = argmin θ F (θ) andθ * = argmin θF (θ) andθ * ∈ Θ, we haveProof of Lemma 2. Besides strong convexity, we know that F (θ) is β-smooth with β = 1. Thatβ = 1 is obvious given the second derivative of F (θ) is bounded by 1. From Lemma 3.11 of Bubeck (2014), for θ 1 , θ 2 ∈ Θ,Choose θ 1 =θ * , θ 2 = θ * , so F (θ 2 ) = 0 =F (θ 1 ). Thenwhich concludes the proof.Proof of Theorem 5. Let σ t = ( σ H ) t and define Y t = qt. In order to apply Lemma 1, we need to employ Lemma 2 to bound the perturbation of Huber loss minimizer via bounding the perturbation of Huber loss derivative. Similar to the proof of Theorem 4, we can show thatz for all t with probability larger than 1 − e −z . We can actually explicitly write out the bound for | σ 2 t − σ 2 t | following the proof of Theorem 4: | σ 2 t − σ 2 t | ≤ 2( κ † t z/T + c κ † t z) < (2c + 2) κ † t z, which means the Huber loss minimizer does fall into the region we have strong convexity by our assumptions in Theorem 5. The bound on | σ 2 t − σ 2 t | also implies that |Y t | ≤ CQ √ z with exception probability of e −z .In addition, we would like to check Y · 2 < ∞. For j > m, γ j = 0 and for j ≤ m,Therefore Y · 2 < ∞ for any fixed ρ ∈ (0, 1). We can indeed apply Lemma 1 to bound the sum of Y t , which is of order CQ √ zT , with exceptional probability of another e −z .Finally, we bound the bias term E[ σ 2 t ]/σ 2 t − 1. Note thatSimilar to the proof of Theorem 3, we know that the s-th component of the third term can be bounded by 2wFurthermore, it is not hard to show that E[(σ 2 t − σ 2 t ) 2 ] is bounded using Lemma 2. Therefore, the bias is indeed of the order z/T with the additional approximation error rate δ 0,t +4δ 1,t σ 2 t . The proof is now complete.Proof of Theorem 6. Following the same proof as Theorem 1, we have thatConditioning on this event, L is Lipchitz in the second argument. Then the error bound is enlarged by B times.Proof of Theorem 7. Recall thatNow let us prove (i) first. We first bound ∆ A . We still apply Lemma 1 to bound the concen-. From the proof of Theorem 6, we know thatwith exception probability of 2e −z/2 . Applying the union bound, we get |h t − σ 2 t | ≤ σ 2 t /2 for all t with probability ≥ 1 − 2T e −z/2 . On this event, we haveSimilar to previous proofs, except that now we look data backward, for j ≥ 0,The last inequality can be shown similar to the proof of Theorem 5 with the assumption that the Huber loss is α-strongly convex locally. Therefore, Y · 2 < ∞ for any fixed ρ ∈ (0, 1). We apply Lemma 1 on Y t and again pick y = C √ zT to make the exceptional probability 2e −z . So we getNow let us try to apply Lemma 1 on ∆Note that since C(·) is non-increasing, so C(h t ) ≤ C(σ 2 t /2) is bounded. If we use the second and third proxies, in the proofs of Theorem 4 and 5, we have showed that | σ 2 t −σ 2 t | ≤ C √ z for all t with exception probability at most e −z . Therefore, we conclude |Y t | ≤ C √ z for all t with exception probability at most e −z . Now for bounding γ j , note that Y t actually is a function of X t−m , . . . , X t−1 , X t , . . . , X t+m with the first m data constructing predictors and the remaining m + 1 data constructing proxies. Hence, for j < m, it is not hard to show γ 2 j ≤ Cν 2 −(m−j),1 for some C > 0 and for m ≤ j ≤ 2m, γ 2 j ≤ Cw 2 j−m,0 . So we have Y · 2 ≤ 2 √ C < ∞. Applying Lemma 1 will again give us P(|∆Combining results for ∆ A and ∆ B and choose ε = C z/T for large enough C, we conclude (i) for bounding II.Next we prove (ii) and (iii). The proof follows the exact same arguments as (i) except for a few bounding details. t , h t ) to be bounded. In (iii), h t is not necessarily bounded, but we directly work with a bounded loss L(σ 2 t , h t ) ≤ M 0 . In (ii), h t ≤ M and we claim h t ≥ σ 2 t /2 with probability ≥ 1 − 2e −z , thus L(σ 2 t , h t ) ≤ B t (M )M . To see why the claim holds, defině h t = t−1 s=t−m ν s,t min(X 2 s , τ t /ν s,t ). The robust predictor proposed in Section 4.1 achieves central fourth moment, whileȟ t can achieve the same rate of convergence with absolute fourth moment. This is similar to the difference between the second and third proxy option. Similar to the proof of Theorem 6, we can showSo we know thatȟ t ≥ σ 2 t /2 with probability ≥ 1 − 2e −z . Interestingly, h t ≥ȟ t . So we always have good control of the left tail due to the positiveness of squared data.Secondly, in bounding ∆ A , we need γ 2 j ≤ Cν 2 −j,1 . In (iii), since loss is Lipchitz in the whole region, we can easily see that γ 2 j ≤ 4ν 2 −j,1 B 2 0 max tκt . In (ii), loss is Lipchitz in the local region of σ 2 t /2 ≤ h t ≤ M , but we know with high probability the clipped predictor indeed falls into the region, so we have γ 2 j ≤ 4ν 2 −j,1 max t B 2 t (M ) max tκt . Therefore, we have no problem bounding ∆ A . Thirdly, in bounding ∆ B , we require C(h t ) to be bounded. Note that since h t ≥ σ 2 t with high probability, we have C(h t ) ≤ C(σ 2 t /2) even when we use non-robust predictors in (ii) and (iii). So we can bound ∆ B as desired too.Finally putting everything together and choose ε = C z/T for large enough C, we conclude (ii) and (iii) for bounding II.