key: cord-0563988-uxri6c21
authors: Han, Ruijian; Luo, Lan; Lin, Yuanyuan; Huang, Jian
title: Online Debiased Lasso for Streaming Data
date: 2021-06-10
journal: nan
DOI: nan
sha: a7b6113fcec13a00f3e01d46470fb947549e3de7
doc_id: 563988
cord_uid: uxri6c21

We propose an online debiased lasso (ODL) method for statistical inference in high-dimensional linear models with streaming data. The proposed ODL consists of an efficient computational algorithm for streaming data and approximately normal estimators for the regression coefficients. Its implementation only requires the availability of the current data batch in the data stream and sufficient statistics of the historical data at each stage of the analysis. A dynamic procedure is developed to select and update the tuning parameters upon the arrival of each new data batch so that we can adjust the amount of regularization adaptively along the data stream. The asymptotic normality of the ODL estimator is established under the conditions similar to those in an offline setting and mild conditions on the size of data batches in the stream, which provides theoretical justification for the proposed online statistical inference procedure. We conduct extensive numerical experiments to evaluate the performance of ODL. These experiments demonstrate the effectiveness of our algorithm and support the theoretical results. An air quality dataset and an index fund dataset from Hong Kong Stock Exchange are analyzed to illustrate the application of the proposed method.

The advent of distributed online learning systems such as Apache Flink (Carbone et al. , 2015) has motivated new developments in data analytics for streaming processing. Such systems enable efficient analyses of massive streaming data assembled through, for example, mobile or web applications (Jiang et al. , 2018) , e-commerce purchases (Akter & Wamba, 2016) , infectious disease surveillance programs (Choi et al. , 2016; Samaras et al. , 2020) , mobile health consortia (Shameer et al. , 2017; Kraft et al. , 2020) , and financial trading floors (Das et al. , 2018) . Streaming data refers to a data collection scheme where observations arrive sequentially and perpetually over time, making it challenging to fit into computer memory for static analyses. Researchers would query such continuous and unbounded data streams in real-time to answer questions of interest including assessing disease progression, monitoring product safety, and validating drug efficacy and side effects. In these scenarios, it is essential for practitioners to process data streams sequentially and incrementally as part of online monitoring and decision-making procedures. Additionally, data streams from various fields such as bioinformatics, medical imaging, and computer vision are usually high-dimensional in nature.

In this paper, we consider the problem of online statistical inference in high-dimensional linear regression with streaming data and propose an online debiased lasso (ODL) estimator. While substantial advancements have been made in online learning and associated optimization problems, the existing works focus on online computational algorithms for point estimation and their numerical convergence properties (Langford et al. , 2009; Duchi et al. , 2011; Tarrès & Yao, 2014; Sun et al. , 2020) . However, these works did not consider the statistical distribution properties of the online point estimators, which are needed for making statistical inference. Online statistical inference methods have been mostly developed under low-dimensional settings where p n (Schifano et al. , 2016; Luo & Song, 2020) . The goal of this paper is to develop an online algorithm and statistical inference procedure for analyzing high-dimensional streaming data.

The last decade has witnessed enormous progress on statistical inference in high-dimensional models, see, for example, Zhang & Zhang (2014) ; van de Geer et al. (2014) ; Javanmard & Montanari (2014) , as well as the review paper Dezeure et al. (2015) and the references therein. Most of the advancements such as the novel debiased lasso have been developed for an offline setting. A major difficulty in an online setting with streaming data is that one does not have full access to the entire dataset as new data arrives on a continual basis. To tackle the computational and inference problems due to the evolving nature of the high dimensional stream, it is desirable to develop an algorithm and statistical inference procedure in an online mode by updating the regression parameters sequentially with newly arrived data batch and summary statistics of historical raw data.

In recent years, there has been an ever-increasing interest in developing online variable selection methods for high-dimensional streaming data. Most of the work is along the line of lasso (Tibshirani, 1996) . For example, Langford et al. (2009) proposed an online 1 -regularized method via a variant of the truncated SGD. Fan et al. (2018) adopted the diffusion approximation techniques to characterize the dynamics of the sparse online regression process. Comprehensive development of online counterparts of popular offline variable selection algorithms such as lasso, Elastic Net (Zou & Hastie, 2005) , Minimax Convex Penalty (MCP) (Zhang, 2010) , and Feature Selection with Annealing (FSA) (Duchi et al. , 2011) has been studied by Sun et al. (2020) . Nevertheless, it is known that variable selection methods focus on point estimation, but do not provide any uncertainty assessment. There is no systematic study on statistical inference, including interval estimation and hypothesis testing, with high-dimensional streaming data. Another complication in dealing with high-dimensional streaming data is that, the regularization parameter λ that controls the sparsity level can no longer be determined by the traditional cross-validation. Instead, small coefficients are rounded to zero with a certain threshold or a pre-specified sparsity level (Sun et al. , 2020) .

Recently, Deshpande et al. (2019) considered a class of online estimators in a high-dimensional auto-regressive model and studied the asymptotic properties via martingale theories. Shi et al. (2020) proposed an inference procedure for high-dimensional linear models via recursive onlinescore estimation. In both works, it is assumed that the entire dataset is available at the initial stage for computing an initial estimator (e.g. the lasso estimator) and the information in the streaming data is used to reduce the bias of the initial estimator. However, the assumption that the full dataset is available at the initial stage is not realistic in an online learning setting.

The goal of this work is to develop an online debiased lasso estimator for statistical inference with high-dimensional streaming data. Our proposed ODL differs from the aforementioned works on online inference in two crucial aspects. First, we do not assume the availability of the full dataset at the initial stage. Second, at each stage of the analysis, we only require the availability of the current data batch and sufficient statistics of historical data. Therefore, ODL achieves statistical efficiency without accessing the entire dataset. Furthermore, we propose a new approach for tuning parameter selection that is naturally suited to the streaming data structure. In addition, we provide a detailed theoretical analysis of the proposed ODL estimator. The main contributions of the paper are as follows.

• We introduce a new approach for online statistical inference in high-dimensional linear models. Our proposed ODL consists of two main ingredients: online lasso estimation and online debiasing lasso. Instead of re-accessing the entire dataset, we utilize only sufficient statistics and the current data batch.

• We propose a new adaptive procedure to determine and update the tuning parameter λ dynamically upon the arrival of a new data batch, which enables us to adjust the amount of regularization adaptively along with the data accumulation. Our proposed online tuning parameter selector aligns with the online estimation and debiasing procedures and involves summary statistics only.

• We establish the asymptotic normality of the proposed ODL estimator under the conditions similar to those in an offline setting and mild conditions on the batch sizes. We show that the asymptotic normality result holds as the cumulative sample size goes to infinity, regardless of finite data batch sizes. These results provide a theoretical basis for constructing confidence intervals and conducting hypothesis tests with approximately correct confidence levels and test sizes, respectively.

• Extensive numerical experiments with simulated data demonstrate that ODL algorithm is computationally efficient and strongly support the theoretical properties of the ODL estima-

tor. An air pollution dataset is also used to illustrate the application of ODL.

The rest of the paper is organized as follows. Section 2 presents the model formulation and our proposed ODL procedure. Section 3 includes the theoretical properties of the proposed ODL estimator. Simulation experiments are given in Section 4 to evaluate the performance of our proposed ODL in comparison to the offline ordinary least square estimator. In Section 5.1 we demonstrate the application of the proposed method on an air pollution dataset. Concluding remarks are given in Section 6. Detailed proof of the theoretical properties is included in the appendix.

Consider a time point b ≥ 2 with a total of N b samples arriving in a sequence of b data batches,

where

n j ) is an n j × p design matrix with n j being the data batch size. Here the regression coefficient β 0 = (β 0,1 , . . . , β 0,p ) ∈ R p is an unknown but sparse vector, and the error terms

. . , n j , are independent and identically distributed (i.i.d) with mean zero and finite but unknown variance σ 2 . Throughout the paper, we consider a high-dimensional linear model, in particular, we allow p ≥ N b ≡ b j=1 n j . In a streaming data setting where data volume accumulates fast over time, individual-level raw data may not be stored in memory for a long time, making it impossible to implement the offline debiased algorithms (Zhang & Zhang, 2014; van de Geer et al. , 2014; Javanmard & Montanari, 2014 ) that require access to the entire dataset. To address this issue, we develop an online debiasing procedure for each component of β 0 in model (1). Without loss of generality, our discussion in the following focuses on the estimation and inference of β 0,r , the r-th component of β 0 , r = 1, . . . , p.

When the first data batch D 1 = {y (1) , X (1) } arrives, we start off by applying the offline debiased lasso to obtain the initial estimator. Specifically, let x (1) r be the r-th column of X (1) and X

(1) −r be the sub-matrix of X (1) excluding the r-th column. An initial lasso estimator is given by

where λ 1 ≥ 0 is a regularization parameter. Following Zhang & Zhang (2014) , to construct a confidence interval for β 0,r , a low-dimensional projection z

(1) r that acts as the projection of x

r to the orthogonal complement of the column space of X

(1) −r , is defined as z

(1)

and λ 1 is taken to be the same as in (2) for simplicity. Then, the offline debiased lasso estimator of β 0 , r = 1, . . . , p, is defined as

Later on, when the second batch D 2 = {y (2) , X (2) } arrives, the offline debiased lasso algorithm would replace y (1) and X (1) by the augmented full dataset y (1) , y (2) and X (1) , X (2) respectively in (2)-(4). However, {y (1) , X (1) } may no longer be available in an online setting. To address this issue, we propose an online estimation and inference procedure that utilize the information in historical raw data via summary statistics. We present the three main steps of ODL in Subsections 2.1-2.3 below.

Upon the arrival of data batch

are available, we can use the offline lasso method that solves the following optimization problem:

where N b = b j=1 n j is the cumulative sample size and λ b is the regularization parameter adaptively chosen for step b. For simplicity, we use β (b) to denote β (b) (λ b ) except in the discussion on the choice of λ b in Section 2.4. However, since we only assume the availability of summary statistics of historical data, we cannot use the algorithms such as coordinate descent (Friedman et al. , 2007) that requires the availability of the whole dataset.

Note that the objective function in (5) depends on the data only through the following summary statistics:

Therefore, it is reasonable to refer to these summary statistic as sufficient statistics relative to this objective function, or simply sufficient statistics without confusing with the concept of sufficient statistic in a parametric model. Based on these two summary statistics, one can obtain the solution to (5) by the gradient descent algorithm.

Let L(β) = b j=1 y (j) − X (j) β 2 2 /(2N b ), and the gradient of L(β) is given by

Notably, the gradient depends on the historical raw data only through the summary statistics S (b) and U (b) . We update the solution iteratively by combining a gradient descent step and soft thresholding (Daubechies et al. , 2004; Donoho & Johnstone, 1994) . Specifically, each iteration consists of the following two steps:

where η is the learning rate in the gradient descent;

•

Step 2 (Soft thresholding): apply the soft-thresholding operator S( β These two steps are carried out iteratively till convergence. In the implementation, the stopping criterion is set as ∂L(β)/∂β 2 ≤ 10 −6 .

The size of the sufficient statistics will not increase as more data batches arrive. For example, when a new batch D b arrives, we update the summary statistics by

incrementally, which are matrices of fixed dimensions even if b → ∞. In addition, a consistent estimator of σ 2 by the method of moments is given by

which will be used in constructing the confidence intervals.

Next, we obtain an online estimator for the low-dimensional projection. Let

where N b and λ b are the same as in (5). We can summarize the data information in the following two statistics:

r . Repeating similar procedure in the online lasso, we can obtain γ (b) r and further define a low-dimensional projection z

excluding the r-th row and the r-th column, and S (b) −r,r is a sub-matrix of S (b) with the r-th row being deleted but the r-th column being kept. The low-dimensional projection z (b) r will be used in constructing the debiased estimator in Subsection 2.3.

When data batch D b arrives, the ODL estimator for β 0,r , r = 1, . . . , p, is defined as

Although the historical data are still involved in (11), we only need to store the following statistics rather than the entire dataset to compute β

r ) X (j) , which have the same dimensions when the new data arrives and can be easily updated. For example, we update a

Consequently, by substituting the online lasso estimator and low-dimensional projection into (11),

on,r . Meanwhile, as discussed in Section 3, the estimated standard error is given by σ

and ( σ 2 ) (b) is given in (9) in Subsection 2.1.

In an offline setting, the tuning parameter λ can be chosen from a candidate set via cross-validation where the entire dataset is split into training and testing sets multiple times. However, such a sample-splitting scheme is not applicable in an online setting since we do not have the full dataset at hand. A natural sample-splitting idea that aligns with the streaming data structure originates from the forecasting accuracy evaluation in time series; see Figure 1 . At time point b, those sequentially arrived data batches up to time point b − 1, denoted by {D 1 , . . . , D b−1 }, serve as the training set, and the current data batch D b is the testing set. This procedure is also known as "rolling-original-recalibration" (Tashman, 2000) .

Specifically, with only the first data batch D 1 , β (1) is a standard lasso estimator with λ 1 selected by the classical offline cross-validation. When the b-th data batch D b arrives, we calculate

and define

In such a way, we are able to determine λ b upon the arrival of a new data batch D b adaptively, and extract the corresponding lasso estimator β (b) (λ b ) as the starting point for ODL. 

is the testing set.

We now summarize our proposed ODL procedure for the statistical inference of β 0,r , r = 1, . . . , p, using a flowchart in Figure 2 . It consists of two main blocks: one is online lasso estimation and the other is online low-dimensional projection. Outputs from both blocks are used to compute the online debiased lasso estimator as well as the construction of confidence intervals in real-time.

In particular, when a new data batch D b arrives, it is first sent to the online lasso estimation block, where the summary statistics

. These summary statistics facilitate the updating of the lasso estimator β (b−1) to β (b) at some grid values of the tuning parameters without retrieving the whole dataset. At the same time, regarding the cumulative dataset that produces the old lasso estimate β (b−1) as training set and the newly arrived D b as testing set, we can choose the tuning parameter λ b that gives the smallest prediction error. Now, the selected λ b is passed to the low-dimensional projection block for the calculation of γ r output from the low-dimensional projection block together with the lasso estimator β (b) (λ b ) will be used to compute the debiased lasso estimator

on,r and its estimated standard error.

Remark 1. When p is large, the online algorithm presented in Figure 2 requires a sufficient large storage capacity, since sample covariance matrix S (b) requires O(p 2 ) space complexity. To reduce memory usage, we can apply the eigenvalue decomposition (EVD) of

whose diagonal elements are the eigenvalues of S (b) . We only need to store Q b and Λ b . Since 

Lasso estimation Outputs Figure 2 : Flowchart of the online debiasing algorithm. When a new data batch D b arrives, it is sent to the lasso estimation block for updating β (b−1) to β (b) . At the same time, it is also viewed as test set for adaptively choosing tuning parameter λ b . In the low-dim projection block, we extract sub-matrices from the updated summary statistic

r from the two blocks are further used to compute the debiased lasso estimator β 

To establish the asymptotic properties of the ODL estimator proposed in Section 2, we first introduce some notation. Consider a random design matrix X with i.i.d rows. Let Σ be the covariance matrix of each row of X. Denote the inverse of Σ by Θ = Σ −1 . For r = 1, . . . , p, define γ r := arg min

and the corresponding residual vector is z r := x r − X −r γ r . Let s 0 = |{j : β j = 0}| and s r = |{k = r : Θ k,r = 0}| be two sparsity levels.

The following regularity conditions on the design matrix X and the error terms are imposed to establish the asymptotic results. Specifically, we assume that X has either i.i.d sub-Gaussian or bounded rows. We first consider the sub-Gaussian case.

Assumption 1. Suppose that (A1) The design matrix X has i.i.d sub-Gaussian rows.

(A2) The smallest eigenvalue Λ 2 min of Σ is strictly positive and 1/Λ 2 min = O(1). In addition, the largest diagonal element of Σ, max j Σ j,j = O(1).

(A3) The error terms

Theorem 1. Assume Assumption 1 holds. For the j-th data batch, suppose that the tuning parameter λ j satisfies λ j = C log p/N j , j = 1, . . . , b. If the first batch size n 1 ≥ cs r log p, the subsequent batch size n j ≥ c log p, j = 2, . . . , b, for some constant c, and

then, for any r = 1, . . . , p and sufficiently large N b ,

Remark 2. Similar to the offline debiased lasso estimator (Zhang & Zhang, 2014; van de Geer et al. , 2014) , Theorem 1 implies that the dimensionality p could be at the exponential rate of the data size. However, the problem here is more difficult than that in the offline setting and the proofs for the properties of the offline debiased estimator do no apply here. Specifically, let The requirement on the minimum batch size in Theorem 1 indicates that, one may apply the online lasso algorithm once the sample size of the first data batch reaches the order of s r log p.

After that, we update the lasso estimators when the size of the newly arrived batch is at the order of log p. The next theorem justifies that the order of the subsequent batch size O(log p) could be relaxed to O(1), at the price of a relatively stronger condition on N b .

Theorem 2. Assume Assumption 1 holds. When the j-th batch data arrives, suppose that the tuning parameter λ j satisfies λ j = C log p/N j . If the first batch size n 1 ≥ cs r log p for some constant c and

then, for any r = 1, . . . , p and sufficiently large N b ,

Remark 5. The requirement of the first batch size n 1 ≥ c 1 s r log p in Theorems 1 and 2 is needed to establish the consistency of the lasso-typed estimator (Bühlmann & van de Geer, 2011) ; otherwise, the error bound of γ

(1) r defined in (10) in the first step cannot be controlled, resulting in large error

r . When there is not enough data at the initial stage, e.g., n 1 = log p, the error bound of γ

(1) r can also be controlled by considering some bounded parameter space such as {γ : γ 1 ≤ C} for some large constant C rather than {γ : γ ∈ R p−1 }.

We now consider the case when the covariates are bounded. For a matrix A = (a ij ), let A ∞ be the largest absolute value of its elements, that is, A ∞ = max i,j |a ij |.

Assumption 2. Suppose that (B1) The covariates are bounded by a finite constant K > 0, namely, X ∞ ≤ K, where X is the design matrix.

(B2) The smallest eigenvalue Λ 2 min of Σ is strictly positive and 1/Λ 2 min = O(1). Moreover, max j Σ j,j = O(1).

(B3) X −r γ r ∞ = O(K) and max r E(z 4 r,1 ) = O(K 4 ), where z r,1 is the first element of z r := (x r − X −r γ r ).

Theorem 3. Assume Assumption 2 holds. When the j-th batch data arrives, suppose that the tuning parameter λ j satisfies λ j = C log p/N j . If the first batch size n 1 ≥ cs 2 r log p for some constant c and

then, for any r = 1, . . . , p and sufficiently large N b ,

is defined in (12).

Theorems 2 and 3 are established without specific assumptions on data batch sizes except for the first batch. Comparing to Theorem 2, Theorem 3 requires a relatively stronger condition on n 1 , but a more relaxed condition on the cumulative sample size N b . Furthermore, rewriting (

The proofs of Theorems 1-3 are included in the appendix. According to Theorems 1 -3, W r is asymptotically normal through verifying the conditions of the Lindeberg central limit theorem. As a result, for any 0 < α < 1, a (1 − α)% confidence interval for β 0,r is (9), Φ(·) is the cumulative distribution function of the standard normal distribution and Φ −1 is its inverse function.

4 Simulation experiments

In this section, we conduct simulation studies to examine the finite-sample performance of the proposed online debiasing procedure in high-dimensional linear models. We randomly generate a total of N b samples arriving in a sequence of b data batches, denoted by {D 1 , . . . ,

Recall that s 0 is the number of non-zero components of β 0 . We set half of the nonzero coefficients to be 1 (relatively strong signals), and another half to be 0.01 (weak signals). We consider the where N b > p. We include the OLS method using R package lm as a benchmark for comparison.

The results are reported in Tables 1-4.

It can be seen from Tables 1-4 that the estimation bias of the online debiased lasso decreases rapidly as the number of data batches b increasing from 2 to 12. Both the estimated standard errors and averaged length of 95% confidence intervals exhibit similar decreasing trend over time. Even though the coverage probabilities of the confidence intervals by the OLS at the end point are around the nominal 95% level, both the estimation bias and standard errors of OLS estimator are much larger than those of online debiased lasso. In particular, the estimation bias of OLS could even be 10 times that of online debiased lasso when p = 400 as shown in Tables 1 and 2 By comparing across different signal groups, i.e. β 0,r = 0, 0.01, 1, we observe that both ASE and ACL are quite close to each other and even coincide when N b = 1200, as shown in Tables 3-4 . We believe this is reasonable, as each column in the design matrix, denoted by x r ∈ R N b ×1 , is of the same marginal distribution, and thus the estimated standard errors computed according to (12) on,r at data batches b = 2, 6, 10. Each row corresponds to a true value of parameter β 0 , i.e. β 0,r = 0, 0.01, 1. Before applying our online algorithm, we first preprocess the original raw data. We transform the categorical predictors into the one-hot vector. We also include interaction terms, which are coded as products of all pairs of the original features. As a result, the dimension of the feature vector is p = 296. Since the curve of an exponential distribution fits the PM 2.5 data well, we use the logarithm of PM 2.5 as the response variable. In addition, we split the data into b = 120 batches fairly by its chronological order. Each batch contains half-month data with size n j = 348, j = 1, . . . , b. t-statistic on the right panel.

Next, we focus on another two variables: pressure and dew point. The results are shown in Figure 5 . It can be seen that the increase of dew point is associated with the increase of PM 2.5 except in the summer. On the contrary, apart from the summer, the pressure itself has a negative impact on the PM 2.5. This finding also agrees with the study in Liang et al. (2015) . The main difference is on the influence of dew point in summer time. We believe the difference arises from the interaction terms. Actually, the coefficient of the square of the dew point is significantly positive, and its estimated standard error is similar to the middle panel in Figure 4 . Both of them have a decreasing trend. The values of t-statistic are also presented on the right panel. For ease of illustration, we also present the trace of the outcomes on the wind direction, pressure and dew point to show the trend of the estimation with the influx of new data. The result is presented in Figure 6 . Moreover, we identify other significant variables such as the wind speed and some interaction terms in this analysis. These findings suggest some interesting covariates that warrant further investigation and validation.

We next illustrate the application of ODL with an index fund dataset. This dataset consists of the returns of 1148 stocks listed in Hong Kong Stock Exchange and the Hang Seng Index (HSI, a freefloat-adjusted market-capitalization-weighted stock-market index in Hong Kong) during the period from January 2010 to December 2020. The response variable is the return of the HSI for every three days, and the predictors are every-three-day returns of the 1148 stocks. We partition the data into batches according to chronological order. Specifically, the first batch consists of a two-

year dataset from 2010-2011 to ensure sufficient sample size at the initial stage and each subsequent batch contains one-year data. Hence, b = 10, n 1 = 164 and n j = 82 for j = 2, . . . , 10. Similar to Lan et al. (2016) , the goal of this study is to identify the most relevant stocks that can be used to create a portfolio for index tracking.

The proposed ODL method is applied and the coefficients of 19 stocks are identified to be 

In this paper we developed an online debiased lasso estimator for statistical inference in linear models with high-dimensional streaming data. The proposed method does not assume the availability of the full dataset at the initial stage and only requires the availability of the current batch of the data stream and the sufficient statistics of the historical data. A natural dynamic tuning parameter selection procedure that takes advantage of streaming data structure is developed as an important ingredient of the proposed algorithm. The proposed online inference procedure is justified theoretically under regularity conditions similar to those in the offline setting and mild conditions on the batch size.

There are several other interesting questions that deserve further study. First, we focused on the problem of making statistical inference about individual regression coefficients, the proposed method can be extended to the case of making inference about a fixed and low-dimensional subvector of the coefficient. Second, we did not address the problem of variable selection in the online learning setting consider here. This is apparently different from the variable selection problem in the offline setting. The main issue is how to recover the variables that are dropped at the early stages of the stream but may be important as more data come in. Third, it would be interesting to generalize the proposed method to generalized linear and nonlinear models. These questions warrant thorough investigation in the future.

In the appendix, we prove Theorems 1-3 and include an additional figure of Q-Q plots from the simulation studies in Section 4.2. We first define the notation needed below. For any sequences Lemmas 1 -5 are needed to prove Theorem 1. The first two lemmas show the consistency of the lasso estimators.

Lemma 1. Suppose that Assumption 1 holds and n 1 ≥ c 1 s r log p for some constant c 1 . Then, for any j = 1, . . . , b, with probability at least 1 − p −3 , low-dimensional projection defined in (10) with λ j = c 2 log p/N j satisfies,

Proof of Lemma 1. For notational convenience, we suppress the subcript r of γ in this proof. For any fixed j = 1, . . . , b, γ (j) := arg min

where N j is the cumulative size of data at step j. Note that S r = {k : Θ k,r = 0, k = r} and s r = |S r |.

To establish the consistency of the lasso estimator, we first show that Σ (j)

−r /N j satisfies the compatibility condition for the set S r . Namely, there is a constant φ j such that for all γ satisfying γ S c r 1 ≤ 3 γ Sr 1 , it holds that

where the i-th element of γ Sr is denoted by γ Sr,i = γ i 1 {i∈Sr} for i = 1, . . . , p − 1. It follows from an extension of Corollary 1 in Raskutti et al. (2010) from Gaussian case to sub-Gaussian case that, with probability at least 1 − p 4 , the above inequality holds as long as N j ≥ c 1 s r log p and Σ (j) −r meets the compatibility condition, which hold under the assumption that n 1 ≥ c 1 s r log p and (A2)

in Assumption 1 respectively.

The remaining proof follows standard arguments. More notations are introduced. Recall that γ := arg min

is the lasso estimator defined in (15), it follows that

Then,

where x 

k,l is sub-exponential distributed by the definition of γ and (A1) in Assumption 1 respectively. By Bernstein inequality, we obtain that

for some constant c which does not depend on p and N i . Since the above inequality holds for any k = r, by Bonferroni inequality, it holds that

By choosing λ j = c 2 log p/N j , then, with probability at least 1 − p −4 ,

where (18) holds due to γ Sr 1 = 0. It further implies that v

Since (19) holds for any j = 1, . . . , b, the proof of Lemma 1 is complete.

Lemma 2. Suppose that Assumption 1 hold and N b ≥ c 1 s 0 log p for some constant c 1 . Then, with probability at least 1 − p −4 , the lasso estimator in (5) with λ b = c 2 log p/N b satisfies,

The proof of Lemma 2 is structurally similar to the proof of Lemma 1 by letting j = b. (A3) in Assumption 1 is used to obtain the concentration inequality as Bernstein inequality in Lemma 1.

We omit the details here. The next lemma is used to estimate the cumulative terms in the online learning.

Lemma 3. Recall that n j and N j are the batch size and the cumulative batch size respectively when the j-th data arrives, j = 1, . . . , b. Then,

Proof of Lemma 3. We first prove (20). Let

Repeat the above procedure by letting t = n j /N j−1 for j = b − 1, . . . , 2. It then follows that

Then, in view of n 1 = N 1 , (20) holds. The remaining step is to prove (21). We claim that

Similarly, we repeat this procedure by choosing a = N j−1 , b = n j for j = b − 1, . . . , 2. It follows that

Given n 1 = N 1 , (21) holds.

For any matrix A = (a ij ), let A ∞ be the largest absolute value of its elements, that is, Lemma 4. Suppose that the conditions in Lemma 1 holds and the subsequent batch size n j ≥ c log p, j = 2, . . . , b, for some constants c. If

then, z r 2 = Ω( √ N b ).

Proof of Lemma 4. By the triangle inequality,

where z r = ((z Since n j ≥ c log p, j = 1, . . . , b and Σ −r ∞ is bounded, it follows that for some constant c 5 , n j N j ≤ c 5 s 2 r log p 1 + log

where the last inequality is from (20) in Lemma 3. As

As a result, we have shown that z r 2 2 ≤ c 6 N b for some constant c 6 in probability. Similarly, in view of the fact that

we can conclude that z r 2 2 ≥ c 7 N b for some constant c 7 . We complete the proof of Lemma 4. Proof of Lemma 5. As mentioned earlier in Remark 2, due to z

r , we cannot directly apply KKT condition here. To see the difference with the proof in the offline debiased lasso, we first separate k =r z r x k ( β (b) k −β 0,k ) into two parts: the offline term and one additional term from the online algorithm. The upper bound of the former is derived from KKT condition while the latter is tackled differently. Consider z r = (( z (1) r ) , . . . , ( z

where Π off pertains to an offline term and Π on pertains to an online error. First,

By the Karush-Kuhn-Tucker (KKT) condition and Lemma 2,

Then, (p)).

Second, for the online error,

By Lemma 1, we obtain

In the proof of Lemma 4, we have shown that Σ (j) −r ∞ is bounded with probability tending to 1.

where the last equation is from (21) in Lemma 3.

Then,

  = O P (s 0 s r log(p)) .

Since s 0 s r log(p)/ √ N b = o(1), the statement of the lemma follows. on,r at data batches b = 2, 6, 10. Each row corresponds to a true value of parameter β 0 , i.e. β 0,r = 0, 0.01, 1.

Big data analytics in E-commerce: a systematic review and agenda for future research

Statistics for high-dimensional data: methods, theory and applications

Apache Flink: stream and batch processing in a single engine

Online Principal Component Analysis in High Dimension: Which Algorithm to Choose?

Web-based infectious disease surveillance systems and public health perspectives: a systematic review

Real-Time Sentiment Analysis of Twitter Streaming data for Stock Prediction

An iterative thresholding algorithm for linear inverse problems with a sparsity constraint

Online Debiasing for Adaptively Collected High-dimensional Data

High-dimensional inference: confidence intervals, p-values and R-software hdi

Ideal spatial adaptation by wavelet shrinkage

Adaptive subgradient methods for online learning and stochastic optimization

Statistical sparse online regression: a diffusion approximation perspective

Pathwise coordinate optimization

Confidence intervals and hypothesis testing for highdimensional regression

A clickstream data analysis of the differences between visiting behaviors of desktop and mobile users

Efficient processing of geospatial mHealth data using a scalable crowdsensing platform

Testing a single regression coefficient in high dimensional linear models

Sparse online learning via truncated gradient

Assessing Beijing's PM2. 5 pollution: severity, weather impact, APEC and winter heating

Renewable Estimation and Incremental Inference in Generalized Linear Models with Streaming Datasets

Restricted eigenvalue properties for correlated Gaussian designs

Syndromic surveillance using web data: a systematic review. Innovation in Health Informatics

Online updating of statistical inference in the big data setting

Translational bioinformatics in the era of real-time biomedical, health care and wellness data streams

Statistical inference for high-dimensional models via recursive online-score estimation

A novel framework for online supervised learning with feature selection

Online Learning as Stochastic Approximation of Regularization Paths: Optimality and Almost-Sure Convergence

Out-of-sample tests of forecasting accuracy: an analysis and review

Regression shrinkage and selection via the lasso

On asymptotically optimal confidence regions and tests for high-dimensional models

Nearly unbiased variable selection under minimax concave penalty

Confidence intervals for low dimensional parameters in high dimensional linear models

Regularization and variable selection via the elastic net

The work of J. Huang is partially supported by the U.S. NSF grant DMS-1916199 

r ) ) . Then, we write the online debiased estimator in (11) into the following vector form:Subtract the true parameter β 0,r and obtainWhat remains to be shown is that,as detailed by Lemma 4 and Lemma 5 respectively.Proof of Theorem 2 and Theorem 3. Theorem 2 and Theorem 3 can be proved in the same fashion as Theorem 1. Due to limited space, we only point out the main difference. In the proof of Theorem 2, we will show the upper bound of Σ (j) −r ∞ is O P (log p), by replacing n j , j = 2, . . . , b with 1 in Lemma 4 and Lemma 5. For Theorem 3, the major difference is to establish a similar lemma to Lemma 1 with conclusion γ (j) r − γ r 1 ≤ cs 2 r λ j under Assumption 2.