key: cord-0588473-8waj6yrx authors: Cho, Haeran; Maeng, Hyeyoung; Eckley, Idris A.; Fearnhead, Paul title: High-dimensional time series segmentation via factor-adjusted vector autoregressive modelling date: 2022-04-06 journal: nan DOI: nan sha: 0b7935646ab595007a064c6e50d9fd902cdecb84 doc_id: 588473 cord_uid: 8waj6yrx Piecewise stationarity is a widely adopted assumption for modelling non-stationary time series. However fitting piecewise stationary vector autoregressive (VAR) models to high-dimensional data is challenging as the number of parameters increases as a quadratic of the dimension. Recent approaches to address this have imposed sparsity assumptions on the parameters of the VAR model, but such assumptions have been shown to be inadequate when datasets exhibit strong (auto)correlations. We propose a piecewise stationary time series model that accounts for pervasive serial and cross-sectional correlations through a factor structure, and only assumes that any remaining idiosyncratic dependence between variables can be modelled by a sparse VAR model. We propose an accompanying two-stage change point detection methodology which fully addresses the challenges arising from not observing either the factors or the idiosyncratic VAR process directly. Its consistency in estimating both the total number and the locations of the change points in the latent components, is established under conditions considerably more general than those in the existing literature. We demonstrate the competitive performance of the proposed methodology on simulated datasets and an application to US blue chip stocks data. Vector autoregressive (VAR) models have emerged as a popular approach to model crosssectional and serial correlations in multivariate, possibly high-dimensional time series. See, for example, recent contributions in finance (Barigozzi and Hallin, 2017 ; Barigozzi and Brownlees, 2019) , biology (Shojaie and Michailidis, 2010) and genomics (Michailidis and d'Alché Buc, 2013) . Within such settings, the importance of data segmentation is wellrecognised, particularly where data sets are collected in naturally nonstationary evironmnets. Indeed, methods have been proposed for detecting change points under piecewise stationary VAR models in both fixed (Kirch et al., 2015) and high dimensions (Safikhani and Shojaie, 2022; Wang et al., 2019; Bai et al., 2020; Maeng et al., 2022) . The dynamics of a VAR process are determined by transition matrices that specify how future values of the series depend on their past. Consequently, VAR modelling quickly becomes a high-dimensional problem since the number of parameters grows quadratically with the dimensionality. In addition, the stability of VAR processes necessitates strong assumptions on the parameters, requiring either that their magnitude is small or that the transition matrices are highly sparse (Lin and Michailidis, 2020) . Therefore, most existing change point methods for high-dimensional, piecewise stationary VAR processes have been developed under the assumption that the transition matrices are sparse over each segment. However, it is debatable whether highly sparse models are appropriate for some applications where VAR models are popularly adopted. In fact, Giannone et al. (2021) note the difficulty of identifying sparse predictive representations for several macroeconomic applications. Factor-adjusted regression modelling has increasingly gained popularity for high-dimensional time series analysis. See, for example, the recent contributions of Fan et al. (2020 Fan et al. ( , 2021 and Barigozzi et al. (2022) . It allows for dominant cross-sectional (auto)correlations to be accounted for by a handful of common factors, which makes it reasonable to assume a sparse VAR model on the remaining component to capture idiosyncratic, variable-specific dependence. Such a modelling assumption is in line with the empirical approach frequently taken in the financial time series analysis literature where, prior to fitting a sparse regression model, the common sources of strong serial and cross-sectional correlations are identified and removed (Barigozzi and Hallin, 2017; Barigozzi and Brownlees, 2019; Guðmundsson and Brownlees, 2021) . In this paper, we extend this framework to the piecewise stationarity setting by proposing a new, piecewise stationary factor-adjusted VAR model, and develop an accompanying change point detection methodology. Under our proposed model, the observed time series is decomposed into a factor-driven common component and an idiosyncratic VAR process. Both of these latent components are permitted to undergo changes which, in the case of the latter, are attributed to changes in the VAR parameters. To the best of our knowledge, such a general model, simultaneously permitting the presence of common factors and change points, has not been previously studied in the literature. Accordingly, we are not aware of any change point method that can comprehensively address the problem of multiple change point detection considered in this paper. The proposed data segmentation methodology adopts a moving window-based procedure for multiple change point detection, which has successfully been adopted for a variety of change point problems (Preuss et al., 2015; Yau and Zhao, 2016; Eichinger and Kirch, 2018; Chen et al., 2021) . The procedure consists of two stages while fully addressing the challenges arising from neither component being directly observed. In the first stage, a nonparametric method is proposed to detect change points in the common component by contrasting local spectral density matrix estimators from neighbouring moving windows. Then, the second stage detects changes in the idiosyncratic component by locally evaluating the Yule-Walker equation for VAR estimation, which is feasible once the common component change points are estimated and factor adjustment is performed. Our proposed two-stage methodology achieves consistency in estimating the total number and locations of the change points in both of the components. Our theoretical analysis permits dependence across stationary segments and heavy-tailedness of the data, a considerably more general setting than those commonly adopted in the VAR segmentation literature. Under Gaussianity, the rate of localisation achieved by the first-stage method nearly matches the minimax lower bound derived for the simpler, covariance change point detection problem, up to an extra multiplicative factor that accounts for simultaneously considering changes at multiple lags of the autocovariances. The second-stage method, when applied to the piecewise stationary VAR processes directly, shows theoretical performance competitive with existing methods, while being potentially much less computationally demanding in high dimensions. The rest of the paper is structured as follows. Section 2 introduces the piecewise stationary time series model. Sections 3 and 4 detail the methodologies proposed for detecting change points in the common and the idiosyncratic components, respectively, and establish their theoretical consistency. Section 5 demonstrates the good performance of the two-stage methodology on simulated datasets in comparison with competing methods, as well as showing its usefulness in an application to a panel of volatility measures of US blue chip companies. R code implementing our method is available from https://github.com/haeran-cho/fvarseg. We denote by A * the transposed complex conjugate of any complex matrix A. The imaginary unit is denoted by ι = √ −1. By I, O and 0, we denote an identity matrix, a matrix of zeros and a vector of zeros whose dimensions depend on the context. For a random variable X and ν ≥ 1, denote X ν = (E|X| ν ) 1/ν . For a matrix A = [a ii , 1 ≤ i ≤ m, 1 ≤ i ≤ n], we define its element-wise ∞ , 0 , 1 and 2 -norms by |A| ∞ = max 1≤i≤m max 1≤i ≤n |a ii |, n i =1 |a ii | 2 , and its spectral and induced L 1 -norms by A and A 1 = max 1≤i ≤n m i=1 |a ii |, respectively. Also for a positive definite matrix A, we denote its minimum eigenvalue by A min . For two real numbers, set a ∨ b = max(a, b) and a ∧ b = min(a, b). Given two sequences {a n } and {b n }, we write a n b n if, for some finite positive constants C 1 and C 2 , there exists N ∈ N 0 = N ∪ {0} such that C 1 ≤ a n b −1 n ≤ C 2 for all n ≥ N . Throughout, L stands for the lag operator, i.e. L X t = X t− for any integer ≥ 0. 2 Piecewise stationary factor-adjusted VAR model In this section, we introduce a piecewise stationary factor-adjusted VAR model in Definition 2.1 under which we develop a data segmentation methodology. Definition 2.1. We observe a zero-mean process {X it , 1 ≤ i ≤ p, 1 ≤ t ≤ n}, where θ χ,k , 1 ≤ k ≤ K χ , denote the change points in the common component χ t , and θ ξ,k , 1 ≤ k ≤ K ξ , the change points in the idiosyncratic component ξ t , with the notational convention that θ χ,0 = θ ξ,0 = 0 and θ χ,Kχ+1 = θ ξ,K ξ +1 = n. We denote by χ [k] t (resp. ξ [k] t ) the stationary processes associated with the k-th segment of the common (resp. idiosyncratic) component which are generated as follows. (A1) For each 0 ≤ k ≤ K χ , there exists a stationary process {χ [k] t , t ∈ Z} satisfying where B [k] (L) = ∞ =0 B [k] L denotes a matrix of square-summable filters with B [k] = [B [k] ,ij , 1 ≤ i ≤ p, 1 ≤ j ≤ q] ∈ R p×q and q ≤ p, and B [k] (L) = B [k+1] (L) for all 0 ≤ k ≤ K χ − 1. The vector process of common factors u t = (u 1t , . . . , u qt ) are independently and identically distributed (i.i.d.) with E(u t ) = 0 and Cov(u t ) = I q for all t ∈ Z. The number of factors is permitted to vary from one segment to another as some finite q k ≤ q, such that B [k] (L)u t = B [k] (L)u [k] t with appropriate sub-vector u [k] t ∈ R q k of u t and sub-matrix filter B [k] (L) of B [k] (K). (A2) For each 0 ≤ k ≤ K ξ , there exists a stationary VAR(d) process {ξ [k] t , t ∈ Z} satisfying where ∈ R p×p , d is a finite non-negative integer, Γ [k] ∈ R p×p is positive definite with (Γ [k] ) 1/2 denoting its symmetric, square root matrix, and ε t = (ε 1t , . . . , ε pt ) is a random vector of i.i.d. innovations with E(ε it ) = 0 and Var(ε it ) = 1. The transition matrices change over time such that A [k] (L) = A [k+1] (L) for all 0 ≤ k ≤ K ξ − 1. The VAR order is permitted to vary from one segment to another as some finite d k ≤ d, such that A [k] = O for ≥ d k + 1. In Definition 2.1, we do not require that the change points in the common and the idiosyncratic components are aligned, or that K χ = K ξ . Our goal is to estimate the total number and locations of the change points separately for the factor-driven common component and the idiosyncratic VAR processes. The processes {χ [k] t , t ∈ Z} (resp. {ξ [k] t , t ∈ Z}) are allowed to be dependent across k through sharing the common innovations u t (resp. ε t ). This makes our model considerably more general than those commonly found in the literature on (high-dimensional) time series segmentation under parametric models (Davis et al., 2006; Wang et al., 2019; Safikhani and Shojaie, 2022; Bai et al., 2021b) , where the assumption of independence across the segments plays an important role. In (A2), we permit the VAR innovation covariance to change but it is the changes in the VAR transition matrices which are of interest to detect. The problem of change point detection under factor models is considered in Barigozzi et al. (2018) and Li et al. (2019b) where they model the common component using piecewise stationary, static factor models. On the other hand, under (A1), we adopt a piecewise stationary extension of the generalised dynamic factor model (GDFM) first proposed by Forni et al. (2000) , which provides the most general approach to factor modelling of highdimensional time series, and encompasses popularly adopted static factor models as special cases, i.e. χ Forni et al. (2015) show that each χ [k] t admits a representation as a singular VAR process which gives the model in (1) the interpretation of decomposing X t into two latent, piecewise stationary VAR processes, one with a low-rank structure and the other with a sparse one; see Appendix B for further discussions on GDFM. In the remainder of this section, we list the assumptions that enable (asymptotic) identification of the model (1), and control the degree of dependence in χ t and ξ t using the functional dependence measures defined in Zhang and Wu (2021) . The following assumption places a mild condition on the decay of serial dependence in χ t ; similar but stronger conditions are found in Forni et al. (2015) in a stationary setting. Assumption 2.1. There exist constants Ξ > 0 and ς > 2 such that for all ≥ 0, Since χ t and ξ t in (1) are latent, we introduce assumptions that ensure their (asymptotic) identifiability. Let Γ χ ( )e −ι ω its spectral density matrix at frequency ω ∈ [−π, π] and µ [k] χ,j (ω), 1 ≤ j ≤ q k , the real, positive eigenvalues of Σ [k] χ (ω) ordered by decreasing size. We similarly define Γ t . Assumption 2.2. For each 0 ≤ k ≤ K χ , the following holds: There exist a positive integer p 0 ≥ 1, pairs of functions ω → α j (ω) for ω ∈ [−π, π] and 1 ≤ j ≤ q k , and r k,j ∈ (0, 1] satisfying r k,1 ≥ . . . ≥ r k,q k , such that for all p ≥ p 0 , If r k,j = 1 for all j, we are in presence of q k factors which are equally pervasive for the whole cross-sections of χ [k] t . If r k,j < 1 for some j, we permit the presence of 'weak' factors. Since our primary interest lies in change point analysis, we later introduce a related but distinct condition on the size of change in the common component in Assumption 3.1. Assumption 2.3 characterises serial and cross-sectional dependence of ξ t . Then, there exist constants Ξ > 0 and ς > 2 such that for all ≥ 0, we have max 0≤k≤K ξ |D Assumption 2.3 (i) is a standard condition in the literature (Lütkepohl, 2005) , and (ii) generalises the commonly found assumption of Γ [k] = I in the VAR segmentation literature. Condition (iv) allows for mild cross-sectional and temporal dependence in ξ [k] t , and leads to the uniform boundedness of µ [k] ξ,1 (ω) as shown in Proposition 2.1 below. Proposition 2.1. Under Assumption 2.3, uniformly over all ω ∈ [−π, π], there exists some M ξ > 0 depending only on M ε , Ξ and ς defined in the assumption such that This, together with Assumption 2.3 (iii), establishes the boundedness of the eigenvalues of Σ [k] ξ (ω), an assumption often found in the high-dimensional VAR modelling literature for the stability of VAR processes (Basu and Michailidis, 2015) ; we refer to Barigozzi et al. (2022) for a detailed discussion on how this condition relates to the stability of each ξ [k] t . In fact, under stationarity, Assumption 2.2 and Proposition 2.1 jointly constitute both a necessary and sufficient condition for the process X t to admit the dynamic factor representation in (1)-(2), see Forni et al. (2000) and Forni and Lippi (2001) . For the identification of the two latent components under (1), we exploit the gap between the eigenvalues of the spectral density matrices of χ t and ξ t as p → ∞, which follows from Assumption 2.2 and Proposition 2.1 and carries over to a large gap in the eigenvalues of the spectral density matrix of X t by Weyl's inequality. Assumption 2.4. (i) The common and idiosyncratic shocks are uncorrelated, i.e. E(u jt ε it ) = 0 for any 1 ≤ i ≤ p, 1 ≤ j ≤ q and t, t ∈ Z. (ii) We assume either of the following conditions on the distributions of u t and ε t . (a) There exists ν > 4 such that Assumption 2.4 (i) is standard in the factor modelling literature. We operate under either of the two conditions in Assumption 2.4 (ii), where (a) is a weaker one permitting heavy-tailed innovations, while the results derived under (b) are readily compared to those derived in the literature under piecewise stationary Gaussian VAR models. Under the model in (A1), the spectral density matrix of the common component satisfies Σ (e −ιω )) * and thus varies over time as Σ for all 0 ≤ k ≤ K χ − 1. Assumption 2.2 and Proposition 2.1 indicate the presence of an eigengap in the spectral density matrix of X t , i.e. the eigenvalues corresponding to the common component diverges with p while the remaining ones are bounded for all p. This motivates us to look for changes in the common component from the behaviour of X t in the frequency domain, which is further justified by the following example. Example 3.1. Suppose that there exists a single change point at t = θ χ,1 in the common component (K χ = 1) at which a new factor is introduced, so that χ Let Σ x,t (ω) denote the spectral density of X t at frequency ω which, due to change points in χ t and ξ t , varies over time. Then, from the above observations, the orthogonality between χ t and ξ t (Assumption 2.4 (i)) and Proposition 2.1, we have That is, the change in the spectral density of χ t is detectable as a change in time-varying spectral density matrix of X t in operator norm provided that (b(e −ιω )) * b(e −ιω ) diverges with p, which follows from Assumption 2.2 on the strength of factors. Thus motivated, we propose to detect and locate changes in χ t in the frequency domain by searching for any large change (in the operator norm) in the local estimators of the spectral density matrix of X t . For this, we propose the following moving window-based methodology. Given a bandwidth G and a time point v ∈ {G, G + 1, . . . , n}, we estimate the local spectral density matrix of X t over the window where K(·) denotes the Bartlett kernel, m = G β the kernel bandwidth for some β ∈ (0, 1), Then, the following statistic contrasting the local estimators from I v (G) and I v+G (G), serves as a good proxy of the difference in local spectral density matrices of χ t . To make it more precise, let Σ χ,v (ω, G) denote a weighted average of the true spectrum Σ [k] χ (ω) from different segments covered by the window t ∈ I v (G), i.e. with L χ (v) = max{0 ≤ k ≤ K χ : θ χ,k + 1 ≤ v} denoting the index of the change point nearest to and strictly left of a time point v. Then, at given frequency ω, the function (of v) linearly increases and then decreases around the change points with a peak of size Σ (ω) formed at v = θ χ,k for all 1 ≤ k ≤ K χ , provided that the bandwidth G is not too large (in the sense of Assumption 3.1 (ii) below). The statistic T χ,v (ω, G) is designed to approximate T * χ,v (ω, G) when χ t is not directly observed, and thus is well-suited to detect and locate θ χ,k , 1 ≤ k ≤ K χ . Algorithm 1 outlines the moving window-based procedure for multiple change point detection in χ t . We evaluate T χ,v (ω, G) at the following Fourier frequencies Algorithm 1: Change point detection in the common component Input: Data {X t } n t=1 , lag window size m, Bartlett kernel K(·), bandwidth G, η ∈ (0, 1], threshold κ n,p Step 0: Set Θ χ ← ∅. Step 1: At ω l , 0 ≤ l ≤ m, compute T χ,v (ω l , G), G ≤ v ≤ n − G, in (6) and identify Step 2: Let θ = arg max v∈I max l T χ,v (ω l , G) and ω( θ) = arg max ω l : 0≤l≤m T θ (ω l , G). Step 3: Update I ← I \ { θ − G + 1, . . . , θ + G}. Step 4: Repeat Steps 2-3 until I is empty. Output: Θ χ and take the maximum over the frequencies at each given location v, which then is compared against some threshold κ n,p . When max 0≤l≤m T χ,v (ω l , G) exceeds κ n,p , there is some evidence for a change point around v. However, care is needed to avoid detecting the same change point more than once since max l T χ,v (ω l , G) is expected to take a large value over an interval containing θ χ,k for each k, with a peak formed near the change point as in the case of T * χ,v (ω, G). Therefore, we regard a time point θ as a change point estimator if it fulfils the following conditions simultaneously with some η ∈ (0, 1): (i) We have max 0≤l≤m T χ, θ (ω l , G) > κ n,p , and (ii) θ is a local maximiser of T χ,v (ω( θ), G) within an interval of radius ηG centered at θ, at the frequency ω( θ) = arg max ω l : 0≤l≤m T θ (ω l , G). These checks correspond to Steps 1 and 2 of Algorithm 1. Once θ is added to the set of final estimators, Step 3 ensures that we detect only a single change point for each θ χ,k by disregarding the interval of radius G centered at θ. By Step 4, we iteratively consider local maximisers of T χ,v (ω l , G) and determine whether they detect change points or not according to (i)-(ii) above. We defer the discussion on the selection of the tuning parameters G, m, η and κ n,p to Appendix C.1-C.2; in the former, we discuss a multiscale extension of Algorithm 1 which allows for a range of bandwidths. Let us denote by ∆ χ,k (ω) = Σ (ω), 1 ≤ k ≤ K χ , the difference in spectral density matrices of χ t from neighbouring segments. As ∆ χ,k (ω) is Hermitian, we can always find the j-th largest (in modulus), real-valued eigenvalue of ∆ χ,k (ω) which we denote by µ j (∆ χ,k (ω)). In the following assumption, µ 1 (∆ χ,k (ω)) = ∆ χ,k (ω) is used to measure the size of change in the common component at the change point θ χ,k . Recall that m = G β for some β ∈ (0, 1), denotes the bandwidth used in local spectral density estimation, see (4). Assumption 3.1. (i) For each 1 ≤ k ≤ K χ , the following holds: There exist a positive integer p 0 ≥ 1 and pairs of functions ω → a [k] j (ω) and ω → b [k] j (ω) for ω ∈ [−π, π] and j = 1, 2, and r k,1 ∈ (0, 1] and r k,2 ∈ [0, 1] satisfying r k,1 ≥ r k,2 , such that for all p ≥ p 0 . Besides, we assume that the functions ω → p −r k,1 µ 1 (∆ χ,k (ω)) are Lipschitz continuous with bounded Lipschitz constants. Then, denoting (ii) The bandwidth G = G n satisfies G n → ∞ as n → ∞ while fulfilling min min Assumption 3.1 specifies the detection lower bound, which is determined by min k ∆ χ,k and min k (θ χ,k+1 − θ χ,k ) (through G), for all K χ change points to be detectable by Algorithm 1. In addition to assuming that min 1≤k≤Kχ p −1 ∆ χ,k diverges faster than (ψ n ∨ m −1 ), Condition (i) requires µ 1 (∆ χ,k (ω)) to be distinct from the rest; in fact, the remaining eigenvalues of ∆ χ,k (ω) are allowed to be exactly zero, which is the case in Example 3.1. It is possible to find the kernel bandwidth m that minimises (ψ n ∨ m −1 ), e.g. as m (G/ log(n)) 1/3 under Gaussianity. This rate, roughly speaking, represents the bias-variance trade-off in approximating Σ χ,v (G, ω) with Σ x,v (G, ω) uniformly in v and ω (see Proposition D.6), where the rate ψ n increases for heavier tailed u t and ε t as well as increasing in m. In our theoretical results, we make the presence of m explicit to highlight the role of this tuning parameter. The additional technical condition on min k (θ ξ,k+1 − θ ξ,k ) is placed in order that the number of total change points within any interval of radius G is at most one. Theorem 3.1. Suppose that Assumptions 2.1-2.4 and 3.1 hold. Let κ n,p satisfy for some constant M > 0. Then, there exists a set M χ n,p with P(M χ n,p ) → 1 as n, p → ∞, such that the following holds for Θ χ = { θ χ,k , 1 ≤ k ≤ K χ : θ χ,1 < . . . < θ χ, Kχ } returned by Algorithm 1, on M χ n,p for large enough n and p: (a) K χ = K χ and max 1≤k≤Kχ | θ χ,k − θ χ,k | ≤ 0 G for some 0 ∈ (0, 1/2) with η ∈ (2 0 , 1]. n,p where under Assumption 2.4 (ii) (b). Remark 3.1. (i) Empirically, replacing θ with θ = arg max v∈I avg l T χ,v (ω l , G) in Step 2 of Algorithm 1 returns a more stable location estimator, where avg l denotes the average operator over l = 0, . . . , m. We can derive the localisation rate for θ similar to that reported in Theorem 3.2 (b) with ∆ χ,k = π −1 π 0 ∆ χ,k (ω) dω in place of ∆ χ,k , and our numerical results in Section 5.1 are based on this estimator. (ii) In Theorem 3.1 (a), the constant 0 can be arbitrarily small as n, p → ∞ due to Assumption 3.1 (i). In Theorem 3.1 (b), ρ [k] n,p reflects the difficulty associated with estimating the individual change point θ χ,k manifested by (p −1 ∆ χ,k ) −2 . In the Gaussian case (Assumption 2.4 (ii) (b)), the localisation rate derived in (b) is always sharper than G due to Assumption 3.1 (i). Considering the problem of covariance change point detection in independent, sub-Gaussian random vectors in high dimensions, Wang et al. (2021) derive the minimax lower bound on the localisation rate in their Lemma 3.2, and ρ [k] n,p matches this rate up to m log(n); here, the dependence on the kernel bandwidth m is attributed to the time series segmentation problem considered in this paper, i.e. a change may occur in the autocovariance matrices at lags other than zero. If heavier tails are permitted (Assumption 2.4 (ii) (a)), ρ [k] n,p can be tighter than 0 G e.g. when ∆ χ,k p, K χ is fixed and m G β for some β ∈ (0, 1 − 4/ν). Following the detection of change points in the common component, the next step is to perform post-segmentation factor adjustment via dynamic principal component analysis (Brillinger, 1981) . Estimating the spectral density of {X t , θ χ,k + 1 ≤ t ≤ θ χ,k+1 } by Σ [k] x (ω) = Σ x, θ χ,k+1 (ω, θ χ,k+1 − θ χ,k ) as defined in (4) (with the same kernel bandwidth m for simplicity), we estimate the spectral density and autocovariance matrices of χ where M [k] x (ω l ) ∈ R q k ×q k is a diagonal matrix with the q k largest eigenvalues of Σ x (ω l ) ∈ R p×q k consists of the corresponding q k eigenvectors, and the Fourier frequencies ω l are defined in (8). The estimators in (11) require the segment-specific factor number q k . We refer to Hallin and Liška (2007) and Onatski (2009) for consistent estimators of q k that make use of the postulated eigengap in the spectral density matrix of X t . In what follows, we treat q k as known. The following assumption imposes a stronger factor structure than Assumption 2.2. Assumption 3.2. Assumption 2.2 holds with r k,j = 1 for all 1 ≤ j ≤ q k and 0 ≤ k ≤ K χ . It is possible to work under the weaker Assumption 2.2 and trace the effect of weak factors given by p −1 min ω µ [k] χ,q k (ω) in the theoretical analysis, see Barigozzi et al. (2022) where the results analogous to Theorem 3.2 below are derived in the stationary setting while permitting weak factors. We choose to work under the stronger Assumption 3.2 as it simplifies the presentation of Theorem 3.2, the results from which carry over to the secondstage change point analysis of the idiosyncratic VAR process investigated in Section 4. Theorem 3.2. Suppose that Assumption 3.2 holds in addition to the assumptions made in Theorem 3.1, and define ρ n,p = max 1≤k≤Kχ min( 0 G, ρ [k] n,p ). Also defině Then on M χ n,p defined in Theorem 3.1, for some finite integer d ∈ N, we have 4 Stage 2: Idiosyncratic component segmentation We re-write the segment-specific VAR model for ξ [k] t defined in (A2), as for 0 ≤ k ≤ K ξ . The idiosyncratic component ξ t is latent under the factor-adjusted VAR model in (1), and therefore existing methods for time-varying VAR segmentation are not directly applicable to our problem. Instead, we make use of the Yule-Walker equation: . . . with each G [k] invertible due to Assumption 2.3 (iii). We propose to utilise this estimating equation for detecting change points in the latent VAR process ξ t in combination with a moving window-based procedure. For a given bandwidth G, we denote a weighted average of Γ Consider G v (G) and g v (G) (resp. G v (G) and g v (G)), which are obtained by replacing ) over each interval I v (G), and G v (G) (resp. g v (G)) is its estimator. Then, it signifies a change in the vicinity of v if the two matrices G v (G) β − g v (G) and G v+G (G) β − g v+G (G), obtained from evaluating the Yule-Walker equation on the adjacent intervals I v (G) and I v+G (G), differ greatly in some matrix norm for a plug-in matrix β ∈ R (pd)×p . This is immediately seen by considering the population counterpart where |||·||| denotes some matrix norm. Then, T * ξ,v ( β, G), as a function of v, linearly increases then decreases with local maxima formed at v = θ ξ,k , 1 ≤ k ≤ K ξ , for any β satisfying for an appropriately chosen G (in the sense made precise in Assumption 4.1 (ii) below). Given the high dimensionality of VAR modelling even for moderately large p, we consider an 1 -regularised Yule-Walker estimator in place of β. At a given location t ∈ {G, . . . , n}, we solve the constrained 1 -minimisation problem with some tuning parameter λ n,p > 0. The estimator in (16) modifies the Dantzig selector (Candes and Tao, 2007) proposed for high-dimensional linear regression, and has been considered by Han et al. (2015) in the context of stationary Gaussian VAR(1) modelling, and by Barigozzi et al. (2022) for factor-adjusted network analysis of high-dimensional time series. The ∞ -constraint in (16) motivates the choice of |||·||| = | · | ∞ , and we adopt the following statistic as a proxy of T * ξ,v ( β, G) for change point analysis of ξ t : for the detection of θ ξ,k from the condition in (15), which requires the set I t (G) to be fully contained in a single stretch of stationarity. Input: Data {X t } n t=1 , change point estimators from the common component Θ χ , λ n,p for (16), bandwidth G, η ∈ (0, 1], threshold π n,p Step 0: Set Θ ξ ← ∅ and v • ← G. Step 1: Step 2: Find θ = arg maxθ ≤v≤min(θ+G,n−G) T ξ,v ( β, G) and update Θ ξ ← Θ ξ ∪ { θ}. Step 3: Update v • ← min(θ + 2G, θ + (η + 1)G). Step 4: Repeat Steps 1-3 until v • > n − G. Output: Θ ξ Upon assuming that θ ξ,1 is at least a distance of G away from the boundaries, it allows us to estimate β [0] by β = β G (G) obtained as in (16) at t = G, which is well-suited to detect θ ξ,1 . Then, sequentially scanning the data using T ξ,v ( β, G) for v ≥ G, the event of T ξ,v ( β, G) exceeding some threshold, say π n,p , at v =θ, can signify that a change has occurred in its neighbourhood, namely θ ξ,1 ∈ {θ, . . . ,θ + G}. Reducing the scope of the search for θ ξ,1 to the above set, its location is estimated by the local maximiser θ ξ,1 = arg maxθ ≤v≤θ+G T ξ,v ( β, G). Further operating under the assumption that the change points θ ξ,k are sufficiently distanced away from one another, we can use an interval of length G to the right of θ ξ,1 to obtain an estimator of β [1] via (16), and detect and locate θ ξ,k , k ≥ 2, one by one from screening the statistic T ξ,v ( β, G) and updating β until reaching the end of the data sequence. Algorithm 2 outlines the proposed methodology; we defer the discussion of tuning parameter selection, including a multiscale extension of Algorithm 2, to Appendix C.1-C.2. In this section, we establish the theoretical consistency of Algorithm 2. For this, we set the tuning parameter for the 1 -regularised Yule-Walker estimation problem in (16), as and M > 0 denotes some constant. The choice of λ n,p reflects the estimation error in The following assumption imposes conditions on the size of the changes in the idiosyncratic component and the minimum spacing between the change points. (ii) The bandwidth G fulfils (10), i.e. min 0≤k≤K ξ (θ ξ,k+1 − θ ξ,k ) ≥ 2G. In Assumption 4.1 (i), the size of change is measured using the element-wise ∞ -norm of ∆ ξ,k . In the related literature, some norm of β [k] − β [k−1] after sparsity adjustment is often employed for this purpose. From Assumption 2.3 (iii), it is readily seen that and showing the connection between the two different approaches to measuring the size of changes. To understand the size of G Horn and Johnson, 1985) with min 1≤i≤p (|γ ξ,ii (0)| < ∞. Assumption 4.1 (ii) re-states Assumption 3.1 (ii) and enables us to sequentially identify an interval uncontaminated by any change point for the estimation of β [k] , 1 ≤ k ≤ K ξ , as performed in Step 1 of Algorithm 2. Theorem 4.1. Suppose that Assumption 4.1 holds in addition to the assumptions made in Theorem 3.2. With λ n,p chosen as in (17), we set Then, there exists a set M ξ n,p with P(M ξ n,p ) → 1 as n, p → ∞, such that the following for large enough n: Theorem 4.1 (a) establishes that Algorithm 2 consistently detects all K ξ change points within the distance of 0 G where, as noted in Remark 3.1 (ii), 0 can be made arbitrarily small as n, p → ∞ under Assumption 4.1 (i). Theorem 4.1 (b) shows that the localisation result can further be refined when θ ξ,k are sufficiently distanced away from the change points in the common component. This is due to the sequential nature of the proposed methodology where the second-stage change point analysis of ξ t depends on the output from the first. If a change point in ξ t , say θ ξ,k , lies close to a change point in χ t , say θ χ,k , the error in the autocovariance estimators of ξ t due to the bias in θ χ,k prevents the application of the arguments involved in the refinement to such θ ξ,k . The refined localisation rate is always tighter than G under Gaussianity due to Assumption 4.1 (i). It is of independent interest to consider the cases where (a) χ t is stationary (i.e. Θ χ = ∅), and (b) where we directly observe the idiosyncratic component (i.e. χ t = 0). Consistency of our Stage 2 methodology readily extends to such settings with further improved localisation rates, see Corollary A.1 in Appendix A where we also compare the Stage 2 methodology to existing VAR segmentation methodologies. We evaluate the performance of the proposed two-stage methodology, referred to as 'FVARseg' hereafter, on simulated datasets (Section 5.1) and in a financial application (Section 5.2). We consider both of the cases when χ t = 0 and χ t = 0. For the former, we consider two models for χ t with the number of factors fixed at q = 2. In the first model, referred to as (C1), χ t admits a static factor model representation while in the second model (C2), it does not. See Appendix C.4 for full details of the two models for common component generation. It is observed that empirically, the task of factor structure estimation is more challenging under (C2) (Forni et al., 2017; Barigozzi et al., 2022) . For the idiosyncratic component, we consider a piecewise stationary VAR(d) process ξ 1,ii = 0 otherwise, then rescale it such that A for 1 ≤ ≤ d and some β ∈ (0, 1]. We refer to Table 1 for the summary of the 24 change point configurations considered in our simulation studies with varying p, d, β, Θ χ , Θ ξ and the data generating process for χ t ; the results under (M1)-(M2) are given in the next section and those under (M3) are reported in Appendix A.3. We report the results from applying FVARseg, with tuning parameters selected as described in Appendix C.1-C.2, to the simulated datasets generated as described in Section 5.1.1. To the best of our knowledge, there does not exist a methodology that comprehensively addresses the change point problem under (1). Therefore, in Table 2 , we compare the Stage 1 methodology of FVARseg with that proposed in Barigozzi et al. (2018) , referred to as BCF hereafter, on their performance at detecting changes in the common component. While BCF has a step for detecting change points in the second-order structure of the idiosyncratic component, it achieves this nonparametically without making use of the parametric structure which may lead to unfair comparison. Instead, in Table A .2 in Appendix A.3, we compare the Stage 2 of FVARseg with a block-wise variant of Safikhani and Shojaie (2022) implemented in the R package VARDetect (Bai et al., 2021a) , on settings where X t follows a piecewise stationary VAR model (i.e. χ t = 0, see (M3) of Table 1 ). Denoting by Θ and Θ the sets of estimated and true change points, respectively, for either the common or the idiosyncratic component, we report the distributions of K − K (with K = | Θ| and K = |Θ|) as well as the Hausdorff distance between Θ and Θ (see (A.2) in Appendix A.3) averaged over 100 realisations. Overall, FVARseg achieves good performance in detecting the correct number of change points and locating them with accuracy, for both the common and the idiosyncratic components. Under the model (C1) that admits a static factor representation, FVARseg shows similar performance as BCF in detecting Θ χ when the dimension is small (p = 50), but the latter tends to over-estimate the number of change points as p increases. Also, FVARseg outperforms the binary segmentation-based BCF in change point localisation as evidenced by the reported d H . BCF requires the number of global factors (say, r), including the ones that are attributed to the change points, as an input and its performance is sensitive to the choice of r. Under (C1), we have r ≤ 3q(K χ + 1) (and we supply this upper bound to BCF) while under (C2), χ [k] t does not admit a static factor representation and accordingly such r does not exist (we set r = 2q for BCF). This is reflected in the results reported in Table 2 where BCF tends to under-estimate the number of change points. Generally, the task of detecting change points in the idiosyncratic VAR process is aggravated by the presence of change points in the common component, and the Stage 2 methodology performs better when K χ = 0 compared to when K χ = 0, both in terms of detection and localisation accuracy; this agrees with the observations made in Corollary A.1 (a). Between the two scenarios (M1) and (M2), the latter poses a more challenging setting for the Stage 2 of FVARseg. This may be attributed to (i) the difficulty posed by the common component generating scenario (C2), which is observed to make the estimation of the idiosyncratic VAR process more difficult (see Section 5 of Barigozzi et al. (2022) ), and (ii) the fact that Θ χ = Θ ξ , see the discussion below Theorem 4.1 (b). We consider daily stock prices from p = 72 US blue chip companies across industry sectors between January 3, 2000 and February 16, 2022 (n = 5568 days), retrieved from the Wharton Research Data Services (https://wrds-www.wharton.upenn.edu); the list of companies and their corresponding sectors can be found in Appendix C.5. Following Diebold and Yılmaz (2014) , we measure the volatility using the daily high-low range as and p low it denote, respectively, the maximum and the minimum log-price of stock i on day t, and set X it = log(σ 2 it ). We apply the proposed FVARseg to detect change points in the panel of volatility measures {X it , 1 ≤ i ≤ p; 1 ≤ t ≤ n}. Noting that each financial year consists of n 0 = 252 trading days, we apply Algorithm 1 with bandwidths G ∈ G χ chosen as an equispaced sequence between [n 0 /4] and 2n 0 of length 4, implicitly setting the minimum distance between two neighbouring change points to be three months. Based on the empirical sample size requirement for the Dantzig selector estimator (see Appendix C.1), we apply Algorithm 2 with G ∈ G ξ chosen as an equispaced sequence between 2.5p and 2n 0 of length 4. The rest of the tuning parameters are selected as described in Appendix C.2. Algorithm 2 takes the VAR order d as an input and its choice in high dimensions is not a trivial problem. We set d = 5 which corresponds to the number of trading days in each week, see Table 3 for the segmentation results. F HD LOW MCD SBUX TGT CL COST CVS PEP PG WMT APA COP CVX HAL NOV OXY SLB XOM AIG ALL AXP BAC BK C COF GS JPM SPG USB WFC ABT AMGN BAX BMY JNJ LLY MDT MRK PFE UNH BA CAT EMR FDX GD GE HON LMT MMM NSC UNP UPS DD FCX CSCO EBAY AAPL HPQ IBM INTC MSFT ORCL QCOM T VZ AEP F HD LOW MCD SBUX TGT CL COST CVS PEP PG WMT APA COP CVX HAL NOV OXY SLB XOM AIG ALL AXP BAC BK C COF GS JPM SPG USB WFC ABT AMGN BAX BMY JNJ LLY MDT MRK PFE UNH BA CAT EMR FDX GD GE HON LMT MMM NSC UNP UPS DD FCX CSCO EBAY AAPL HPQ IBM INTC MSFT ORCL QCOM T VZ AEP Figure 1 illustrates that the VAR transition matrices vary considerably from one segment to another using the three consecutive segments covering the Great Financial Crisis; this in turn shows time-varying Granger causal linkages between the p companies. To further validate the change point estimators, we perform a forecasting exercise. We compare two approaches, referred to as (F1) and (F2) below, to producing a forecast of X t using a sub-sample of {X u , u ≤ t − 1}, for t ∈ T ⊂ {1, . . . , n}. Simply put, (F1) uses the observations belonging to the same segment as t only, for constructing the forecast of χ t (resp. ξ t ) according to the segmentation defined by Θ χ (resp. Θ ξ ), while (F2) ignores the presence of the most recent change point estimator. We expect (F1) to give more accurate predictions if the change points reported in Table 3 are valid; if some of the change point estimators are spurious, (F2) is expected to produce better forecasts since it makes use of more observations. More precisely, recall the definition of L χ (v) = max{0 ≤ k ≤ K χ : θ χ,k + 1 ≤ v} and define L ξ (v) analogously. We select T such that each t ∈ T fulfils (i) min( L χ (t), L ξ (t)) ≥ 2 and (ii) min(t − θ χ, Lχ(t) , t − θ ξ, L ξ (t) ) ≥ n 0 , which ensures that there are at least n 0 observations to produce a forecast; we have |T | = 1600. For each t ∈ T , we obtain X t (N) = χ t (N 1 ) + ξ t (N 2 ) for some N = (N 1 , N 2 ) , where χ t (N 1 ) denotes an estimator of the best linear predictor of χ t given X t− , 1 ≤ ≤ N 1 , and ξ t (N 2 ) is defined analogously, using the following two approaches. Barigozzi et al. (2022) propose two methods for estimating the best linear predictors of χ t and ξ t under a stationary factor-adjusted VAR model, one based on a more restrictive assumption on the factor structure ('restricted') than the other ('unrestricted'); we refer to the paper for their detailed descriptions. Both estimators are combined with the two approaches (F1) and (F2). We consider two measures of errors see Table 4 for the results. According to all evaluation criteria, (F1) produces forecasts that are more accurate than (F2) regardless of the choice of the forecasting methods, which supports the validity of the change point estimators returned by our proposed methodology. A Further discussions on the methodology for idiosyncratic VAR process segmentation We consider the performance of Stage 2 methodology when applied to the special cases under the model (1) where (a) χ t is stationary (i.e. Θ χ = ∅) and (b) we directly observe X t = ξ t (i.e. χ t = 0). Note that Theorem 3.1 indicates that in both cases, the Stage 1 methodology returns Θ χ = ∅. The results reported in Theorem 4.1 readily extend to such settings. Corollary A.1. Suppose that the assumptions of Theorem 4.1 hold, including Assumption 4.1 (i) with λ n,p specified below. Then, with [k] n,p defined as in Theorem 4.1, i.e. [k] there exist a set M ξ n,p with P(M ξ n,p ) → 1 as n, p → ∞ and constants 0 , c 0 > 0 such that on M ξ n,p , we have n,p for all 1 ≤ k ≤ K ξ for n large enough, in the following situations. (a) There is no change point in the common component, i.e. K χ = 0, and we set (b) The common component does not exist, i.e. X t = ξ t for all t, and we set λ n,p = M (max 0≤k≤K ξ β [k] 1 + 1)θ n,p with When compared to the methods dedicated to the setting corresponding to Corollary A.1 (b), our Stage 2 methodology achieves comparative theoretical performance in terms of the detection rate required on the size of changes for their detection, and the rate of localisation achieved. We provide a comprehensive comparison of Stage 2 methodology with the existing VAR segmentation methods in the next section, both on their theoretical and computational properties. There are a few methods proposed for time series segmentation under piecewise stationary, Gaussian VAR models, a setting that corresponds to Corollary A.1 (b) under Gaussianity. In this setting, we compare the Stage 2 methodology outlined in Algorithm 2 with those proposed by Wang et al. (2019) and Safikhani and Shojaie (2022) . Separation Localisation Complexity Table A .1 summarises the comparative study in terms of their theoretical and computational properties. Denoting by∆ k the size of change between the (k − 1)-th and the k-th segments (measured differently for different methods), the separation rate refers to some ν n,p → ∞ such that if ν −1 n,p min k∆ 2 k · min k (θ ξ,k+1 − θ ξ,k ) → ∞, the corresponding method correctly detects all K ξ change points; for Stage 2, we set∆ k = |∆ ξ,k | ∞ and for the others,∆ k = s The localisation rate refers to some n,p → ∞ satisfying max 1≤k≤K ξ w 2 k |θ k − θ ξ,k | = O P ( n,p ) for the estimatorsθ k returned by respective methods. For Algorithm 2, the weights reflect the difficulty associated with locating individual change points, i.e. w k =∆ k , while for Wang et al. (2019) and Safikhani and Shojaie (2022) , the weights are global with w k = min k∆k and w k = 1, respectively. Safikhani and Shojaie (2022) further assume that min k |β [k] − β [k−1] | 2 is bounded away from zero. We suppose that max k β [k] 1 = O(1), a sufficient condition for the stability of the segment-specific VAR processes (Basu and Michailidis, 2015 , Proposition 2.2), which is required by all the methods in consideration for their theoretical analysis. Although immediate comparison of the theoretical results is difficult due to different definitions of∆ k , the observation in (18)) implies that∆ k defined for different methods are comparable. Then, Table A .1 shows that the proposed Stage 2 methodology is theoretically competitive in terms of the dependence of the separation rate on K ξ (note that g does not depend on K ξ ) and the localisation rate. In Algorithm 2, the 1 -regularised Yule-Walker estimation problem in (16) can be solved in parallel and further, it needs to be performed only K ξ + 1 times with large probability, which makes it more attractive compared to the dynamic programming methodology of Wang et al. (2019) that requires the Lasso estimation to be performed O(n 2 ) times, or the multi-stage procedure of Safikhani and Shojaie (2022) that solves a fused Lasso problem of dimension np 2 d to obtain pre-estimators of the change points, and then exhaustively searches for the final set of estimators which can be NP-hard in the worst case. In Appendix A.3, we compare the Stage 2 methodology with a blockwise modification of Safikhani and Shojaie (2022) that is implemented in the R package VARDetect (Bai et al., 2021a) . Finally, we note that there are methods developed under piecewise stationary extensions of the low-rank plus sparse VAR(1) model proposed in Basu et al. (2019) , see Bai et al. (2020 Bai et al. ( , 2021b . While they additionally permit a low rank structure in the transition matrices, the spectrum of X t is assumed to be uniformly bounded which rules out pervasive (serial) correlations in the data and thus is distinguished from the factor-adjusted VAR model considered here. , with all tuning parameters selected as described in Appendix C.1-C.2), with a block-wise variation of the methodology proposed by Safikhani and Shojaie (2022) which is implemented in the R package VARDetect (Bai et al., 2021a), on settings corresponding to (M3) in Table 1 . Denoting by Θ and Θ the sets of estimated and true change points, respectively, for either the common or the idiosyncratic component, we report the distributions of K − K (with K = | Θ| and K = |Θ|) as well as the Hausdorff distance between Θ and Θ, averaged over 100 realisations. In addition to the measures reported in Table 2, Table A .2 reports the average computation time (in seconds). We observe that the Stage 2 method of FVARseg outperforms VARDetect in all criteria considered, such as detection and localisation accuracy as well as computational time, particularly as p increases. In particular, VARDetect struggles to detect any change point when the change is weak (recall that β = 0.6 is used when d = 1 which makes the size of change at θ ξ,2 small) or when d = 2. Compared to Table 2 , the localisation performance of the Stage 2 method improves in the absence of the common component, even though the size of changes under (M3) is on average smaller than that under (M1)-(M2), which confirms the theoretical findings reported in Corollary A.1 (b). Although not reported here, when both stages of FVARseg is applied to the data generated under (M3), it successfully does not detect any spurious change point estimator from the common component. B Generalised dynamic factor model B.1 Generality of GDFM Forni and Lippi (2001) show that the necessary and sufficient condition for any p-dimensional, weakly stationary time series to admit the GDFM representation, is to have a finite number of the eigenvalues of its spectral density matrix diverge with p (as in Assumption 2.2) while the remaining ones are bounded for all p. In other words, GDFM itself (without the VAR model imposed on ξ t as in this paper) can be regarded as a representation of high-dimensional time series rather than a model. Therefore, arguably, GDFM provides the most general framework for high-dimensional time series factor modelling and it encompasses other factor models found in the literature such as static factor models (Forni et al., 2009) . Static factor models are popularly adopted in both stationary (Stock and Watson, 2002; Bai, 2003; Fan et al., 2013; Barigozzi and Cho, 2020) and piecewise stationary (Barigozzi et al., 2018; Li et al., 2019b) time series modelling in high dimensions. Under stationary factor models, the common component permits a representation χ it = λ i f t with some finite-dimensional vector processes f t ∈ R r as the common factors; here, 'static' refers to that χ it loads f t contemporaneously and does not preclude serial dependence therein. The model in (2) includes such a static factor model by representing Remark R of Forni et al. (2009) ). On the other hand, some models that have a finite number of factors under (2) cannot be represented with f t of finite dimension, the simplest example being the case where χ it = a i (1−b i L) −1 u t for some b i ∈ (−1, 1) (Forni et al., 2015) ; see also (C2) in Section 5.1.1. Hallin et al. (2018) observe that principal component analysis (PCA), typically accompanying static factor models as an estimation tool, does not enjoy the optimality property that makes their success in the i.i.d. case in the presence of serial correlations, unlike the dynamic PCA adopted for estimation under GDFM (see Section 3.3). ,ij L , is a ratio of finite-order polynomials in L such that for some finite s 1 , s 2 ∈ N, ,ij L , l = 1, 2, for all 1 ≤ j ≤ q and 0 ≤ k ≤ K χ . Furthermore, assume the followings. (a) There exists M χ > 0 such that ij (z) = 0 for all |z| ≤ 1. Under such assumptions, Section 4 of Forni et al. (2015) establishes that for generic values of the parameters B [k,1] ,ij and B [k,2] ,ij (outside a countable union of nowhere dense subsets), χ [k] t admits a block-wise singular VAR representation χ, L with its degree s ≤ q k s 1 + q 2 k s 2 , and det(A [k,h] χ (z)) = 0 for all |z| ≤ 1. The representation (B.1) gives the piecewise stationary factor-adjusted VAR model in (1) the interpretation of decomposing high-dimensional time series into two latent VAR processes with time-varying transition matrices, one of low rank (singular) accounting for dominant dependence and the other modelling individual interdependence between the variables unaccounted for by the former. The selection of the bandwidth G requires balancing between two opposing objectives. On one hand, a larger value of G gives more information to detect each change point (Assumptions 3.1 (i) and 4.1 (i)). However, if G is too large, we may have windows that contain two or more changes in applying Algorithms 1-2, which violates Assumptions 3.1 (ii) and 4.1 (ii). When the changes are multiscale (a mixture of large changes over short intervals and smaller changes over long intervals), there may not exist a single bandwidth fulfilling the above requirements simultaneously. To remedy this, we propose a multiscale approach whereby we apply Algorithms 1-2 with a range of bandwidths and merge the outputs across multiple bandwidths using a 'bottom-up' approach that has popularly been adopted with moving window-based data segmentation procedures (Messer et al., 2014; Meier et al., 2020) . Let Θ(G) denote the output from Algorithms 1 or 2 with a bandwidth G. Then, given a set of bandwidths G = {G h , 1 ≤ h ≤ H : G 1 < . . . < G H }, we proceed as follows: Starting with h = 1 and Θ = ∅, we accept θ ∈ Θ(G h ) to Θ if and only if minθ ∈ Θ | θ −θ| ≥ ηG for some η ∈ (0, 1], and update h ← h + 1 until all Θ(G), G ∈ G, are considered. In simulation studies, we use G χ = {[n/10], [n/8], [n/6], [n/4]} for Algorithm 1, and G ξ an equispaced sequence between [2.5p] and [n/4] of length 4 for Algorithm 2. The choice of G ξ is motivated by the simulation results of Barigozzi et al. (2022) under the stationarity, where the Dantzig selector estimator in (16) was observed to performs well when the sample size exceeds 2p. We select the tuning parameters of Algorithms 1-2 motivated by the theoretical results. Theorem 3.1 indicates that the choice of the threshold κ n,p in Algorithm 1 involves that of a constant M , which in turn depends on the constants related to the (time-varying) dependence adjusted norms of X t , see the proof of Lemma D.7. Since such quantities are not accessible or difficult to estimate in practice, we propose to 'standardise' T χ,v (ω l , G) using the statistic evaluated at v = G. Denoting such a standardised test statistic by T • χ,v (ω l , G) = (T χ,G (ω l , G)) −1 T χ,v (ω l , G), we simulate B = 100 time series following (1) with K χ = K ξ = 0 using the models considered in Section 5.1.1 and record max G≤v≤n−G T • χ,v (ω l , G). Repeating the same procedure for varying combinations of (n, p, G), we fit a regression model to 100(1−τ )th percentile of log(max v T • χ,v (ω l , G)) with log log(n) and log(G) as regressors and use the fitted model to derive a threshold for given n and G. All model terms are deemed highly significant according to t-test and the model provides a good fit with R 2 adj = 0.9651. In the post-segmentation factor adjustment step, we select q k , the number of factors within each stationary segment, using the information criterion-based approach of Hallin and Liška (2007) . For the selection of the threshold π n,p in Algorithm 2, we adopt an approach similar to that taken for the selection of κ n,p apart from that, for standardisation of and additionally consider log log(p) as a regressor. The resultant model has all its model terms highly significant according to t-test with R 2 adj = 0.985. When Algorithm 2 is used solely to detect change points in observed VAR processes, a smaller threshold is recommended (as indicated by Corollary A.1) and we observe that π n,p = 1 works well after the proposed standardisation. The constraint parameter λ n,p for the Dantzig selector problem (16) can be chosen from a grid via cross-validation; we refer to Section 5.1 of Barigozzi et al. (2022) for its description. Selection of the VAR order d in high dimensions is non-trivial and investigation into this problem is beyond the scope of this paper, since e.g. the information criteria developed for fixed dimensional time series (Lütkepohl, 2005) are not valid as p → ∞. In our simulations (Section 5.1), we regard d as known while in real data analysis we consider d ∈ {1, 2, 3, 4, 5} motivated by the interpretation of d = 5 as the number of trading days per week in the particular application. Throughout the numerical experiments, we set m = max(1, G 1/3 ) as the lag window size at a given G, and η = 0.5 for both Algorithms 1-2. The computational bottleneck of the combined two-stage methodology is the computation of the test statistic T χ,v (ω l , G) in Algorithm 1, which involves singular value decomposition (SVD) of a p × p-matrix at multiple frequencies ω l , 0 ≤ l ≤ m, and over time G ≤ v ≤ n − G. To facilitate the computation, we propose to perform sub-sampling of T χ,v (ω l , G) at v = G, G + b n , G + 2b n , . . ., with b n = 2 log(n) . This may incur additional bias of at most b n /2 ≤ log(n) in the change point location estimators, which is asymptotically ignorable in view of Theorem 3.1, but reduce the number of SVD to be performed by the factor of b n . We provide full details on how the common component is generated for numerical experiments reported in Section 5.1. t admits a static factor model representation, as where u jt ∼ iid N (0, σ 2 j ) with (σ 1 , σ 2 ) = (1, 0.5), and the MA coefficients are generated as (B [k] 0,ij , B [k] 1,ij , B [k] 2,ij ) ∼ iid N 3 (0, I) for all 1 ≤ i ≤ p and 1 ≤ j ≤ q when k = 0. Then sequentially for k = 1, . . . , K χ , we draw Π t does not admit a static factor model representation, as where u jt ∼ iid N (0, 1) and the coefficients a ij are drawn uniformly as a ij ∼ iid U[−1, 1] with U[a, b] denoting a uniform distribution. The AR coefficients are generated as α [k] ij ∼ iid U[−0.8, 0.8] when k = 0 and then sequentially for k = 1, . . . , K χ , we draw Π C.5 Information on the real dataset Table C .1 provides the list of the 72 companies included in the application presented in Section 5.2 along with their tickers and industry classifications . In the following lemmas, we operate under Assumptions 2.1, 2.2, 2.3 and 2.4. Recall the definitions of L χ (v) and L ξ (v), and we define Σ ξ,v (ω, G) analogously as Σ χ,v (ω, G) defined in (7). Then, the local spectral density matrix of X t is defined as t ) ), we define the local autocovariance matrix of χ t as analogously define that of ξ t as Γ ξ,v ( , G), and let Γ x,v ( , G) = Γ χ,v ( , G) + Γ ξ,v ( , G). Zhang and Wu (2021) extend the functional dependence measure introduced in Wu (2005) for high-dimensional, locally stationary time series. Denote by χ,1 (·), . . . , g [k] χ,p (·)) and G [k] ξ (·) = (g [k] ξ,1 (·), . . . , g [k] ξ,p (·)) R p -valued measurable functions such that χ ε 1 ) , . . . , (u t , ε t )} denote a coupled version of F t with an independent copy (u 0 , ε 0 ) replacing (u 0 , ε 0 ). Then, the element-wise functional dependence measure is defined as the uniform functional dependence measure as the dependence adjusted norms as X i· ν,α = sup ≥0 ( + 1) α ∞ t= δ t,ν,i and |X · | ∞ ν,α = sup ≥0 ( + 1) α ∞ t= δ t,ν , and the overall and the uniform dependence adjusted norms as (a) Under Assumption 2.4 (ii) (a), we have Ψ ν,α ≤ C ν,Ξ,ς M 1/2 ε p 2/ν µ 1/ν ν and |X · | ∞ ν,α ≤ C ν,Ξ,ς M 1/2 ε log 1/2 (p)p 1/ν µ 1/ν ν for some constant C ν,Ξ,ς > 0 depending only on its subscripts (varying from one occasion to another). (b) Under Assumption 2.4 (ii) (a)-(b), we have Φ ν,α ≤ C ν,Ξ,ς M 1/2 ε µ 1/ν ν for any ν for which u jt ν and ε it ν exist. Proof. By Minkowski inequality, for all t. Due to independence of u jt , Assumption 2.1 and Lemma D.3 of Zhang and Wu (2021) , there exists C ν > 0 that depends only on ν such that Similarly, from Assumption 2.3 and independence of ε it , we have t,ij ) 2 and setting α ≤ ς − 1, Φ ν,α ≤ C ν,Ξ,ς M 1/2 ε µ 1/ν ν , Ψ ν,α ≤ C ν,Ξ,ς M 1/2 ε p 2/ν µ 1/ν ν and |X · | ∞ ν,α ≤ C ν,Ξ,ς M 1/2 ε log 1/2 (p)p 1/ν µ 1/ν ν . Lemma D.2. There exist some constants C Ξ,ς , C Ξ,ς,ε > 0 which depend only on Ξ, ς and M ε defined in Assumptions 2.1 and 2.3, such that for all h ≥ 0, Proof. Suppose that L χ (t − ) = k and L χ (t) = l. Then, for any h ≥ 0, we have uniformly in 1 ≤ i, i ≤ p and t for some C Ξ,ς > 0 depending only on Ξ and ς, thanks to Assumption 2.1. Similarly, assuming that L ξ (t − ) = k and L ξ (t) = l, we have uniformly in 1 ≤ i, i ≤ p and t for some C Ξ,ς,ε > 0 depending only on Ξ, ς and M ε , from Assumption 2.3 (ii) and (iv). The following lemma is a direct consequence of Lemma D.2 and Assumption 2.4 (i). We adopt the notations Σ [k] ξ,ii (ω), 1 ≤ i, i ≤ p] to denote the elements of the spectral density matrices, and similarly Γ Proof. By Lemma D.3, we can find B σ that depends only on Ξ, ς and M ε defined in Assumptions 2.1 and 2.3, such that χ,ii (ω) possess derivatives of any order and are of bounded variation, i.e. there exists B σ > 0 such that χ,ii ( )e −ιω has derivatives of all orders. Moreover, for some constant C Ξ,ς > 0 not depending on 1 ≤ i, i ≤ p, 0 ≤ k ≤ K χ or ω ∈ [−π, π], which entails the bounded variation of σ [k] χ,ii (ω). Let k] z . Under Assumption 2.3, we can find a constant M ξ > 0 which depends only on M ε , Ξ and ς such that, uniformly over ω ∈ [−π, π] and 0 ≤ k ≤ K ξ , D.3 Proof of Theorem 3.1 Proposition D.6. Under the assumptions made in Theorem 3.1, we have Proof. By Lemma D.7, we have For ease of notation, define Σ x,v (ω, G) = Σ x,v (ω, G) − Σ x,v+G (ω, G) and analogously define Σ x,v (ω, G), Σ χ,v (ω, G) and Σ ξ,v (ω, G). By definition and Assumption 3.1 (ii), Σ χ,θ k (ω, G) = ∆ χ,k (ω). Also, let ω(v) = arg max{ω l , 0 ≤ l ≤ m : T χ,v (ω l , G)} and ω • [k] = arg max{ω l , 0 ≤ l ≤ m : ∆ χ,k (ω l ) }. Then, due to the Lipschitz continuity of p −r k,1 ∆ χ,k (ω) (see Assumption 3.1 (i)), we have for some small enough constant ∈ (0, 1). In what follows, we omit the subscript χ from θ χ,k and θ χ,k for simplicity and throughout the proof, we operate on the set M χ n,p = E (1) n,p ∩Ē (1) n,p , where with M as in Theorem 3.1, andĒ (1) n,p is defined in (D.7) below. By Proposition D.6, we have P(E (1) n,p ) → 1 as n, p → ∞ and similarly, P(Ē (1) n,p ) → 1 by Lemma D.8, such that P(M χ n,p ) → 1. maximiser of T χ,v (ω(v), G) within its ηG-radius. This, combined with how I is updated at each iteration, makes sure that only a single estimator is added to Θ χ for each change point. From (D.2), (D.3) and Assumption 3.1 (i), for small enough > 0, Also by (D.4), for all 1 ≤ k ≤ K χ under Assumption 3.1 (i), and II is bounded analogously. Putting together the bounds on I-III together with the fact that g Then, we observe: First, note that by (D.3), and as applying the same arguments as those adopted in bounding I and II above, we have max(R k1 , R k2 ) ≤ |F k |. Now we turn our attention to R k4 . Let γ x,ii (t, ) = γ t ), and γ x,ii (t, ) = E(X i,t− X i t ) when ≥ 0 and γ x,ii (t, ) = E(X it X i ,t−| | ) when < 0. Then, By Lemma D.2, there exists constant C Ξ,ς,ε , C Ξ,ς,ε > 0 that do not depend on i, i such that Similarly, noting that there are at most a single change point within any (θ k − θ k )-interval, we have n,p . Collecting the bounds on IV , V and V I, we have for some constant C > 0, under Assumption 3.1 (i). Also, we observe that by Proposition 2.1, Turning our attention to R k3 , note that where the definitions of Q (r) k,ii (ω, h, H), r = 1, 2, can be found in (D.10). Then by Lemma D.8 and Chebyshev's inequality, there exists some constant c 1 > 0 such that P(Ē (1) n,p ) → 1, wherē n,p (which itself does not depend on k), we have onĒ n,p ) 1−2/ν , m log(GKχ) c 0 ρ for c 0 large enough which, combined with the bounds on R kl , l = 1, 2, 4, 5, contradicts the first inequality in (D.6). As these statements are deterministic on E (1) n,p ∩Ē (1) n,p , the above arguments apply to all 1 ≤ k ≤ K χ which concludes the proof. In what follows, we operate under the assumptions made in Theorem 3.1. (9). we first address the first term in the RHS of (D.8). In Lemma D.1, for ς > 2 (as assumed in Assumptions 2.1 and 2.3 (iv)), we can always set α = ς − 1 > 1/2 − 2/ν. Then, from the finiteness of Φ ν,α shown therein and by Theorems 4.1 and 4.2 of Zhang and Wu (2021), there exist universal constants C 1 , C 2 , C 3 > 0 and constants C α , C ν,α > 0 that depend only on their subscripts, such that for any z > 0, Noting that for any positive random variable Y , we have E(Y ) = ∞ 0 P(Y > y)dy, we have E(max v sup ω | σ x,v,ii (ω, G)−E( σ x,v,ii (ω, G))| 2 ) ≤ Cψ 2 n for some constant C > 0 independent of i, i , thanks to Lemma D.1. Turning our attention to the second term in the RHS of (D.8), let Γ x,v ( , G) = [ γ x,v,ii ( , G), 1 ≤ i, i ≤ p] and define Γ χ,v ( , G), γ χ,v,ii ( , G), Γ ξ,v ( , G) and γ ξ,v,ii ( , G), analogously. Note that by Assumption 2.4 (i), Then by Lemma D.2, for all , 1 ≤ i, i ≤ p and G ≤ v ≤ n − G, From the bounds on I and II (which hold uniformly over 1 ≤ i, i ≤ p and G ≤ v ≤ n) and that ς > 2, there exists C Ξ,ς,ε > 0 such that III ≤ C Ξ,ς,ε G −1 = o(m −1 ). Also from Lemma D.3 , there exists C Ξ,ς,ε > 0 such that . Combining the bounds on III-V , the proof is complete. For H ∈ {0, ±G} and 1 ≤ k ≤ K χ , define (D.10) Lemma D.8. There exists a constant C > 0 not dependent on 1 ≤ i, i ≤ p such that for some δ ∈ {m, . . . , G}, under Assumption 2.4 (ii) (a), and and under Assumption 2.4 (ii) (b). Proof. Under Assumption 2.4 (ii) (a), Proposition 6.2 of Zhang and Wu (2021) , combined with the arguments adopted in the proof of their Theorem 4.1 (most notably, their Equation (B.15)) and Bonferroni correction, obtains that there exist universal constants C 1 > 0 and C ν,α , C α > 0 that depend only on their subscripts, such that ν,α δ ν/2−1 z ν/2 + C 1 (4m + 1)GK χ exp − δz 2 C α mΨ 4 4,α thanks to Lemma D.1. Then, as in the proof of Lemma D.7, we can find C > 0 independent of i, i and show the first part of the claim by Lemma D.1. Similarly, under Assumption 2.4 (ii) (b), Lemma D.1 and Theorem 6.3 of Zhang and Wu (2021) show that there exists a universal constant C 2 > 0 such that , which completes the proof. We provide a series of supporting results under the assumptions made in Theorem 3.2, leading to the proof of the claims. In what follows, we operate in M χ n,p . We defineψ n aš Also, let δ χ,k = θ χ,k+1 − θ χ,k for 0 ≤ k ≤ K χ , and δ χ,k = θ χ,k+1 − θ χ,k for 0 ≤ k ≤ K χ , such that Σ [k] x (ω) = Σ x, θ χ,k+1 (ω, δ χ,k ). Proposition D.9. (a) There exists a constant C > 0 such that (b) Withθ n,p defined in (12) Proof of (a). Under Assumption 3.1 (ii), applying Proposition 6.2 and Theorem 6.3 of Zhang and Wu (2021) with their (B.15), there exist universal constants C 1 , C 2 > 0 not dependent on 1 ≤ i, i ≤ p and constants C α , C ν,α > 0 that depend only on their subscripts, such that for any z > 0, P max k sup ω σ x, θ χ,k+1 ,ii (ω, δ χ,k ) − E σ x, θ χ,k+1 ,ii (ω, δ χ,k ) ≥ z ≤    Cν,αKχ(4m+1)m ν/2−1 Φ ν ν,α G ν/2−1 z ν/2 + C 1 K χ (4m + 1) exp − Gz 2 | σ x, θ χ,k+1 ,ii (ω, δ χ,k ) − E( σ x, θ χ,k+1 ,ii (ω, δ χ,k ))| 2 ) ≤ Cψ 2 n . Next, we bound the bias term max k max i,i sup ω E σ x, θ χ,k+1 ,ii (ω, δ χ,k ) − σ x,θ χ,k+1 ,ii (ω, δ χ,k ) ≤ max k max i,i sup ω E σ x, θ χ,k+1 ,ii (ω, δ χ,k ) − σ x, θ χ,k+1 ,ii (ω, δ χ,k ) + max k max i,i sup ω σ x, θ χ,k+1 ,ii (ω, δ χ,k ) − σ x,θ χ,k+1 ,ii (ω, δ χ,k ) =: I + II. We can show that I = O(m −1 ) by the arguments analogous to those adopted in bounding the RHS of (D.9). Also on M χ n,p , we have for fixed ≥ 0, γ x, θ χ,k+1 ,ii ( , δ χ,k ) − γ x,θ χ,k+1 ,ii ( , δ χ,k ) n,p + ρ [k+1] n,p G by Lemma D.3, from which we obtain II = O(ρ n,p /G). In summary, we obtain E max k sup ω σ x, θ χ,k+1 ,ii (ω, δ χ,k ) − σ x,θ χ,k+1 ,ii (ω, δ χ,k ) 2 ≤ C ψ n ∨ 1 m ∨ ρ n,p G 2 for some constant C > 0 that does not depend on 1 ≤ i, i ≤ p, from which the conclusion follows. Proof of (b). Under Assumption 3.1 (ii), applying Theorems 6.1 and 6.3 of Zhang and Wu (2021) with their (B.15), there exist universal constants C 1 , C 2 , C 3 > 0 and constants C α , C ν,α > 0 that depend only on their subscripts, such that for any z > 0, P max k sup ω Σ x, θ χ,k+1 (ω, δ χ,k ) − E( Σ x, θ χ,k+1 (ω, δ χ,k )) Cν,αKχ(4m+1)m ν/2−1 (log 7/4 (p)p 1/ν ) ν G ν/2−1 z ν/2 + C 1 K χ (4m + 1)p 2 exp − Gz 2 thanks to Lemma D.1, such that | Σ x, θ χ,k+1 (ω, δ χ,k ) − E( Σ x, θ χ,k+1 (ω, δ χ,k ))| ∞ = O p (θ n,p ). We can bound the bias term max k sup ω |E( Σ x, θ χ,k+1 (ω, δ χ,k ))−Σ x,θ χ,k+1 (ω, δ χ,k )| ∞ as in the proof of (a), and the conclusion follows. We denote by e [k] χ,j (ω), 1 ≤ j ≤ q k , the eigenvectors of Σ χ (ω) that correspond to µ [k] χ,j (ω), an let M [k] χ (ω) = diag(µ [k] χ,j (ω), 1 ≤ j ≤ q k ) and E [k] χ,j (ω), 1 ≤ j ≤ q k ] for all 0 ≤ k ≤ K χ . Lemma D.10. There exists a unitary, diagonal matrix O k (ω) ∈ C q k ×q k for each ω ∈ [−π, π] and 0 ≤ k ≤ K χ , such that Next, we can find {ω * l } m−1 l=−m and {ω • l } m−1 l=−m with ω * l , ω • l ∈ [ω l , ω l+1 ], such that Proof. By definition, we have where, from Lemma D.16, we have I = O p (θ n,p ) withθ n,p defined therein. Also, In the remainder of this section, we omit ξ from θ ξ,k and θ ξ,k for simplicity. In what follows, we operate on M χ n,p ∩ E (2) n,p ∩Ē (2) n,p withĒ (2) n,p defined in (D.22) below which, due to Theorem 3.1, Proposition D.14 and Lemma D.17, satisfies P(M χ n,p ∩ E (2) n,p ∩Ē (2) n,p ) → 1. Proof of Theorem 4.1 (a). In the first iteration of Algorithm 2 with v • = G, the estimator β = β v• (G) satisfies a large enough c 0 such that, D.20) . This, together with the bound on |R k3 |, shows that the inequality in (D.21) does not hold, and thus we prove the claim. Since all the arguments are conditional on E (2) n,p ∩Ē (2) n,p , which in turn are formulated uniformly over 1 ≤ k ≤ K ξ , the proof is complete. Proof of Corollary A.1. For the proof of (a), we first note that under the stationarity of χ t , we have K χ = 0 on M χ n,p such that ρ n,p = 0. Therefore, we have P(E n,p ) → 1 where Operating on M χ n,p ∩ E (2) n,p ∩Ē (2) n,p , analogous arguments as those adopted in Theorem 4.1 apply. For the proof of (b), we proceed similarly as in the case of (a) except that now we have P(E In what follows, we operate under the assumptions made in Theorem 4.1. We define Γ χ,v ( , G) analogously as Γ ξ,v ( , G) with θ χ,k in place of θ ξ,k and let Γ x,v ( , G) = Γ χ,v ( , G)+ Γ ξ,v ( , G). Lemma D.16. Recall the definition ofθ n,p in (A.1). Then, Proof. By Theorems 3.1 and 3.2 of Zhang and Wu (2021) , there exist universal constants C 1 , C 2 > 0 and constants C α , C ν,α > 0 that depend only on their subscripts, such that for any z > 0, Cν,αnd ν/4 log ν+1 (G)(log 3/2 (p)p 1/ν ) ν (Gz) ν/2 + C 1 np 2 exp − Gz 2 such that max v max | Γ x,v ( , G) − E( Γ x,v ( , G))| ∞ = O p (θ n,p ), thanks to Lemma D.1. As for the bias term, applying the arguments adopted in the proof of Lemma D.7, it is shown that which completes the proof. For 1 ≤ k ≤ K ξ , H ∈ {0, ±G} and ≥ 0, define Proof. Applying Theorems 6.4 and 6.5 of Zhang and Wu (2021) with Bonferroni correction, there exist universal constant C 1 , C 2 > 0 and constants C α , C ν,α > 0 that depend only on their subscripts, such that for any z > 0, Cν,αK ξ Gd ν/4 (p 1/ν log 3/2 (p)) ν δ ν/2−1 z ν/2 + C 1 K ξ Gdp 2 exp − δz 2 Inferential theory for factor models of large dimensions Multiple Change Point Detection in Structured VAR Models: the VARDetect R Package Multiple change points detection in low rank and sparse high dimensional vector autoregressive models Multiple change point detection in reduced rank high dimensional vector autoregressive models Nets: Network estimation for time series Consistent estimation of high-dimensional factor models when the factor number is over-estimated Simultaneous multiple change-point and factor analysis for high-dimensional time series Fnets: Factor-adjusted network estimation and forecasting for high-dimensional time series A network analysis of the volatility of high dimensional financial series Low rank and structured modeling of highdimensional vector autoregressions Regularized estimation in sparse high-dimensional time series models Time Series: Data Analysis and Theory A constrained 1 minimization approach to sparse precision matrix estimation The dantzig selector: Statistical estimation when p is much larger than n Inference of breakpoints in high-dimensional time series Structural break estimation for nonstationary time series models On the network topology of variance decompositions: Measuring the connectedness of financial firms A MOSUM procedure for the estimation of multiple random change points Factor-adjusted regularized model selection Large covariance estimation by thresholding principal orthogonal complements Autoregressive models for gene regulatory network inference: Sparsity, stability and causality issues Testing hypotheses about the number of factors in large factor models Detection of multiple structural breaks in multivariate time series Joint structural break detection and parameter estimation in high-dimensional nonstationary VAR models Discovering graphical granger causality using the truncating lasso penalty Forecasting using principal components from a large number of predictors Optimal covariance change point localization in high dimensions Localizing changes in highdimensional vector autoregressive processes Nonlinear system theory: Another look at dependence Inference for multiple change points in time series via likelihood ratio scan statistics A useful variant of the Davis-Kahan theorem for statisticians Convergence of covariance and spectral density estimates for high-dimensional locally stationary processes (1) n,p , we havefor all G ≤ v ≤ n − G. From (D.2), it follows that for any v satisfying min 1≤k≤Kχ |v − θ k | ≥ G, we have T χ,v (ω(v), G) ≤ κ n,p since Σ χ,v (ω(v), G) = O due to Assumption 3.1 (ii), and such v does not belong to I. Also, noting that by (D.1), (D.2) and the definition of ω(θ k ) and ω • [k] ,we conclude that at least one change point is detected within distance G from each θ k , 1 ≤ k ≤ K χ . Next, suppose that θ ∈ Θ χ satisfies | θ − θ k | < G. From that T χ, θ (ω( θ), G) ≥ T χ,θ k (ω(θ k ), G) ≥ T χ,θ k (ω • [k] , G), we obtainfor some small constant 0 ∈ (0, 1/2) and large enough n under Assumption 3.1 (ii), i.e. we detect at least one change point withinThen by (D.2) and the lower bound on κ n,p ,such v cannot be a local Proof of Theorem 3.1 (b). WLOG, we consider the case when θ k ≤ θ k ; the following arguments apply analogously to the case when θ k > θ k . We prove by contradiction that ifn,p , we have T χ, θ k (ω( θ k ), G) < T χ,θ k (ω( θ k ), G) and thus θ k cannot be the local maximiser of T χ,v (ω( θ k ), G) within its ηG-environment as required.From, G) and by (D.1) and (D.2), we havefor an arbitrarily small constant > 0. Noting that Σ x,v (ω, G) and Σ χ,v (ω, G) are Hermitian (and thus diagonalisable with real diagonal entries), we writeThen on E(1) n,p , there exists s k with |s k | = 1 and a constant c 1 > 0 such thatwhere the first inequality follows from Corollary 1 of Yu et al. (2015) , and the second one from Assumption 3.1 (i), (D.2) and (D.3). WLOG, suppose that gProof. By Propositions 2.1 and D.9 (a), we haveThen by Theorem 2 of Yu et al. (2015) , there exist such O k (ω) satisfyingfor all ω and k which, combined with (D.11) and Assumption 3.2, concludes the proof.x,j (ω) denote the j-th largest eigenvalue of Σ x, θ χ,k+1 (ω, δ χ,k ) (and thus the j-th diagonal element of M [k] x (ω)). As a consequence of (D.11) and Weyl's inequality, for all 1 ≤ j ≤ q k and ω ∈ [−π, π],x (ω) exists with probability tending to one as n, p → ∞. Therefore,.Then from (D.12), we have for all ω,where O p holds uniformly over ω and k.Let ϕ i denote a vector whose i-th element is one and the rest are set to be zero; its length are determined by the context.Proof. By Propositions 2.1 and D.9 (b), we haveThen, by (D.13), Assumption 3.2 and Lemmas D.4, D.10 and D.11, we haveχ,j (ω) (resp. e[k]x,j (ω)) the j-th column of E[k]x (ω)) and e [k] χ,ij (ω) (resp. e [k] x,ij (ω)) denote its i-th element.x,ij (ω) = O p (1).Proof. Note that by Assumption 2.4 (i) and the arguments adopted in the proof of Lemma D.4, σ x,θ χ,k+1 ,ii (ω, δ χ,k ) = σ [k] χ,ii (ω) + σ ξ,θ χ,k+1 ,ii (ω, δ χ,k ) and max k,i,i sup ω σ x,θ χ,k+1 ,ii (ω, δ χ,k ) ≤ B σ < ∞. Then from that σχ,j (ω)|e [k] χ,ij (ω)| 2 ≤ B σ , the claim (a) follows. Next, by Proposition D.9 (b) and Lemma D.4,which, combined with (D.12), leads to (b).Proof of Theorem 3.2 (a). First, note thatBy Lemmas D.11, D.12, D.13 and Cauchy-Schwarz inequality,x,i j (ω)| 2x,j (ω)under Assumption 3.2, and III can be handled analogously. By (D.12) and Lemma D.13,χ,ii (ω)e ιω dω =: I + II.By Theorem 3.2 (a),on M χ n,p , from Theorem 3.2 (b). The conclusion follows by noting thatθ n,p ∨θ n,p = O(ϑ n,p ).A consequence of Proposition D.14 is that P(E (17).Proposition D.15. Under the assumptions made in Theorem 3.2, with λ n,p chosen as in (17), we have on M χ n,p ∩ E(2) n,p ,1 for all θ ξ,k + G ≤ v ≤ θ ξ,k+1 and 0 ≤ k ≤ K ξ .Proof. We first note that solving (16) is equivalent to solving the problem column-wise, i.e.β v,·j (G) = arg min β∈R pd |β| 1 subject to G v (G)β − g v,·j (G) ∞ ≤ λ n,p for 1 ≤ j ≤ p, (see e.g. Lemma 1 of Cai et al. (2011)) , where β ·j denotes the j-th column of any β ∈ R (dp)×p . Next, we show for any θ ξ,k + G ≤ v ≤ θ ξ,k+1 , β [k] is a feasible solution to (16), for all 0 ≤ k ≤ K ξ . This follows from that∞ ≤ λ n,p on E(2) n,p . Then, | β v,·j (G)| 1 ≤ |β [k] ·j | 1 for θ ξ,k + G ≤ v ≤ θ ξ,k+1 and consequently, β v (G) 1 ≤ β [k] 1 . From this, we have∞ ≤ 2λ n,p .