key: cord-0637352-k77pyajw authors: Cho, Haeran; Fryzlewicz, Piotr title: Multiple change point detection under serial dependence: Wild contrast maximisation and gappy Schwarz algorithm date: 2020-11-27 journal: nan DOI: nan sha: 5f76c2cea3bb3b6f7d717fc4cf3d43be91fd826d doc_id: 637352 cord_uid: k77pyajw We propose a methodology for detecting multiple change points in the mean of an otherwise stationary, autocorrelated, linear time series. It combines solution path generation based on the wild contrast maximisation principle, and an information criterion-based model selection strategy termed gappy Schwarz algorithm. The former is well-suited to separating shifts in the mean from fluctuations due to serial correlations, while the latter simultaneously estimates the dependence structure and the number of change points without performing the difficult task of estimating the level of the noise as quantified e.g. by the long-run variance. We provide modular investigation into their theoretical properties and show that the combined methodology, named WCM.gSa, achieves consistency in estimating both the total number and the locations of the change points. The good performance of WCM.gSa is demonstrated via extensive simulation studies, and we further illustrate its usefulness by applying the methodology to London air quality data. This paper proposes a new methodology for detecting possibly multiple change points in the piecewise constant mean of an otherwise stationary, linear time series. This is a well-known difficult problem in multiple change point analysis, whose challenge stems from the fact that change points can mask as natural fluctuations in a serially dependent process and vice versa. We briefly review the existing literature on multiple change point detection in the presence of serial dependence and situate our new proposed methodology in this context; see also Aue and Horváth (2013) for a review. One line of research extends the applicability of the test statistics developed for independent data, such as the CUSUM (Csörgő and Horváth, 1997) and moving sum (MOSUM, Hušková and Slabý; 2001) statistics, to time series setting. Their performance depends on the estimated level of noise quantified e.g. by the long-run variance (LRV), and the estimators of the latter in the presence of multiple change points have been proposed (Tecuapetla-Gómez and Munk, 2017; Eichinger and Kirch, 2018; Dette et al., 2020) . The estimation of the LRV, even when the mean changes are not present, has long been noted as a difficult problem (Robbins et al., 2011) ; the popularly adopted kernel estimator of LRV tends to incur downward bias (den Haan and Levin, 1997; Chan and Yau, 2017) , and can even take negative values when the LRV is small (Hušková and Kirch, 2010) . It becomes even more challenging in the presence of (possibly) multiple change points, and the estimators may be sensitive to the choice of tuning parameters which are often related to the frequency of change points. Selfnormalisation of test statistics avoids direct estimation of this nuisance parameter (Shao and Zhang, 2010; Pešta and Wendler, 2020) but theoretical investigation into its validity is often limited to change point testing, i.e. when there is at most a single change point, with the exception of Wu and Zhou (2020) and Zhao et al. (2021) , both of which adopt local windowbased procedures. Consistency of the methods utilising penalised least squares estimation (Lavielle and Moulines, 2000) or Schwarz criterion (Cho and Kirch, 2021) constructed without further parametric assumptions, has been established under general conditions permitting serial dependence and heavy-tails, but their consistency relies on the choice of the penalty, which in turn depends on the level of the noise. The second line of research utilises particular linear or non-linear time series models such as the autoregressive (AR) model, and estimates the serial dependence and change point structures simultaneously. AR(1)-type dependence has often been adopted to describe the serial correlations in this context: Chakar et al. (2017) and Romano et al. (2021) propose to minimise the penalised cost function for detection of multiple change points in the mean of AR(1) processes via dynamic programming, and Fang and Siegmund (2020) study a pseudosequential approach to change point detection in the level or slope of the data. Lu et al. (2010) investigate the problem of climate time series modelling by allowing for multiple mean shifts and periodic AR noise. Gallagher et al. (2021) propose to estimate AR parameters from differenced data in the presence of multiple mean shifts, and investigate the performance of change point detection methods developed for i.i.d. noise setting to the residuals. Fryzlewicz (2020b) proposes to circumvent the need for accurate estimation of AR parameters through the use of a multi-resolution sup-norm (rather than the ordinary least squares) in fitting the postulated AR model, but this is only possible because the goal of the method is purely inferential and therefore different from ours. We also mention that Davis et al. (2006 Davis et al. ( , 2008 , Cho and Fryzlewicz (2012) , Bardet et al. (2012) , Chan et al. (2014) , Yau and Zhao (2016) and Korkas and Fryzlewicz (2017) , among others, study multiple change point detection under piecewise stationary, univariate time series models, and Cho and Fryzlewicz (2015) , Safikhani and Shojaie (2020) and Cho and Korkas (2021) under high-dimensional time series models. We now describe our proposed methodology against this literature background and summarise its novelty and main contributions of this paper. 1. The first step of the proposed methodology constructs a sequence of candidate change point models by adopting the Wild Contrast Maximisation (WCM) principle: it iteratively locates the next most likely change point in the data between the previously proposed change point estimators, as the one maximising a given contrast (in our case, the absolute CUSUM statistic) in the data sections over a collection of intervals of varying lengths and locations. It produces a complete solution path to the change point detection problem as a decreasing sequence of max-CUSUMs corresponding to the successively proposed change point candidates. The WCM principle has successfully been applied to the problem of multiple change point detection in the presence of i.i.d. noise (Fryzlewicz, 2014 (Fryzlewicz, , 2020a . We show that it is particularly useful under serial dependence by generating a large gap between the max-CUSUMs attributed to change points and those attributed to the fluctuations due to serial correlations. This motivates a new, 'gappy' model sequence generation procedure which, by considering only some of the candidate models along the solution path that correspond to large drops in the decreasing sequence of max-CUSUMs as serious contenders, systematically selects a small subset of model candidates. We justify this gappy model sequence generation theoretically and further demonstrate numerically how it substantially facilitates the subsequent model selection step. 2. The second step performs model selection on the sequence of candidate change point models generated in the first step. To this end, we propose a backward elimination strategy termed gappy Schwarz algorithm (gSa), a new application of Schwarz criterion (Schwarz, 1978) constructed under a parametric, AR model assumption on the noise. Information criteria have been widely adopted for model selection in change point problems (Yao, 1988; Kühn, 2001) . However, through its application on the gappy model sequence, our proposal differs from the conventional use of an information criterion in the change point literature which involve its global (Davis et al., 2006; Killick et al., 2012; Romano et al., 2021) or local (Chan et al., 2014; Fryzlewicz, 2014) minimisation. Rather than setting out to minimise Schwarz criterion, the Schwarz algorithm starts from the largest model in consideration and iteratively compares a pair of consecutive models by evaluating the reduction of the cost due to newly introduced change point estimators, offset by the increase of model complexity as measured by Schwarz criterion. This has the advantage over the direct minimisation of the information criterion on a solution path as it avoids the substantial technical challenges linked to dealing with under-specified models in the presence of serial dependence. The two ingredients, WCM-based gappy model sequence generation and model selection via Schwarz algorithm, make up the WCM.gSa methodology. Throughout the paper, we highlight the important roles played by these two components and argue that WCM.gSa offers state-of-the-art performance in the problem of multiple change point detection under serially dependent noise. WCM.gSa is modular in the sense that each ingredient can be combined with alternative model selection or model sequence generation procedures, respectively. We provide separate theoretical analyses of the two steps so that they can readily be fed into the analysis of such modifications, as well as showing that the combined methodology, WCM.gSa, achieves consistency in estimating the total number and the locations of multiple change points. The paper is organised as follows. In Sections 2 and 3, we introduce the two ingredients of WCM.gSa individually, and show its consistency in multiple change point detection in the presence of serial dependence. Section 4 summarises our numerical results and applies WCM.gSa to London air quality datasets. The Supplementary Appendix contains comprehensive simulation studies, an additional data application to central England temperature data, and the proofs of the theoretical results. The R software implementing WCM.gSa is available from https://github.com/haeran-cho/wcm.gsa. We consider the canonical change point model f j · I(t ≥ θ j + 1) + Z t , t = 1, . . . , n. (1) Under model (1), the set Θ := {θ 1 , . . . , θ q } with θ j = θ j,n , contains q change points (with θ 0 = 0 and θ q+1 = n) at which the mean of X t undergoes changes of size f j . We assume that the number of change points q does not vary with the sample size n, and we allow serial dependence in the sequence of errors {Z t } n t=1 with E(Z t ) = 0. A large number of multiple change point detection methodologies have been proposed for a variant of model (1) in which the errors {Z t } n t=1 are independent. In particular, a popular class of multiscale methods aim to isolate change points for their detection by drawing a large number of sub-samples of the data living on sub-intervals of [1, n] . When a sufficient number of sub-samples are drawn, there exists at least one interval which is well-suited for the detection and localisation of each θ j , j = 1, . . . q, whose location can be estimated as the maximiser of the series of CUSUM statistics computed on this interval. Methods in this category include the Wild Binary Segmentation (WBS, Fryzlewicz; 2014) , the Seeded Binary Segmentation (Kovács et al., 2020) and the WBS2 (Fryzlewicz, 2020a) . All of the above are based on the WCM principle, i.e. the recursive maximisation of the contrast between the means of the data to the left and right of each putative change point as measured by the CUSUM statistic, over a large number of intervals. (We propose the term Wild Contrast Maximisation rather than, say, 'wild CUSUM maximisation' since, in other change point detection problems, the WCM principle can be applied with statistics other than CUSUM, e.g. generalised likelihood ratio tests.) Their theoretical properties have been established assuming i.i.d. (sub-)Gaussianity on {Z t } n t=1 . In the remainder of this paper, we focus on WBS2, whose key feature is that for any given 0 ≤ s < e ≤ n, we identify the sub-interval {s • + 1, . . . , e • } ⊂ {s + 1, . . . , e} and its inner point k • ∈ {s • + 1, . . . , e • − 1}, which obtains a local split of the data that yields the maximum CUSUM statistic. More specifically, let R s,e denote a subset of A s,e := {( , r) ∈ Z 2 : s ≤ < r ≤ e and r − > 1}, selected either randomly or deterministically, with |R s,e | = min(R n , |A s,e |) for some given R n ≤ n(n − 1)/2. Then, we identify (s • , e • ) ∈ R s,e that achieves the maximum absolute CUSUM statistic, as Starting with (s, e) = (0, n), recursively repeating the above operation over the segments We denote by P 0 the output generated by the WBS2: each element of P 0 contains the triplet of the beginning and the end of the interval and the break that returns the maximum contrast (measured as in (2)) at a particular iteration, and the corresponding max-CUSUM statistic. The order of the sorted max-CUSUMs (in decreasing order) provides a natural ordering of the candidate change points, which gives rise to the following solution path P := if X (m) = 0 for some m ≤ |P 0 |, then (s (m) , k (m) , e (m) ) is not associated with any change point and thus such entries are excluded from the solution path P. The WCM principle provides a good basis for model selection, i.e. selecting the correct number of change points, in the presence of serially dependent noise. This is due to the iterative identification of the local split with the maximum contrast, which helps separate the large max-CUSUMs attributed to mean shifts, from those which are not. In the next section, we propose how to utilise this property of the solution path P generated according to the WCM principle. The solution path P consists of a sequence of candidate change point models K 1 ⊂ K 2 ⊂ . . . with K l := {k (1) , . . . , k (l) }. In this section, we propose a 'gappy' model sequence generation step which selects a subset of the above model sequence by discarding candidate models that are not likely to be the final model. More specifically, by the construction of WBS2, which iteratively identifies the local split of the data with the most contrast (max-CUSUM), we expect to observe a large gap between the CUSUM statistics X (m) computed over those intervals (s (m) , e (m) ) that contain change points well within their interior, and the remaining CUSUMs. Therefore, for the purpose of model selection, we can exploit this large gap in X (m) , 1 ≤ m ≤ P , or equivalently, in Y (m) := log(X (m) ); we later show that under some assumptions on the size of changes and the level of noise, the large log-CUSUMs Y (m) attributed to change points scale as log(n) while the rest scale as log log(n). For the identification of the large gap in Y (1) ≥ . . . ≥ Y (P ) , the simplest approach is to look for the largest difference Y (m) − Y (m+1) . However, this largest gap may not necessarily correspond to the difference between the max-CUSUMs attributed to mean shifts and spurious ones attributed to fluctuations in the errors, but simply be due to the heterogeneity in the change points (i.e. some changes being more pronounced and therefore easier to detect than others). Therefore, we identify the M largest gaps from Y (m) − Y (m+1) , 1 ≤ m ≤ P − 1, and denote the corresponding indices by g 1 < . . . < g M such that This returns a sequence of nested models with Θ l = Θ l−1 ∪ {k (g l−1 +1) , . . . , k (g l ) }. Theorem 2.1 below shows that the model sequence in (4) contains one which consistently detects all q change points with high probability. Typically, this gappy model sequence is much sparser than the sequence of all possible models from the solution path and therefore, intuitively, makes our model selection task easier than if we worked with the entire solution path of all nested models. We confirm this point numerically in the simulation studies reported in Appendix D. In this section, we establish the theoretical properties of the sequence of nested change point models obtained from combining WBS2 with the gappy model sequence generation outlined in Sections 2.1-2.2. The following assumptions are, respectively, on the distribution of {Z t } n t=1 and the size of changes under H 1 : q ≥ 1. Assumption 2.1. Let {Z t } n t=1 be a sequence of random variables satisfying E(Z t ) = 0 and Var(Z t ) = σ 2 Z with σ Z ∈ (0, ∞). Also, let P(Z n ) → 1 with ζ n satisfying log(n) = O(ζ n ) and . Also, there exists some c 1 ∈ (0, 1) such that min 1≤j≤q δ j ≥ c 1 n, and for some ϕ > 0, we have ζ 2 n /(min 1≤j≤q (f j ) 2 δ j ) = O(n −ϕ ). Remark 2.1. Assumption 2.1 permits {Z t } n t=1 to have heavier tails than sub-Gaussian such as sub-exponential or sub-Weibull (Vladimirova et al., 2020) . Appendix G shows that linear time series with short-range dependence and sub-exponential innovations satisfy the assumption, using the Nagaev-type inequality derived in Zhang and Wu (2017) . Similar arguments can be made with the concentration inequalities shown in Doukhan and Neumann (2007) for weakly dependent time series fulfilling E(|Z t | k ) ≤ (k!) ν C k for all k ≥ 1 and some ν ≥ 0 and C > 0, or in Merlevède et al. (2011) for geometrically strong mixing sequences with sub-exponential tails. Alternatively, under the invariance principle, if there exists (possibly after enlarging the probability space) a standard Wiener process W (·) such that t=1 Z t −W ( ) = O(log κ ( )) a.s. with κ ≥ 1, then Assumption 2.1 holds with ζ n log κ (n) for any κ > κ , where we denote by a n b n to indicate that a n = O(b n ) and b n = O(a n ). Such invariance principles have been derived for dependent data under weak dependence such as mixing (Kuelbs and Philipp, 1980) and functional dependence measure (Berkes et al., 2014) conditions. As remarked in Proposition 2.1 (c.i) of Cho and Kirch (2021) , the thus-derived ζ n usually does not provide the tightest upper bound, but it suits our purpose in controlling the level of noise. Then, on Z n , the following statements hold for n large enough and some c 2 ∈ (0, ∞). (ii) The sorted log-CUSUMs Y (m) satisfy Y (m) = γ m log(n)(1 + o(1)) for m = 1, . . . , q, while Y (m) ≤ κ m log(ζ n )(1 + o(1)) for m ≥ q + 1, where {γ m } q m=1 and {κ m } m≥q+1 are non-increasing sequences with 0 < γ m ≤ 1/2 and 0 ≤ κ m ≤ 1. Theorem 2.1 (i) establishes that for the solution path P obtained according to the WCM principle, the entries corresponding to the q largest max-CUSUMs contain the estimators of all q change points θ j and further, the localisation rate attained by θ j is minimax optimal up to a logarithmic factor (see e.g. Verzelen et al. (2020) ). Statement (ii) shows that the q largest log-CUSUMs are of order log(n) and are thus distinguished from the rest of the log-CUSUMs bounded as O(log log(n)). In summary, Theorem 2.1 establishes that the sequence of nested change point models (4) contains the consistent model Θ[q] as a candidate model provided that M is sufficiently large. We emphasise that Theorem 2.1 is not (yet) a full consistency result for our complete change point estimation procedure -this will be the objective of Section 3. Theorem 2.1 merely indicates that the solution path we obtain contains the correctly estimated model, hence it is in principle possible to extract it with the right model selection tool. Section 3 proposes such a tool. In this section, we discuss how to consistently estimate the number and the locations of change points by choosing an appropriate model from the sequence of nested change point models (4). We propose a new backward elimination-type procedure, referred to as 'gappy Schwarz algorithm' (gSa), which makes use of the Schwarz information criterion constructed under a parametric assumption imposing an AR structure on {Z t } n t=1 . The novelty of gSa is in the new way in which it applies Schwarz criterion, rather than in the formulation of the information criterion itself. We show the usefulness of gSa when change point model selection is performed simultaneously with the estimation of the serial dependence. We assume that {Z t } n t=1 in (1) is a stationary AR process of order p, i.e. where satisfy E(ε t ) = 0 and Var(ε t ) = σ 2 ε ∈ (0, ∞), and are assumed to have no serial correlations; further assumptions on {ε t } n t=1 are made in Assumption 3.1. We denote by µ • j := (1 − p i=1 a i )f θ j +1 the effective mean level over each interval θ j + p + 1 ≤ t ≤ θ j+1 , for j = 0, . . . , q, and by d j = µ • j − µ • j−1 the effective size of change correspondingly. Also recall that δ j = min(θ j − θ j−1 , θ j+1 − θ j ). In the model selection procedure, we do not assume that the AR order p is known, and its data-driven choice is incorporated into the model selection methodology as described later. For now, suppose that it is set to be some integer r ≥ 0, and that a change point model is given by a set of change point candidates A = {k j , 1 ≤ j ≤ m : k 1 < . . . < k m } ⊂ {1, . . . , n}. Then, Schwarz criterion (Schwarz, 1978) is defined as where σ 2 n ({X t } n t=1 , A, r) denotes a measure of goodness-of-fit (its precise definition is given below), and a penalty is imposed on the model complexity determined by the AR order and the number of change points; the requirement on the penalty parameter ξ n in relation to the distribution of {ε t } is discussed in Assumption 3.4. We adopt the residual sum of squares as σ 2 n ({X t } n t=1 , A, r), i.e. For notational convenience, we assume that X 0 , . . . , X −r+1 are available and their means remain constant such that E(X t ) = E(X 1 ) for t ≤ 0; in practice, we can simply omit the first p max observations when constructing Y and X above, where p max denotes a pre-specified upper bound on the AR order. The matrix X is divided into the AR part contained in L(r) and the deterministic part in R(A) under (6). We propose to obtain the estimator of regression parameters denoted by β = β(A, r) = ( α(r) , µ(A) ) via least squares estimation, where α(r) ∈ R r denotes the estimator of the AR parameters and µ(A) ∈ R |A|+1 that of the segment-specific levels. We select the typitcally unknown AR order p as follows: AR models of varying orders r ∈ {0, . . . , p max }, are fitted to the data from which we estimate p by We address the consistency of the estimator p(A) in our theoretical analysis. We first narrow down the model selection problem to that of determining between a given change point model A and the null model without any change points. Suppose that the number and locations of mean shifts are consistently estimated by (a subset of) A in the sense made clear in Assumption 3.2 below, which includes the case of no change point (q = 0) with the trivial subset ∅ ⊂ A. Then, the estimator β(A, p) = ( α( p) , µ(A) ) can be shown to estimate the AR parameters sufficiently well with p = p(A) returned by (9), and the criterion SC({X t } n t=1 , A, p) gives a suitable indicator of the goodnessof-fit of the change point model A offset by the increased model complexity. On the other hand, if any change point is ignored in fitting an AR model, the resultant AR parameter estimators over-compensate for the under-specification of mean shifts. In our numerical experiments (reported in Appendix D.3), this often leads to SC({X t } n t=1 , ∅, p(∅)) having a smaller value than SC({X t } n t=1 , A, p) such that their direct comparison returns the null model even though there are multiple change points present and detected by A. where I − Π 1 denotes the projection matrix removing the sample mean from the rightmultiplied vector. By having the plug-in estimator α( p) from β(A, p) in its definition, SC 0 avoids the above-mentioned difficulty arising when evaluating Schwarz criterion at a model that under-specifies the change points. We conclude that the data is better described by the and if the converse holds, we prefer the null model over the change point model. This Schwarz criterion-based model selection strategy is extended to be applicable with a sequence of nested change point models ∅ = Θ 0 ⊂ Θ 1 ⊂ . . . ⊂ Θ M as in (4) even when M > 1. Referred to as the gappy Schwarz algorithm (gSa) in the remainder of the paper, the proposed methodology performs a backward search along the sequence from the largest model Θ l with l = M , sequentially evaluating whether the reduction in the goodness-of-fit (i.e. increase in the residual sum of squares) by moving from Θ l to Θ l−1 , is sufficiently offset by the decrease in model complexity. More specifically, let s, e ∈ Θ l−1 ∪ {0, n} denote two candidates satisfying In other words, A contains candidate estimators detected within the local environment {s + 1, . . . , e − 1}, which appear in Θ l but do not appear in the smaller models Θ l , l ≤ l − 1. Then, we compare SC({X t } e t=s+1 , A, p s:e ) against SC 0 ({X t } e t=s+1 , α s:e ( p s:e )) as in (10), with the least squares estimator of the AR parameters α s:e ( p s:e ) and its dimension p s:e obtained locally by minimising SC({X t } e t=s+1 , A, r) over r (see (9)). If SC({X t } e t=s+1 , A, p s:e ) < SC 0 ({X t } e t=s+1 , α s:e ( p s:e )), the change point estimators in A are deemed as not being spurious; if this is the case for all estimators in Θ l \ Θ l−1 , we return Θ l as the final model. In our theoretical analysis, when q ≥ 1, we assume that there exists some 1 ≤ l * ≤ M such that Θ l * correctly detects all change points and nothing else (see Assumption 3.2 below), which is guaranteed by the model sequence generation method described in Section 2. Then with high probability, we have SC({X t } e t=s+1 , A, p s:e ) < SC 0 ({X t } e t=s+1 , α s:e ( p s:e )) simultaneously in all local regions {s + 1, . . . , e} overlapping with Θ l * \ Θ l * −1 while when l > l * , we have SC({X t } e t=s+1 , A, p s:e ) ≥ SC 0 ({X t } e t=s+1 , α s:e ( p s:e )) in all such regions. Therefore, sequentially examining the nested change point models from the largest model Θ M , gSa is expected to return Θ l * as the final model. In its implementation, in the unlikely event of disagreement across the regions containing Θ l \ Θ l−1 , we take a conservative approach and conclude that Θ l contains spurious estimators, and update l → l − 1 to repeat the same procedure until some Θ l , l ≥ 1, is selected as the final model, or the null model Θ 0 = ∅ is reached. The full algorithmic description of gSa is provided in Appendix A.2. In summary, gSa does not directly minimise Schwarz criterion but starting from the largest model, searches for the first largest model Θ l in which all candidate estimators in Θ l \ Θ l−1 are deemed important as described above. By adopting SC 0 for model comparison, it avoids evaluating Schwarz criterion at a model that under-estimates the number of change points (which may lead to loss of power) and achieves model selection consistency as shown in the next section. For the theoretical analysis of gSa, we make a set of assumptions and remark on their relationship to those made in Section 2.3. Assumption 3.1 is imposed on the stochastic part of model (6). Assumption 3.1. (i) The characteristic polynomial a(z) = 1 − p i=1 a i z i has all of its roots outside the unit circle |z| = 1. (ii) {ε t } is an ergodic and stationary martingale difference sequence with respect to an increasing sequence of σ-fields F t , such that ε t and X t are F t -measurable and E(ε t |F t−1 ) = 0. Assumption 3.1 (i)-(iii) are taken from Wei (1982a,b, 1983) , where the strong consistency in stochastic regression problems is established. In particular, Assumption 3.1 (i) indicates that {Z t } n t=1 is a short-memory linear process. The bound in Assumption 3.1 (iv) is related to the detectability of change points, and gives a lower bound on the penalty parameter ξ n of Schwarz criterion, see Assumption 3.4. Theorem 1.2A of De la Peña (1999) derives a Bernstein-type inequality for a martingale difference sequence satisfying E(|ε t | k ) ≤ (k!/2)c k ε E(ε 2 t ) for all k ≥ 3 and some c ε ∈ (0, ∞), from which we readily obtain ω n log(n). Under a more stringent condition that {ε t } is a sequence of i.i.d. sub-Gaussian random variables, it suffices to set ω n log(n) (e.g. see Proposition 2.1 (a) of Cho and Kirch (2021)); Remark 3.1 (Links between Assumptions 2.1, 2.2 and 3.1). Assumption 2.1 does not impose any parametric condition on the dependence structure of {Z t } n t=1 . For linear, short memory processes (implied by Assumption 3.1 (i)), Peligrad and Utev (2006) show that the invariance principle for the linear process is inherited from that of the innovations. Then, as discussed for some κ ∈ [1, κ), which in turn leads to ζ n ω n . In view of Assumptions 2.1 and 2.2, the condition that ω 2 n = O(min 1≤j≤q δ j ) is a mild one. We impose the following assumption on the nested model sequence Assumption 3.2. Let M n denote the following event: for a given penalty ξ n , we have for some ρ n satisfying (min 1≤j≤q d 2 j δ j ) −1 ρ n → 0. Then, P(M n ) → 1. By Theorem 2.1, we have the condition (11) satisfied by the gappy model sequence generated as in (4) with ρ n ζ 2 n . We state this result as an assumption so that if gSa were to be applied with an alternative solution path algorithm, our results would be directly applicable if the latter satisfied Assumption 3.2. Since the serial dependence structure is learned from the data by fitting an AR model to each segment, the requirement on the minimum spacing of the largest model Θ M is a natural one and it can be hard-wired into the solution path generation step. Assumption 3.3 is on the effective size of changes under (6), and Assumption 3.4 on the choice of the penalty parameter ξ n . In particular, the choice of ξ n connects the detectability of change points with the level of noise remaining in the data after accounting for the autoregressive dependence structure. Assumption 3.3. max 1≤j≤q |d j | = O(1) and D n := min 1≤j≤q d 2 j δ j → ∞ as n → ∞. Assumption 3.4. ξ n satisfies D −1 n ξ n = o(1) and ξ −1 n max(ω 2 n , ρ n ) = o(1). By Assumption 3.1 (i), the effective change size d j is of the same order as f j since d j = (1 − p i=1 a i )f j . Therefore, Assumption 3.3 on the detection lower bound formulated with d j , together with Assumption 3.4, is closely related to Assumption 2.2 formulated with f j . In fact, we can select ξ n such that Assumption 3.4 follows immediately from Assumption 2.2, recalling that the rate of localisation attained by the latter is ρ n ζ 2 n and ω n = O(ζ n ). for n large enough. We generate 1000 realisations under each setting where ε t ∼ iid N (0, 1). In addition to when f t undergoes mean shifts as described below, we also consider the case where f t = 0 to evaluate the size control performance. (M1) f t undergoes q = 5 change points at (θ 1 , θ 2 , θ 3 , θ 4 , θ 5 ) = (100, 300, 500, 550, 750) with (M2) f t undergoes q = 5 change points θ j as in (M1) with n = 1000 and (f 0 , f 1 , f 2 , f 3 , f 4 , f 5 ) = (0, 5, −3, 6, −7, −3), and {Z t } follows an ARMA(2, 6) model: (M3) f t undergoes q = 15 change points at θ j = nj/16 , j = 1, . . . , 15 with n = 2000, where the level parameters f θ j +1 are generated uniformly as Table 1 summarises the simulation results; see Table D .1 in Appendix for the full results where the exact definitions of RMSE and d H can be found. Overall, across the various scenarios, WCM.gSa performs well both when q = 0 and q ≥ 1. In particular, the proportion of the realisations where WCM.gSa detects spurious estimators in the absence of any mean shift is close to 0. Controlling for the size, especially in the presence of serial correlations, is a difficult task and as shown below, competing methods fail to do so by a large margin in some scenarios. When q ≥ 1, WCM.gSa performs well in most scenarios according to a variety of criteria, such as model selection accuracy measured by | q − q| or the localisation accuracy measured by d H . We highlight the importance of the gappy model sequence generation step of Section 2.2: see the results reported under 'no gap' which refers to a procedure that omits this step from WCM.gSa and applies the Schwarz criterion-based model selection procedure directly to the model sequence consisting of consecutive entries from the WBS2-generated solution path. It suffers from having to perform a large number of model comparison steps and tends to over-estimate the number of change points in some scenarios. DepSMUCE occasionally suffers from a calibration issue; in order not to detect spurious change points, it requires α to be set conservatively but for improved detection power, a larger α is better. In addition, the estimator of the LRV proposed therein tends to under-estimate the LRV when it is close to zero as in (M1), or when there are strong autocorrelations as in (M3), thus incurring a large number of falsely detected change points. Similar sensitivity to the choice of α is observable from SNCP. In addition, it tends to return spurious change point estimators when q = 0 in the presence of strong autocorrelations as in (M3), while under-detecting change points generally when q ≥ 1 with the exception of (M1). DeCAFS operates under the assumption that {Z t } n t=1 is an AR(1) process. Therefore, it is applied under model mis-specification in some scenarios, but still performs reasonably well in not returning false positives. The exception is (M3) where, in the presence of strong autocorrelations, it returns spurious estimators over 50% of realisations even though the model is correctly specified in this scenario. Its detection accuracy suffers under model mis-specification in some scenarios such as (M1) and (M2) when compared to WCM.gSa, but DeCAFS tends to attain good MSE. NO x is a generic term for the nitrogen oxides that are the most relevant for air pollution, namely nitric oxide (NO) and nitrogen dioxide (NO 2 ). The main anthropogenic sources of NO x are mobile and stationary combustion sources, and its acute and chronic health effects Table 1 : (M1)-(M11): We report the proportion of returning q ≥ 1 when q = 0 (size) and the summary of estimated change points when q > 1 according to the distribution of q − q, relative MSE (RMSE) and the Hausdorff distance (d H ) over 1000 realisations. Methods that control the size at 0.05, and that achieve the best performance when q > 1 according to different criteria, are highlighted in bold for each scenario. where it is also seen that the thus-transformed data exhibit persistent autocorrelations. We analyse the transformed time series from NO 2 and NO x concentrations for change points in the level, with the tuning parameters for WCM.gSa chosen as recommended in Appendix C apart from M , the number of candidate models considered; given the large number of observations (n = 7139), we allow for M = 10 instead of the default choice M = 5. The change points detected by WCM.gSa are plotted in Figure 1 . For comparison, we also report the change points estimated by DepSMUCE and DeCAFS, see Table 2 . Figure 1 shows that a good deal of autocorrelations remain in the data after removing the estimated mean shifts, but the persistent autocorrelations are no longer observed. This supports the hypothesis that the (de-trended and transformed) NO 2 and NO x concentrations over the period in consideration, can plausibly be accounted for by a model with short-range dependence and multiple mean shifts; we refer to Mikosch and Stărică (2004) , Berkes et al. (2006) Yau and Davis (2012) and Norwood and Killick (2018) for discussions on how weakly dependent time series with mean shifts may appear as a long-range dependent time series. In Appendix E.2, we further validate the set of change point estimators detected by WCM.gSa from the NO 2 time series, by attempting to remove the bulk of serial dependence from the data and then applying an existing procedure for change point detection for uncorrelated data. Algorithm 1 provides a pseudo code for the Wild Binary Segmentation 2 (WBS2) algorithm proposed in Fryzlewicz (2020a) . We remark that WBS2 as defined in Fryzlewicz (2020a) For each l ≥ 1, we denote Θ l = { θ l,j , 1 ≤ j ≤ q l : θ l,1 < . . . < θ l, q l }, and adopt the notational convention that θ l,0 = 0 and θ l, q l +1 = n. Initialised with l = M , gSa performs the following steps. Step 1: We identify u ∈ {0, . . . , q l−1 } with { θ l−1,u + 1, . . . , θ l−1,u+1 − 1} ∩ Θ l = ∅; that is, the segment { θ l−1,u + 1, . . . , θ l−1,u+1 − 1} defined by the consecutive elements of Θ l−1 , has additional change points detected in Θ l such that { θ l−1,u + 1, . . . , θ l−1,u+1 − 1} ∩ ( Θ l \ Θ l−1 ) = ∅. By construction, the set of such indices, I l := {u 1 , . . . , u q l }, satisfies |I l | ≥ 1. For each u v , v = 1, . . . , q l , we repeat the following steps with a logical vector of length q l , F ∈ {TRUE, FALSE} q l , initialised as F = (TRUE. . . . , TRUE). Step 1.1: Setting A = { θ l−1,uv + 1, . . . , θ l−1,uv+1 − 1} ∩ Θ l , obtain p that returns the , A, r) over r ∈ {0, . . . , p max } as outlined in (9), and the corresponding AR parameter estimator α( p) via least squares estimation. Step Step 2: If some elements of F satisfy F v = TRUE and l > 1, update l ← l − 1 and go to Step 1. If F v = FALSE for all v = 1, . . . , q l , return Θ l as the set of change point estimators. Otherwise, return Θ 0 = ∅. Theorem 3.1 shows that we have either F v = FALSE for all v = 1, . . . , q l when the corresponding Θ l = Θ l * (see Assumption 3.2 for the definition of Θ l * ), or F v = TRUE for all v when l > l * and thus all Θ l \ Θ l−1 are spurious estimators. In implementing the methodology, we take a conservative approach in the above Step 2, to guard against the unlikely event where the output F contains mixed results. Throughout this section, we condition on the event that Θ[q] is chosen at the model selection step, and discuss how the location estimators can further be refined; consistent model selection based on the estimators of change point locations returned directly by WBS2 (without any additional refinement), is discussed in Section 3. By Theorem 2.1 and Assumption 2.2, each θ j , 1 ≤ j ≤ q, is sufficiently close to the corresponding change point θ j in the sense that | θ j −θ j | ≤ (f j ) −2 ρ n ≤ cδ j for some c ∈ (0, 1/6) with probability tending to one, for n large enough. Defining 1 = 0, r q = n, j = 2 3 θ j−1 + 1 3 θ j , j = 2, . . . , q, and r j = 1 3 θ j + 2 3 θ j+1 , j = 1, . . . , q − 1, we have each interval ( j , r j ) sufficiently large and contain a single change point θ j well within its interior, i.e. Then, we propose to further refine the location estimator θ j byθ j = arg max j θ 5 and σ(t) = We generate 1000 realisations under each model. For each scenario, we additionally consider the case in which f t ≡ 0 (thus q = 0) in order to evaluate the proposed methodology on its size control. On each realisation, we apply the proposed WCM.gSa with the tuning parameters are selected as described in Section C. For comparison, we consider a procedure that omits the gappy model sequence generation step from WCM.gSa: referred to as 'no gap', it applies the SC-based model selection procedure directly to the model sequence consisting of consecutive entries from the WBS2-generated solution path. We include DepSMUCE (Dette et al., 2020), DeCAFS (Romano et al., 2021) , MACE (Wu and Zhou, 2020) and SNCP (Zhao et al., 2021) in the simulation studies. DepSMUCE extends the SMUCE procedure (Frick et al., 2014) proposed for independent data, by estimating the LRV using a difference-type estimator. MACE is a multiscale moving sum-based procedure with self-normalisation-based scaling that accounts for serial correlations. SNCP is a time series segmentation methodology that combines self-normalisation and a nested local window-based algorithm, and is applicable to detect multiple change points in a broad class of parameters. Although not its primary objective, DeCAFS can be adopted for the problem of detecting multiple change points in the mean of an otherwise stationary AR(1) process, and we adapt the main routine of its R implementation (Romano et al., 2020) to change point analysis under (1) as suggested by the authors. For DepSMUCE and MACE, we consider α ∈ {0.05, 0.2} and for SNCP, α ∈ {0.01, 0.05, 0.1} as per the codes provided by the authors. MACE requires the selection of the minimum and the maximum bandwidths in the rescaled time [0, 1] and moreover, the latter, say s max , controls the maximum detectable number of change points to be (2s max ) −1 ; we set s max = min(1/(3q), n −1/6 ) for fair comparison, which varies from one model to another. Other tuning parameters not mentioned here are chosen as recommended by the authors. where f t is the piecewise constant signal constructed with the set of estimated change point locations Θ, and f * t is an oracle estimator constructed with the true θ j , and the Hausdorff distance (d H ) between Θ and Θ: . Without the gappy model sequence generation step, the procedure suffers from having to perform a large number of model comparison steps, and the 'no gap' procedure tends to over-estimate the number of change points when q is large, or in the presence of mild nonstationarities in the noise. From this, we conclude that the gappy model sequence generation step plays an important role in final model selection by removing those candidate models that are not likely to be the one correctly detecting all change points from consideration. DepSMUCE performs well for short series (see (M6)) or in the presence of weak serial correlations as in (M8), but generally suffers from a calibration issue. That is, in order not to detect spurious change points under H 0 , it requires the tuning parameter to be set conservatively at α = 0.05; however, for improved detection power, α = 0.2 is a better choice. In addition, the estimator of the LRV proposed therein tends to under-estimate the LRV when it is close to zero as in (M1) DeCAFS operates under the assumption that {Z t } n t=1 is an AR(1) process. Therefore, it is applied under model mis-specification in some scenarios, but still performs reasonably well in not returning false positives under H 0 . The exception is (M3) where, in the presence of strong autocorrelations, it returns spurious estimators over 50% of realisations even though the model is correctly specified in this scenario. Its detection power suffers under model mis-specification in some scenarios such as (M2) and (M9) when compared to WCM.gSa, but DeCAFS tends to attain good MSE. MACE suffers from both size inflation and lack of power, possibly due to its sensitivity to choice of some tuning parameters such as the bandwidths. Table D .1: We report the proportion of rejecting H 0 (by returning q ≥ 1) under H 0 : q = 0 (size) and the summary of estimated change points under H 1 : q > 1 according to the distribution of q − q, relative MSE and the Hausdorff distance (d H ) over 1000 realisations. Methods that control the size under H 0 (according to the specified α for DepSMUCE, MACE and SNCP, and at 0.05 for WCM.gSa and DeCAFS), and that achieve the best performance under H 1 according to different criteria, are highlighted in bold for each scenario. If any change point is ignored in fitting an AR model, the information criterion SC tends to over-compensate for the under-specification of mean shifts, which makes direct minimisation of SC unreliable as a model selection method. To illustrate this and motivate the use of SC 0 in gSa, we present a simulation study with datasets generated under the models (M9) and (M11) in Section D.1. Here, our aim is to compare a change point model Θ 1 (correctly detecting all q change points) and the null model Θ 0 = ∅ using two different approaches -one adopted in gSa comparing SC 0 ({X t } n t=1 , α( p)) and SC({X t } n t=1 , Θ 1 , p) with p = p( Θ 1 ) ('Method 1'), and the other selecting the model minimising SC by comparing SC({X t } n t=1 , Θ 0 , p( Θ 0 )) and SC({X t } n t=1 , Θ 1 , p) ('Method 2'). In both scenarios, the errors do not follow an AR model of a finite order so we select p( Θ 0 ) and p( Θ 1 ) as described in (9). For the choice of Θ 1 , we consider the no bias case Θ 1 = {θ j , 1 ≤ j ≤ q} and the biased case Θ 1 = {θ j + s j · λ j , 1 ≤ j ≤ q}, where s j ∼ iid Uniform{−1, 1} and λ j ∼ iid Poisson(5); the latter case reflects that the best localisation rate in change point problems is O p (1). The result is summarised in Table D E.2 Validating the number of change points detected from the NO 2 time series Table 2 in the main paper shows a considerable variation in the number of detected change points in the NO 2 time series between the competing methods. To run an independent check for the number of change points, we firstly remove the bulk of the serial dependence of the data by fitting the AR(1) model to it and work with the empirical residuals from this fit. For this, we set the AR coefficient to 0.5, as suggested by the sample autocorrelation function in On these, we perform change point detection using a method suitable for multiple levelshift detection under serially uncorrelated noise. The method we use is the IDetect technique with the information-criterion-based model selection , as implemented in the R package breakfast . The reason for the selec- This, in our view, represents very good agreement on the whole, especially given that the two methods are entirely different in nature and worked with different time series on input. This result further enhances our confidence in the output of WCM.gSa for this dataset. The Hadley Centre central England temperature (HadCET) dataset (Parker et al., 1992) contains the mean, maximum and minimum daily and monthly temperatures representative of a roughly triangular area enclosed by Lancashire, London and Bristol, UK. We analyse the yearly average of the monthly mean, maximum and minimum temperatures up to 2019 for change points using the proposed WCM.gSa methodology. The mean monthly data dates back to 1659, while the maximum and the minimum monthly data begins in 1878; we focus on the period of 1878-2019 (n = 142) for all three time series. To take into account that the time series are relatively short, we set p max = 5 (maximum allowable AR order) for WCM.gSa and the minimum spacing to be 10 (i.e. no change points occur within 10 years from one another), while the rest of the parameters are chosen as recommended in Section C; the results are invariant to the choice of the penalty ξ n ∈ {log 1.01 (n), log 1.1 (n)}. (Reid et al., 2016) , which is attributed to anthropogenic warming and a volcanic eruption. WCM.gSa 1892 , 1988 1892 , 1988 1892 , 1987 DepSMUCE(0.05) 1987 1988 1956 DepSMUCE(0.2) 1892 , 1988 1988 1892 , 1987 DeCAFS 1892 , 1988 1892 , 1988 1892 , 1987 F Proofs For any square matrix B ∈ R p×p , let λ max (B) and λ min (B) denote the maximum and the minimum eigenvalues of B, respectively, and we define the operator norm B = λ max (B B). Let 1 denote a vector of ones, 0 a vector of zeros and I an identity matrix whose dimensions are determined by the context. The projection matrix onto the column space of a given matrix A is denoted by Π A = A(A A) −1 A , provided that A A is invertible. We write 7.5 8.0 8.5 9.0 9.5 10.0 11.0 mean 1878 1892 1907 1922 1937 1952 1967 1981 1996 1878 1892 1907 1922 1937 1952 1967 1981 1996 1878 1892 1907 1922 1937 1952 1967 1981 1996 Throughout the proofs, we work under the following non-asymptotic bound for some K > 0, which holds for all n ≥ n(K) for some large enough n(K), which replaces the asymptotic condition in Assumptions 2.2 and (5). The o-notation always refers to K in (F.1) being large enough, which in turn follows for large enough n. (2021)). For max(s, θ j−1 ) < k < θ j < min(e, θ j+1 ), it holds that where a + = a · I a≥0 . Similarly, for max(s, θ j−1 ) < θ j ≤ k < min(e, θ j+1 ), it holds that Lemma F.2 (Lemma 2.2 of Venkatraman (1992); Lemma 8 of Wang and Samworth (2018)). For some 0 ≤ s < e ≤ n with e − s > 1, let Θ ∩ [s, e] = {θ • 1 , . . . , θ • m } with m ≤ q, and we adopt the notations θ • 0 = s and θ • m+1 = e. If the series F s,k,e is not constantly zero for θ • j + 1 ≤ k ≤ θ • j+1 for some j = 0, . . . , m, one of the following is true: does not change sign and has strictly increasing absolute values, does not change sign and has strictly decreasing absolute values, (iii) 1 ≤ j ≤ m − 1 and F s,k,e , θ • j + 1 ≤ k ≤ θ • j+1 is strictly monotonic, (iv) 1 ≤ j ≤ m − 1 and F s,k,e , θ • j + 1 ≤ k ≤ θ • j+1 does not change sign and its absolute values are strictly decreasing then strictly increasing. Throughout the proofs, C 0 , C 1 , . . . denote some positive constants. We define the following intervals for each j = 0, . . . , q n , Let (s, e) denote an interval considered at some iteration of the WBS2 algorithm. By construction, the minimum length of the interval obtained by deterministic sampling is given by (e − s)/ K , where K satisfies R n ≤ K( K + 1)/2. Then, R s,e drawn by the deterministic sampling contains at least one interval ( m(j) , r m(j) ) satisfying m(j) ∈ I L,j and r m(j) ∈ I R,j for any θ j ∈ Θ ∩ (s, e) (if Θ ∩ (s, e) is not empty), provided that 3 (e − s)/ K ≤ 2 min 1≤j≤q δ j . This condition in turn is met under (5). Then, it follows from the proof of Proposition B.1 of Cho and Kirch (2021) that there exists a permutation {π(1), . . . , π(q)} of {1, . . . , q} such that for j = 1, . . . , q, by (F.1). From (F.2), the assertion in (i) follows readily. Also consequently, the intervals (s (m) , e (m) ), m = q + 1, . . . , n − 1 meet one of the followings: for some j = 1, . . . , q. Under (a), from Assumption 2.1, , e (m) − θ j ) + 2ζ n ≤ √ ρ n + 2ζ n ≤ C 2 ζ n (F.5) by Lemma F.1; the case when θ j > k (m) is handled analogously. Under (c), we obtain where the first inequality follows from Lemma F.2 and the second inequality from Lemma F.1. From (F.3) and (F.4)-(F.6), and also that X (1) ≤ C 4 √ n due to f j = O(1), we conclude that where {γ m } and {κ m } meet the conditions in (ii). We adopt the following notations throughout the proof: For a fixed integer r ≥ 1 and an arbitrary set A = {k 1 , · · · , k m } ⊂ {1, . . . , n} satisfying min 0≤j≤m (k j+1 − k j ) ≥ r + 1 (with k 0 = 0 and k m+1 = n), we define X = X(A, r) = [L : R] and Y as in (8). Also we set X (j) = [L (j) : 1] for each j = 0, . . . , m, where L (j) has x t = (X t , . . . , X t−r+1 ) , k j ≤ t ≤ k j+1 − 1 as its rows. Sub-vectors of Y and ε corresponding to k j ≤ t ≤ k j+1 − 1 are denoted by Y (j) and ε (j) , respectively. When r = 0, we have X = R and X (j) = R (j) , Besides, we denote the (approximate) linear regression representation of (6) with the true change point locations θ j and AR order p by Correspondingly, X • denotes an n × (p + q + 1)-matrix with its rows given by for 1 ≤ t ≤ n, whereby X • ≡ X(Θ, p). When p = 0, the matrix L • is empty. The following results are frequently used throughout the proof. (8), and also X (j) , L (j) , R (j) and ε (j) , correspondingly, and let N j = k j+1 − k j . Then, under Assumption 3.1 (i)-(iii) and Assumption 3.2, we have the followings hold almost surely for all j = 0, . . . , m and A ⊂ Θ M : otherwise. Therefore, which concludes the proof. Throughout the proofs, C 0 , C 1 , . . . denote some positive constants. In what follows, we operate in E n ∩ M n , and all big-O notations imply that they hold a.s. due to Proposition F.3. We briefly sketch the proof, which proceeds in four steps (i)-(iv) below. We first suppose that Assumption 3.2 holds with M = 1, and also that p is known. Then, a single iteration of the gSa algorithm in Section A.2 boils down to choosing between Θ 0 = ∅ and Θ 1 : If SC({X t } n t=1 , Θ 1 , p) < SC 0 ({X t } n t=1 , α(p)), we favour a change point model; if not, we conclude that there is no change point in the data. In (i), when q = 0, we show that a i )f 0 representing the time-invariant overall level, and therefore Y − X β 2 ≈ (I − Π 1 )(Y − L α) 2 which leads to SC 0 ({X t } n t=1 , α(p)) < SC({X t } n t=1 , Θ 1 , p) under Assumption 3.4. In (ii), when q ≥ 1, we show that for some fixed constant C > 0 and thus SC 0 ({X t } n t=1 , α(p)) > SC({X t } n t=1 , Θ 1 , p), provided that Θ 1 meets (11). In (iii), we show the consistency of the proposed order selection scheme. For the general case where M > 1, in (iv), we can repeatedly apply the above arguments for each call of Step 1 of the gSa algorithm: Under Assumption 3.2, when l > l * , any θ l,j / ∈ Θ l * are spurious estimators and thus we have the gSa algorithm proceed to examine Θ l−1 ; when l = l * , any θ l * ,j / ∈ Θ l * −1 are detecting those change points undetected in Θ l * −1 and thus the gSa algorithm returns Θ l * . As outlined above, in the following (i)-(iii), we only consider the case of M = 1 and consequently drop the subscript '1' from Θ 1 and θ 1,j where there is no confusion. For given Θ, recall that X = X( Θ, p) = [L : R] and N j = θ j+1 − θ j . For t = θ j +1, . . . , θ j +p, (i) When q = 0. We first note that such that by Proposition F.3, we have (F.14) We decompose the residual sum of squares as Invoking Proposition F.3 and (F.14), Putting together the bounds on R 11 -R 12 , we conclude that Next, note that By the arguments similar to those adopted in Proposition F.3 and Lemma 1 of Lai and Wei (1982a) , we have |R 21 | = O(log(n)). Also, by Proposition F.3 and (F.14), R 22 ≤ L( α − α • ) 2 = O(log(n)). Next, where the first term is bounded by due to Proposition F.3 and Lemma 1 of Lai and Wei (1982a) , and the second term is bounded by the bound on the first term and R 21 as O(log(n)). Therefore, Combining (F.15) and (F.16) with Assumption 3.1 (ii)-(iii), and noting that log(1 + for n large enough, due to Assumption 3.4. (ii) When q ≥ 1. Recall that in M n , we have q = q. wheref = max 0≤j≤q |f θ j +1 |. We first establish the consistency of µ in estimating µ • . Applying Lemma F.4, we write (Horn and Johnson, 1985, Theorem 4.2. 2)) and thus lim inf n→∞ n −1 λ min (R (I− Π L )R) > 0 by Proposition F.3. Also, since tr(R (I − Π L )R) ≤ n trivially, we obtain |R 32 | = O log(n)/n adopting the same arguments used in the proof of (F.11). Next, by (F.13) and since and therefore |R 31 | 2 = O (qρ n /n). Putting together the bounds on R 31 -R 32 , we obtain Also, note that by Lemma F.4, By Proposition F.3 and (F.20), Also, due to (F.18) and (F.19), and we also obtain R 43 = O (log(n) ∨ qρ n ). By Proposition F.3 and (F.20), while with (F.13), (F.19), Assumption 3.1 and Chebyshev's inequality, on E n . Combining the bounds on R 41 -R 45 , we obtain Next, note that Repeatedly invoking Lemma F.5, we have where R −I 1 denotes a matrix constructed by merging the j-th and the (j + 1)-th columns of R via summing them up for all j ∈ I 1 , while the rest of the columns of R are unchanged, with I 1 denoting a subset of {1, . . . , q} consisting of all the odd indices. For notational simplicity, let C j (·) = C θ j−1 , θ j , θ j+1 (·) where there is no confusion. Note that Without loss of generality, suppose that θ j ≤ θ j . Analogous arguments apply when θ j > θ j . By Lemma F.1, Under Assumptions 3.2, 3.3 and 3.4, min(N j−1 , n ρ n → 0 as n → ∞) and thus and R 63 is similarly bounded. Therefore, we conclude Similarly, by (F.13) and Assumption 3.2, we derive Invoking Assumption 3.1 (iv), it is easily seen that on E n , Finally, by (F.17) and (F.20), Next, we note that First, by Assumption 3.1 (iv), R 71 = O( q j=0 N j ω 2 n · N −1 j ) = O(qω 2 n ) on E n . Also, from Proposition F.3 and (F.20), R 72 ≤ L( α − α • ) 2 = O(log(n) ∨ qρ n ). In addition, where the first term is O(qρ n ) as in (F.18). From (F.13) and the definition of R and R • , (recall that θ 0 = θ 0 = 0 and θ q+1 = θ q+1 = n) such that by Assumptions 3.2 and 3.3, we for some constant C 1 > 0, hence R 73 = O(qρ n ). The bounds on R 72 and R 73 imply the from Lemma 1 of Lai and Wei (1982a) , Proposition F.3 and (F.20). Finally, where using the arguments involved in bounding R 45 , we have the first term bounded by O(q(ρ n ∨ ω 2 n )), while the second term is bounded as on E n , recalling (F.28)-(F.29) and by Assumptions 3.1 (iv), 3.2 and 3.3. Therefore, R 76 = O(q(ρ n ∨ ω 2 n )). Collecting the bounds on R 71 -R 76 , we obtain (I − Π R )(Y − L α) 2 = ε 2 + O log(n) ∨ q(ρ n ∨ ω 2 n ) . Y − X β 2 − qξ n =: n 2 log(1 + R 8 ) − qξ n . (F.32) When R 8 ≥ 1, we have the RHS of (F.32) trivially bounded away from zero by Assumption 3.4. When R 8 < 1, note that for g(x) = log(x)/(x − 1), since lim x↓1 g(x) → 1 and from its continuity, there exists a constant C 2 > 0 such that inf 1≤x<2 g(x) ≥ C 2 . Therefore, n 2 log(1 + R 8 ) − qξ n ≥ C 3 qD n + O log(n) ∨ q(ρ n ∨ ω 2 n ) − qξ n > 0, invoking Assumption 3.1 (ii)-(iii), (F.22) and (F.31) for some C 3 > 0. (iii) Order selection consistency. Thus far, we have assumed that the AR order p is known. We show next that for n large enough, the order p is consistently estimated by p obtained as in (9). Recall the notation β( Θ, r) = ( α (r), µ ( Θ)) . Firstly, suppose that r > p while r ≤ p max . Then, by (F.14) when q = 0 or by ( and therefore, we have Y − X( Θ, r) β( Θ, r) 2 + (r − p)ξ n =O log(n) ∨ q(ρ n ∨ ω 2 n ) + (r − p)ξ n > 0 for n large enough, by Assumption 3.4. Next, consider r < p. For notational convenience, let Π(r) = Π X( Θ,r) , and the submatrix of X( Θ, p) containing its columns corresponding to the i-th lags for i = r + 1, . . . , p by X(p|r). Then, [X(p|r) (I − Π(r))X(p|r)] −1 is a sub-matrix of (X( Θ, p) X( Θ, p)) −1 and thus by Theorem 4.2.2 of Horn and Johnson (1985) and Proposition F.3, we have λ max X(p|r) (I − Π(r))X(p|r) ≤ λ max X( Θ, p) X( Θ, p) ≤ tr X( Θ, p) X( Θ, p) = O(n) and similarly, (F.33) λ min X(p|r) (I − Π(r))X(p|r) ≥ λ min X( Θ, p) X( Θ, p) and thus lim inf n→∞ n −1 λ min X(p|r) (I − Π(r))X(p|r) > 0. with some constant C 4 > 0 for n large enough, where the O(log(n)) bound on the RHS of (F.35) is due to (F.33), (F.34) and Lemma 1 of Lai and Wei (1982a) , while the O(qρ n ) bound from (F.18), regardless of whether there are change points or not. Therefore, we have with some constant C 5 > 0 for n large enough, by Assumption 3.4, (F.15) and (F.22). (iv) When M > 1. The above (i)-(iii) completes the proof in the special case when Assumption 3.2 is met with M = 1. In the general case where M > 1, the above proof is readily adapted to prove the claim of the theorem. (a) First, note that for any l ≥ l * , the intervals examined in Step 1 of the gSa algorithm, { θ l−1,uv + 1, . . . , θ l−1,uv+1 − 1}, v = 1, . . . , q l , correspond to one of the following cases under Assumption 3.2: Null case with no 'detectable' change points, i.e. either Θ ∩ { θ l−1,uv + 1, . . . , θ l−1,uv+1 − 1} = ∅, or all θ j ∈ Θ ∩ { θ l−1,uv + 1, . . . , θ l−1,uv+1 − 1} satisfy d 2 j min(θ j − θ l−1,uv , θ l−1,uv+1 − θ j ) ≤ ρ n , or change point case with Θ ∩ { θ l−1,uv + 1, . . . , θ l−1,uv+1 − 1} = ∅ and d 2 j min(θ j − θ l−1,uv , θ l−1,uv+1 − θ j ) ≥ D n − ρ n for at least one θ j ∈ Θ ∩ { θ l−1,uv + 1, . . . , θ l−1,uv+1 − 1}. In fact, when l = l * , all { θ l * −1,uv + 1, . . . , θ l * −1,uv+1 − 1} for v = 1, . . . , q l * , correspond to the change point case, while when l ≥ l * + 1, they all correspond to the null case. (b) In the null case, the set A = Θ l ∩ { θ l−1,uv + 1, . . . , θ l−1,uv+1 − 1} serves the role of the set of spurious estimators, Θ, as in (i) with |A| serving as q. Besides, we account for the possible estimation bias in the boundary points θ l−1,uv and θ l−1,uv+1 in the case of q ≥ 1 above, the proof is complete. For a fixed j = 1, . . . , q, we drop the subscript j and writeθ =θ j , = j , r = r j , θ = θ j , f = f j and δ = δ j . In what follows, we assume that X ,θ,r > 0; otherwise, consider −X t (resp. −f t and −Z t ) in place of X t (f t and Z t ). Then, on Z n , we have max F ,θ,r > 0 for n large enough. Below, we consider the case whereθ ≤ θ; the case whereθ > θ can be handled analogously. We first establish that θ −θ ≤ min(θ − , r − θ)/2. (F.40) If θ −θ > min(θ − , r − θ)/2 ≥ δ/4 (due to (B.1)), by Lemma F.2 and (F.38), we have F ,θ,r − F ,θ,r ≥ 1 3 √ 3 (f ) 2 δ while |Z ,θ,r − Z ,θ,r | ≤ 2 √ 2ζ n , thus contradicting that X ,θ,r ≥ X ,θ,r under (F.1). Next, for some ρ n satisfying (f ) −2 ρ n ≤ δ/4, we have P arg max 0 and β > 1 satisfying |b | ≤ γ −β for all ≥ 1. We can show that the dependence adjusted sub-exponential norm m 1/m , is bounded from the above by some fixed constant C 1 > 0 with ν = 1, where Z t,{0} = ∞ =0, =t b ε t− + b t ε 0 with ε 0 an independent copy of ε 0 . Then, by Lemma C.4 of Zhang and Wu (2017), there exists a fixed constant C 2 > 0 such that Z t ≥ ζ n ≤ C 2 n(n + 1) exp − 3ζ 2/3 n 4e Z · ψ 1 ,0 , i.e. we can set ζ n = C 3 log 3/2 (n) with a large enough C 3 > 0 (depending only on Z · ψ 1 ,0 ) and have P(Z n ) → 1. Using similar arguments and Bernstein's inequality (see e.g. Theorem 2.8.1 of Vershynin (2018)), we have P(E n ) → 1 with ω n log(n). Nitrogen dioxide in the United Kingdom breakfast: Methods for Fast Multiple Change-Point Detection and Estimation Detecting multiple generalized change-points by isolating single ones. Preprint. of weakly dependent random variables, with applications A MOSUM procedure for the estimation of multiple random change points Detection and estimation of local signals Relating and comparing methods for detecting changes in mean Multiscale change point inference Wild Binary Segmentation for multiple change-point detection Detecting possibly frequent change-points: Wild Binary Segmentation 2 and steepest-drop model selection Narrowest Significance Pursuit: inference for multiple change-points in linear models Autocovariance estimation in the presence of changepoints UK COVID-19 lockdown: 100 days of air pollution reduction Matrix Analysis A note on studentized confidence intervals for the changepoint Permutation tests for multiple changes Human health effects of air pollution Optimal detection of changepoints with a linear computational cost Resampling methods for the change analysis of dependent data Multiple change-point detection for non-stationary time series using wild binary segmentation Seeded binary segmentation: A general methodology for fast and optimal change point detection Almost sure invariance principles for partial sums of mixing B-valued random variables. The Annals of Probability An estimator of the number of change points based on a weak invariance principle Asymptotic properties of projections with applications to stochastic regression problems Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems Asymptotic properties of general autoregressive models and strong consistency of least-squares estimates of their parameters Least-squares estimation of an unknown number of shifts in a time series An MDL approach to the climate segmentation problem A Bernstein type inequality and moderate deviations for weakly dependent sequences Nonstationarities in financial time series, the long-range dependence, and the IGARCH effects Long memory and changepoint models: a spectral classification procedure A new daily central England temperature series Invariance principle for stochastic processes with short memory Nuisance parameters free changepoint detection in nonstationary series Global impacts of the 1980s regime shift Mean shift testing in correlated data DeCAFS: Detecting Changes in Autocorrelated and Fluctuating Signals Detecting abrupt changes in the presence of local fluctuations and autocorrelated noise We thank Paul Fearnhead and Gaetano Romano for their constructive comments. Haeran Cho was supported by the Leverhulme Trust Research Project Grant (RPG-2019-390). Proof. The results in (F.8)-(F.9) follow from Theorem 3 (ii) of Lai and Wei (1983) and the finiteness of Θ M . By Corollary 2 of Lai and Wei (1982a) , (F.10) follow from that tr(R R) = n and R (j) R (j) = N j . By Lemma 1 of Lai and Wei (1982b) , we havewhich, together with (F.8) and (F.10), leads to (F.11).Lemma F.4 (Lemma 3.1.2 of Csörgő and Horváth (1997)). For any X = [L : R], themerging the j-th and the (j + 1)-th columns of R via summing them up, while the rest of the columns of R are unchanged. Then,Proof. Denote the (j + 1)-th column of R by R j . Then, by simple calculations, we haveAlso by construction,Note thatsuch that on Z n , due to (B.1) and (F.40),Also, by (B.1),Then, by (F.38) and (F.1), there exists some c 3 > 0 such that setting ρ n = c 3 ( ζ n ) 2 , we have