key: cord-0105670-0ui1ohu9
authors: Fryzlewicz, Piotr
title: Narrowest Significance Pursuit: inference for multiple change-points in linear models
date: 2020-09-11
journal: nan
DOI: nan
sha: 0c9c91977e7e746ab2299e11470d866b970da954
doc_id: 105670
cord_uid: 0ui1ohu9

We propose Narrowest Significance Pursuit (NSP), a general and flexible methodology for automatically detecting localised regions in data sequences which each must contain a change-point, at a prescribed global significance level. Here, change-points are understood as abrupt changes in the parameters of an underlying linear model. NSP works by fitting the postulated linear model over many regions of the data, using a certain multiresolution sup-norm loss, and identifying the shortest interval on which the linearity is significantly violated. The procedure then continues recursively to the left and to the right until no further intervals of significance can be found. The use of the multiresolution sup-norm loss is a key feature of NSP, as it enables the transfer of significance considerations to the domain of the unobserved true residuals, a substantial simplification. It also guarantees important stochastic bounds which directly yield exact desired coverage probabilities, regardless of the form or number of the regressors. NSP works with a wide range of distributional assumptions on the errors, including Gaussian with known or unknown variance, some light-tailed distributions, and some heavy-tailed, possibly heterogeneous distributions via self-normalisation. It also works in the presence of autoregression. The mathematics of NSP is, by construction, uncomplicated, and its key computational component uses simple linear programming. In contrast to the widely studied"post-selection inference"approach, NSP enables the opposite viewpoint and paves the way for the concept of"post-inference selection". Pre-CRAN R code implementing NSP is available at https://github.com/pfryz/nsp.

Examining or monitoring data sequences for possibly multiple changes in their behaviour, other than those attributed to randomness, is an important task in a variety of fields. This paper focuses on abrupt changes, or change-points. Having to discriminate between changepoints perceived to be significant, or "real", and those attributable to randomness, points to the importance of statistical inference in multiple change-point detection problems.

In this paper, we propose a new generic methodology for determining, for a given data sequence and at a given global significance level, localised regions of the data that each must contain a change-point. We define a change in the data sequence Y t on an interval [s, e] as a departure, on this interval, from a linear model with respect to pre-specified regressors. We give below examples of scenarios covered by the proposed methodology; all of them involve multiple abrupt changes, i.e. change-points.

Scenario 1. Piecewise-constant signal plus noise model.

where f t is a piecewise-constant vector with an unknown number N and locations 0 = η 0 < η 1 < . . . < η N < η N +1 = T of change-points, and Z t is zero-centred noise;

we give examples of permitted joint distributions of Z t below. The location η j is a change-point if f η j +1 = f η j , or equivalently if f t cannot be described as a constant vector when restricted to any interval [s, e] ⊇ [η j , η j + 1].

Scenario 2. Piecewise-polynomial (including piecewise-constant and piecewise-linear as special cases) signal plus noise model.

In (1), f t is a piecewise-polynomial vector, in which the polynomial pieces have a fixed degree q ≥ 0, assumed known to the analyst. The location η j is a change-point if f t cannot be described as a polynomial vector of degree q when restricted to any interval [s, e] ⊇ [η j , η j + 1], such that e − s ≥ q + 1.

Scenario 3. Linear regression with piecewise-constant parameters.

For a given design matrix X = (X t,i ), t = 1, . . . , T , i = 1, . . . , p, the response Y t follows the model Y t = X t,· β (j) + Z t for t = η j + 1, . . . , η j+1 ,

for j = 0, . . . , N , where the parameter vectors β (j) = (β (j) 1 , . . . , β

p ) are such that β (j) = β (j+1) .

Each of these scenarios is a generalisation of the preceding one. To see this, observe that Scenario 3 reduces to Scenario 2 if p = q + 1 and the ith column of X is a polynomial in t of degree i − 1. We permit a broad range of distributional assumptions for Z t : we cover i.i.d. Gaussianity and other light-tailed distributions, and we use self-normalisation to also handle (not necessarily known) distributions within the domain of attraction of the Gaussian distribution, including under heterogeneity. In addition, in Section 3, we introduce Scenario 4, a generalisation of Scenario 3, which provides a framework for the use of our methodology under regression with autoregression (AR).

The literature on inference and uncertainty evaluation in multiple change-point problems is diverse in the sense that different authors tend to answer different inferential questions. Below we briefly review the existing literature which seeks to make various confidence statements about the existence or locations of change-points in particular regions of the data, or significance statements about their importance (as opposed to merely testing for any change), aspects that are relevant to this work.

In the piecewise-constant signal model, SMUCE (Frick et al., 2014) estimates the number N of change-points as the minimum among all candidate fitsf t for which the empirical residuals pass a certain multiscale test at significance level α. It then returns a confidence set for f t , at confidence level 1−α, as the set of all candidate signals for which the number of change-points agrees with the thus-estimated number, and for which the empirical residuals pass the same test at significance level α. An issue for SMUCE, discussed e.g. in Chen et al. (2014) , is that the smaller the significance level α, the more lenient the test on the empirical residuals, and therefore the higher the risk of underestimating N . This poses problems for the kinds of inferential statements in SMUCE that the authors envisage for it, because for a confidence set of an estimate of f t to cover the truth, the authors require (amongst others) that the estimated number of change-points agrees with the truth. The statement in Chen et al. (2014) , who write: " Table 5 [in Frick et al. (2014) ] shows in the Gaussian example with unknown mean that, even when the sample size is as large as 1500, a nominal 95% confidence set has only 55% coverage; even more strikingly, a nominal 80% coverage set has 84% coverage" is an illustration of this issue. SMUCE is extended to heterogeneous Gaussian noise in Pein et al. (2017) and to dependent data in Dette et al. (2018) . Chen et al. (2014) attempt to remedy this issue by using an estimator of N which does not depend (in the way described above) on the possibly small α, but uses a different significance level instead, which is believed to lead to better estimators of N . This breaks the property of SMUCE that the larger the nominal coverage, the smaller the chance of getting the number of change-points right. However, in their construction, called SMUCE 2 , an estimate of f t is still said to cover the truth if the number of the estimated change-points agrees with the truth. This is a bottleneck, which means that for many challenging signals SMUCE 2 will also be unable to cover the truth with a high nominal probability requested by the user. In the approach taken in this paper, this issue does not arise as we shift the inferential focus away from N .

A number of authors approach uncertainty quantification for multiple change-point problems from the point of view of post-selection inference (a.k.a. selective inference). In the piecewise-constant model with i.i.d. Gaussian noise, Hyun et al. (2018a) consider the fused lasso (Tibshirani et al., 2005) solution with k estimated change-points, and test hypotheses of the equality of signal mean of either side of a given thus-detected change-point, conditioning on many aspects of the estimation process, including the k detected locations and the estimated signs of the associated jumps. The same work covers linear trend filtering (Tibshirani, 2014) and gives similar conditional tests for the linearity of the signal at a detected location. For the piecewise-constant model with i.i.d. Gaussian noise, Hyun et al. (2018b) outline similar post-selection tests for detection via binary segmentation (Vostrikova, 1981) , Circular Binary Segmentation (Olshen et al., 2004) and Wild Binary Segmentation (Fryzlewicz, 2014) . In the piecewise-constant model with i.i.d. Gaussian noise, Jewell et al. (2020) cover in addition the case of l 0 penalisation (this includes the option of estimating the number and locations of change-points via the Schwarz Information Criterion, see Yao (1988) ) and avoid Hyun et al. (2018b) 's technical requirement that the conditioning set be a polyhedral, which allows Jewell et al. (2020) to reduce the size of the conditioning set and hence gain power. The definition of the resulting p-value is still somewhat complex. For example, their test which conditions on the least information, that for the equality of the means of the signal to the left and to the right of a given detected change-point η j within a symmetric and non-adaptively chosen window of size 2h, has a p-value defined as (paraphrasing an intentionally heuristic description from Jewell et al. (2020) ) "the probability under the null that, out of all data sets yielding a change-point atη j and for which the 2h − 1-dimensional component independent of the test statistic over the window [η j − h + 1,η j + h] is the same as that for the observed data, the difference in means around η j within that window is as large as what is observed". An additional potential issue is that choosing h based on an inspection of the data prior to performing this test would affect its validity as it explicitly relies on h being chosen in a data-agnostic way. Related work appears in Duy et al. (2020) .

Notwithstanding their usefulness in assessing the significance of previously estimated changepoints, these selective inference approaches share the following features: (a) they do not explicitly consider uncertainties in estimating change-point locations, (b) they do not provide regions of globally significant change in the data, (c) they define significance for each change-point separately, as opposed to globally across the whole dataset, (d) they rely on a particular base change-point detection method with its potential strengths or weaknesses. Our approach explicitly contrasts with these features; in particular, in contrast to postselection inference, it can be described as enabling "post-inference selection", as we argue later on.

A number of authors provide simple consistency results for the number and locations of detected change-points, typically stating that on a set whose probability tends to one with T , for T large enough and under certain model assumptions such as a minimum spacing between consecutive change-points and minimum magnitudes of the parameter changes, N is estimated correctly and the true change-points must lie within certain distances of the estimated change-points. Examples in the piecewise-constant setting are numerous and include Yao (1988) , Boysen et al. (2009 ), Hao et al. (2013 ), Fryzlewicz (2014 , Lin et al. (2017 ), Fryzlewicz (2018 , Wang et al. (2018) , Cho and Kirch (2020) , and Kovács et al. (2020b) . There are fewer results of this type beyond Scenario 1: examples include Baranowski et al. (2019) in the piecewise-linear model (a method that extends conceptually to higher-order polynomials), and Wang et al. (2019) in the linear regression setting. In inferential terms, such results are usually difficult to use in practice, as the probability statements made typically involve unknown constants related to the minimum distance between change-points or the minimum magnitude of parameter change. In addition, the significance level in these types of results is usually understood to converge to 0 with T (at a speed which, even if known in terms of the rate, is often unknown in terms of constants), rather than being fixable to a concrete value by the user. Some authors go further and provide simultaneous asymptotic distributional results regarding the distance between the estimated change-point locations and the truth. For example, this is done, in the linear regression context, in Bai and Perron (1998) , under the assumption of a known number of change-points, and their minimum distance being O(T ). Naturally enough, the distributional limits depend on the unknown magnitudes of parameter change, which, as pointed out in the post-selection literature referenced above, are often difficult to estimate well. Moreover, convergence to a pivotal distribution involving, for each changepoint, an independent functional of the Wiener process, is only possible in an asymptotic framework in which the magnitudes of the shifts converge to zero with T . Some related literature is reviewed in Marusiakova (2009) . Similar results for the piecewise-constant signal plus noise model and estimation via MOSUM appear in Eichinger and Kirch (2018) .

Inference in multiple change-point problems is also sometimes posed as control of the False Discovery Rate (FDR). In the piecewise-constant signal model, Li and Munk (2016) propose an estimator, constructed similarly to SMUCE, which controls the FDR but with a generous definition of a true discovery, which, as pointed out in Jewell et al. (2020) , in the most extreme case, permits a detection as far as almost T /2 observations from the truth. Hao et al. (2013) and Cheng et al. (2019) show FDR control for their SaRa and dSTEM estimators (respectively) of multiple change-point locations in the piecewise-constant signal model. Control of FDR is too weak a criterion when one wants to obtain regions of prescribed global significance in the data, as we do in this work: FDR is focused on the number of change-points rather than on their locations, and in particular, it permits estimators which frequently, or even always, over-estimate the number of change-points by a small fraction. This makes it impossible to guarantee that with a large global probability, all regions of significance detected by an FDR-controlled estimator contain at least one change-point each.

Bayesian approaches to uncertainty quantification in multiple change-point problems are considered e.g. in Fearnhead (2006) and Nam et al. (2012) (see also the monograph Ruanaidh and Fitzgerald (1996) ), and are particularly useful when clear priors, chosen independently of the data, are available about some features of the signal.

We now summarise our new approach, then situate it in the context of the related literature, and next discuss its novel aspects. The objective of our methodology, called "Narrowest Significance Pursuit" (NSP), is to automatically detect localised regions of the data Y t , each of which must contain at least one change-point (in a suitable sense determined by the given scenario), at a prescribed global significance level. NSP proceeds as follows. A number M of intervals are drawn from the index domain [1, . . . , T ], with start-and endpoints chosen either uniformly at random, or over an equispaced deterministic grid. On each interval drawn, Y t is then checked to see whether or not it locally conforms to the prescribed linear model, with any set of parameters. This check is performed through estimating the parameters of the given linear model locally via a particular multiresolution sup-norm, and testing the residuals from this fit via the same norm; self-normalisation is involved if necessary. In the first greedy stage, the shortest interval (if one exists) is chosen on which the test is violated at a certain global significance level α. In the second greedy stage, the selected interval is searched for its shortest sub-interval on which a similar test is violated. This sub-interval is then chosen as the first region of global significance, in the sense that it must (at a global level α) contain a change-point, or otherwise the local test would not have rejected the linear model. The procedure then recursively draws M intervals to the left and to the right of the chosen region (with some, or with no overlap), and so on, and stops when no further regions of global significance can be found.

The theme of searching for globally significant localised regions of the data containing change appears in different versions in the existing literature. This frequently involves multiscale statistics: operators of the same form applied over sub-samples of the data taken at different locations and of differing lengths. Dümbgen and Spokoiny (2001) test locally and at multiple scales for monotonicity or concavity of a curve against a general smooth alternative with an unknown degree of smoothness. Dümbgen and Walther (2008) identify regions of local increases or decreases of a density function. Walther (2010) searches for anomalous spatial clusters in the Bernoulli model using dyadically constructed blocked scan statistics. SiZer (Chaudhuri and Marron, 1999) is an exploratory multiscale data analytic tool, with roots in computer vision, for assessing the significance of curve features for differentiable curves; SiZer for curves with jumps is described in Kim and Marron (2006) . , in the piecewise-constant signal plus i.i.d. Gaussian noise model, approximate the tail probability of the maximum CUSUM statistic over all sub-intervals of the data. They then propose an algorithm, in a few variants, for identifying short, nonoverlapping segments of the data on which the local CUSUM exceeds the derived tail bound, and hence the segments identified must contain at least a change-point each, at a given significance level. present results of similar nature for a Gaussian model with lag-one autocorrelation, linear trend, and features that are linear combinations of continuous, piecewise differentiable shapes. Both these works draw on the last author's extensive experience of the topic, see e.g. Siegmund (1988) . The most important high-level differences between NSP and these two approaches are listed below.

(a) While in and , the user needs to be able to specify the significant signal shapes to look for, NSP searches for any deviations from local model linearity with respect to specific regressors.

(b) Out of our scenarios, and provide results under our Scenario 1 and Scenario 2 with linearity and continuity. Their results do not cover our Scenario 3 (linear regression with arbitrary X) or Scenario 2 with linearity but not necessarily continuity, or Scenario 2 with higher-than-linear polynomials.

(c) The distribution under the null of the multiscale test performed by NSP is stochastically bounded by the scan statistic of the corresponding true residuals Z t , and is therefore independent of the scenario and of the design matrix X used. This means that NSP is ready for use with any user-provided design matrix X, and this will require no new calculations or coding, and will yield correct coverage probabilities. This is in contrast to the approach taken in and , in which each new scenario not already covered would involve new and fairly complicated approximations of the null distribution.

(d) Thanks to its double use of the multiresolution sup-norm (in the local linear fit, and then in the test of this fit), NSP is able to handle regression with autoregression practically in the same way as without, and does not suffer from having to estimate the unknown AR coefficients as nuisance parameters to be plugged back in, the way it is done in , who mention the instability of the latter procedure if the current data interval under consideration is used for this purpose. This issue does not arise in NSP and hence it is able to deal with autoregression, stably, on arbitrarily short intervals. This is of importance, as change-point analysis under serial dependence in the data is a known difficult problem, and NSP offers a new approach to it, thanks to this feature.

We also mention below other main distinctive features of NSP in comparison with the existing literature.

(i) NSP is specifically constructed to target the shortest possible significant intervals at every stage of the procedure, and to explore as many intervals as possible while remaining computationally efficient. This is achieved by a two-stage greedy mechanism for determining the shortest significant interval at every recursive stage, and by basing the sampling of intervals on the "Wild Binary Segmentation 2" sampling scheme, which explores the space of intervals much better (Fryzlewicz, 2020) than the older "Wild Binary Segmentation" sampling scheme used in Fryzlewicz (2014) , Baranowski et al. (2019) and mentioned in passing in .

(ii) NSP critically relies on what we believe is a new use of the multiresolution sup-norm. On each interval drawn, NSP locally fits the postulated linear model via multiresolution sup-norm minimisation (as opposed to e.g. the more usual OLS or MLE). It then uses the same norm to test the empirical residuals from this fit, which ensures that, under the local null, their maximum in this norm is bounded by that of the corresponding (unobserved) true residuals on that interval. This ensures the exactness of the coverage statements furnished by NSP, at a prescribed global significance level, regardless of the scenario and for any given regressors X.

(iii) Thanks to the fact that multiresolution sup-norms can be interpreted as Hölderlike norms on certain function spaces, NSP naturally extends to the cases of unknown or heterogeneous distributions of Z t using the elegant functional-analytic selfnormalisation framework developed in Rackauskas and Suquet (2001) , Rackauskas and Suquet (2003) and related papers. Also, the use of multiresolution sup-norms means that if simulation needs to be used to determine critical values for NSP, then this can be done in a computationally efficient manner.

The paper is organised as follows. Section 2 introduces the NSP methodology and provides the relevant coverage theory. Section 3 extends this to NSP under self-normalisation and in the additional presence of autoregression. Section 4 provides extensive numerical examples under a variety of settings. Section 5 describes three real-data case studies. Section 6 concludes with a brief discussion. Complete R code implementing NSP is available at https://github.com/pfryz/nsp.

This section describes the generic mechanics of NSP and its specifics for models in which the noise Z t is i.i.d., light-tailed and enough is known about its distribution for self-normalisation not to be required. We provide details for Z t ∼ N (0, σ 2 ) with σ 2 assumed known, and some other light-tailed distributions. We discuss the estimation of σ 2 . NSP under regression with autoregression, and self-normalised NSP, are in Section 3.

Throughout the section, we use the language of Scenario 3, which includes Scenarios 1 and 2 as special cases. In particular, in Scenario 1, the matrix X in (2) is of dimensions T × 1 and has all entries equal to 1. In Scenario 2, the matrix X is of dimensions T × (q + 1) and its ith column is given by (t/T ) i−1 , t = 1, . . . , T . Scenario 4 (for NSP in the additional presence of autoregression), which generalises Scenario 3, is dealt with in Section 3.2.

We start with a pseudocode definition of the NSP algorithm, in the form of a recursively defined function NSP. In its arguments, [s, e] is the current interval under consideration and at the start of the procedure, we have [s, e] = [1, T ]; Y (of length T ) and X (of dimensions T × p) are as in the model formula (2); M is the (maximum) number of sub-intervals of [s, e] drawn; λ α is the threshold corresponding to the global significance level α (typical values for α would be 0.05 or 0.1) and τ L (respectively τ R ) is a functional parameter used to specify the degree of overlap of the left (respectively right) child interval of [s, e] with respect to the region of significance identified within [s, e] , if any. The no-overlap case would correspond to τ L = τ R ≡ 0. In each recursive call on a generic interval [s, e] , NSP adds to the set S any globally significant local regions (intervals) of the data identified within [s, e] on which Y is deemed to depart significantly (at global level α) from linearity with respect to X. We provide more details underneath the pseudocode below. NSP(s,s + τ L (s,ẽ, Y, X), Y, X, M, λ α , τ L , τ R )

NSP(ẽ − τ R (s,ẽ, Y, X), e, Y, X, M, λ α , τ L , τ R ) 23: end function

The NSP algorithm is launched by the pair of calls below.

On completion, the output of NSP is in the variable S. We now comment on the NSP function line by line. In lines 2-4, execution is terminated for intervals that are too short; clearly, if e = s, then there is nothing to detect on [s, e] . In lines 11-13, each sub-interval [s m , e m ] is checked to see to what extent the response on this sub-interval (denoted by Y sm:em ) conforms to the linear model (2) with respect to the set of covariates on the same sub-interval (denoted by X sm:em,· ). For NSP without self-normalisation, described in this section, this check is done by fitting the postulated linear model on [s m , e m ] using a certain multiresolution sup-norm loss, and computing the same multiresolution sup-norm of the empirical residuals from this fit, to form a measure of deviation from linearity on this interval. This core step of the NSP algorithm will be described in more detail in Section 2.2.

In line 14, the measures of deviation obtained in line 12 are tested against threshold λ α , chosen to guaranteed global significance level α. How to choose λ α depends (only) on the distribution of Z t ; this question will be addressed in Sections 2.3-2.4. The shortest subinterval(s) [s m , e m ] for which the test rejects the local hypothesis of linearity of Y versus X at global level α are collected in set M 0 . In lines 15-17, if M 0 is empty, then the procedure decides that it has not found regions of significant deviations from linearity on [s, e], and stops on this interval as a consequence. Otherwise, in line 18, the procedure continues by choosing the sub-interval, from among the shortest significant ones, on which the deviation from linearity has been the largest. (Empirically, M 0 often has cardinality one, in which case the choice in line 18 is trivial.) The chosen interval is denoted by [s m 0 , e m 0 ].

In line 19, [s m 0 , e m 0 ] is searched for its shortest significant sub-interval, i.e. the shortest sub-interval on which the hypothesis of linearity is rejected locally at a global level α. Such a sub-interval certainly exists, as [s m 0 , e m 0 ] itself has this property. The structure of this search again follows the workflow of the NSP procedure; more specifically, it proceeds by executing lines 2-18 of NSP, but with s m 0 , e m 0 in place of s, e. The chosen interval is denoted by [s,ẽ] . This two-stage search (identification of [s m 0 , e m 0 ] in the first stage and of [s,ẽ] ⊆ [s m 0 , e m 0 ] in the second stage) is crucial in NSP's pursuit to force the identified intervals of significance to be as short as possible, without unacceptably increasing the computational cost. The importance of this two-stage solution will be illustrated in Section 4.1.2. In line 20, the selected interval [s,ẽ] is added to the output set S.

In lines 21-22, NSP is executed recursively to the left and to the right of the detected interval [s,ẽ] . However, we optionally allow for some overlap with [s,ẽ] . The overlap, if present, is a function of [s,ẽ] and, if it involves detection of the location of a change-point within [s,ẽ] , then it is also a function of Y, X. An example of the relevance of this is given in Section 4.1.1.

We now comment on a few generic aspects of the NSP algorithm as defined above, and situate it in the context of the existing literature.

Length check for [s, e] in line 2. Consider an interval [s, e] with e − s < p. If it is known that the matrix X s:e,· is of rank e − s + 1 (as is the case, for example, in Scenario 2, for all such s, e) then it is safe to disregard [s, e], as the response Y s:e can then be explained exactly as a linear combination of the columns of X s:e,· , so it is impossible to assess any deviations from linearity of Y s:e with respect to X s:e,· . Therefore, if this rank condition holds, the check in line 2 of NSP can be replaced with e − s < p, which (together with the corresponding modifications in lines 5-10) will reduce the computational effort if p > 1. Having p = p(T ) growing with T is possible in NSP, but by the above discussion, we must have p(T ) + 1 ≤ T or otherwise no regions of significance will be found.

Sub-interval sampling. Sub-interval sampling in lines 5-10 of the NSP algorithm is done to reduce the computational effort; considering all sub-intervals would normally be too expensive. In the change-point detection literature (without inference considerations), Wild Binary Segmentation (WBS, Fryzlewicz, 2014) uses a random interval sampling mechanism in which all or almost all intervals are sampled at the start of the procedure, i.e. with all or most intervals not being sampled recursively. The same style of interval sampling is used in the Narrowest-Over-Threshold change-point detection (note: not change-point inference) algorithm (Baranowski et al., 2019) and is mentioned in passing in . Instead, NSP uses a different, recursive interval sampling mechanism, introduced in the change-point detection (not inference) context in Wild Binary Segmentation 2 (WBS2, Fryzlewicz, 2020) . In NSP (lines 5-10), intervals are sampled separately in each recursive call of the NSP routine. As argued in Fryzlewicz (2020) , this enables more thorough exploration of the domain {1, . . . , T } and hence better feature discovery than the non-recursive sampling style. We note that NSP can equally use random or deterministic interval selection mechanisms; a specific example of a deterministic interval sampling scheme in a change-point detection context can be found in Kovács et al. (2020b) .

Relationship to NOT. The Narrowest-Over-Threshold (NOT) algorithm of Baranowski et al. (2019) is a change-point detection procedure (valid in Scenarios 1 and 2) and comes with no inference considerations. The common feature shared by NOT and NSP is that in their respective aims (change-point detection for NOT; locating regions of global significance for NSP) they iteratively focus on the narrowest intervals on which a certain test (a changepoint locator for NOT; a multiscale scan statistic on multiresolution sup-norm fit residuals for NSP) exceeds a threshold, but this is where similarities end: apart from this common feature, the objectives, scopes and modi operandi of both methods are different.

Focus on the smallest significant regions. Some authors in the inference literature also identify the shortest intervals (or smallest regions) of significance in data. For example, Dümbgen and Walther (2008) plot minimal intervals on which a density function significantly decreases or increases. Walther (2010) plots minimal significant rectangles on which the probability of success is higher than a baseline, in a two-dimensional spatial model. mention the possibility of using the interval sampling scheme from Fryzlewicz (2014) to focus on the shortest intervals in their CUSUM-based determination of regions of significance in Scenario 1. In addition to NSP's new definition of significance involving the multiresolution sup-norm fit (whose benefits are explained in Section 2.2), NSP is also different from these approaches in that its pursuit of the shortest significant intervals is at its algorithmic core and is its main objective. To achieve it, NSP uses a number of solutions which, to the best of our knowledge, either are new or have not been considered in this context before. These include the two-stage search for the shortest significant subinterval (NSP routine, line 19) and the recursive sampling (lines 5-10, proposed previously but in a non-inferential context by Fryzlewicz (2020)).

This section completes the definition of NSP (in the version without self-normalisation) by describing the DeviationFromLinearity function (NSP algorithm, line 12). Its basic building block is a scaled partial sum statistic, defined for an arbitrary input sequence

(3)

In the feature (including change-point) detection literature, scaled partial sum statistics are used in at least two distinct contexts. In the first type of use, they serve as likelihood ratio statistics, under i.i.d. Gaussianity of the noise, for testing whether a given constant region of the data has a different mean from its constant baseline. For the problem of testing for the existence of such a region or estimating its unknown location (or their locations if multiple), sometimes under the heading of epidemic change-point detection, scaled partial sum statistics are combined across (s, e) in various ways, often into variants of scan statistics (i.e., maxima across (s, e) of absolute scaled partial sum statistics), see Siegmund and Venkatraman (1995) , 2020) for an accessible overview of this problem. In this type of use, scaled partial sum statistics operate directly on the data, so we refer to this mode of use as "direct".

The second popular use of scaled partial sum statistics is in estimators that can be represented as the simplest (from the point of view of a certain regularity or smoothness functional) fit to the data for which the empirical residuals are deemed to behave like the true residuals. In this mode of use, scaled partial sum statistics are used as components of a multiresolution sup-norm used to check this aspect of the empirical residuals. SMUCE (Frick et al., 2014) , reviewed previously, is one example of such an estimator. Others are the taut string algorithm for minimising the number of local extreme values (Davies and Kovac, 2001) , the general simplicity-promoting approach of Davies et al. (2009) and the Multiscale Nemirovski-Dantzig (MIND) estimator of Li (2016) . The explicit reference to Dantzig in Li (2016) (see also e.g. Frick et al. (2014) ) reflects the fact that the Dantzig selector for high-dimensional linear regression (Candes and Tao, 2007) also follows the "simplicity of fit subject to a sup-norm constraint on the residuals" logic. In this type of use, scaled partial sum statistics do not operate directly on the data but are used in a fit-to-data constraint, so we refer to this mode of use as "indirect".

We now describe the DeviationFromLinearity function and show how its use of scaled partial sum statistics does not strictly fall into the "direct" or "indirect" categories.

We define the scan statistic of an input vector y (of length T ) with respect to the interval set I as

As in Davies and Kovac (2001) , Davies et al. (2009 ), Frick et al. (2014 , Li (2016) and related works, the set I used in NSP contains intervals at a range of scales and locations. Although in principle, the computation of (4) for the set I a of all subintervals of [1, T ] is possible in computational time O(T log T ) (Bernholt and Hofmeister, 2006) , the algorithm is fairly involved and for computational simplicity we use the set I d of all intervals of dyadic lengths and arbitrary locations, that is

A simple pyramid algorithm of complexity O(T log T ) is available for the computation of all U s,e (y) for [s, e] ∈ I d . We also define restrictions of I a and I d to arbitrary intervals [s, e] : u, v] ∈ I d }, and analogously for I a . We will be referring to · I d , · I a and their restrictions as multiresolution sup-norms (see Nemirovski (1986) and Li (2016) ) or, alternatively, multiscale scan statistics if they are used as operations on data. If the context requires this, the qualifier "dyadic" will be added to these terms when referring to the I d versions. The facts that, for any interval [s, e] and any input vector y (of length T ), we have 

This fits the postulated linear model between X and Y restricted to the interval [s m , e m ]. However, we use the multiresolution sup-norm · I d [sm,em] as the loss function, rather than the more usual L 2 loss. This has important consequences for the exactness of our significance statements, which we explain later below.

2. Compute the same multiresolution sup-norm of the empirical residuals from the above fit,

(6) and (7) can obviously also be carried out in a single step as

, however, for comparison with other approaches, it will be convenient for us to use the two-stage process (in formulae (6) and (7)) for the computation of D [sm,em] .

3. Return D [sm,em] .

The following important property lies at the heart of NSP. , which completes the proof. This is a simple but valuable result, which can be read as follows: "under the local null hypothesis of no signal on [s, e] , the test statistic D [s,e] , defined as the multiresolution supnorm of the empirical residuals from the same multiresolution sup-norm fit of the postulated linear model on [s, e] , is bounded by the multiresolution sup-norm of the true residual process Z t ". This bound is achieved because the same norm is used in the linear model fit and in the residual check, and it is important to note that the corresponding bound would not be available if the postulated linear model were fitted with a different loss function, e.g. via OLS. Having such a bound allows us to transfer our statistical significance calculations to the domain of the unobserved true residuals Z t , which is much easier than working with the corresponding empirical residuals. It is also critical to obtaining global coverage guarantees for NSP, as we now show.

Theorem 2.1 Let S = {S 1 , . . . , S R } be a set of intervals returned by the NSP algorithm.

The following guarantee holds.

Proof. The second inequality is implied by (5). We now prove the first inequality. On the set Z I d ≤ λ α , each interval S i must contain a change-point as if it did not, then by Proposition 2.1, we would have to have

However, the fact that S i was returned by NSP means, by line 14 of the NSP algorithm, that D S i > λ α , which contradicts (8). This completes the proof.

Theorem 2.1 should be read as follows. Let α = P ( Z I a > λ α ). For a set of intervals returned by NSP, we are guaranteed, with probability of at least 1 − α, that there is at least one change-point in each of these intervals. Therefore, S = {S 1 , . . . , S R } can be interpreted as an automatically chosen set of regions (intervals) of significance in the data. In the no-change-point case (N = 0), the correct reading of Theorem 2.1 is that the probability of obtaining one of more intervals of significance (R ≥ 1) is bounded from above by P ( Z I a > λ α ). The following comments are in order.

NSP vs direct use of scan statistics. The use of scan statistics in NSP is different from that in the "direct" approaches described at the beginning of this section, as in NSP they are used on residuals from local linear fits, rather than on the original data.

NSP vs indirect use of multiresolution sup-norms. The use of multiresolution sup-norms in NSP is also different from the "indirect" use in the Dantzig-selector-type estimators in Davies and Kovac (2001) , Davies et al. (2009 ), Frick et al. (2014 and Li (2016) . These estimators use other types of fit to the data (ones that maximise certain regularity / simplicity), to be checked, in terms of their goodness-of-fit, via a multiresolution sup-norm. NSP uses a multiresolution sup-norm fit to be checked via the same multiresolution sup-norm. This is a fundamental difference which leads to exact coverage guarantees for NSP with very simple mathematics. We show in Section 4 that SMUCE (Frick et al., 2014) does not have the corresponding coverage guarantees even if it abandons its focus on N as an inferential quantity.

Interpretation of S as unconditional confidence intervals. Traditionally, sets of confidence intervals for change-point locations are constructed (see e.g. Bai and Perron (1998) ) condi-tional on having selected a particular model, i.e. estimating N . Such a conditional approach does not guarantee unconditional global coverage in the sense of Theorem 2.1. By contrast, the set S of intervals returned by NSP in not conditional on any particular estimator of N , and as a result provides unconditional coverage guarantees. Still, the regions of significance in S have a "confidence interval" interpretation in the sense that each must contain at least one change, with a certain prescribed global probability. (1 − α)100%-guaranteed lower bound on the number of change-points. A simple corollary of Theorem 2.1 is that for S = {S 1 , . . . , S R }, if the corresponding sets S − i are mutually disjoint (as is the case e.g. if τ L = τ R ≡ 0), then we must have N ≥ R with probability at least 1 − α. It would be impossible to obtain a similar upper bound on N with a guaranteed probability without order-of-magnitude assumptions on spacings between change-points and magnitudes of parameter changes. Such assumptions are typically difficult to verify, and we do not make them in this work. As a consequence, our result in Theorem 2.1 does not rely on asymptotics and has a finite-sample character.

Computation of linear fit with multiresolution sup-norm loss. The linear model fit in formula (6) can be computed in a simple and efficient way via linear programming. This is carried out in our code with the help of the R package lpSolve.

Irrelevance of accuracy of nuisance parameter estimators. β 0 in formula (6) does not have to be an accurate estimator of the true local β for the bound in Proposition 2.1 to hold; it holds unconditionally and for arbitrary short intervals [s, e] . This is in contrast to e.g. an OLS fit, in which we would have to ensure accurate estimation of the local β (and therefore: suitably long intervals [s, e]) to be able to obtain similar bounds. We return to this important issue in Section 3.2 for comparison with the existing literature.

"Post-inference selection" and related new concepts. NSP is not automatically equipped with pointwise estimators of change-point locations. This is an important feature, because thanks to this, it can be so general and work in the same way for any X without a change.

If it were to come with meaningful pointwise change-point location estimators, they would have to be designed for each X separately, e.g. using the maximum likelihood principle. (However, NSP can be paired up with such pointwise estimators; examples, and the role of the overlap functions τ L and τ R in such pairings, are given in Sections 4 and 5.) We now introduce a few new concepts, to contrast this feature of NSP with the concept of "post-selection inference" (see e.g. Jewell et al. (2020) for its use in our Scenario 1).

• "Post-inference selection". If it can be assumed that an interval S i = [s i , e i ] ∈ S only contains a single change-point, its location can be estimated e.g. via MLE performed locally on the data subsample living on [s i , e i ]. Naturally, the MLE should be constructed with the specific design matrix X in mind, see Baranowski et al. (2019) for examples in Scenarios 1 and 2. In this construction, "inference", i.e. the execution of NSP, occurs before "selection", i.e. the estimation of the change-point locations, hence the label of "post-inference selection". This avoids the complicated machinery of post-selection inference, as we automatically know that the p-value associated with the estimated change-point must be less than α.

• "Simultaneous inference and selection" or "in-inference selection". In this construction, change-point location estimation on an interval [s,ẽ] occurs directly after adding it to S. The difference with "post-inference selection" is that this then naturally enables appropriate non-zero overlaps τ L and τ R in the execution of NSP. More specifically, denoting the estimated location within [s,ẽ] byη, we can set, for example,

so that lines 21-22 of the NSP algorithm become NSP(s,η, Y, X, M, λ α , τ L , τ R ) NSP(η + 1, e, Y, X, M, λ α , τ L , τ R ).

• "Inference without selection". This term refers to the use of NSP unaccompanied by a change-point location estimator.

Known vs unknown distribution of Z I a . By Theorem 2.1, the only piece of knowledge required to obtain coverage guarantees in NSP is the distribution of Z I a (or Z I d ), regardless of the form of X. This is in contrast with the approach taken in and , in which coverage is guaranteed with the knowledge of distributions which may differ for each X. This property of NSP is attractive because much is known about the distribution of Z I a for various underlying distributions of Z; see Sections 2.3 and 2.4 for Z Gaussian and following other light-tailed distributions, respectively. Any future further distributional results of this type would only further enhance the applicability of NSP. However, if the distribution of Z I a is unknown, then an approximation can also be obtained by simulation. This can be done an order of magnitude faster than simulating the maximum of all possible CUSUM statistics, a quantity required to guarantee coverage in the setting of but without the assumption of Gaussianity on Z: on a single dataset, the computation of Z I a is an O(T 2 ) operation, whereas the computation of the maximum CUSUM is O(T 3 ).

Lack of penalisation for fine scales. Instead of using multiresolution sup-norms (multiscale scan statistics) as defined by (4), some authors, including Walther (2010) and Frick et al. (2014) , use alternative definitions which penalise fine scales (i.e. short intervals) in order to enhance detection power at coarser scales. We do not pursue this route, as NSP aims to discover significant intervals that are as short as possible, and hence we are interested in retaining good detection power at fine scales. However, some natural penalisation of fine scales in necessary in the self-normalised case; see Section 3.1 for more details.

Upper bounds for p-values on non-detection intervals. By calculating the quantity D [s,e] , defined in (7) 

We now recall distributional results for Z I a , in the case Z t ∼ i.i.d. N (0, σ 2 ) with σ 2 assumed known, which will permit us to choose λ α = λ α (T ) so that P { Z I a > λ α (T )} → α as T → ∞. The resulting λ α (T ) can then be used in Theorem 2.1.

The assumption of a known σ 2 is common in the change-point inference literature, see e.g. Hyun et al. (2018a) , and Jewell et al. (2020) . Fundamentally, this is because in Scenarios 1 and 2, in which the covariates possess some degree of regularity across t, the variance parameter σ 2 is relatively easy to estimate (see Section 4.1 of Dümbgen and Spokoiny (2001) , and , for overviews of the most common approaches). Fryzlewicz (2020) points out potential issues in estimating σ 2 in the presence of frequent change-points, but they are addressed in Kovács et al. (2020a) . See Section 2.5 for the unknown σ 2 case.

Results on the distribution of Z I a are given in Siegmund and Venkatraman (1995) and Kabluchko (2007) . We recall the formulation from Kabluchko (2007) as it is slightly more explicit.

Theorem 2.2 (Theorem 1.3 in Kabluchko (2007)

where Φ() is the standard normal cdf.

We use the approximate value H = 0.82 in our numerical work. Using the asymptotic independence of the maximum and the minimum (Kabluchko and Wang, 2014) , and the symmetry of Z, we get the following simple corollary.

as T → ∞. In light of (9), we obtain λ α for use in Theorem 2.1 as follows: (a) equate α = 1 − exp(−2e −γ ) and obtain γ, (b) form λ α = σ(a T + b T γ). Kabluchko and Wang (2014) provide a result similar to Theorem 2.2 for distributions of Z dominated by the Gaussian in a sense specified below. These include, after scaling so that E(Z) = 0 and Var(Z) = 1, the symmetric Bernoulli, symmetric binomial and uniform distributions, amongst others. We now briefly summarise it for completeness. Consider the cumulant-generating function of Z defined by ϕ(u) = log E(e uZ ) and assume that for some σ 0 > 0, we have ϕ(u) < ∞ for all u ≥ −σ 0 . Assume further that for all ε > 0, sup u≥ε ϕ(u)/(u 2 /2) < 1. Finally, assume

for some d ∈ {3, 4, . . .} and κ > 0. Typical values of d for non-symmetric and symmetric distributions, respectively, are 3 and 4. Under these assumptions, we have

. After simple algebraic manipulations, this result permits a selection of λ α for use in Theorem 2.1, similarly to Section 2.3.

We show under what condition Theorem 2.2 remains valid with an estimated variance σ 2 , and give an estimator of σ 2 that satisfies this condition for certain matrices X and parameter vectors β (j) . Similar considerations are possible for the light-tailed distributions from Section 2.4, but we omit them for brevity.

With {Z t } T t=1 ∼ N (0, σ 2 ) rather than N (0, 1), the statement of Theorem 2.2 trivially modifies to

.

With σ estimated via a generic estimatorσ, we ask under what circumstances

In light of (10), it is enough to solve for γ T in σ(a

In view of the form of a T and b T defined in Theorem 2.2, γ T defined in (12) satisfies γ T −→ T →∞ γ on a set large enough for (11) to hold if

After Rice (1984) and Dümbgen and Spokoiny (2001) , definê

Define the signal in model (2) by f t = X t,· β (j) for t = η j + 1, . . . , η j+1 , for j = 0, . . . , N . The total variation of a vector {f t } T t=1 is defined by T V (f ) = T −1 t=1 |f t+1 − f t |. As in Dümbgen and Spokoiny (2001)

By way of a simple example, in Scenario 1, T V (f ) = N j=1 |f η j − f η j +1 |, and therefore (15) is satisfied if the sum of jump magnitudes in f is o(T 1/2 log −1 T ). Note that if f is bounded with a number of change-points that is finite in T , then T V (f ) = const(T ). Similar arguments apply in Scenario 2, and in Scenario 3 for certain matrices X.

Without formal theoretical justifications, we also mention two further estimators of σ 2 (or σ) which we use later in our numerical work.

• In Scenarios 1 and 2, we useσ M AD , the Median Absolute Deviation (MAD) estimator as implemented in the R routine mad, computed on the sequence {2 −1/2 (Y t+1 −Y t )} T −1 t=1 . Empirically,σ M AD is more robust thanσ R to the presence of change-points in f t , but is also more sensitive to departures from the Gaussianity of Z t .

• In Scenario 3, in settings outside Scenarios 1 and 2, we use the following estimator. In model (2), estimate σ via least squares, on a rolling window basis, using the window of size w = min{T, max([T 1/2 ], 20)}, to obtain the sequence of estimatorsσ 1 , . . . ,σ T −w+1 . Takeσ M OLS = median(σ 1 , . . . ,σ T −w+1 ), where MOLS stands for 'Median of OLS estimators'. The hope is that most of the local estimatorŝ σ 1 , . . . ,σ T −w+1 are computed on change-point-free sections of the data, and therefore the median of these local estimators should serve as an accurate estimator of the true σ. Empirically,σ M OLS is a useful alternative toσ R in settings in which condition (15) is not satisfied.

3.1 Self-normalised NSP Sections 2.3 and 2.4 outline the choice of λ α for Gaussian or lighter-tailed distributions of Z t . Kabluchko and Wang (2014) point out that the square-root normalisation used in (3) is not natural for the heavier-tailed than Gaussian sublogarithmic class of distributions, which includes Gamma, negative binomial and Poisson. Siegmund and Yakir (2000) provide the 'right' normalisation for these and other exponential-family distributions, but this involves the likelihood function of Z t and hence requires the knowledge of its full distribution, which may not always be available to the analyst. Similarly, Mikosch and Rackauskas (2010) provide the suitable normalisation for regularly varying random variables with index α RV , which also involves the knowledge of α RV . We are interested in obtaining a universal normalisation in (3) which would work across a wide range of distributions without requiring their explicit knowledge.

One such solution is offered by the self-normalisation framework developed in Rackauskas and Suquet (2001) , Rackauskas and Suquet (2003) , Rackauskas and Suquet (2004) and related papers. We now recall the basics and discuss the necessary adaptations to our context. We first discuss the relevant distributional results for the true residuals Z t . In this paper, we only cover the case of symmetric distributions of Z t . For the non-symmetric case, which requires a slightly different normalisation, see Rackauskas and Suquet (2003) .

In Rackauskas and Suquet (2003) , the following result is proved. Let

This last condition, in particular, is satisfied if θ = 1/2 and ν > 1/2. The function ρ θ,ν,c will play the role of a modulus of continuity. Let Z 1 , Z 2 , . . . be independent and symmetrically distributed with E(Z t ) = 0; note they do not need to be identically distributed. Define

in probability as T → ∞. Egorov (1997) shows that (16) is equivalent to the central limit theorem. Therefore, the material of this section applies to a much wider class of distributions than the heterogeneous extension of SMUCE in Pein et al. (2017) , which only applies to normally distributed Z t .

Let the random polygonal partial sums process ζ T be defined on [0, 1] as linear interpolation between the knots (V 2 t /V 2 T , S t ), t = 0, . . . , T , where S 0 = V 0 = 0, and let

Denote by H ρ θ,ν,c [0, 1] the set of continuous functions x :

H ρ θ,ν,c [0, 1] is a Banach space in its natural norm

x ρ θ,ν,c = |x(0)| + ω ρ θ,ν,c (x, 1). 

and, with > 0 and c = exp(1 + 2 ), consider the statistic

In the notation and under the conditions listed above, it is a direct consequence of the distributional convergence (17) in the space H 0 ρ θ,ν,c [0, 1] that for any level γ, we have P sup 0≤i<j≤T I ρ 1/2,1/2+ ,c (ζ se

as T → ∞, and the quantiles of the distribution of sup u,v∈[0,1] I ρ 1/2,1/2+ ,c (W, u, v), which does not depend on the sample size T , can be computed (once) by simulation. 

We now outline how this can be achieved.

(1) j ) be the ordinary least-squares residuals from regressing Y (i+1):j on X (i+1):j,· , where j − i > p. As [s, e] contains no change-point, we have (Ẑ

which guarantees (Ẑ

. .+Z 2 j for and j −i suitably large, for a range of distributions of Z t and design matrices X. We now briefly sketch the argument justifying this for Scenario 1; similar considerations are possible in Scenario 2 but are notationally much more involved and we omit them here for brevity. The argument relies again on self-normalisation. From standard least-squares theory (in any Scenario), we have

In Scenario 1, (X (i+1):j,· X (i+1):j,· ) −1 = (j − i) −1 , and hence

From the above, we obtain

In light of the distributional result (18), the relationship between the statistic I ρ 1/2,1/2+ ,c (W, u, v) and Rackauskas and Suquet (2004) 's statistic UI(ρ 1/2,1/2+ ,c ), as well as their Remark 5, we are able to bound sup 0≤i<j≤T I 2 ρ 1/2,1/2+ ,c (ζ se T , V 2 i /V 2 T , V 2 j /V 2 T ) by a term of order O(log T ) on a set of probability 1 − O(T −1 ). Making the mild assumption that sup 0≤i<j≤T log 1+2 {cV 2 T /(Z 2 i+1 + . . . + Z 2 j )} l T = o P (T log −1 T ) and continuing from (21), we obtain the bound

for a certain constant C > 0, which can be bounded from below by Z (i+1):j Z (i+1):j (1 + ) −2 , uniformly over those i, j for which (j − i) −1 l T log T → 0. This justifies (20) and completes the argument. 

which in turn guarantees the bound (19), is practically equivalent to the multiresolution norm minimisation solved in formula (6) except it now uses a weighted version of the norm · I a [s,e] , where the weights are given in the denominator of (22). This weighted problem is solved via linear programming just as easily as (6), the only difference being that the relevant constraints are multiplied by the corresponding weights.

We now discuss further practicalities of the self-normalisation. Estimating V T . Our empirical experience is that the statistic on the LHS of (19) is fairly robust to the choice of V 2 T , as the latter only enters through the (close to) square-root logarithmic term in the denominator. In addition, over-estimation of V 2 T for use on the LHS of (19) is permitted as it only strengthens the bound in (19) . For these reasons, we do not dwell on the accurate estimation of V 2 T here, but use the rough estimatê

where theσ t 's are the constituents of theσ M OLS estimator from Section 2.5.

Relevant range of i and j. As clarified earlier, the use of (20) requires that small values of j −i do not enter in the computation of the supremum on the LHS of (19). In practice, however, we use all [i + 1, j] ∈ I d [s,e] . This is because the function I ρ 1/2,1/2+ ,c (ζ se T , V 2 i /V 2 T , V 2 j /V 2 T ) naturally penalises small scales (i.e. short intervals [i + 1, j]) through the use of the logarithmic term in the denominator. Therefore, in practice, short intervals [i + 1, j] do not tend to achieve the supremum on the LHS of (19) and as a result, we have found further exclusion of such short intervals unnecessary.

Choice of . We have experimented with in the range [0.03, 0.1] and found little difference in practical performance. Our code uses = 0.03 as a default.

To accommodate autoregression, we introduce the following additional scenario.

For a given design matrix X = (X t,i ), t = 1, . . . , T , i = 1, . . . , p, the response Y t follows the model

for j = 0, . . . , N , where the regression parameter vectors β (j) = (β (j) 1 , . . . , β (j) p ) and the autoregression parameters a (j) k are such that either β (j) = β (j+1) or a (j) k = a (j+1) k for some k (or both types of changes occur).

In this work, we treat the autoregressive order r as fixed and known to the analyst. Changepoint detection in the signal in the presence of serial correlation is a known hard problem in change-point analysis and many methods (see e.g. Dette et al. (2018) for an example and a literature review) rely on the accurate estimation of the long-run variance of the noise, a difficult problem. consider r = 1 and treat the autoregressive parameter as known, but acknowledge that in practice it is estimated from the data; however, they add that "[it] would also be possible to estimate [the autoregressive parameter] from the currently studied subset of the data, but this estimator appears to be unstable". NSP circumvents this instability issue, as explained below. NSP for Scenario 4 proceeds as follows.

1. Supplement the design matrix X with the lagged versions of the variable Y , or in other words substitute

where Y ·−k denotes the respective backshift operation. Omit the first r rows of the thus-modified X, and the first r elements of Y .

2. Run the NSP algorithm of Section 2.1 with the new X and Y (with a suitable modification to line 12 if using the self-normalised version), with the following single difference. In lines 21 and 22, recursively call the NSP routine on the intervals [s,s + τ L (s,ẽ, Y, X) − r] and [ẽ − τ R (s,ẽ, Y, X) + r, e], respectively. As each local regression is now supplemented with autoregression of order r, we insert the extra "buffer" of size r between the detected interval [s,ẽ] and the next children intervals to ensure that we do not process information about the same change-point in both the parent call and one of the children calls, which prevents double detection. The discussion under the heading of "Guaranteed location of change-points" from Section 2.2 still applies in this case.

As the NSP algorithm for Scenario 4 proceeds in exactly the same way as for Scenario 3, the result of Theorem 2.1 applies to the output of NSP for Scenario 4 too.

The NSP algorithm offers a new point of view on change-point analysis in the presence of autocorrelation. This is because unlike the existing approaches, most of which require the accurate estimation of the autoregressive parameters before successful change-point detection can be achieved, NSP circumvents the issue by using the same multiresolution norm in the local regression fits on each [s, e] , and in the subsequent tests of the local residuals. In this way, the autoregression parameters do not have to be estimated accurately for the relevant stochastic bound in Proposition 2.1 to hold; it holds unconditionally and for arbitrary short intervals [s, e] . Therefore unlike e.g. the method of , NSP is able to deal with autoregression, stably, on arbitrarily short intervals.

4 Numerical illustrations 4.1 Scenario 1 -piecewise constancy 4.1.1 Low signal-to-noise example

We use the piecewise-constant blocks signal of length T = 2048 containing N = 11 changepoints, defined in Fryzlewicz (2014) . We contaminate it with i.i.d. Gaussian noise with σ = 10, simulated with random seed set to 1. This represents a difficult setting from the perspective of multiple change-point detection, with practically all state of the art multiple change-point detection methods failing to estimate all 11 change-points with high probability (Anastasiou and Fryzlewicz, 2020) . Therefore, a high degree of uncertainty with regards to the existence and locations of change-points can be expected here.

The NSP procedure with theσ M AD estimate of σ, run with the following parameters: M = 1000, α = 0.1, τ L = τ R = 0, and with a deterministic interval sampling grid, returns 7 intervals of significance, shown in the top left plot of Figure 1 . We recall that it is not the aim of the NSP procedure to detect all change-points. The correct interpretation of the result is that we can be at least 100(1 − α)% = 90% certain that each of the intervals returned by NSP covers at least one true change-point. We note that this coverage holds for this particular sample path, with exactly one true change-point being located within each interval of significance.

NSP enables the definition of the following concept of a change-point hierarchy. A hypothesised change-point contained in the detected interval of significance [s 1 ,ẽ 1 ] is considered more prominent than one contained in [s 2 ,ẽ 2 ] if [s 1 ,ẽ 1 ] is shorter than [s 2 ,ẽ 2 ]. The bottom left plot of Figure 1 shows a "prominence plot" for this output of the NSP procedure, in which the lengths of the detected intervals of significance are arranged in the order from the shortest to the longest.

It is unsurprising that the intervals returned by NSP do not cover the remaining 4 changepoints, as from a visual inspection, it appears that all of them are located towards the edges of data sections situated between the intervals of significance. Executing NSP without an overlap, i.e. with τ L = τ R = 0, means that the procedure runs, in each recursive step, wholly on data sections between (and only including the end-points of) the previously detected intervals of significance. Therefore, in light of the close-to-the-edge locations of the remaining 4 change-points within such data sections, and the low signal-to-noise ratio, any procedure would struggle to detect them there.

This shows the importance of allowing non-zero overlaps τ L and τ R in NSP. We next test the following overlap functions on this example:

τ L (s,ẽ) = (s +ẽ)/2 −s, τ R (s,ẽ) = (s +ẽ)/2 + 1 −ẽ. [ẽ, e] ). The outcome of the NSP procedure with the overlap functions in (24) but otherwise the same parameters as earlier is shown in the top right plot of Figure 1 . This version of the procedure returns 10 intervals of significance, such that (a) each interval covers at least one true change-point, and (b) they collectively cover 10 of the signal's N = 11 change-points, the only exception being η 3 = 307.

We briefly remark that one of the returned intervals of significance, [s,ẽ] = [837, 1303] , is much longer than the others, but this should not surprise given that the ( In other words, [s m 0 , e m 0 ] is not searched for its shortest sub-interval of significance, but is added to S as it is. The output of NSP(1) on Y t is shown in the right plot of Figure 2 . The intervals of significance returned by NSP (1) This issue would not arise in NSP, as NSP would then search this detection interval for its shortest significant sub-interval. From the output of the NSP procedure, we can see that this second-stage search drastically reduced the length of this detection interval, which is For very long signals, it is conceivable that an analogous three-stage search may be a better option, possibly combined with a reduction in M to enhance the speed of the procedure.

For the NSP procedure, Theorem 2.1 promises that the probability of detecting an interval of significance which does not cover a true change-point is bounded from above by P ( Z I a > λ α ), regardless of the value of M and of the overlap parameters τ L , τ R . In this section, we set P ( Z I a > λ α ) = α = 0.1.

We now show that a similar coverage guarantee is not available in SMUCE, even if we move away from its focus on N as an inferential quality, thereby obtaining a more lenient performance test for SMUCE. In R, SMUCE is implemented in the package stepR, available from CRAN. For a generic data vector y, the start-and end-points of the confidence intervals for the SMUCE-estimated change-point locations (at significance level α = 0.1) are available in columns 3 and 4 of the table returned by the call jumpint(stepFit(y, alpha=0.1, confband=T)) with the exception of its final row.

In this numerical example, we consider again the blocks signal with σ = 10. For each of 100 simulated sample paths, we record a "1" for SMUCE if each interval defined above contains at least one true change-point, and a "0" otherwise. Similarly, we record a "1" for NSP if each interval S − i contains at least one true change-point, where S = {S 1 , . . . , S R } is the set of intervals returned by NSP, and a "0" otherwise. As before, in NSP, we use M = 1000, τ L = τ R = 0, and a deterministic interval sampling grid.

With the random seed set to 1 prior to the simulation of the sample paths, the percentages of "1"'s obtained for SMUCE and NSP are in Table 1 . While NSP (generously) keeps its promise of delivering a "1" with the probability of at least 0.9, the same cannot be said for SMUCE, for which the result of 52% makes the interpretation of its significance parameter α = 0.1 difficult. method coverage SMUCE 52 NSP 100 

We consider the continuous, piecewise-linear shortwave2 signal, defined as the first 450 elements of the wave2 signal from Baranowski et al. (2019) , contaminated with i.i.d. Gaussian noise with σ = 0.5. The signal and a sample path are shown in Figure 3 .

In this model, we run the NSP procedure, with no overlaps and with the other parameters set as in Section 4.1.1, (wrongly or correctly) assuming the following, where q denotes the postulated degree of the underlying piecewise polynomial: q = 0. This wrongly assumes that the true signal is piecewise constant. q = 1. This assumes the correct degree of the polynomial pieces making up the signal. q = 2. This over-specifies the degree: the piecewise-linear pieces can be modelled as piecewise quadratic, but with the quadratic coefficient set to zero.

We denote the resulting versions of the NSP procedure by NSP q for q = 0, 1, 2. The intervals of significance returned by all three NSP q methods are shown in Figure 3 . Theorem 2.1 guarantees that the NSP 1 intervals each cover a true change-point with probability of at least 1−α = 0.9 and this behaviour takes places in this particular realisation. The same guarantee holds for the over-specified situation in NSP 2 , but there is no performance guarantee for the mis-specified model in NSP 0 .

The total length of the intervals of significance returned by NSP q for a range of q can potentially be used to aid the selection of the 'best' q. To illustrate this potential use, note that the total length of the NSP 0 intervals of significance is much larger than that of NSP 1 or NSP 2 , and therefore the piecewise-constant model would not be preferred here on the (1) noise with coefficient 0.9 and standard deviation (1 − 0.9 2 ) −1/2 /5 (light grey), NSP intervals of significance (shaded red), true change-points (blue); see Section 4.4 for details.

grounds that the data deviates from it over a large proportion of its domain. The total lengths of the intervals of significance for NSP 1 and NSP 2 are very similar, and hence the piecewise-linear model might (correctly) be preferred here as offering a good description of a similar portion of the data, with fewer parameters than the piecewise-quadratic model.

We briefly illustrate the performance of the self-normalised NSP. We define the piecewiseconstant squarewave signal as taking the values of 0, 10, 0, 10, each over a stretch of 200 time points. With the random seed set to 1, we contaminate it with a sequence of independent t-distributed random variables with 4 degrees of freedom, with the standard deviation changing linearly from σ 1 = 2 √ 2 to σ 800 = 8 √ 2. The simulated dataset, showing the "spiky" nature of the noise, is in the left plot of Figure 4 .

We run the self-normalised version of NSP with the following parameters: a deterministic equispaced interval sampling grid, M = 1000, α = 0.1, = 0.03, no overlap; the outcome is in the left plot of Figure 4 . Each true change-point is correctly contained within a (separate) NSP interval of significance, and we note that no spurious intervals get detected despite the heavy-tailed and heterogeneous character of the noise. A typical feature of the self-normalised NSP intervals of significance, exhibited also in this example, is their relatively large width in comparison to the standard (non-self-normalised) NSP. In practice, we rarely came across a self-normalised NSP interval of significance of length below 60. This should not surprise given the fact that the self-normalised NSP is distribution-agnostic in the sense that the data transformation it uses is valid for a wide no. of intervals of significance 2 3 4 5 percentage of sample paths 11 32 42 15 range of distributions of Z t , and leads to the same limiting distribution under the null. Therefore, the relative large width of self-normalised intervals of significance arises naturally as a protection against mistaking potential heavy-tailed noise for signal. We emphasise that the user does not need to know the distribution of Z t to perform the self-normalised NSP.

We use the piecewise-constant signal of length T = 1000 from the first simulation setting in Dette et al. (2018) , contaminated with Gaussian AR(1) noise with coefficient 0.9 and standard deviation (1 − 0.9 2 ) −1/2 /5. A sample path, together with the true change-point locations, is shown in the right plot of Figure 4 .

We run the AR version of the NSP algorithm (as outlined in Section 3.2), with the following parameters: a deterministic equispaced interval sampling grid, M = 100, α = 0.1, no overlap,σ 2 M OLS estimator of the residual variance. The resulting intervals are shown in the right plot of Figure 4 ; NSP intervals cover four out of the five true change-points, and there are no spurious intervals.

We simulate from this model 100 times and obtain the following results. In 100% of the sample paths, each NSP interval of significance covers one true change-point (which fulfils the promise of Theorem 2.1). The distribution of the detected numbers of intervals is as in Table 2 ; we recall that NSP does not promise to detect the number of intervals equal to the number of true change-points in the underlying process.

We re-analyse the time series of US ex-post real interest rate (the three-month treasury bill rate deflated by the CPI inflation rate) considered in Garcia and Perron (1996) and Bai and Perron (2003) . The dataset is available at http://qed.econ.queensu.ca/jae/datasets/ bai001/. The dataset Y t , shown in the left plot of Figure 5 , is quarterly and the range is 1961:1-1986:3, so t = 1, . . . , T = 103.

We first perform a naive analysis in which we assume our Scenario 1 (piecewise-constant mean) plus i.i.d. N (0, σ 2 ) innovations. This is only so we can obtain a rough segmentation which we can then use to adjust for possible heteroscedasticity of the innovations in the next stage. We estimate σ 2 viaσ 2 M AD and run the NSP algorithm (with random interval sampling but having set the random seed to 1, for reproducibility) with the following parameters: M = 1000, α = 0.1, τ L = τ R = 0. This returns the set S 0 of two significant intervals: corresponding fit is in the left plot of Figure 5 . We then produce an adjusted dataset, in which we divide Y 1:47 , Y 48:82 , Y 83:103 by the respective estimated standard deviations of these sections of the data. The adjusted datasetỸ t is shown in the right plot of Figure  5 and has a visually homoscedastic appearance. NSP run on the adjusted dataset with the same parameters (random seed 1, M = 1000, α = 0.1, τ L = τ R = 0) produces the significant interval setS 0 = {[23, 54], [76, 84] }. CUSUM fits on the corresponding data sectionsỸ 23:54 ,Ỹ 76:84 produce identical estimated change-point locationsη 1 = 47,η 2 = 82. The fit is in the right plot of Figure 5 .

We could stop here and agree with Garcia and Perron (1996) , who also conclude that there are two change-points in this dataset, with locations within our detected intervals of significance. However, we note that the first interval, [23, 54] , is relatively long, so one question is whether it could be covering another change-point to the left ofη 1 = 47. To investigate this, we re-run NSP with the same parameters onỸ 1:47 but find no intervals of significance (not even with the lower thresholds induced by the shorter sample size T 1 = 47 rather than the original T = 103). Our lack of evidence for a third change-point contrasts with Bai and Perron (2003) 's preference for a model with three change-points.

However, the fact that the first interval of significance [23, 54] is relatively long could also be pointing to model misspecification. If the change of level over the first portion of the data were gradual rather than abrupt, we could naturally expect longer intervals of significance under the misspecified piecewise-constant model. To investigate this further, we now run NSP onỸ t but in Scenario 2, initially in the piecewise-linear model (q = 1), which leads to one interval of significance:

This raises the prospect of modelling the mean ofỸ 1:73 as linear. We produce such a fit, in which in addition the mean ofỸ 74:103 is modelled as piecewise-constant, with the changepoint locationη 2 = 79 found via a CUSUM fit onỸ 74:103 . As the middle section of the estimated signal between the two change-points (η 1 = 73,η 2 = 79) is relatively short, we also produce an alternative fit in which the mean ofỸ 1:76 is modelled as linear, and the mean ofỸ 77:103 as constant (the starting point for the constant part was chosen to accommodate the spike at t = 77). This is in the right plot of Figure 6 and has a lower BIC value (9.28) than the piecewise-constant fit from the right plot of Figure 5 (10.57). This is because the linear+constant fit uses four parameters, whereas the piecewise-constant fit uses five.

The viability of the linear+constant model for the scaled dataỸ t is encouraging because it raises the possibility of a model for the original data Y t in which the mean of Y t evolves smoothly in the initial part of the data. We construct a simple example of such a model by fitting the best quadratic on Y 1:76 (resulting in a strictly decreasing, concave fit), followed by a constant on Y 77:104 . The change-point location, 77, is the same as in the lin-ear+constant fit forỸ t . The fit is in the left plot of Figure 6 . It is interesting to see that the quadratic+constant model for Y t leads to a lower residual variance than the piecewiseconstant model (4.83 to 4.94). Both models use five parameters. We conclude that more general piecewise-polynomial modelling of this dataset can be a viable alternative to the piecewise-constant modelling used in Garcia and Perron (1996) and Bai and Perron (2003) . This example shows how NSP, beyond its usual role as an automatic detector of regions of significance, can also serve as a useful tool in achieving improved model selection.

We consider the time series D t of the daily number of Covid-19-associated deaths in the UK, available from https://coronavirus.data.gov.uk/. The dataset was accessed on 24th July and runs from 6th March to 23rd July 2020. To eliminate the weekly seasonality and bring the distribution closer to Gaussian with constant variance, we perform the Anscombe transform (Anscombe, 1948) on weekly moving averages of D t , i.e. we consider D t = 2 D a t + 3/8 with D a t = 1/7 t s=t−6 D s .

The transformed time seriesD t is shown in Figure 7 .

Having set random seed to 1, we run NSP onD t in Scenario 2 with assumed piecewise linearity with N (0, σ 2 ) innovations. The variance σ 2 is estimated as in Section 5.1. We use random interval sampling, M = 1000, α = 0.1, τ L = τ R = 0. The procedure returns five intervals of significance, which are shown in Figure 7 , together with estimated change-point locations obtained by running the procedure of Baranowski et al. (2019) within each interval of significance, under continuous piecewise-linearity, with the restriction that the number of change-points within each interval should be one.

Although the final three intervals of significance are relatively long, this should not surprise, as the deviations from linearity within them are (visually) somewhat subtle. We note that while the change-point at t = 43 corresponds to the acceleration in the reduction of daily deaths, the opposite is true for the final detected change-point at t = 100. This brief illustration exemplifies the potential use of NSP in answering questions, posed in the context of the Covid-19 pandemic by mainstream media, or whether perceived trend changes in Covid-19-related time series are spurious or "real".

We consider the average monthly property price P t in the London Borough of Newham, for all property types, recorded from January 2010 to August 2020 (T = 125) and accessed on 8th September 2020. The data is available from https://landregistry.data.gov. uk/. We use the logarithmic scale Q t = log P t and are interested in the stability of the autoregressive model Q t = b + aQ t−1 + Z t .

NSP, run on a deterministic equispaced interval sampling grid, with M = 1000 and α = 0.1, with theσ 2 M OLS estimator of the residual variance and both with no overlap and with an overlap as defined in formula (24), returns a single interval of significance [31, 96] , which corresponds to a likely change-point location between August 2012 and December 2017. Table 3 shows the estimated regression coefficients (with their standard errors) over the two sections. A goodness-of-fit analysis (not shown) reveals a satisfactory fit of this single-change-point model with Z t modelled as i.i.d. Gaussian.

It appears that both the intercept and the autoregressive parameter change significantly at the change-point. In particular, the change in the autoregressive parameter from 1.03 (standard error 0.01) to 0.92 (0.02) suggest a shift from a unit-root process to a stationary one. This agrees with a visual assessment of the character of the process in Figure 8 , where it appears that the process is more 'trending' before the change-point than it is after, where it exhibits a conceivably stationary behaviour. This is perhaps unsurprising given the fact that the first part of the observation period (Jan 2010 -Apr 2015) includes the year 2012, in which the borough served as the main host of the 2012 Olympic Games and experienced the associated regeneration. On the other hand, the second part (May 2015 -Aug 2020) includes the time of Britain's EU membership referendum and the initial stages of the Covid-19 pandemic, both of which can likely be associated with downward pressure on house prices.

We conclude with a brief discussion of a few speculative aspects of NSP.

Possible use of NSP in online monitoring for changes. NSP can in principle be used in the online setting, in which 'alarm' should be raised as soon as Y starts deviating from linearity with respect to X. In particular, consider the following simple construction: having observed (Y t , X t ), t = 1, . . . , T , successively run NSP on the intervals [T − 1, T ], [T − 2, T ], . . . , until either the first interval of significance is discovered, or [1, T ] is reached. This will provide an answer to the question of whether the most recently observed data deviates from linearity and if so, over what time interval.

Using and interpreting NSP in the presence of gradual change. If NSP is used in the absence of change-points but in the presence of gradual change, obtaining a significant interval means that it must (at global significance level α) contain some of the period of gradual change. However, this does not necessarily mean that the entire period of gradual change is contained within the given interval of significance. Note that this is the situation portrayed in Section 4.2, in which the simulation model used is a 'gradual change' model from the point of view of the NSP 0 method, but an 'abrupt change' model from the point of view of NSP 1 and NSP 2 .

Possible use of NSP in testing for time series stationarity. It is tempting to ask whether NSP can serve as a tool in the problem of testing for second-order stationarity of a time series. In this problem, the response Y t would be the time series in question, while the covariates X t would be the Fourier basis. The performance of NSP in this setting will be reported in future work.

Does the principle of NSP extend to other settings? NSP is an instance of a statistical procedure which produces intervals of significance (rather than point estimators) as an output. It is an interesting open question to what extent this emphasis on "intervals of significance before point estimators" may extend to other settings, e.g. the problem of parameter inference in high-dimensional regression.

Detecting multiple generalized change-points by isolating single ones

The transformation of Poisson, binomial and negative-binomial data

Near-optimal detection of geometric objects by fast multiscale methods

Estimating and testing linear models with multiple structural changes

Computation and analysis of multiple structural change models

Narrowest-Over-Threshold detection of multiple change-points and change-point-like features

An algorithm for a generalized maximum subsequence problem

Consistencies and rates of convergence of jump-penalized least squares estimators

The Dantzig selector: Statistical estimation when p is much larger than n

Detection with the scan and the average likelihood ratio

SiZer for exploration of structures in curves

Discussion of 'Multiscale change point inference' by Frick, Munk and Sieling

Multiple testing of local extrema for detection of change points

Localised pruning for data segmentation based on multiscale change point procedures

Local extremes, runs, strings and multiresolution

Nonparametric regression, confidence regions and regularization

Multiscale change point detection for dependent data

Multiscale testing of qualitative hypotheses

Multiscale inference about a density

Computing valid p-value for optimal changepoint by selective inference using dynaming programming

On the asymptotic behavior of self-normalized sums of random variables

A MOSUM procedure for the estimation of multiple random change points

Detection and estimation of local signals

Segmentation and estimation of change-point models: false positive control and confidence regions

Exact and efficient Bayesian inference for multiple changepoint problems

Multiscale change-point inference (with discussion)

Wild Binary Segmentation for multiple change-point detection

Tail-greedy bottom-up data decompositions and fast multiple change-point detection

Detecting possibly frequent change-points: Wild Binary Segmentation 2 and steepest-drop model selection

An analysis of the real interest rate under regime shifts

Multiple change-point detection via a screening and ranking algorithm

Exact post-selection inference for the generalized lasso path

Post-selection inference for changepoint detection algorithms with application to copy number variation data

Optimal sparse segment identification with application in copy number variation analysis

Testing for a change in mean after changepoint detection

Extreme-value analysis of standardized Gaussian increments. Unpublished

Limiting distribution for the maximal standardized increment of a random walk

SiZer for jump detection

Multidimensional multiscale scanning in exponential families: limit theory and statistical consequences

Seeded intervals and noise level estimation in change point detection: A discussion of Fryzlewicz (2020)

Seeded binary segmentation: A general methodology for fast and optimal change point detection

Variational Estimators in Statistical Multiscale Analysis

FDR-control in multiscale change-point segmentation

A sharp error analysis for the fused lasso, with application to approximate changepoint screening

Tests for multiple changes in linear regression models

The limit distribution of the maximum increment of a random walk with regularly varying jump size distribution

Photonic imaging with statistical guarantees: From multiscale testing to multiscale estimation

Quantifying the uncertainty in change points

Nonparametric estimation of smooth regression functions

Circular binary segmentation for the analysis of array-based DNA copy number data

Heterogeneous change point inference

Invariance principles for adaptive self-normalized partial sums processes

Invariance principle under self-normalization for nonidentically distributed random variables

Hölder norm statistics for epidemic change

Bandwidth choice for nonparametric regression

Exact asymptotics for the scan statistic and fast alternatives

Confidence sets in change-point problems

Using the generalized likelihood ratio statistic for sequential detection of a change-point

Tail probabilities for the null distribution of scanning statistics

Adaptive piecewise polynomial estimation via trend filtering

Sparsity and smoothness via the fused lasso

Detecting 'disorder' in multidimensional random processes

Optimal and fast detection of spatial clusters with scan statistics

Univariate mean change point detection: Penalization, CUSUM and optimality

Statistically and computationally efficient change point localization in regression settings

Estimating the number of change-points via Schwarz' criterion

Acknowledgements I wish to thank Yining Chen, Zakhar Kabluchko and David Siegmund for helpful discussions.