key: cord-0641997-yxq73b81 authors: Levajkovic, Tijana; Messer, Michael title: Multiscale change point detection via gradual bandwidth adjustment in moving sum processes date: 2021-03-01 journal: nan DOI: nan sha: 01992d085f31b5bdaefe72eaeaf23fa16d5bb3fc doc_id: 641997 cord_uid: yxq73b81 A method for the detection of changes in the expectation in univariate sequences is provided. Moving sum processes are studied. These rely on the selection of a tuning bandwidth. Here, a framework to overcome bandwidth selection is presented - the bandwidth adjusts gradually. For that, moving sum processes are made dependent on both time and the bandwidth: the domain becomes a triangle. On the triangle, paths are constructed which systematically lead to change points. An algorithm is provided that estimates change points by subsequent consideration of paths. Strong consistency for the number and location of change points is shown. Simulations support estimation precision. A companion R-package mscp is made available on CRAN. We contribute to the field of change point detection in stochastic sequences. Change point detection applies in various research areas, e.g., climatology (Reeves et al., 2007) , speech recognition (Rybach et al., 2009) , oceanography (Killick et al., 2010) , neuroimaging (Aston and Kirch, 2012) , virology (Kass-Hout et al., 2012) etc. We consider T univariate and independent random variables (RVs) X 1 , . . . , X T , that are piecewise identically distributed, with existing (2 + p)-th moments (p > 0), without parametric assumptions. Multiple change points in expectation form a set C. See Figure 1 (bottom) for an example with T = 200, X i ∼ N (µ, σ 2 ), three change points C = {65, 105, 145}, and thus four sections with parameters µ = 1, 4, 1, −2 and σ = 1, 0.8, 1, 0.5, i.e., changes in σ may additionally occur when µ changes. There is extensive literature that covers changes in expectation, e.g., methods based on likelihood ratios (Gombay and Horváth, 1994; Fang et al., 2020) , empirical processes (Horváth and Shao, 2007; Holmes et al., 2013) , U -statistics (Gombay and Horváth, 2002; Horváth and Hušková, 2005; Döring, 2010) , least-squares (Lavielle and Moulines, 2000; Harchaoui and Lévy-Leduc, 2010 ) and many more. We mention methodology based * corresponding author on CUSUM-statistics, e.g., by Page (1954) ; Hinkley (1971) ; Berkes et al. (2006) ; Dehling et al. (2017) . For a general overview of change point methods see the textbooks of Csörgő and Horváth (1997) ; Chen and Gupta (2000) ; Brodsky (2017) . In this paper, we aim to tackle multiple change points that may occur on different time scales, as considered e.g., in Spokoiny (2009) ; Fryzlewicz (2014) ; Matteson and James (2014) ; Pein et al. (2017) . We study moving sum processes (MOSUM), see e.g., Chu et al. (1995) ; Steinebach and Eastwood (1995) ; Antoch and Hušková (1999) ; Hušková and Slabý (2001) . For that we select a window size (bandwidth) h ∈ {1, . . . , T /2 } and define MOSUM (D t,h ) t for index t = h, . . . , T − h: for every time t consider two adjacent windows of size h, left {t − h + 1, . . . , t} (index ) and right {t + 1, . . . , t + h} (index r), and set D t,h := √ h ·μ r −μ (σ 2 r +σ 2 ) 1/2 (1) whereμ j andσ 2 j denote the mean and empirical variance of the RVs whose indices lie in the windows, j ∈ { , r}. D t,h is Welch's t-statistic for two samples of size h. We typically find |D t,h | ≈ 0 if no change is involved, but |D t,h | > 0 if there is a change nearby, t ≈ c ∈ C. Thus, change point estimates may be obtained by argmax-estimation, see e.g., Eichinger and Kirch (2018) . A major challenge lies in the choice of the window size h. An h small enough is sensitive to rapid changes when it does not overlap subsequent change points, while a larger h improves detection power of small effects as more RVs are evaluated. But note that h too large may result in overlap of subsequent changes and thus in an estimation bias. In order to account for change points that occur on multiple time scales, including rapid changes as well as small effects, methods that combine multiple windows were proposed, see e.g., Messer (2019) ; Cho and Kirch (2020) . They work in two steps: first change point candidates are generated for every single h, and afterwards all sets are merged giving final estimates. Despite improvements, the methods demand the selection of a window set that best accounts for the location of unknown change points. The aim of this paper is to provide a MOSUM framework that overcomes window selection, but nevertheless exploits multiple windows to address change point occurrences on multiple time scales, denoted multi-scale change point detection algorithm (MSCP), see Algorithm 4.5. For that we extend the MOSUM perspective: instead of considering (D t,h ) t as a process of time t only, we let it depend on both t and h, i.e., (D t,h ) (t,h) , while the indices (t, h) lie in a triangle ∆ δ ⊂ R 2 , see Figure 1 (top). ∆ δ is a right-angled and isosceles triangle, and the hypotenuse is oriented as the lower edge and refers to the smallest window h = δ fixed. A higher horizontal slice refers to a larger h, and the upper vertex describes the largest h = T /2 . The triangular structure follows from shrinkage of possible t-indices h, . . . , T −h when h increases. In Figure 1 D t,h is color-coded with D t,h ≈ 0 green, > 0 red, and < 0 blue. At c 1 = 65 there is an increase in µ and thus D t,h > 0, while at c 2 = 105 and c 3 = 145 a decrease in µ yields D t,h < 0, at least when h is not too large (h < 40) such that only a single c u is overlapped. The upper part of ∆ δ refers to larger h that result in an overlap of multiple change points. The area between c 1 and c 2 is green also for h large, at (t, h) ≈ (85, 70), because the parameters in the first and third section coincide an thus the effects cancel out. In contrast, the area between c 2 and c 3 is dark blue for h large, at (t, h) ≈ (120, 70), as both changes at c 2 and c 3 are negative, which amplifies the effect and as a consequence simple argmax estimation would be flawed. Importantly, note that for smaller h large values of |D t,h | concentrate around c u ∈ C. MSCP subsequently acts on subsets of ∆ δ by locally exploiting (D t,h ) (t,h) . The key ingredient is the construction of a zigzag-path, see Figure 2 (magenta): given a starting value (t s , h s ) (pink circle), the path leads towards the lower edge of ∆ δ to some point (ĉ, δ), andĉ functions as a change point estimate. The path evolves according to stepwise local argmax-estimation: in each instance the path moves one step downwards, i.e., h switches to h − 1 non-randomly. Then the path moves either one step left or right, or it stays, i.e., t switches to some value in {t − 1, t, t + 1}, while the choice falls on the Figure 2 : Representation of areas of attraction A u (blue), inner sets B u (red), cones K cu (orange) for three change points c u , the remainder R (green), and a zigzag-path (magenta). The idea of MSCP is the following, see also Figure 3 : given a set S ⊂ ∆ δ of possible starting values (pink circles), the one maximizing h −1/2 · |D t,h | (i.e., the strongest 'signal to noise ratio') is the first starting point considered. Then its path delivers the first change point estimateĉ. In order to avoid false positives, a breaking criterion is evaluated, which is computed from D t,h on the path. If breaking is not demanded, thenĉ is accepted. In order to avoid multiple detections of the same change point, all elements of S, whose paths could lead toĉ, are deleted from S ('cut out cone'). Then estimation restarts, and iteratively change points are detected until the breaking criterion applies. In Figure 3 the algorithm estimates c 3 , c 1 and c 2 , and then the breaking criterion applies for the next candidate. For the process in Figure 4 , MSCP yields estimates 145, 63 and 105, i.e., the number is correct, the smallest estimate at 63 is close to c 1 = 65 and the other two estimates even hit the true c 2 = 105 and c 3 = 145. MSCP is made available in the R-package mscp on CRAN (Levajkovic and Messer, 2021) , which includes a summary and plotting routine. For the upper process given in Figure We strengthen three important upsides of MSCP: first, the method is non-parametric with weak distributional assumptions and thus allows to analyze a high variety of data. The consideration of additional changes in variance could be helpful in practice, as e.g., an increase in the mean might be accompanied with an increase in volatility. Second, the problem of window selection is overcome. Third, the gradual window adjustment improves the detection of change points on multiple time scales and different effects. Estimation precision is supported in simulation studies by investigating different change point scenarios and effect sizes. Various distributions are considered, including normal, gamma, Poisson and binomial, and their combinations. Also, reasonable segmentation is shown for the evolution of COVID-19 cases in Zimbabwe in 2020. The paper is organized as follows: In Section 2 we specify the model, ∆ δ and D t,h . In Section 3 we study properties of (D t,h ) (t,h)∈∆ δ which are used for change point detection. In Section 4 we construct the zigzag-path and show that it yields proper estimates. We introduce the set of starting points S, define MSCP, and state consistency. In Section 5 we discuss the tuning parameters (δ, S and the breaking criterion), and present the simulations and the data example. Proofs are given in the Appendix 6. Let (Ω, A, P) be a probability space and p > 0. We consider a triangular scheme: let (Z u,i ) u,i for u ∈ {1, 2, . . . , |C| + 1} and i = 1, 2, . . . be independent RVs in L 2+p (Ω, A, P) with E[Z u,i ] = 0 and Var(Z u,i ) = 1, and let (Z u,i ) i=1,2... be an i.i.d. sequence for each u. Further let µ 1 , µ 2 , . . . , µ |C|+1 ∈ R with µ u = µ u+1 , and σ 2 1 , σ 2 2 , . . . , σ 2 |C|+1 positive. For n ∈ {1, 2, . . .} set Dependence of X i on n is suppressed for simplicity. The process X := (X i ) i=1,2,...,nT has |C| change points. In the section from nc u−1 + 1 to nc u it has expectation µ u and variance σ 2 u , which are then changing to µ u+1 and σ 2 u+1 . The case n = 1 is considered the real-time scenario. Throughout, asymptotics are studied letting n → ∞, i.e., the observation regime and all change points increase linearly. Given T , the set of processes X constitutes the model M. For all u, we set the first two moments µ Local estimators and MOSUM For (t, h) ∈ ∆ δ we consider indices interpreted a left and right window I (n) := { nt − nh + 1, . . . , nt } and I (n) r := { nt + 1, . . . , nt + nh }. We set local estimators for µ k and µ {k} for j ∈ { , r} viâ and abbreviateμ j :=μ j . Dependence on t and h is suppressed to avoid overload. Then we define (1). We set D (nt, nh + 1) A q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q B C j , j ∈ {l, r} for both nt and nh appear within a factor-n-enlarged triangle (B), which is rescaled to ∆ δ (C). denote the set of R-valued functions on K, which in case of k = 1 are càdlàg, and in case of k = 2 càdlàg with respect to each component when the other component is fixed. D R [K] is endowed with uniform distance · ∞ . As asymptotics yield almost surely (a.s.) continuous limits, there is no need to evoke Skorokhod topology. Strong consistency We state strong convergence of the estimators. We define their limits pointwise for (t, h) ∈ ∆ δ . They are weightings of the theoretical moments depending on the change point locations relative to the window. Consider the left window (t − h, t] and assume that the change points that are overlapped are C ⊂ C, i.e., we find t − h < c ,1 < c ,2 < · · · < c ,|C | ≤ t, and c ,u denotes the u-th smallest change point in C . Set c ,0 := t − h and c ,|C |+1 := t. For u = 1, 2, . . . , |C | + 1 set the distance d ,u := c ,u − c ,u−1 between adjacent change points. The moments related to these sections are abbreviated by µ k ,u and σ 2 ,u , see Figure 6 left. From this we set t−h c l, 1 c l, 2 c l, C l t withμ :=μ 1 . The limits weight the moments with the time d ,u spent in a section relative to the window length h = u d ,u . Further we set and note thatσ 2 =˜σ 2 + e 2 , i.e.,σ 2 weights not only the variances σ 2 ,u as in˜σ 2 , but also the additional error term e 2 . The error accounts for the violation of change points in the calculation of the mean in the computation of the empirical variance -the mean is calculated from all data in the windows 'violating' unknown change points. It is e 2 ≥ 0. If no change points are overlapped C = ∅, then the limits simplify to single theoretical momentsμ k = µ k ,1 and˜σ 2 =σ 2 = σ 2 ,1 and also e 2 = 0. Analogously we define the limitsμ k r andσ 2 r for the right window (t, t + h] by considering the change points C r whose ordered elements c r,u fulfill t < c r,1 < c r,2 < · · · < c r,|Cr| ≤ t + h. If C = ∅, then for all (t, h) ∈ ∆ δ we obtain the population paramtersμ j = µ,σ 2 j = σ 2 etc. In the following we state strong consistency of the estimators. The result follows from the Marcinkiewicz-Zygmund SLLN, noting that Figure 6 right: for k = 1 we cannot reach 1/2 noting CLT and LIL, and for k = 2 the rate improves with the existence of higher moments via p, but only for p < 2 while p/(2 + p) < 1/2, i.e., p ≥ 2 brings no improvement. Any rate for k = 2 is valid for The systematic component of (D We t,h and conclude from Lemma 3.1 t,h . It has an analogous shape, but smoother as noise is canceled out. We will construct the zigzag-path from D (n) t,h and deduce properties from a non-random path based on d Considering a single change point We describe (d (n) t,h ) t for fixed h as a function of t. Let c u ∈ C. We assume h small such that only c u is overlapped by the double-window when it is near c u . We formalize the shape of (d (n) t,h ) t in Lemma 3.3, which will be used to show that the zigzag-path is pushed towards c u when it comes close to it. Figure 8E in which the blue and red lines indicate a uniform bound for the derivative. We set v t,h := [(σ 2 r +σ 2 )/(˜σ 2 r +˜σ 2 )] 1/2 , recallingσ 2 j =˜σ 2 j + e 2 j , see (5) and (6), and denote ∧ the minimum and ∨ the maximum. strictly decreasing for t ∈ (c u , c u + h]. (8) with constants t,h six cases are differentiated: for µ u+1 > µ u and linear and strictly decreasing for t ∈ (c u , c u + h]. strictly convex and strictly decreasing for t ∈ (c u , c u + h]. strictly concave and strictly decreasing for t ∈ (c u , c u + h]. For µ u+1 < µ u , the expressions hold true, but with 'convex' and 'concave' as well as Figure 8D and E. t,h and lines with slopes ∓(nh) 1/2 · κ a (blue) and ∓(nh) 1/2 · κ b (red). Figure 8 shows the statistics that factor into (d (n) t,h ) t∈τ h . It is µ 2 > µ 1 and σ 2 2 > µ 2 2 . The functionμ r −μ has the shape of a hat (A). It is positive as µ 2 > µ 1 . The functioñ σ 2 r +˜σ 2 is linear (B). It is increasing as σ 2 2 > σ 2 1 . The error e 2 r + e 2 describes two parabolas and vanishes outside the h-neighborhood of c := c u and also at c (C). The error is maximal at c ∓ h/2 which are the points where either the right or the left window is divided in half and thus equally sharing the left and right population. Panel D shows v t,h · d t,h which is (nh) 1/2 times (A) divided by the square-root of the sum of (B) and (C). As µ 2 > µ 1 , it is strictly increasing in [c − h, c] and strictly decreasing on (c, t,h , as v t,h ≥ 1. Note that in Lemma 3.3 it is 0 < κ a < κ b < ∞, and κ a , κ b depend only on the population parameters, such that the bounds for the derivative in (9) depend on h but not on t. The upper bound (nh) 1/2 · κ b is a Lipschitz constant for (d (n) t,h ) t . Figure 8E shows lines with slopes ∓(nh) 1/2 · κ a (blue) and ∓(nh) 1/2 · κ b (red), i.e., the slope of (d Weak convergence We state weak convergence of (D Segmentation of the triangle We specify regions of ∆ δ . Each c u ∈ C has an area of For all (t, h) ∈ A u the associated double window overlaps c u but no other change point. We show that path that starts in A u is systematically pushed towards c u , such that its endpoint becomes a proper estimate for c u . We also set , for whose elements we find the 't-distance' to c u smaller than the 'h-distance' (plus 1) to the bottom of ∆ δ . For any Figure 2 (orange). K cu consists of all (t, h) ∈ ∆ δ for which the double window overlaps c u and possibly neighboring change points. It holds Finally, we consider the remainder R := ∆ δ \ |C| u=1 K cu , that consists of (t, h) ∈ ∆ δ for which no c u is overlapped by the double window, such that d Output: the zigzag downpath (t In t and accordingly t e andt (n) e , noting that for f t,h the path is independent of n, as n −1/2 · |d (n) A path starting in A u systematically tends towards c u : Consequently, starting in A u yields detection of c u up to a distance of δ − 1. More precisely, for (t s , h s ) ∈ B u we find lim n→∞t (n) e = c u a.s., and for (t s , h s ) ∈ A u \B u it is lim n→∞ |t Starting points We call a set S ⊂ N 2 ∩∆ δ a sufficient set of starting points if S∩A u = ∅ for all u = 1, . . . , |C|, i.e., for all c u ∈ C there lies a starting point in A u . For a mesh size g ∈ N set a grid S g := (g · N 2 ) ∩ ∆ δ , see Figure 3A (pink dots). Lemma 4.4. Let X ∈ M and δ < δ C /2 . A grid S g is sufficient if g ≤ δ C /2 . The proof applies geometry. When a path enters A u , then c u is detected up to a distance of δ − 1, see Proposition 4.3. Thus, when considering all paths starting in S, then all c ∈ C will be detected. We need to avoid multiple detections of any c as well as false positives. Change point detection Set min ∅ := ∞, and for a set F denote U (F ) a uniformly sampled element. The parameters κ and β yield a threshold κ · n β (in 3b.), and if chosen large then this supports breaking the algorithm. Recall E[|X 1 | 2+p ] < ∞. β is related to p via v, see Figure 6 right. Small p forces β close to 1/2 and a larger p allows for smaller β. The parameter β is redundant if n = 1. In 1., the maximum is a.s. unique if the RVs of X do not have point masses. In general, uniform sampling yields well-definedness. It is D Note that we allow an error up to δ − 1, and if the distance oft (n) e,m to c u is positive, then a slightly shifted cone was cut. This gives rise for 3a. at first. Also note that if all c ∈ C are detected and near cones are cut (again mentioning 3a.), then what is eventually left over is the remainder R. Thus, on the next path the objection function will no longer take extreme values which support breaking in 3b., and the previous estimates are returned. Finally note that the breaking criterion 3c. applies if the minimal distance between change points was deceeded, also accounting for estimation precision only up to δ − 1. Asymptotically, this will not occur, so 3c. is obsolete for theoretical consistency, but may be used in practice. We denoteĉ (n) u the u-th smallest element ofĈ (n) . For well-definedness, if |Ĉ (n) | < |C| setĉ (n) u := 0 for all u = |Ĉ (n) | + 1, . . . , |C|. The algorithm succeeds: Theorem 4.6. Let δ C /2 > δ. In Algorithm 4.5 let the input S be sufficient. Then for the outputĈ (n) it holds a.s. as n → ∞ that Location estimation is correct up to δ − 1. The choice δ = 1 states strong consistency. Recall that δ C /2 > δ is needed for sufficiency of S = S g in Lemma 4.4. Also it implies δ C > 2(δ − 1), i.e., neighboring change points should be separated by more than the worst errors of their estimators. The proof applies Proposition 4.3 for the estimation of C, and Corollary 3.2 for correct breaking. Parameter choice We give recommendations for the choice of S, κ and δ in practice, where n = 1. Then, the threshold is κ · n β = κ, i.e., β is redundant. t,h | → sup (t,h)∈∆ δ |L (t,h) |, and we choose κ as the (1 − α)-quantile of the limit distribution (e.g., significance level α = 0.01), which can be derived in simulations, compare e.g., Messer et al. (2014) . If |Ĉ (1) | > 0, then H 0 is rejected at level ≤ α, as the maximum w.r.t. a path is bounded by the supreme over ∆ δ . 2. Choice of S: Lemma 4.4 states sufficiency of S = S g if g ≤ δ C /2 (in case δ < δ C /2 ). Thus, if δ C is known it is reasonable to choose g = δ C /2 as g large reduces computational complexity. If δ C is unknown, we set g = δ or even smaller if computational complexity allows for. 3. Choice of δ: We recommend δ = 20: by construction, asymptotic considerations imply an increase of the double windows. For n → ∞, weakly D (n) t,h → L t,h ∼ N (0, 1) for all (t, h) ∈ ∆ δ , but for n = 1 small h means less observations and the approximation via N (0, 1) is harder to justify. Practically, for tiny h there is high variability inμ j andσ 2 j and thus D t,h is likely to take extreme values even if no change is involved, resulting in false positives. Thus, δ should be bounded from below. Figure 9 shows simulations for the probability that P(|Ĉ (1) | > 0) if C = ∅ (rejection probability under H 0 ), for both α = 0.05 and α = 0.01, depending on δ. Over a variety of distributions, including normal, exponential, gamma, binomial and Poisson, for δ ≥ 20 the α-level is kept throughout. T = 1000, left: α = 5%, right: α = 1%. Six distributions color coded: N (0, 1) (magenta), P ois(1) (orange), exp(1) (green), b(10, 1/2) (black), gamma(0.5, 2) (blue) and gamma(2, 2) (red). Right: Legend. In the following we use δ = g = 20 and κ is derived in simulations for α = 0.01. Simulation studies We evaluate the performance of MSCP in simulation studies. Throughout it is T = 1000 and |C| = 5. We consider different scenarios of locations c u ∈ C and parameters µ u and σ u , given in Table 10 . E.g., 1a states certain µ u which are halved in 1c. Also, in 1a it is σ u = 1 constant, while in 1b the σ u alter between 1 and 2. For each scenario we consider different distributions: A. N (µ, σ 2 ) (normal), B. gamma(s, λ) (gamma), C. P oi(λ) (Poisson), D. b(10, p) (binomial, n = 10 fix), and E. is a combination of the previous that changes between the six sections using A.,B.,C.,D.,A.,B. (mix). Note that for P oi(λ) and b(10, p) the formulation of σ u can be disregarded as it follows from µ u , and also if distributions coincide due to equality of µ u as e.g., in 1a and 1b, they are presented only once. Scenario C µ u σ u 1a 100, 300, 500, 700, 900 1, 4, 1, 8, 1, 4 1, 1, 1, 1, 1, 1 1b 100, 300, 500, 700, 900 1, 4, 1, 8, 1, 4 1, 2, 1, 2, 1, 2 1c 100, 300, 500, 700, 900 0.5, 2, 0.5, 4, 0.5, 2 1, 1, 1, 1, 1, 1 2a 300, 400, 500, 600, 700 1, 4, 1, 8, 1, 4 1, 1, 1, 1, 1, 1 2b 300, 400, 500, 600, 700 1, 4, 1, 8, 1, 4 1, 2, 1, 2, 1, 2 2c 300, 400, 500, 600, 700 0.5, 2, 0.5, 4, 0.5, 2 1, 1, 1, 1, 1, 1 3a 200, 500, 550, 600, 750 1, 4, 1, 8, 1, 4 1, 1, 1, 1, 1, 1 3b 200, 500, 550, 600, 750 1, 4, 1, 8, 1, 4 1, 2, 1, 2, 1, 2 3c 200, 500, 550, 600, 750 0.5, 2, 0.5, 4, 0.5, 2 1, 1, 1, 1, 1, 1 3d 200, 500, 550, 600, 750 0.5, 2, 0.5, 4, 0.5, 2 1, 2, 1, 2, 1, 2 3e 200, 500, 550, 600, 750 1, 2, 4, 8, 4, 2 1, 1, 1, 1, 1, 1 We summarize the results. In the well-posed scenarios 1a-c we obtain precise estimates:Ĉ T is close to 5000 over all scenarios and distributions. Notably, in 1a it is exactlŷ C 10 = 5000 throughout, andĈ 5 = 5000 for three distributions. Also, the strict measurê C 2 reveals precise results. In 1b we consider varying σ u , but the estimates remain similarly precise. In 1c we have jump sizes halved but still detect a high number of change points at reasonable precision over all distributions. Setup 2a-c is again well-posed in the sense that results are strong: 2a and 2b are pretty similar to 1a and 1b, and in 2c there is some loss over 1c. In setup 3a-c performance is reduced a bit as compared to 2a-c due to closer distances. Nevertheless, we obtain about 5000 estimates at precise location for 3a and 3b, and about 4800 estimates for 3c (Poisson reduced to ≈ 4400). In scenario 3d we reduced the effects w.r.t. both smaller jump sizes and higher variances, and we now see performance losses as we obtain only around 3000 estimates. But note that e.g., for c 6 it is µ 6 − µ 5 = 1.5 with σ 6 = 2 which is reasonably hard to be detected, also taking into account the smaller spacings between c u . Despite the reduced number, note that precision for detected estimates is kept high. Scenario 3e is another example with different jump sizes resulting in a good overall outcome, but slightly weaker than 3a as effects are reduced similar to 3c. Overall, MSCP yields reliable results over different scenarios of change points and effects: both the number and the locations are reliably estimated. Notably, performance is also kept over different distributions and regardless of changes in variance. We apply MSCP to the daily number of new COVID-19 cases in Zimbabwe between 2020/3/3 and 2020/12/14, yielding T = 269 time stamps, as reported in (Ministry of Health and Child Care of Zimbabwe, 2021) . The result is depicted in Figure 14 . Four change points were estimated,Ĉ = {68, 110, 173, 241}, yielding parameter esti- 17, 104, 22, 85 andσ u ≈ 1, 18, 83, 16, 35 . The segmentation closely aligns with visual inspection. Note that an increase in mean is accompanied with an increase in variance, and vice versa. In each section, serial correlation proves to be small (i.e., ≤ 0.2, while in the last section for the first time lag it is ≈ 0.4 and negligible for higher lags). Proof of Lemma 3.1: W.l.o.g. let j = . Forμ k first let C = ∅ where we need to For this, first apply Marcinkiewicz-Zygmund (MZ)-SLLN to (X i ) i and (X 2 i ) i : For (X i ) i note that for any α ∈ (0, 2) it is E[|X i | α ] < ∞, and thus for k = 1 a.s. as n → ∞ For the squares (X 2 i ) i we choose α ∈ (0, (2 + p)/2] if p < 2 and α ∈ (0, 2) if p ≥ 2. Then E[|X 2 i | α ] < ∞ and thus (14) applies for k = 2. As (α − 1)/α ≤ p/(2 + p) if p < 2, and (α − 1)/α < 1/2 if p ≥ 2, we obtain for k = 1, 2 a.s. as n → ∞ For uniform convergence in (13) we discretize time and apply (15), a.s. as n → ∞ as r n := n v · sup t∈[0,T ] |t − nt /n| µ k ≤ n v · (1/n) µ k → 0 as n → ∞, and for ε > 0 a.s. for n large enough, as the maximum in the first summand is independent from n, and for the second summand note (m/nT ) 1−v ≤ 1 and (15) yields that for almost every yields a.s. as n → ∞ We include the factor 1/h ≥ 2/T > 0 and obtain in ( The u-th summand refers to a subsection that relates to the error sequence (Z u,i ) i=1,2,... and in which X i equals µ u + σ u · Z u,i . From (16) we conclude a.s. as n → ∞ which states uniform convergence w.r.t all subintervals of varying length d. When including the factor d/h ≥ 1 and summing over all u, the expression still vanishes a.s. as n → ∞, and it states an upper bound for n v · sup (t,h)∈∆ δ |μ k −μ k |. Averages in the subsection tend to σ 2 ,u + 0 + (μ − µ ,u ) 2 , and summation over subsections yields a.s. as n → ∞ that n v · sup (t,h)∈∆ δ |σ 2 −σ 2 | → 0, as before using MZ-SLLN and discretization arguments. Proof of Lemma 3.3: Continuity is inherited from the limits in (5) and (6). W.l.o.g. consider c 1 =: c. As v t,h ≥ 1 with equality at c it is v t,h · |d (n) t,h | ≥ |d (n) t,h |.μ r −μ has the shape of a hat: it is zero outside the h-neighborhood of c, it is µ 2 − µ 1 at c, and it is linearly interpolated in between.˜σ 2 r +˜σ 2 is 2σ 2 1 left of c − h, and 2σ 2 2 right of c + h, and linearly interpolated in between. Thus, in the h-neighborhood of c the root (˜σ 2 r +˜σ 2 ) 1/2 is constant if σ 2 2 = σ 2 1 , it is strictly convex if σ 2 2 > σ 2 1 and strictly concave if σ 2 2 < σ 2 1 . As the numerator is piecewise linear, i.e., of order t, and the denominator of order t 1/2 , the statements about the curvatures of v t,h · d (n) t,h hold true. We now turn to d (n) t,h . We representσ 2 j through˜σ 2 j plus errors for j ∈ { , r}, i.e., we find the error as e 2 j := [d 1 d 2 /h 2 ](µ 2 − µ 1 ) 2 . We abbreviated d u := d j,u . Note that e 2 = e 2 r , and set x := d 1 /h and d 2 /h = (1 − h)/h = (1 − x) with x ∈ [0, 1], which yields a representation through the proportions of the window left and right of c. Note that the error e 2 j = x(1 − x) · (µ 2 − µ 1 ) 2 is quadratic and maximal for x = 1/2, thus taking the value (µ 2 − µ 1 ) 2 /4, which is plausible as half of the window refers to the left and the other half to the right population. Within both [c − h, c] and (c, c + h] it holds thatμ r −μ is linear in x and the denominator of d (n) t,h is now the root of first function˜σ 2 r +˜σ 2 which is linear in x plus second the quadratic error e 2 j . W.l.o.g. we consider t ∈ [c − h, c] where the right window contains c. Then we find d (n) t,h as a function f of x as for all valid h ∈ (δ, T /2], which yields the derivative w.r.t. x f (x) = √ nh · (µ 2 − µ 1 ) · 2 −1 [(σ 2 2 − σ 2 1 ) + (µ 2 − µ 1 ) 2 ] · x + 2σ 2 1 [(σ 2 2 − σ 1 1 ) · x + 2σ 2 1 + (µ 2 − µ 1 ) 2 · x · (1 − x)] 3/2 . In (19) we see that the fraction is positive and thus f (x) > 0 if µ 2 > µ 1 and f (x) < 0 if µ 2 < µ 1 , which gives (8). To bound |f (x)| we find the numerator ≥ 2 −1 [σ 2 2 + 3σ 2 1 + (µ 2 − µ 1 ) 2 ] ∧ 2σ 2 1 and for the denominator we get (σ 2 2 − σ 2 1 )x + 2σ 2 1 ≤ 2(σ 2 2 ∨ σ 2 1 ) and x(1 − x) ≤ 1/4, such that |f (x)| ≥ √ nh · |µ 2 − µ 1 | · 2 −1 [σ 2 2 + 3σ 2 1 + (µ 2 − µ 1 ) 2 ] ∧ 2σ 2 1 [2(σ 2 2 ∨ σ 2 1 ) + (µ 2 − µ 1 ) 2 /4] 3/2 ≥ √ nh · |µ 2 − µ 1 | · 2(σ 2 2 ∧ σ 2 1 ) [2(σ 2 2 ∨ σ 2 1 ) + (µ 2 − µ 1 ) 2 /4] 3/2 > 0. In the second inequality we omitted (µ 2 − µ 1 ) 2 ≥ 0 and used that 2σ 2 1 < 2 −1 [σ 2 2 + 3σ 2 1 ] iff σ 2 1 < σ 2 2 . This is the lower bound in (9). For the upper bound we find the numerator ≤ 2 −1 [σ 2 2 + 3σ 2 1 + (µ 2 − µ 1 ) 2 ] ∨ 2σ 2 1 ≤ 2(σ 2 2 ∨ σ 2 1 ) + (µ 2 − µ 1 ) 2 , and for the denominator we mention (σ 2 2 − σ 2 1 )x + 2σ 2 1 ≥ 2(σ 2 2 ∧ σ 2 1 ) such that |f (x)| ≤ √ nh · |µ 2 − µ 1 | · 2(σ 2 2 ∨ σ 2 1 ) + (µ 2 − µ 1 ) 2 [2(σ 2 2 ∧ σ 2 1 )] 3/2 , which completes (9). i.e., the maximum w.r.t. any path starting in S\R exceeds the minimum w.r.t. to all A u , and the second inequality holds as D (n) t,h = n 1/2 d (1) t,h + o a.s. (n 1/2 ) uniformly on ∆ δ (Corollary 3.2) and |d (1) t,h | > 0 within all A u (Lemma 3.3), noting that β < 1/2. Thus, a.s. for n large there is no break. Second, for m = |C| + 1 the algorithm now breaks. All c ∈ C are estimated. Thus, possibly after applying criterion 3a, all remaining starting points lie in R. This also covers C = ∅. Note that a path that starts in R remains in R. It holds a.s. for n large max (ts,hs)∈R i.e., the maximum w.r.t. all paths starting in R deceeds the maximum w.r.t the entire R. For the second inequality we note that within R it is d Finally note that 3c. is asymptotically redundant: a.s. for n large, for m ≤ |C| we stated correct estimation up to the error, so neighboring estimates will have at least distance δ C − 2(δ − 1), and for m = |C| + 1 the algorithm already broke in 3b. 2 Estimators of changes Evaluating stationarity via change-point alternatives with applications to fmri data On discriminating between longrange dependence and changes in mean Change-point analysis in nonstationary stochastic models Parametric statistical change point analysis Two-stage data segmentation permitting multiscale change points, heavy tails and dependence MOSUM tests for parameter constancy Limit theorems in change-point analysis Power of change-point tests for long-range dependent data Multiple change-point estimation with U -statistics A mosum procedure for the estimation of multiple random change points Segmentation and estimation of change-point models: false positive control and confidence regions Wild binary segmentation for multiple change-point-detection An application of the maximum likelihood test to the change-point problem Rates of convergence for U -statistic processes and their bootstrapped versions Multiple change-point estimation with a total variation penalty Inference about the change-point from cumulative sum tests Nonparametric tests for change-point detectionà la Gombay and Horváth Testing for changes using permutations of U-statistics Limit theorems for permutations of empirical processes with applications to change point analysis Permutation tests for multiple changes Application of change point analysis to daily influenza-like illness emergency department visits Detection of changes in variance of oceanographic time-series using changepoint analysis Least-squares estimation of an unknown number of shifts in a time series mscp: Multiscale Change Point Detection via Gradual Bandwidth Adjustment in Moving Sum Processes A nonparametric approach for multiple change point analysis of multivariate data Bivariate change point detection: joint detection of changes in expectation and variance A multiple filter test for the detection of rate changes in renewal processes with varying variance Historical data on the daily number of new reported covid-19 cases in Zimbabwe. pages Sources accessed 25 Continuous inspection schemes Heterogeneous change point inference A review and comparison of changepoint detection techniques for climate data Audio segmentation for speech recognition using segment features Multiscale local change point detection with applications to value-at-risk On extreme value asymptotics for increments of renewal processes Proof of Proposition 3.4: Donsker's theorem yields in (D R [0, T ], · ∞ ) as n → ∞Define a continuous map ϕ from (D R [0, T ], · ∞ ) to (D R [∆ δ ], · ∞ ) viaand apply ϕ on (20). The continuous mapping yields in (D R [∆ δ ], · ∞ ) as n → ∞ 1 (2σ 2 nh) 1/2where the constant µ vanishes. Now by replacing 2σ 2 byσ 2 r +σ 2 and using Lemma 3.1 and Slutsky's theorem weak convergence of (DProof of Lemma 4.2: Within A u the slope of (n −1/2 · |dProof of Proposition 4.3 From Corollary 3.2 we obtain a.s. as n → ∞On A u (excluding t = c u ) a lower bound for the derivative of (|d (1) t,h |) t is given through κ a δ 1/2 > 0, see Lemma 3.3. Thus, for any ε > 0 it iswhere the convergence follows from (21), i.e.,From this we boundIn the first inequality we used (23), in the second the triangle inequality, and the convergence follows from (21). The estimatort Proof of Lemma 4.4: From δ < δ C /2 conclude that A u contains a square S with horizontal and vertical edges of length δ C /2 : choose the center of S as (c u , δ C ). In fact, the right corners of S may only be adjacent to A u . Nevertheless, S ∩ S δ C /2 = ∅, and thus S ∩ S g = ∅ for g ≤ δ C /2 . i.e., all potential paths that start in the 'upper part' S\R of ∆ δ end close to a c u ∈ C.(25) holds true, as for any (t s , h s ) ∈ S\R the path eventually enters an A u and then Proposition 4.3 states a.s. |t (n) es − c u | ≤ δ − 1 for n large. The maximum follows from |S| being finite. Further, a.s. for n large it is min (ts,hs)∈ u=1,...,|C| Aui.e., as long as the algorithm does not break, starting points are first chosen from S\R ⊃ u A u up until all of them are cut out, and after that they are chosen from R. (26) follows from Corollary 3.2 and Lemma 3.3, noting that dCombining (25) and (26), it follows that a.s. for n large at first all c ∈ C are estimated up to a distance δ−1 from starting values within S\R, given the algorithm does not break.Possibly step 3a is applied in between. We need to show that the breaking criterion 3b applies appropriately: First, for m ≤ |C| we show that it does not apply. A path starting in S\R must pass an A u such that a.s. for n large min (ts,hs)∈S\R max k=0,1...,hs−δ