key: cord-0518004-nfruccmf
authors: Wang, Mengchen; Harris, Trevor; Li, Bo
title: Bayesian Changepoint Estimation for Spatially Indexed Functional Time Series
date: 2022-01-08
journal: nan
DOI: nan
sha: d8048a405e18a30f6b5e8565b3cdbcbd579691f0
doc_id: 518004
cord_uid: nfruccmf

We propose a Bayesian hierarchical model to simultaneously estimate mean based changepoints in spatially correlated functional time series. Unlike previous methods that assume a shared changepoint at all spatial locations or ignore spatial correlation, our method treats changepoints as a spatial process. This allows our model to respect spatial heterogeneity and exploit spatial correlations to improve estimation. Our method is derived from the ubiquitous cumulative sum (CUSUM) statistic that dominates changepoint detection in functional time series. However, instead of directly searching for the maximum of the CUSUM based processes, we build spatially correlated two-piece linear models with appropriate variance structure to locate all changepoints at once. The proposed linear model approach increases the robustness of our method to variability in the CUSUM process, which, combined with our spatial correlation model, improves changepoint estimation near the edges. We demonstrate through extensive simulation studies that our method outperforms existing functional changepoint estimators in terms of both estimation accuracy and uncertainty quantification, under either weak and strong spatial correlation, and weak and strong change signals. Finally, we demonstrate our method using a temperature data set and a coronavirus disease 2019 (COVID-19) study.

In recent years, there has been a considerable renewed interest in changepoint detection and estimation in many fields, including Climate Science Lund et al., 2007) , Finance and Business (Lavielle and Teyssiere, 2007; Taylor and Letham, 2018) , and traffic analysis (Kurt et al., 2018) . The changepoint problem was first studied by Page (1954) for independently and normally distributed time series. Since then, changepoint literature has grown tremendously. Methods for changepoints in time series have been developed for both at most one change and multiple changepoints. Vast methodologies are derived based on the cumulative sum (CUSUM) statistic (e.g., Wald, 1947; Shao and Zhang, 2010; Aue and Horváth, 2013; Fryzlewicz and Rao, 2014) which was first introduced by Page (1954) to detect a shift in the process mean, though other methods have also been proposed (e.g., Chernoff and Zacks, 1964; MacEachern et al., 2007; Sundararajan and Pourahmadi, 2018) .

With the proliferation of high-frequency data collection and massive data storage in recent years, functional data has become increasingly common and functional data analysis is an increasingly valuable toolkit. For instance, daily temperature data in a specific year can be considered functional data and analyzed using functional data methods. Consequently, functional time series become prevalent and they usually contain more information than a single time series. Following the previous example, daily temperature data over, say 50 years, can be treated as a functional time series which is much more informative than an annual average temperature series with 50 observations. As for univariate time series, changepoint detection and estimation for functional time series have received particular interest owing to the rise of high-dimensional time series.

Within the functional data analysis (FDA) literature, changepoint detection has primarily focused on the scenario of at most one change. Berkes et al. (2009) proposed a CUSUM test to detect and estimate changes in the mean of independent functional sequence data.

The comprehensive asymptotic properties for their estimation were further studied in Aue et al. (2009) . Berkes et al.' s test was then extended to weakly dependent functional data by Hörmann and Kokoszka (2010) and to epidemic changes, for which the observed changes will return to baseline at a later time, by Aston and Kirch (2012) . Zhang et al. (2011) introduced a test for changes in the mean of weakly dependent functional data using self-normalization to alleviate the use of asymptotic control. Later, Sharipov et al. (2016) developed a sequential block bootstrap procedure for these methods. Recently, Aue et al. (2018) proposed a fully functional method for finding a change in the mean without losing information due to dimension reduction, thus eliminating restrictions of functional principal component based estimators. Other methods in multiple changepoint detection for functional time series can be seen in Chiou et al. (2019) , Rice and Zhang (2019) , Harris et al. (2020) and Li and Ghosal (2021) .

Environmental data often naturally takes the form of spatially indexed functional data.

Again using our temperature data example, if we observe such functional time series at many weather stations in a region, then we have a spatial functional time series. The study for changepoint estimation with spatially indexed functional time series is relatively scant compared to the abundant literature for data not associated with spatial locations.

The possible spatial correlation for spatially indexed data presents both challenges and opportunities for such data analysis. It is often not straightforward to model and estimate spatial correlation in statistical analysis. However, appropriately taking into account spatial correlation can effectively improve the statistical inference drawn from the spatial data (Shand et al., 2018) . Gromenko et al. (2017) tackled the changepoint estimation for spatial functional data by assuming a common break time for all functional time series over the spatial domain.

They developed a test statistic as a weighted average of the projected CUSUM with the weights defined as the inverse of the covariance matrix of the spatial data. However, the assumption of a common changepoint over the entire spatial domain can be unrealistic when considering functional data over a vast region such as weather data in a state. Other related work on spatial functional data includes a test for the correlation between two different functional data sets observed over the same region , a test for the equality of the mean function in two samples of spatial functional data , and a nonparametric method to estimate the trend as well as evaluate its significance for spatial functional data (Gromenko and Kokoszka, 2013) .

To illustrate the limitation of assuming a common changepoint for a large region, we examine the changepoints of the daily minimum temperature in California from 1971 to 2020 obtained from https://www.ncdc.NOAA.gov/cdo-web/search?datasetid=GHCND. The data are collected over 207 stations, but only 28 stations have sufficiently complete (<15% missing values) time series for meaningful change point estimation and are presented here. We first use 21 Fourier basis functions to smooth the daily data and then apply the Fully Functional (FF) method of Aue et al. (2018) to each station. We then test for the existence of changepoints with the FF method and find 16 stations with p < 0.05 after a false discovery rate (FDR) control (Benjamini and Hochberg, 1995) . The locations of stations and the FF changepoint estimates are shown in Figure 1 . The changepoint estimates appear asynchronous, though somewhat spatially clustered. Thus, even without accounting for spatial variability, we can see that the break times vary significantly by location. Assuming just a single common break time would, therefore, misrepresent the changepoint process and lose information. We propose a flexible changepoint estimation method for simultaneously locating at most one change in each mean function of spatially indexed functional time series. Our method allows both the break time and the amount of change to vary spatially, while taking spatial correlation into account to strengthen the changepoint estimation and respect the inherent spatial continuity. We derive our method based on the asymptotic properties of the functional CUSUM squared norm process at each location. Specifically, we propose to fit spatially correlated piecewise linear models with two pieces for the CUSUM squared norm process across the spatial domain, and estimate changepoints by where the two pieces meet at each individual location. All parameters are jointly specified in a Bayesian hierarchical model, which provides a powerful means for parameter estimation as well as allows us to conveniently quantify the uncertainty of the estimation.

The rest of this paper is organized as follows. In Section 2, we first introduce the notations and the properties of the CUSUM squared norm process, then present our proposed method.

In Section 3, we conduct simulations under different scenarios to evaluate the performance of our proposed model and other competitive methods. Real data analysis on the California minimum temperatures and the COVID-19 dataset is presented in Section 4. The paper concludes with a brief discussion in Section 5.

Let X s,t (u) be the functional observation at location s ∈ D and time t ∈ Z, where D is a compact subset in R d . Each X s,t (u) ∈ L 2 ([0, 1]) is a real-valued square integrable function defined without loss of generality on the unit interval [0, 1], i.e. u ∈ [0, 1], and 1 0 X 2 s,t (u)du < ∞. We assume the functional times series at each location s is generated from the following model,

where µ s (u) is the baseline mean function that is distorted by the addition of δ s (u) after the break time k * s ∈ {1, . . . , T } at location s, and 1(A) is an indicator function that equals 1 only when event A is true and zero otherwise. We assume that the functional data at all locations are observed at the same time points. To simplify notation, we sometimes suppress u from the functional random variables such as referring to ε s,t (u) by ε s,t when there is no risk of confusion.

Following Assumption 1 in Aue et al. (2018) , we allow the error functions ε s,t (u) ∈ L 2 ([0, 1])

to be weakly dependent in time by assuming they are L p − m−approximable for some p > 2.

Assumption 1 below essentially means that for any location s the error functions ε s,t (u) are weakly dependent.

Assumption 1. For all spatial locations s ∈ D, the error functions (ε s,t : t ∈ Z) satisfy (a) there is a measurable space S and a measurable function g :

where ε t,m = g ζ t , . . . , ζ t−m+1 , ζ * t,m,t−m , ζ * t,m,t−m−1 , . . . with ζ * t,m,j being independent copies of ζ 0 independent of (ζ t : t ∈ Z).

This assumption covers most commonly used stationary functional time series models, such as functional auto-regressive and auto-regressive moving average processes. We additionally assume that all error functions are generated from the same distribution as in Assumption 2.

Assumption 2. The errors (ε s,t : s ∈ D, t ∈ Z) are identically distributed random fields on

Assumption 2 indicates that the error functions at all time points and all locations follow the same distribution. Under Model (2.1), the only changes observed in a functional time series are due to δ s (u), i.e., changes in the mean of the functional sequence. Therefore, all other aspects of the distribution, such as the variance, are required to remain the same.

While seemingly restrictive, requiring the moments to not change simultaneously is common in functional time series (Gromenko et al., 2017) and required for identifiablility even in univariate change point estimation (Horváth, 1993) . Practically, Assumption 2 also allows to share variance parameters across spatial locations when estimating the properties of error functions. Finally, we assume that the error process is stationary and isotropic.

Assumption 3. The errors (ε s,t : s ∈ D, t ∈ Z) form a mean zero, second-order stationary and isotropic random field. Formally,

where ||s − s || is the Euclidean distance between spatial locations s and s .

Assumption 3 essentially means the covariance between any two observations only depends on their distance in each dimension, regardless of their locations and relative orientation.

Suppose we observe functional time series X s,t at spatial locations s ∈ D and time points t = 1, . . . , T . Changepoint detection, at each location s, can be formulated into the following hypothesis test:

where δ s = 0 means δ s (u) = 0, for all u ∈ [0, 1] and otherwise δ s = 0. Aue et al. (2018) proposed a fully functional approach to testing the hypothesis (2.2) for each location s based on the functional CUSUM defined as

for which the two empty sums S s,T,0 (u) = S s,T,T (u) = 0. Noting that the L 2 norm of the CUSUM statistic, S s,T,k , as a function of k tends to be large at the true break date motivates a max-type test statistic for detecting a change in the mean function:

If a changepoint is detected, Aue et al. (2018) further provided an estimator for the break

The CUSUM test based on Equation (2.4) allows the functional time series to be m− dependent and requires notably weaker assumptions than the functional principal component based methods. The CUSUM statistic is shown to be powerful (Page, 1954; MacEachern et al., 2007) in detecting mean shift of univariate time series. For functional time series, the CUSUM is also the basis of many other changepoint detection methods (Berkes et al., 2009; Hörmann and Kokoszka, 2010; Aston and Kirch, 2012; Sharipov et al., 2016; Gromenko et al., 2017 ).

Most previous methods consider changepoint detection in a single functional time series, and thus may have limited power when directly applied for the spatially indexed functional data that exhibit spatial correlation. While Gromenko et al. (2017) took spatial correlation into account, their assumption of simultaneous changepoint can be too restrictive for data observed in a large spatial domain. We aim to develop a flexible and efficient method to estimate spatially varying break time k * s jointly for all locations while taking advantage of spatial correlation in the changepoint estimation. Due to the power of CUSUM statistic in changepoint detection, our method will employ the CUSUM as the building block.

Since our method is derived based on the asymptotic properties of CUSUM processes for spatially indexed functional time series, this section focuses on studying those properties before introducing our model in Section 2.4. To simplify notation, let Y T,k (s) = S s,T,k (u) 2 , k = 0, . . . , T, (2.5)

The notation Y T,k (s) emphasises that Y is a spatially varying random process. By definition, Y T,k (s) = 0 when k = 0 and k = T . Since Y T,k (s) largely preserves the changepoint information (Aue et al., 2018) , our method will be built on Y T,k (s) which reduces the functional sequence X s,t (u) at each location into a time series Y T,k (s), k = 0, . . . , T . The spatial functional sequence thus reduces into a spatiotemporal random process.

We then study the characteristics of the spatiotemporal process Y T,k (s). Let λ l and ψ l (u)

be the eigenvalues and eigenfunctions of the error process s,t (u) in Equation (2.1). The formal definition is deferred to Appendix A. Let q = k/T be the scaled time point.

Lemma 1. Under the null hypothesis of no changepoint at location s, we have

Proposition 1. Under the alternative hypothesis that there is one changepoint k * s at location s and the corresponding change function is δ s (u), we have

for a random process Z T,k (s) with

Assumption 2 for the error functions implies both λ l and ψ l are invariant across s and t, so all locations share the same parameter a which represents the feature of the long-run variance, whereas b s depends on change functions that may vary across different locations.

Proofs of Lemma 1 and Proposition 1 are deferred to Appendix C.

The asymptotics in Proposition 1 indicates that we can use the mean and variance of √ {Z T,k (s)} to approximate those of √ {Y T,k (s)} at a large T . However, the calculation of the first two moments for √ {Z T,k (s)} is rather involved compared to that for Z T,k (s) due to the square root operator. More details can be found in Appendix C. To bypass that difficulty, we propose to use the mean and variance of Z T,k (s) to approximate those of the Y T,k process. This is not an optimal choice, however, we think the approximations are reasonable, at least better than some naive choices such as constant or linearly variance. To evaluate how well (2.6) and (2.7) approximate the mean and variance of the Y T,k (s) respectively, we conduct simulations at four different settings composed of two different T 's and two signal-to-noise ratio (SNR) values that will be introduced in Section 3.1. The details of the simulation can be found in Appendix D. Figure 2 compares the empirical mean and variance from the simulations with their theoretical approximations. For all scenarios we considered, the approximations seem to match with the empirical result well, especially in the mean function. The expression in Equation (2.6) shows that when T is large the mean of the Y T,k (s) sequence attains its peak at the changepoint. This is indeed the basis of the test in Aue et al. (2018) . Figure 2 also shows that the Y T,k (s) sequence starts from exactly zero on both ends and then peaks at the true changepoint 0.6. Comparing the mean of Y T,k (s) at two different SNR values, it is seen that when the change signal is stronger, the peak tends to be more pointed. The variance of Y T,k (s) also starts from zero at the two ends and then increases toward the center. However, there is no theoretical evidence that the variance should maximize at the changepoint. Indeed we find the peak of the variance is not necessarily located at the changepoint, though this particular simulation shows so. The properties of Y T,k (s) enlighten us to estimate the break time by fitting a piecewise linear model with two pieces for the Y T,k (s), 0 ≤ k ≤ T sequence at each location. The two pieces are expected to be joined at the break time. Figure 3 illustrates this idea using simulated Y T,k (s) processes. Due to the constraint of being zeroes on both ends, the two pieces can be modeled by one slope parameter, and a stronger change signal will lead to a steeper slope. Although the mean function in Equation (2.6) suggests a piecewise quadratic model, for simplicity and the robustness of linear models we choose the piecewise linear model which suffices for our purpose of capturing the peak of the Y T,k process. In order to correctly quantify the uncertainty of the fitted piecewise linear model and thus the uncertainty of the changepoint estimation, it is important to feed the regression model with the appropriate variance structure. We model the variance of the piecewise linear model following Equation

(2.7).

If the functional data are observed at nearby locations, their break times are expected to be similar due to spatial dependency, so is the amount of change. What these similarities pass to the piecewise linear models is that the locations of the joints and the slopes of the models at two neighboring locations tend to be respectively similar. This suggests us to borrow information from neighbors when estimating the changepoint at one specific location.

Given the above considerations, we propose a Bayesian hierarchical model to jointly estimate changepoints together with their uncertainty for all locations that have changepoints.

In practice, we can first apply any changepoint detection method at each location and then employ FDR to adjust the p-values to decide which locations show significant evidence of having a changepoint. If the number of spatial locations N is large, the mirror procedure developed by Yun et al. (2020) can be an effective alternative to the classic FDR control.

We model the Y T,k (s) process through a Bayesian hierarchical model. Assume changepoints are detected at locations s 1 , . . . , s N . At each of those locations, we fit a two-piece piecewise linear model with only one slope parameter for Y T,k (s), k = 1, . . . , T − 1 due to the constraint of Y T,k (s) = 0 for k = 0 and k = T . We model the slope parameters and the joints of the two pieces as spatially correlated processes to account for the spatial correlation in the break time and change amount of the changepoints. Let c(s) = k * s /T ∈ (0, 1) be the scaled location specific changepoint. We propose the following model:

Stage I Likelihood of the Y T,k (s) process:

where β(s) < 0 is the spatially varying piecewise linear model coefficient, and the error process e k (s) is assumed to be a zero-mean spatially correlated Gaussian process. We further assume a space-time separable covariance structure for errors for simplicity, as is widely used in spatiotemporal modeling (Haas, 1995; Hoff, 2011) . We denote the entire error process as e = (e 1 (s 1 ), . . . , e 1 (s N ), e 2 (s 1 ), . . . , e 2 (s N ), . . . , e T −1 (s 1 ), . . . , e T −1 (s N )) T ,

where

The variance term Ω follows the theoretical approximation in Equation (2.7) to represent the uncertainty of Y T,k (s). Parameters a and b s are complex functions of unknown eigenvalues, eigenfunctions and change functions. We will directly treat them as unknown nuisance parameters in our model. This also gives us the leverage of being less dependent on the exact form of the approximation but rather following its basic structure. The pure temporal correlation matrix Γ t and pure spatial correlation matrix Γ s can be governed by any valid correlation function such as exponential or Matérn function (Stein, 2012) . For simplicity, we assume an exponential covariance function for both matrices:

where φ t and φ s are range parameters for temporal and spatial correlation, respectively.

As shown earlier by the asymptotic and numerical results, the shape of the piecewise linear model is influenced by the change function and changepoint. To respect the fact that the nearby locations tend to have similar changepoints and change functions, we regulate β = (β(s 1 ), . . . , β(s N )) T and c = (c(s 1 ), . . . , c(s N )) T by a correlated process. Since b = (b s 1 , . . . , b s N ) T also depends on the change function, it is governed by a correlated process as well. Because the dependency in β, c and b all arise from the spatial dependency in the data, it is not unreasonable to assume these parameters share one correlation matrix Σ(φ) to retain parsimony of the model. Considering the constraints that the slope β(s) is negative, changepoint c(s) is between 0 and 1, and the parameters a and b s in the variance part are positive, we construct the following priors:

All parameters µ β , µ c , µ a and µ b take values in R, so we choose a normal distribution with large variance as their weak hyperpriors. The variance parameters σ 2 i : i = β, c, a, b are all given a conjugate inverse gamma hyperprior. We choose IG(0.1, 0.1) because it provides sufficiently vague hyperpriors for the variances of β, c, a, and b. The range parameters φ, φ s and φ t are positive, so we choose an exponential hyperprior for them but set a different hyperparameter for φ t , given that the spatial and temporal domains have different characteristics.

Stage III Hyperprior:

where I N is the N × N identity matrix. We use the Markov chain Monte Carlo (MCMC)

algorithm to obtain posterior samples from the model. Gibbs sampling is utilized to sample the posteriors for σ 2 β , σ 2 c , σ 2 a and σ 2 b , while the Metropolis-Hasting-within-Gibbs algorithm is implemented for the rest parameters. The derivation of posterior distributions can be found in Appendix E.

We conduct simulations to evaluate the accuracy of our changepoint estimation, as well as the coverage and the length of the credible interval. We also explore how the strength of spatial correlation and change signal influence performance. To further study the properties of our method, we compare it with other competitive methods from the perspective of changepoint estimation.

We randomly select N = 50 locations in a 10 × 10 spatial domain as the rejection region D R resulting from a changepoint detection algorithm adjusted by the FDR control. Due to the joint estimation for all locations of our method, the false discoveries, i.e., the null locations falsely classified as alternatives, may undermine the estimation. To mimic false discoveries at a typical rate 0.1, we randomly select a cluster of N 0 = 5 locations among the 50 to be the falsely classified null locations. At each location, we consider T = 50 time points and generate T functional data, X s,t (u) : u ∈ [0, 1] for t = 1, . . . , T , as defined in Equation (2.1).

At those N 0 locations, the change function δ s (u) is set to be zero. Without loss of generality, we assume the mean curves, µ s 1 , . . . , µ s N , to be zero functions. Thus, the data generation mainly involves simulating error functions, break time and change functions.

Error functions: Although we allow the error functions to be weakly dependent, using temporally independent error functions in simulation studies is very common Aue et al., 2018) . In particular, Aue et al. (2018) repeated their simulation with the first-order functional autoregressive errors, and found the results generally remain the same as those from the independent errors. This is because the Y T,k (s) process is insensitive to the error correlation structure. We therefore adopt temporally independent error functions in our simulation. For each location, we generate T error functions ε s,t as follows,

where L = 21 is the number of Fourier basis functions, ν l (u) is the lth Fourier basis function, and ξ l s,t is the coefficient for ν l (u) at location s and time point t.

Define ξ l t = (ξ l s 1 ,t , ξ l s 2 ,t , . . . , ξ l s N ,t ) for any l between 1 and L, and assume ξ l t ∼ N (0 N , 1 2 1 m 3 Σ). To ensure curve smoothness, we set m = 1 if l = 1, m = l 2 if l is even, and m = l−1 2 if l is odd and l ≥ 3. The derivation of m and details of basis functions are deferred to Appendix F. To ensure the error functions be spatially correlated, we assume that the

for a range parameter φ.

Break time: For the region of the N a = 45 true alternative locations, D a := {s 1 , . . . , s Na },

we first generate the scaled break times ( k * s : s ∈ D a ) from a truncated multivariate normal distribution such that 0.15 ≤ k * s ≤ 0.85 for any s ∈ D a :

where [a] denotes rounding a to its nearest integer. We truncate the scaled break time to ensure there are a reasonable amount of data both before and after the changepoint. This also allows the signal-to-noise ratio defined later in this section to be within a normal range.

Change functions: We generate change functions δ s , s ∈ D a as follows:

where ν l is the lth Fourier basis function and η l s is the coefficient for ν l at location s. Define η l = (η l s 1 , . . . , η l s Na ) T for any l between 1 and L, and assume η l ∼ N (ρ 1 m 2 1 Na , 1 10 1 m 3 Σ a ), where m and Σ a follow the definition in the error function and break time, respectively. The parameter ρ measures the magnitude of the change signal.

To investigate how our model performs under different spatial correlation strengths, we consider both φ = 2 and 5 which corresponds to relatively weaker and stronger spatial correlation. It is also interesting to study the influence of change signal strength on our model performance. We adopt the signal-to-noise ratio (SNR) used in Aue et al. (2018) to measure the strength of the change signal. SNR, the ratio of the magnitude of change function to that of error functions, is defined as

where θ is the scaled date of the changepoint, i.e. k * s /T in our context, δ is the change function, C is the long-run covariance matrix of the error functions as defined in Equation (A.1), and tr(·) is the trace function. The estimation procedure for SNR at a single location is detailed in Aue et al. (2018) . By setting ρ = 1 and 1.5 we obtain simulated data with mean SNR over all locations in D a being around 0.5 and 1, which corresponds to weaker and stronger signal, respectively.

To evaluate the performance of the proposed method, we examine the rooted mean squared error (RMSE) of the changepoint estimate, the empirical coverage of the credible interval (CI), and the length of CI. For each setting of spatial correlation and SNR, we run 100 simulations.

Different locations, changepoints and functional data are generated independently in each simulation. For our Bayesian model, to make sure the MCMC chain has already converged, we try several sets of different initial values for all parameters and evaluate the difference between those chains with the Gelman-Rubin diagnostic (Gelman and Rubin, 1992) . We also apply Geweke's diagnostic (Geweke et al., 1991) to determine the burn-in period. Through experimentation, we find that 20,000 MCMC iterations with a 15,000 burn-in period and thinning with step size 10, is sufficient to produce nearly iid samples from the posterior distribution. We compute 95% credible intervals as the interval between the 2.5 and 97.5

percentiles of the posteriors for each parameter.

We compare our method to the recent Fully Functional (FF) method in Aue et al. (2018) and the method particularly designed for spatial functional data in Gromenko et al. (2017) (hereinafter GKR). The GKR method mainly focuses on changepoint detection and does not provide confidence intervals. Thus, our comparison to Gromenko et al. (2017) is only limited to comparing the accuracy of the estimation. The changepoint confidence interval based on the FF method is computed using the R package fChange. Although all methods are applied to the functional data at 50 locations, the evaluation metrics are calculated only at the 45 true alternative locations.

The RMSE of the changepoint estimates from all three methods is reported in Appendix H. Unsurprisingly, GKR has significantly higher error rates than the other two methods since it assumes a single changepoint whereas the data are generated with spatially varying changepoints. We instead focus on FF and our method in Figure 4 since they have comparable error rates. Across all four scenarios representing both the weaker and stronger spatial correlation and SNR, our proposed method outperforms FF by reducing the RMSE of the changepoint estimation. When the signal of change is stronger (ρ = 1.5), both FF and our method show smaller and more stable RMSE, as expected. When spatial correlation is higher (φ = 5), our method achieves far less estimation error, especially in the challenging situation with a weaker change signal (ρ = 1). This implies that our method can use spatial correlation to improve the changepoint estimation. Curiously, the FF method experiences a slight RMSE reduction in the high correlation regime, which turned out to be an artifact of the data generation randomness. Details are reported in Appendix H. Figure 4 : Boxplots of RMSE, the empirical coverage probability of 95% credible and confidence intervals and the logarithms of interval length under four settings. The range parameter φ = 2 and φ = 5 represent weaker and stronger spatial correlation, and ρ = 1 and ρ = 1.5 represent weaker and stronger change signals, respectively. "BH" is our proposed Bayesian hierarchical model and "FF" refers to the fully functional method in Aue et al. (2018) .

We further report the empirical coverage probability of our 95% credible intervals against the 95% confidence intervals of the FF method, and present the interval lengths of both methods in Figure 4 . Narrow credible or confidence intervals, with empirical coverage close to the nominal level, indicate precise uncertainty quantification. Our credible intervals, based on weakly informative priors, are closer to the nominal level and narrower than the corresponding FF confidence intervals. We observe that when the change signal is stronger, both FF and our method improve the uncertainty quantification compared to the lower change signal scenarios. Again, our method is apparently able to take advantage of the spatial correlation in changepoint estimation, reflected by shorter credible interval length while better coverage when the spatial correlation becomes stronger. This ability is particularly important when the change signal is weak, because in such cases, methods like FF that do not take spatial correlation into account may face challenges. Figure 5 shows the 95% credible and confidence intervals from randomly chosen simulation runs in two different settings. Figure 5 (a) is associated with the stronger spatial correlation and the stronger change signal when both our and the FF method have the best performance among all the settings in terms of both the accuracy of the changepoint estimation and the uncertainty quantification. In this scenario, the performance of the confidence interval from the FF method is slightly worse than the credible interval of our method and the RMSE from FF is competitive. Nevertheless, it is still seen that when the true changepoints are closer to the edges, the FF method tends to miss true values and results in longer credible intervals, while our method consistently captures all changepoints well regardless of their positions. Besides, for many locations, even though the estimate from the FF method is close to the true value, their confidence interval often appears too long to be informative. Figure 5 (b) corresponds to the case with the stronger spatial correlation and weaker signal.

Both methods perform satisfactorily when the true changepoint is near 0.5. However, the FF method in this scenario struggles to capture the changepoint as well as quantify the uncertainty when the real changepoint is slightly extreme toward both ends. In contrast, our method still retains its power in those situations by providing accurate estimates and informative credible intervals.

Under the null hypothesis of no changepoint, the variability of Y T,k process is large in the middle and reaches its peak at 0.5. Even if the changepoint exists, the variance in the middle still tends to be higher due to the intrinsic properties of Y T,k , though the peak may not occur at the center. Since the FF method only searches for the maximum value of the Y T,k (s) process, it could be vulnerable to the large variance often dwelling around the center of the duration. When the real changepoint is off-center and the signal is weak, high variance near the center can lead to spurious maxima in the Y T,k process. In contrast, our method attempts to identify the changepoint with a piecewise linear model, which is more robust to variance. Furthermore, our method allows us to borrow the neighborhood information to estimate the changepoint, which is particularly helpful for challenging situations such as change signal being weak or changepoints close to the edges.

We demonstrate our method on two datasets and, again, compare our results with the FF detector of Aue et al. (2018) . The first dataset is the temperature profiles introduced in Section 1, and the second dataset records COVID-19 positive cases by age in Illinois during the spring of 2021.

As described in Section 1, we have daily minimum temperature profiles at 207 locations in California from 1971 to 2020. Due to the high degree of missingness in many sites, we only retain 28 stations that have at least 85% complete profiles each year. Each profile is then smoothed with 21 Fourier basis functions. We apply the FF method to further subset the number of stations down to 16, each with a p-value below the 0.05 cutoff after FDR correction.

As an example, the daily minimum temperature profile at Los Angeles International Airport in 1980 together with the smoothed curve using the 21 Fourier basis functions are shown in Figure 6 (a). When applying our method to this data, we check the MCMC convergence using the same diagnostics as discussed in the simulation, which guides us to run 30,000 iterations with a burn in of 20,000 and thinning interval of 10. The changepoint posterior estimates are presented in Figure 6 (b). We also show the credible intervals for each station in Figure 6 (c), together with the FF estimates and their confidence intervals.

A comparison between Figure 6 the simulation studies. Our estimates also preserve the spatial continuity of the naturally dependent temperature process, as evidenced by the changepoint locations in Figure 6 (b).

Stations close in space tend to have changepoints close in time. Accurate changepoint estimates and informative credible intervals can help us more profoundly understand the climate dynamics and the threat of tipping points in the climate system.

As we all know, the coronavirus emerged as mainly attacking the older adults, but then it is observed that the age distribution of COVID-19 cases moved toward younger ones. One inter- Champaign County has the adjusted p-value 0.102, only barely above the threshold 0.1.

Since Champaign County is the 10th largest among the 102 counties in Illinois in terms of population, and it has a large young age group due to a major public university being in this county, we also include Champaign for changepoint estimation.

We apply both FF and our method to the data over the counties that are expected to have changepoints. For simplicity, we use an exponential covariance function to model the dependence between county level parameters, and use the county geographical center to calculate distance, though conditional or simultaneous autoregressive models are usually more typical for areal aggregated data. We do not expect the results will be sensitive to the choice of the covariance model due to the scatter of the 29 counties. Using the same convergence diagnostic as for the previous temperature dataset, we run MCMC for 50,000 iterations and take the first 40,000 as the burn-in, then we thin the rest using the stepsize 10 to obtain the posterior samples.

The changepoint estimates from both methods are illustrated in Figure 7 (a), and the 95% credible intervals and confidence intervals are shown in Figure 7 (b). Again, our method is able to estimate changepoints close to the boundaries , and our credible intervals are shorter than the FF confidence intervals. 

We developed a Bayesian hierarchical model for estimating a single mean changepoint for spatially indexed functional time series. Our method allows each location to have its own changepoint but also respects the fact that the changepoints and change functions tend to be similar when the locations are close. Simulations show that our model provides more accurate changepoint estimates and shorter but more informative credible intervals than the FF estimates and their confidence intervals. In particular, our method outperforms the FF method in estimating the early or late changepoints. We demonstrated our proposed method on the daily minimum temperature in California and the COVID-19 cases over age groups in Illinois.

Our method is established based on the properties of the Y T,k process, a function of the CUSUM statistic that is widely employed for changepoint detection and estimation.

Instead of searching for the maximum value of the Y T,k process as in the FF method, we proposed to use a two-piece piecewise linear model to capture the peak of the Y T,k process.

We carefully built the variance structure of the Y T,k process into the piecewise linear model so that the uncertainty of the linear model fitting and thus the changepoint estimates are appropriately quantified. By jointly fitting the spatially correlated piecewise linear models across all locations through a Bayesian hierarchical model, we took the inherent spatial correlation into account in our changepoint estimation.

Our method differs significantly from the existing methods in two aspects. First, we utilize spatial correlation to synthesize information over the whole spatial domain instead of focusing on a single location, e.g., Aue et al. (2018) . Second, we allow spatially varying changepoints for different locations, instead of assuming a single shared changepoint across all locations (Gromenko et al., 2017) . 

We first define the long-run covariance kernel, eigenvalues and eigenfunctions related to the error functions in Equation (2.1). Since the error properties are assumed homogeneous across all locations, we drop the spatial index and represent the error functions at one spatial location as ε t , t ∈ Z in the following. Under Assumption 1, the limiting performance of Y T,k (s) at location s depends on the long-run covariance kernel of the error terms ε t : t ∈ Z.

The kernel is defined as

which was first considered by Hörmann and Kokoszka (2010) with its estimator and convergence further studied. A positive definite and symmetric Hilbert-Schmidt integral operator c ε on L 2 [0, 1] can be defined using C ε . More formally,

where g ∈ L 2 ([0, 1]). This further defines a non-increasing sequence of non-negative eigenvalues λ l : l ∈ N and the corresponding orthonormal eigenfunctions ψ l : l ∈ N, which satisfy

The eigenvalues and eigenfunctions of c ε determine the asymptotic mean and variance of the Y T,k (s) time series. The estimations of eigenvalues and eigenfunctions are provided by fChange package in R. We use the optimal bandwidth selector provided by this package to complete the estimation.

Let W (t) denote a standard Wiener process (or Brownian motion), i.e. W (t) is a stochastic process such that for t ≥ 0, the increments W (t) − W (0) are stationary, independent, and normally distributed with E{W (t)} = 0 and var{W (t)} = t. We can further define a Brownian bridge on [0, T ] as the process

Lemma 2. If B(t) is a Brownian bridge for t ∈ [0, 1], then it has the following properties:

cov{B 2 (t), B(t)} = 0.

To simplify the notation, we write B(t) as B t and W (t) as W t . According to the definition and properties of Brownian bridge, it can be seen that E(B t ) = 0, var (B t ) = t(T −t)

T and cov(W s , W t ) = E(W s W t ) = s, s ≤ t. Furthermore, we can get the following equations,

Plugging the results of the above two equations into Equation (B.3), we can get cov (B 2 t , B t ) = 0.

C Proof of Lemma 1 and Proposition 1

To simplify notation, we drop the location index from Equation 2.1 and assume the observations, at location s, follow

The hypothesis about changepoint detection is

and the definition of CUSUM statistic is

For spatial locations with no changepoint, i.e. null locations, we denote the CUSUM statistic by S 0 T,k (u). For locations with a changepoint, i.e. alternative locations, we denote the CUSUM statistic as S A T,k (u).

Recall Theorem 1.2 of Jirak (2013) which states that, for

under Assumption 1, there exists a sequence of Gaussian processes,

From this theorem, we immediately have that

Then we can have,

At a null location we have E(X 1 ) = · · · = E(X T ) = µ, and

It can be easily seen that S 0 T,k (u) = S T ( k T , u) − k T S T (1, u), k = 0, . . . , T . In another way, we write q = k T , and then S 0 T,k (u) = S 0 T,T q (u). We further define S 0 T (q, u) = S T (q, u) − qS T (1, u), q = 0, 1 T , . . . , 1.

In this case, S 0 T,T q (u) = S 0 T (q, u).

The goal here is S 0 T (q, u) 2 , and note that

Or in another way,

Following Theorem 1 of Aue et al. (2018) and the proof in their supplement materials, using the definition of Γ T (q, u), calculations can be done to show that u, u ) . Hence, for all T , the Gaussian process Γ 0 T (q, u) has the same distribution as

where λ and φ are defined as in Appendix A. And (B : ∈ N) are independent and identically distributed standard Brownian bridges defined on [0, 1]. Following the supplement of Aue et al. (2018) , it is obvious that, for all T ,

which, in light of Equation (C.2), Slutsky's theorem and continuous mapping, implies

for any location s,

We further explore the mean and variance of its asymptotic distribution ∞ =1 λ B 2 (q), where q ∈ [0, 1]. Recall Lemma 2, and the fact that B l are independent, we have that

Based on Equations (C.4) and (C.5), the mean and variance are increasing before 0.5 and decreasing after, with the peak at 0.5 and both ends equalling zero.

Under H A , the observations follow E(X 1 ) = · · · = E(X k * ) = µ and E(X k * +1 ) = · · · = E(X T ) = µ + δ.

a. Before the changepoint: When k ≤ k * ,

Recall the definition of S 0 T (q, u) and we denote k

Again combining with Equation (C.1), this implies Γ 0

To simplify the notation, We write Z T,k for Γ 0 T (q, u) − δ 0 2 and Y T,k for S A T,k 2 , then we have

i.e.

Then we calculate the expectation and variance of this approximation distribution based on Equation (C.7) to get the conclusion when this is an alternative location and q ≤ k * /T as follows,

More concisely,

b. After the changepoint: Similarly, we deal with the case when k > k * . Now

Following the procedure as before, we use the definition of S 0 T (q, u) and write k * √ T T −k T δ as δ 1 , so S A T,k = S A T,T q = S 0 T (q, u) − δ 1 . By replacing δ 0 by δ 1 in the preceding calculations, we obtain a similar result and can, again, use Z T,k to show that under H A , when q > k * /T ,

Based on (C.8) and (C.10), when T is large, either before or after the changepoint, the form of the mean will be dominated by the T dependent terms. Thus, the mean is approximately increasing before the changepoint and approximately decreasing after, with the peak exactly on the changepoint location and both ends equalling zero. Because the theoretical form of the mean is complicated, we use a two-piece piecewise linear model with fixed ends to approximate it. We keep the special form of variance to better capture the uncertainty of the Y T,k process.

To have a better visualization of the proposed theoretical form for variance, we plot the variance under cases with different changepoint as shown in Figure 9 . From this, we can see that the variance is always relatively large in the middle no matter where the changepoint is.

Approximation of Y T,k Process To illustrate how well the approximation form can mimic the Y T,k (s) process when a changepoint exists at location s, T curves are simulated and we consider changepoint 0.6 on the scaled time domain [0, 1], i.e. if T =100, the changepoint is 60. The noise observations and change functions are generated following the same formulas as those in the simulation. The magnitude of the change function ρ is tuned for different scenarios to maintain the target signal-to-noise ratio (SNR), which is detailed introduced in Section 3.1. For each setting, the simulation is repeated 500 times. We collect Y T,k at each time point and then calculate the mean and variance of all 500 simulated Y T,k 's. In addition, we estimate the eigenvalues and eigenfunctions to obtain the proposed theoretical approximation E{Z T,k (s)} and var{Z T,k (s)} as in (2.6) and (2.7). The estimation procedures follow those from Aue et al. (2018) . Figure 2 compares the mean and variance from the simulations and those derived based on the theoretical approximation. The first row in the figure is the result for mean and the second for variance. It is seen that when T =50 and 100, the approximation captures the trend and uncertainty of the Y T,k process very well.

Piecewise Linear Fit of the mean of Y T,k Process To present how we use the piecewise linear model to model Y T,k process, we consider the case with T = 50 time points and changepoint at 0.7 and show the results in Figure 3 . The noise functional data and change function (ρ = 0.8) are generated according to the procedure in the simulation. We pick two interesting simulations: one has Y T,k process with a peak right at the changepoint, and the other one has a peak slightly off. For each Y T,k process, to get the best two-piece piecewise linear fit with two fixed ends, we use grid search for the slope and break point and pick the model with the smallest residual sum of squares, which is shown as the dashed line in the figure.

We denote the entire Y T,k process for all N locations as

To make the symbols concise, we use β 0 , c 0 , a 0 , and b 0 to represent log(−β), Φ −1 (c), log(a), and log(b). Then we can get the joint likelihood of all the parameters conditioning on the observations case. Recall the multivariate normal distribution for an n × 1 vector y:

which means σ 2 |· ∼ IG n/2 + α 1 , (y − µ) T Σ −1 (y − µ)/2 + α 2 . Now we apply the conclusion above to our cases. From Stage III in our model, σ 2 i ∼

Similarly, we can get σ 2 a |· ∼ IG

For the transformed parameters β 0 , a 0 , b 0 , c 0 and the mean parameters µ β , µ a , µ b , µ c in Stage II, we use Metropolis-Hasting-within-Gibbs with symmetric proposal distribution.

In the following, we take the parameter c 0 as an example to illustrate the process in Iteration j + 1. We use c 0(j) and c 0 * to represent the sample in jth iteration and the new proposed sample. And we pick N (c 0(j) , σ 2 ) as the proposal distribution, where σ 2 is a tuning parameter.

(i) Generate a random candidate state c 0 * ∼ N (c 0(j) , σ 2 ).

standard normal distribution. To ensure the convergence of the chains, we try several sets of different initial values for all parameters and evaluate the difference between those chains by Gelman-Rubin diagnostic. Geweke's diagnostic is applied to determine the burn-in period.

With Fourier basis functions, we generate the functional data according to ε s,t = L l=1 ξ l s,t ν l , s ∈ D R , t = 1, . . . , T,

where ε s,t is the independent curve at location s at time point t and δ s is the change function for the alternative location. To make the symbols consistent with that in our R code, here ν l is the lth basis function in R. To make the index of coefficients easier to understand, we rewrite the above two equations as follows, Note that for the cosine based basis functions, we use A and A to denote its coefficient. And B and B are for sine based basis functions. When programming in R, √ 2 is multiplied in front of the basis to make sure the norm of each basis is 1. Similar results can be easily obtained by multiplying some constant if another programming language is used.

To guarantee the smoothness of the functional data with and without change function, we expect the coefficients decay with the frequency of the basis function. And the sine and cosine basis would not influence the magnitude of coefficients if they share the same frequency.

So we divide the coefficients into several groups and coefficients in the same group will share the same fluctuation.

parameters as follows. In the following, we illustrate the way to get the estimates for one location, so we ignore the index for the location and denote the estimates at one location asβ,ĉ,â, andb.

For each location, we use the package fChange which implements the method introduced in Forβ, we generate a sequence of possible values, use the two-piece piecewise linear model with fixed ends y = β{(ĉ − 1)q + (q −ĉ)1(q ≥ĉ)}, q = k T , k = 1, . . . , T − 1 based on the changepoint estimateĉ from the FF method, and pick the β that can model the Y T,k process with the minimum mean squared error.

The estimators of the parameters need not necessarily be very accurate, since we just would like to provide reasonable initial values that are roughly on the same scale as that of the true values, which can help on the convergence of the chains.

RMSE from the method in Gromenko et al. (2017) Figure 10 shows the RMSE from all three methods. Figure 10 : Boxplots of RMSE from three different methods under four settings. "GKR" indicates the method in Gromenko et al. (2017) . The settings and other method labelling are the same as those in Figure 4 . Figure 4 , it seems that the FF method also benefits from the stronger spatial correlation, which should not be, in theory. To further explore how BH and FF react to different spatial correlations given the same data generation seed number, we examine two types of pairwise differences where each pair shares the same seed number.

We first take the pairwise RMSE difference between the FF and BH method for each parameter setting, as shown in Figure 11 (a). When spatial correlation is stronger (φ = 5), the RMSE reduction by using the BH appears more significant than that with the weaker spatial correlation (φ = 2). Then we take the pairwise RMSE difference between φ = 2 and φ = 5 for each of the FF and BH methods and for both ρ = 1 and ρ = 1.5, as shown in Figure 11 (b). Again, the RMSE reduction of BH by having φ = 5 as opposed to φ = 2 for both ρ = 1 and ρ = 1.5 appears more significant than the RMSE reduction of FF due to a stronger spatial correlation. Both plots (a) and (b) show that the BH essentially benefits from the stronger spatial correlation while there is no clear evidence that FF enjoys strong spatial correlation.

(a) (b) Figure 11 : (a) Boxplots of pairwise differences between RMSE from FF and RMSE from BH under four different settings. (b) Boxplots of pairwise differences between RMSE from φ = 2 and RMSE from φ = 5 for BH and FF under both weaker and stronger signal strength.

COVID-19 Data in Illinois We use 7 Fourier basis functions to smooth the raw data. An example about the raw data and the functional time series after smoothing is shown in Figure   12 . Figure 12 : The ratio of cases in each age group in Champaign county on Jan. 5th, 2021 (black line with dots) and functional time series after smoothing (red). The x-axis labels the corresponding age groups evenly spread between 0 and 1.

Detecting and estimating changes in dependent functional data

Estimation of a change-point in the mean function of functional data

Structural breaks in time series

Detecting and dating structural breaks in functional data without dimension reduction

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Detecting changes in the mean of functional observations

Estimating the current mean of a normal distribution which is subjected to changes in time

Identifying multiple changes for a functional data sequence with application to freeway traffic segmentation

Multiple-change-point detection for auto-regressive conditional heteroscedastic processes

Inference from iterative simulation using multiple sequences

Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments

Testing the equality of mean functions of ionospheric critical frequency curves

Nonparametric inference in small data sets of spatially indexed curves with application to ionospheric trend determination

Detection of change in the spatiotemporal mean function

Estimation and testing for spatially indexed curves with application to ionospheric and magnetic field trends

Local prediction of a spatio-temporal process with an application to wet sulfate deposition

Scalable multiple changepoint detection for functional data sequences

Separable covariance arrays via the tucker product, with applications to multivariate relational data

Weakly dependent functional data

The maximum likelihood method for testing changes in the parameters of normal observations. The Annals of statistics

Estimation of the mean of functional time series and a two-sample problem

On weak invariance principles for sums of dependent random functionals

A bayesian change point model for detecting sip-based ddos attacks

Adaptive detection of multiple change-points in asset price volatility

Bayesian change point detection for functional data

Changepoint detection in periodic and autocorrelated time series

A robust-likelihood cumulative sum chart

Continuous inspection schemes

A review and comparison of changepoint detection techniques for climate data

Consistency of binary segmentation for multiple change-points estimation with functional data

Spatially varying auto-regressive models for prediction of new human immunodeficiency virus diagnoses

Testing for change points in time series

Sequential block bootstrap in a hilbert space with application to change point analysis

= 3, i.e.E (W 2 t ) = t, E (W 4 t ) = 3t 2 , we derive the following equations,After plugging the result of each item into Equation (B.2), we can haveIn this paper, we only consider the Brownian bridge on the unit interval [0, 1], i.e. T = 1.Therefore, the result is further simplified asFurther, we can have the fully conditional likelihood for each parameter as follows,For the variance parameters σ 2 i , i = β, a, b, c, we use Gibbs sampling and we use Metropolis-Hasting-within-Gibbs to update other parameters.a. Variance parameters To get the posterior distribution for σ 2 i , i = β, a, b, c, we first derive the posterior distribution of the variance parameter with conjugate prior in the general f (c 0(j) |·)T (c 0 * |c 0(j) ) .(iii) Then generate a random number u ∈ [0, 1] from the uniform distribution on [0, 1]. If u ≤ A(c 0 * , c 0(j) ), we accept the new state and set c 0(j+1) = c 0 * . Otherwise, we reject the new state and set c 0(j+1) = c 0(j) .Note that because of the symmetry of the proposal distribution and supposing in the jth iteration, we have updated all the other parameters except the three range parameters φ, φ s , φ t and the variance parameters. c. Range parameters For the range parameters, we still use a normal random walk as the proposal distribution, but note that the range parameters should be positive. For example, to get posterior samples for φ, the acceptance rate is) . The random candidate φ * is generated from N (φ (j) , σ 2 ) and φ * is positive, where σ 2 is a tuning parameter and can be different from that we use to generate c 0 * . Assume that in the (j + 1)th iteration, φ is updated right after c 0 , and the other two range parameters φ s , φ t and the variance parameters are not updated yet. Based on the proposal distribution, we havewhere φ, Φ are the probability density function and cumulative density function of the Recall the smoothness and decay properties of Fourier coefficients: A piecewise continuous function has Fourier coefficients that decay as 1/n. And also a conclusion in stochastic sequence: If (X n ) is a stochastic sequence such that each element has finite variance, thenMoreover, if a −2 n var (X n ) = var (a −1 n X n ) is a null sequence for a sequence (a n ) of real numbers, then a −1 n {X n − E (X n )} converges to zero in probability by Chebyshev's inequality, soTo make sure that the independent curves, change functions, and the final observations are at least piecewise continuous, the coefficients should satisfy that for any s,If A n,l (s), B t,l (s) ∼ N (0, r 1 l 3 ) and A l (s), B l (s) ∼ N (ρ 1 l 2 , r 2 l 3 ), the conditions above can be guaranteed, where r 1 , r 2 and ρ are some constants. We choose r 1 = 1 2 and r 2 = 1 10 , so the range of independent curves is well controlled and the change functions for all locations share a similar shape, which is a common situation in the spatial correlated real dataset. And the spatial correlation structure is the same as that we use to generate changepoints.

For each location s, we get the estimates for parameters β(s), c(s), a(s) and b(s) and denote them asβ(s),ĉ(s),â(s) andb(s). To start the MCMC, we assign initial values for all