key: cord-0504914-1i96ykfp
authors: Ning, Ning; Ionides, Edward L.
title: Iterated Block Particle Filter for High-dimensional Parameter Learning: Beating the Curse of Dimensionality
date: 2021-10-20
journal: nan
DOI: nan
sha: 07486277f7c73304df542a8d0b44e4b225c227cb
doc_id: 504914
cord_uid: 1i96ykfp

Parameter learning for high-dimensional, partially observed, and nonlinear stochastic processes is a methodological challenge. Spatiotemporal disease transmission systems provide examples of such processes giving rise to open inference problems. We propose the iterated block particle filter (IBPF) algorithm for learning high-dimensional parameters over graphical state space models with general state spaces, measures, transition densities and graph structure. Theoretical performance guarantees are obtained on beating the curse of dimensionality (COD), algorithm convergence, and likelihood maximization. Experiments on a highly nonlinear and non-Gaussian spatiotemporal model for measles transmission reveal that the iterated ensemble Kalman filter algorithm (Li et al. (2020)) is ineffective and the iterated filtering algorithm (Ionides et al. (2015)) suffers from the COD, while our IBPF algorithm beats COD consistently across various experiments with different metrics.

We firstly give the background and motivation in Section 1.1 and then state our contributions in Section 1.2, followed with the organization of the paper in Section 1.3.

Spatiotemporal data arises when measurements are made through time at a collection of spatial locations. Spatiotemporal inference for epidemiological and ecological systems is arguably the last remaining open problem from the six challenges in time series analysis of nonlinear systems posed by Bjørnstad and Grenfell (2001) . A disease transmission system is stochastic and imperfectly observable, thus it is commonly modeled by a partially observed Markov process (POMP), otherwise known as state space model or hidden Markov model, which consists of a latent Markov process representing the time evolution of the system and a measurement process by which stochastic observations of this latent process are collected at specified time points. Particle filters (PFs), also known as sequential Monte Carlo methods, are recursive algorithms that enable estimation of the likelihood of observed data and the conditional distribution of the latent process given data from a POMP model (Doucet et al. (2001) ; Cappé et al. (2007) ; Doucet et al. (2009) ).

For the purpose of parameter learning, two iterated filtering (IF) approaches were developed, Ionides et al. (2006) and its subsequent improvement Ionides et al. (2015) referred to as IF1 and IF2 algorithms respectively, which coerce a particle filter into maximizing the likelihood function for unknown parameters. PF methods and the parameter learning algorithms based on them (such as IF2), are capable of handling highly nonlinear latent processes (King et al. (2008) ; Ionides et al. (2011) ). In epidemiological applications, IF1 and IF2 can considerably increase the accuracy of outbreak predictions while also allowing models whose structure reflects different underlying assumptions to be compared (Dobson (2014) ). Unfortunately, PF suffers from rapid deterioration in performance as the model dimension increases (Bengtsson et al. (2008) ; Snyder et al. (2008) ). Rebeschini and Van Handel (2015) rigorously showed that PF suffers the curse of dimensionality (COD) phenomenon which says that the upper bound of the algorithmic filter error is exponential in the dimension of the state space of the underlying model. As expected, PF-based parameter learning algorithms suffer from the COD, limiting their applicability in high-dimensional problems.

The ensemble Kalman filter (EnKF) is a recursive filter suitable for problems with a large number of variables. EnKF represents uncertainty in the latent state space using a finite collection of state values, and we refer to these ensemble members as particles by analogy with PF. EnKF differs from PF by adopting a Gaussian approximation in the rule used to update the particles when filtering. EnKF methods have been used for geophysical models in data assimilation due to their computational scalability to high dimensions (Houtekamer and Mitchell (2001) ; Evensen (1994) ; Katzfuss et al. (2020) ). For the parameter learning purpose, the iterated EnKF (IEnKF) algorithm extends the IF2 approach for parameter estimation by replacing a PF with an EnKF; it propagates the ensemble members by simulation from the dynamic model and then updates the ensemble to assimilate observations using a Gaussian-inspired rule (Li et al. (2020) ). Given that EnKF relies on locally linear and Gaussian approximations, it can be ineffective for highly nonlinear and non-Gaussian systems (Ades and Van Leeuwen (2015) ; Lei et al. (2010) ; Miller et al. (1999) ). Unsurprisingly, the corresponding unsuitability carries to EnKF based parameter learning algorithms (such as IEnKF).

Block sampling strategies for PF were proposed with temporal blocks in (Doucet et al., 2006) . Rebeschini and Van Handel (2015) investigated spatial blocks and proved that a block PF (BPF) beats the COD under certain conditions. However, the beautiful work of Rebeschini and Van Handel (2015) is theoretical in nature and was not anticipated to be applicable to real highdimensional problems (page 2812 therein). In recent years, many efforts have been undertaken to develop practical methods for these problems by developing the "block" concept, include but are not limited to, the following: Johansen (2015) proposed a method for systems identification based on both the block sampling idea and the annealed importance sampling approach; Singh et al.

(2017) applied the particle Gibbs algorithm inside a generic Gibbs sampler over temporal blocks to handle long time series; Park and Ionides (2020) proposed a twisted particle filter model (Whiteley and Lee (2014) ) with iterated auxiliary PFs (Guarniero et al. (2017) ) to infer on moderately high-dimensional spatiotemporal models where its particle filtering corresponds to an adapted version of the block sampling method; Goldman and Singh (2021) proposed a blocked sampling scheme for latent state inference in high-dimensional state space models; Ionides et al. (2021) proposed the bagged filter for partially observed interacting systems and showed that BPF can perform well on practical scientific models.

So far there is no high-dimensional parameter learning approach that is generically applicable over partially observed, highly nonlinear, stochastic spatiotemporal processes. The goal of this paper is to develop such an algorithm that is generically applicable and able to beat the COD. Unlike the limited theoretical understanding of EnKF and hence IEnKF, the proposed algorithm has rigorous convergence analysis with precise error bound indicating dimensions.

In this paper, we propose the iterated BPF (IBPF) algorithm for high-dimensional parameter learning over partially observed, nonlinear, stochastic spatiotemporal processes. The contributions of the paper fall into four distinct categories:

(1) General graphical model structure. In this paper, we consider a general graphical POMP model having the following characteristics:

• General state spaces and measures. The latent state (X n ) n≥0 is a Markov chain taking values in a general Polish state space X, endowed with a general reference measure ψ. The observation sequence (Y n ) n≥1 is conditionally independent given (X n ) n≥1 and takes values in a general Polish state space Y, endowed with a general reference measure φ. The auxiliary Markov chain (Θ n ) n≥0 for parameter learning purpose, has its own general Polish state space Θ and general reference measure λ. • General transition densities. The transition densities of X n , Y n , and Θ n are all time-inhomogeneous and in the general form where only weak standard conditions are required (see "Theoretical contribution" below for details). A model that is defined using a simulator instead of an analytically tractable characterization of an underlying process, is said to be implicitly defined (Diggle and Gratton (1984) ). Models for complex dynamic systems, are sometimes defined implicitly by a computer simulation algorithm, which lack analytically tractable transition densities. Inference approaches that are applicable on implicitly defined models are said to possess the plug-and-play property (Bretó et al. (2009) ; He et al. (2010) ), or alternatively called equation-free (Kevrekidis et al. (2004) ; Xiu et al. (2005) ) or likelihoodfree (Marjoram et al. (2003) ; Sisson et al. (2007) ). We note that our IBPF algorithm is in the class of plug-and-play algorithms, meaning that they do not need to evaluate the transition density, merely require a simulator as input for the latent dynamic process.

• General graph structure. The state of (X n , Y n , Θ n ) at each time n is a random field (X v n , Y v n , Θ v n ) v∈V indexed by a general finite undirected graph G = (V, E), with V being the set of vertices of the graph and E being the set of edges of the graph. A general partition K separates V to nonoverlapping blocks such that V = K∈K K and K ∩ K = ∅ for K = K and K, K ∈ K.

(2) Innovative methodology. Now we elaborate our methodological contributions by answering, how the BPF algorithm can be used in parameter learning over general spatiotemporal systems, how it connects and differentiates to mainstream algorithms, and how its performance can be guaranteed in terms of graph dimensions, time steps, algorithm convergence, and likelihood maximization, respectively as follows:

• The BPF algorithm in Rebeschini and Van Handel (2015) is the first PF algorithm that has rigorous guarantees on beating COD, however "it is far from clear whether this simple algorithm is of immediate practical utility in the most complex real-world applications" (page 2812 therein). Our IBPF algorithm in Algorithm 1 embeds a BPF algorithm on an extended state space in an iterative scheme that constructs parameter values approaching the maximum of the likelihood function. This algorithm inherits from BPF the property that only observations in each block are used to update predictions of state variables and parameter variables in that block, which is the key to scalability.

• The connections and difference of our IBPF algorithm to the IF2 algorithm, the PF algorithm, the BPF algorithm, and the iterated importance sampling algorithm can be seen as follows: When there is no partition of the graph, the IBPF algorithm is in nature the IF2 algorithm; when there is no partition of the graph and no iteration, the IBPF algorithm is the PF algorithm; when there is no parameter variable involved and no iteration, the IBPF is the BPF algorithm; when there is no partition of the graph and no particle, the IBPF is the iterated importance sampling.

• First, our IBPF algorithm has rigorous performance guarantees in terms of graph dimensions: From Theorem 3.3 we can see that the algorithmic error is exponential in the dimension of a set card(k) instead of the dimension of the graph card(V ), where card( · ) stands for cardinality. Second, our IBPF algorithm has rigorous performance guarantees in terms of time steps: The upper bound in Theorem 3.3 is uniform on all time steps n. Third, our IBPF algorithm has rigorous performance guarantees in terms of algorithm convergence: Since the upper bound in Theorem 3.3 is uniform on data Y 1 , . . . , Y N in any one of the M iterations, given that the result of one iteration is simply the initial value of the next iteration, all M iterations together can be represented as a filtering problem for this extended graphical POMP model on M replications of the data, because of which the convergence as M goes to infinity follows. Fourth, our IBPF algorithm has rigorous performance guarantees in terms of likelihood maximization: Given that our IBPF algorithm is the block-wised version of the IF2 algorithm, by replacing the filtering error bound generated from the PF algorithm in the maximum likelihood estimation (MLE) analysis in Ionides et al. (2015) with our Theorem 3.3, block MLE follows under corresponding conditions using the local block metric.

(3) Theoretical contribution. In Theorem 3.3, for this generalized model structure and practical methodology, we rigorously show that the algorithmic error can be bounded using the dimension of a local block, uniformly both in time and in the model dimension. Our theoretical contributions are the following:

• Theorem 3.3 is obtained under Assumption 3.1 where only weak conditions for local transition densities are enforced. The only conditions for the transition densities f X v n |Xn−1 and f Y v n |X v n are that they are upper and lower bounded as the conditions enforced in Theorem 2.1 of Rebeschini and Van Handel (2015) , which are vertex-based localized version of standard assumptions in PF analysis. In this context, we also have the same condition on the transition density f Θ v n |Θn−1 . Detailed explanations of Assumption 3.1 are provided in Remark 3.2. • Theorem 3.3 generalizes the main result of Rebeschini and Van Handel (2015) (Theorem 2.1 therein) to the time-inhomogeneous setting. Rebeschini and Van Handel (2015) introduced the mathematical machinery of a local particle filtering algorithm in high dimension that had not previously been applied in the study of nonlinear filtering. They analyzed the basic time-homogeneous case which corresponds to taking our local transition densities f X v n |Xn−1 , f Y v n |X v n , and f Θ v n |Θn−1 being the same for all time n respectively. However, Rebeschini and Van Handel (2015) "is essentially a proof of concept" (page 2819 therein). Being time-homogeneous is a limitation in practical applications; for example, stochastic epidemic models may have a time-varying population size, or other covariate, leading to time-inhomogeneity (see, e.g. Bretó et al. (2009)). Theorem 3.3 is consistent with Rebeschini and Van Handel (2015) in the timehomogeneous setting, but provides a general result under the timeinhomogeneous setting when ignoring the parameter learning functionality (i.e. taking all parameters as constants).

• Theorem 3.3 provides theoretical guarantees on algorithm performance in terms of convergence and beating the COD, in whose proof we rely on the decay of correlations (DOCs) property. The DOCs property arises in statistical mechanics, in regards to investigations of high-dimensional networks (see, e.g., Liu et al. (2018) ; Ning (2019a,b, 2021) ). In the current context, the DOC property means that the effect on the distribution in the block K of a perturbation made in another block K decays rapidly in the distance d(K, K ) defined in (2.3). By means of the Dobrushin comparison theorem, we extend the mechanism to the parameter space by showing that the DOC properties of the underlying model are inherited by the IBPF algorithmic filter π n , which is the joint conditional distribution of state X n and Θ n given the new observation Y n .

• Theorem 3.3 gives precise answers on what exact sufficient conditions needed, and based on those specific conditions how well can the algorithm is mathematically guaranteed to perform, and how to tune the hyper-parameters to find a user-desired trade-off between the conditions and the algorithmic error bound. Theorem 2.1 of Rebeschini and Van Handel (2015) only mentions that there exists a constant 0 ∈ (0, 1) such that the assumed lower bound of X's local transition density is bigger than 0 uniformly for all time n. In our generalized setting, our sufficient condition is that the product of the assumed lower bound of X's local transition density ( x ) and the assumed lower bound of Θ's local transition density ( θ (σ)) is larger than

, where ∆ is the maximal number of vertices that interact with one single vertex in its r-neighborhood and ∆ K is the maximal number of blocks that interact with a single block (including itself). For k being a subset of the block K in partition K, the error bound of Theorem 2.1 of Rebeschini and Van Handel (2015) is ambiguous but concise enough for theoretical purpose, in the form as follows:

where α, β 1 and β 2 are positive finite constants. We give a precise bound with each parameter in its exact form in Theorem 3.3 and interpret the results in Remark 3.4.

(4) Excellent performance. Our IBPF algorithm is designed for high-dimensional parameter learning over non-linear and non-Gaussian POMP models with a graphical structure (such as spatiotemporal models) where each vertex (location) is allowed to have its own specific parameters. It is appropriate and sometimes necessary, for some parameters to vary across locations, for example, to measles transmission modeling the basic reproduction number regarding the epidemic transmission speed, say R 0 . In epidemiological theory, R 0 is central, because it has interpretations in terms of many quantities of interest. In Section 4.1, we generalize the spatiotemporal model for measles transmission covered in Park and Ionides (2020) and Ionides et al. (2021) , by allowing location specific parameters, for algorithm utility illustration, performance evaluation, and shedding light on its wide appli-cations.

In Section 4.3, we conduct the performance analysis of IEnKF, IF2 and our IBPF on learning initial values of measles transmission dynamics and nonlinear parameters for each location, over the dataset covered in Section 4.2. We progressively increase the parameter learning difficulties through four cases: In case 1, the goal is to learn four initial values for each location; in case 2, in addition to initial values, the goal is to also learn the nonlinear parameter R 0 for each location; in case 3, in addition to initial values, the goal is to also learn a highly nonlinear parameter for each location; in case 4, the goal is to learn several unit-specific parameters incorporating all the parameters in cases 1 − 3, which is the hardest in all these 4 cases.

To compare the algorithms fairly, each algorithm uses the same number of iterations and the same number of particles. We conduct 10 replicates of all the parameter learning performance comparisons, in the way that in each replicate all algorithms start with the same initial search values drawn uniformly. We evaluate the best parameters learned in the 10 replicates using each algorithm (e.g. IEnKF by means of the EnKF metric on log-likelihood estimation) with the metrics of the other two algorithms (the PF metric adopted by IF2 and the BPF metric adopted by our IBPF), in all the experiments. We restrict each block to have exactly 2 cities in all the experiments since 2 cities are the minimum (for the 1 city case, our IBPF is IF2), which certainly debases IBPF's performance but avoids human intervene. Figure 2 reports the log-likelihood estimates per city per time step of various dimensions for cases 1 and 2. From the results of case 1, we can see that for this initial values learning problem which is nonlinear and non-Gaussian but not highly nonlinear, all three algorithms find parameter values with likelihood close to that of the true parameter. From the results of case 2, we can see that IEnKF and IBPF still perform well, but IF2 does not scale well which confirms the phenomenon that PF does not scale well with dimensions (Bengtsson et al. (2008); Snyder et al. (2008) ). Figure 3 reports the corresponding results for cases 3 and 4. We can see that even for 2 cities with all these 3 metrics, IEnKF performs very badly in case 3 and fails completely in case 4, all of which confirm the phenomenon that EnKF may not perform well for highly nonlinear and non-Gaussian problems (Ades and Van Leeuwen (2015) ; Lei et al. (2010); Miller et al. (1999) ). The performance of IF2 drops in both case 3 and case 4 with all these 3 metrics. In all four cases, our IBPF is able to find parameter values with likelihood on or better than that of the true parameter consistently with all three metrics.

The rest of the paper proceeds as follows: In Section 2, we define the model and measures by describing firstly the general graphical POMP model (X n , Y n ; Θ n ) on graph G in Section 2.1, the partition K that separates the general graph G to independent blocks in Section 2.2, and then the global and local metrics necessary to conduct analysis throughout the paper in Section 2.3. In Section 3, we provide the main results of this paper, by firstly describing our IBPF algorithm for parameter learning purpose over general graphical POMP model in Section 3.1, conducting preliminary algorithmic analyses in Section 3.2, and then providing rigorous theoretical guarantees in Section 3.3. In Section 4, we conduct performance analysis through a generalized spatiotemperal modeling for measles covered in Section 4.1, over the dataset covered in Section 4.2, and evaluate the performance of IBPF, IF2 and IEnKF in Section 4.3. In Appendix A, we provide the IF2 algorithm for comparison. In Appendix B, existing technical results are provided which are needed for rigorous proofs following. In Appendix C, we prepare for mathematical derivations by properly defining filtering, smoothing, and correlation measurement quantities. We deferred all the lemmas and propositions in bounding the bias and variance of the algorithmic error, to Appendix D and Appendix E, respectively. Original parameter learning results of four case studies without rescaling to account for spatial dimensions and time dimensions, are provided in Appendix F.

In this section, we describe in greater detail of the model and measures. We firstly describe the extended POMP model (X n , Y n ; Θ n ) on graph G in Section 2.1, and then the partition K that separates G to independent blocks in Section 2.2, followed by the global and local metrics necessary to conduct analysis in Section 2.3.

A general POMP model is a Markov chain (X n , Y n ), where (X n ) n≥0 is a Markov chain in a Polish state space X, while (Y n ) n≥1 are conditionally independent given (X n ) n≥1 in a Polish state space Y. Here, X n is not directly observable while Y n serves as its partial and noisy observations made at time n. Define the reference measure of X n (resp. Y n ) on its state space X (resp. Y) as ψ (resp. φ). Suppose that there is an unknown auxiliary Markov chain Θ n for parameter learning, which has its own Polish state space Θ and reference measure λ. For n ≥ 1, with respect to ψ we define the transition density of X n as f Xn|Xn−1 (x n |x n−1 ; θ n ), with respect to φ we define the transition density of Y n as f Yn|Xn (y n |x n ; θ n ), and with respect to λ we define the transition density of Θ n as f Θn|Θn−1 (θ n |θ n−1 ; σ) where σ is a nonnegative constant. That is, with imsart-generic ver. 2014/10/16 file: ms_upload_version.tex date: October 22, 2021 our extended Markov chain model (X n , Y n ; Θ n ), its transition probability is given by P (A|(x n−1 , y n−1 ) ; θ n ) = 1 A (x n , y n )f Xn|Xn−1 (x n |x n−1 ; θ n )f Yn|Xn (y n |x n ; θ n )ψ(dx n )φ(dy n ).

The state of (X n , Y n , Θ n ) at each time n is a random field (X v n , Y v n , Θ v n ) v∈V indexed by a finite undirected graph G = (V, E), where V stands for vertices of the graph and E stands for edges of the graph. The graph describes the location relationship of data and the spatial degrees of freedom of the model. Based on the network structure, the state spaces X, Y, and Θ can be written as the product forms

Define the reference measure of X v n on its state space X v as ψ v . Define the reference measure of Y v n on its state space Y v as φ v . Define the reference measure of Θ v n on its state space Θ v as λ v . Similarly, based on the network structure, we have the following product-formed expressions:

With respect to ψ v we define the transition density of X v n as f X v n |Xn−1 , with respect to φ v we define the transition density of Y v n as f Y v n |X v n , and with respect to λ v we define the transition density of Θ v n as f Θ v n |Θn−1 . Similarly, based on the network structure, we have the following product-formed expressions:

We consider a partition K that partition V into nonoverlapping blocks, i.e.,

Based on the partition, we can write

where Ξ can be X n , Y n or Θ n , as well as the associated state space X, Y or Θ. For any set W ⊆ V , we use (X × Θ) W and X W × Θ W interchangeably, and the measures of X n and Θ n with W can be defined respectively as

We define the distance d as the length of the shortest path in the graph G connecting two vertices, based on which we define for each vertex v ∈ V the r-neighborhood N (v) as

For integers 0 ≤ m ≤ n, denote

where Ξ can be X, Y or Θ. Suppose that, for n ≥ 1, the conditional distribution of X v n given X 0:n−1 depends on X

n−1 only and then we have

n−1 . Similarly suppose that, for n ≥ 1, the conditional distribution of Θ v n given Θ 0:n−1 depends on Θ

n−1 only and then we have

based on which we can define the collection of blocks that interact with any block K ∈ K in one step of the dynamics as

Given a set W ⊆ V , denote the r-inner boundary of W as the subset of vertices in W that can interact with vertices outside W ,

We now denote some quantities which will be used frequently throughout the paper: The maximal size of one single block in the partition K

where card(K) denotes the cardinality of K; the maximal number of vertices that interact with one single vertex in its r-neighborhood in one step of the dynamics

the maximal number of blocks that interact with one single block in one step of the dynamics ∆ K := max K∈K card{K ∈ K : d(K, K ) ≤ r}.

(2.8)

We assume that the process (X, Y ; Θ) is realized on its canonical probability space; denote P and E as the probability measure and expectation on that space, respectively. We use the functional analytic notation ρ(f ) for the integral of a measurable function f with respect to the measure ρ (provided this integral exists)

,

Between two random measures ρ and ρ on X, we define the distance

where X denotes the class of measurable functions f : X → R, and define the local distance, for k ⊆ V ,

where X k denotes the class of measurable functions f : X → R such that f (x) = f (x) whenever x k = x k . Similarly, we define the total variation distance between two random measures ρ and ρ on X (2.11) and define the local total variation distance, for k ⊆ V ,

(2.12)

In this section, we describe our IBPF algorithm for parameter learning over general graphical POMP models in Section 3.1, conduct its preliminary algorithmic analyses in Section 3.2, and then establish its theoretical guarantees on algorithm performances and convergences in Section 3.3.

We propose the IBPF algorithm in Algorithm 1. For notational convenience, we set 1:N := {1, 2, . . . , N } for N ∈ N throughout the paper. In Algorithm 1, Θ F,m n,j (resp. X F,m n,j ) is the j-th particle in the Monte Carlo representation of the m-th iteration of a filtering recursion at time n, where this filtering recursion is coupled with a prediction recursion represented by Θ P,m n,j (resp. X P,m n,j ). We note that the initial values of the dynamic can be treated as parameters and then be learned, with the help of the initial value function f X0 (θ); see, case study 1 in Section 4.3.

We list the IBPF algorithm's connections with the IF2 algorithm in Ionides et al. (2015) (also see, Appendix A), the PF algorithm, the BPF algorithm, and the iterated importance sampling algorithm: when there is no partition of the graph (K = 1), the IBPF algorithm is in nature the IF2 algorithm; when there is no partition of the graph and no iteration (K = 1 and M = 1), the IBPF algorithm is the PF algorithm; when the parameter is known and fixed (Θ = θ) and there is no iteration (M = 1), the IBPF algorithm is the BPF algorithm; when there is no partition of the graph and no particle (K = 1 and N = 1), the IBPF algorithm is the iterated importance sampling algorithm.

Inspecting the IBPF pseudocode (Algorithm 1), we can see that the same set of observations Y 1 , . . . , Y n is used in each of the M iterations. Let us firstly focus on one of the M iterations, say m = 1, and ignore the m superscript/subscript.

Given the observations Y 1 , . . . , Y n , we aim to approximate the joint nonlinear filter, for n ≥ 1,

The filter π n for n ≥ 1 that we also call the true filter to differentiate with the IBPF approximated filter, can be expressed in a recursive way π n = F n π n−1 ,

where δ x stands for the point mass at x, with F n = C n P n (3.2) evolving as follows:

where P n is defined as the prediction operator Algorithm 1 (The IBPF algorithm)

Initial value function f X 0 (θ) Simulator for f Xn|X n−1 (xn | x n−1 ; θ), n ∈ 1:N Evaluator for f Yn|Xn (yn | xn ; θ), n ∈ 1:N Data, y 1:N Number of iterations, M Number of particles, N Partition, K Initial parameter swarm, {Θ 0 j , j ∈ 1:J} Perturbation density, f Θn|Θ n−1 (θ | ϑ ; σ), n ∈ 1:N Perturbation sequence, σ 1:M and C n is defined as the correction operator

for any probability measure ρ on X × Θ.

To facilitate analysis, we define an intermediate filter π n which can be expressed in a recursive way π n = F n π n−1 ,

with F n = C n BP n (3.6) evolving as follows: where, for any measure ρ on X × Θ, B is defined as the blocking operator

with B K ρ being the marginal of ρ on (X × Θ) K . Before we can explicitly show the effect of B K on F n ρ for any n ≥ 1 and any measure ρ on X × Θ, we need to firstly define the block versions of the prediction operator P n given in (3.3) and correction operator C n given in (3.4), as follows: define P K n as the prediction operator specific to block K

for any measure ρ 1 on (X × Θ) ∪ K ∈N (K) K ; define C K n as the correction operator specific to block K

for any measure ρ 2 on (X × Θ) K . Then we can write

for any measure ρ on X × Θ.

The IBPF approximated filter denoted as π n can be expressed in a recursive way π n = F n π n−1 , π 0 = δ x δ θ , (3.11) with F n = C n BS J P n (3.12) evolving as follows:

The following assumption is enforced in obtaining our main theoretical result (Theorem 3.3):

Assumption 3.1. For any v ∈ V , x n−1 , x n ∈ X, y n ∈ Y, θ n−1 , θ n ∈ Θ, and n ≥ 1, we impose the following conditions:

(1) Suppose there exists x > 0 such that

(2) Suppose there exists y > 0 such that

Remark 3.2. In Assumption 3.1, (1) and (2) are the same as the conditions enforced in Theorem 2.1 of Rebeschini and Van Handel (2015) which are localized versions of standard assumptions that are routinely employed in the analysis of PFs, and (3) is the same condition on the transition density of Θ. Analogous to the global mixing assumption implying that the underlying Markov chain is strongly ergodic, a local transition density being bounded above and below, as a local counterpart of the global mixing assumption, would imply that the underlying Markov chain is strongly ergodic in a local sense. Also just as the standard assumptions in PF analysis are usually relaxed in practical applications, the same situation can be expected to our localized Assumption 3.1. We note that the highly nonlinear stochastic spatiotemporal process covered in Section 4.1 does not have the product formed transition densities (2.1) and does not meet Assumption 3.1, but the excellent performance of IBPF is consistent over different scenarios.

Recall that |K| ∞ defined in (2.6) is the maximal size of one single block in the partition K, ∆ defined in (2.7) is the maximal number of vertices that interact with one single vertex in its r-neighborhood, and ∆ K defined in (2.8) is the maximal number of blocks that interact with a single block (including itself). In the following theorem, we bound the error generated by our IBPF algorithmic filter π n defined in (3.11), to the unalgorithmic true filter π n defined in (3.1), uniformly both in time n and in the model dimension card(V ):

Theorem 3.3. With x , θ (σ), and y satisfying Assumption 3.1, when

(3.14)

for every n ≥ 0, K ∈ K and k ⊆ K, we have

.

( 3.15) Interpretations of Theorem 3.3 and its proof idea are provided in the following remark:

Remark 3.4. With the standard PF algorithm where all observations are used to update the filtering distribution, the algorithmic error is exponential in the dimension of the model under the global metric ||| · ||| defined in (2.9). With the IBPF algorithm, only observations in a block, say K, are used to update the filtering distribution in that block. Then one can expect that the algorithmic error is merely exponential in the dimension of the block card(K) under the local metric ||| · ||| k defined in (2.10). Theorem 3.3 confirms that intuition and provides a guarantee on beating the COD.

The two terms in the upper bound result from the algorithmic bias and the algorithmic variance respectively, and were obtained separately in the proof. An intermediate filter without Monte Carlo defined in (3.5), π n , was used to separate the bias ||| π n − π n ||| k and the variance ||| π n − π n ||| k . To control the bias generated by the blocking operator, the DOC property was established with the help of the Dobrushin comparison theorem. It means that the influence of a perturbation on the marginal distribution at a vertex v ∈ K should decay exponentially in the distance from v to the boundary of the block ∂K. This idea is revealed in the e −βd(k,∂K) factor in the first term of the bound. The second term of the bound quantifies the error due to the variance of the Monte Carlo sampling of the IBPF algorithm. As in the standard PF analysis, Monte Carlo sampling provides the 1 √ J factor under the local metric in this local update setting. Given that each block interacts with at most ∆ K neighbors in the previous time step, the ∆ K factor in the second term is expected.

Proof of Theorem 3.3. With ||| · ||| k defined in (2.10), by the triangle inequality, we have ||| π n − π n ||| k ≤||| π n − π n ||| k + ||| π n − π n ||| k .

(3.16)

In the following, we are going to bound ||| π n − π n ||| k in Step 1, bound ||| π n − π n ||| k in Step 2, and sum them up in Step 3.

Step 1. Bounding ||| π n − π n ||| k . Let us firstly use the local total variation distance · k defined in (2.12) to bound π n − π n k . By (3.1), we have that

By (3.5), we have that

Given that π 0 = π 0 , we can bound π n − π n k by means of error decomposition

We display our discussions as follows:

• When s = n in (3.17), by Proposition D.4, we have

By Proposition D.6, we have that for every s ≥ 1, K ∈ K, and v ∈ K ,

With the condition

, by Assumption 3.1 and the fact that ∆, ∆ K ≥ 1, we have

Then the definition of β given in (3.15) and the fact that r ≥ 1 yield

By (3.21) and the fact that we have

By the definition of ∂K given in (2.5) which is the subset of vertices in K that can interact with vertices outside K , we have ∂K ⊆ K which together with the fact that v ∈ K and ∂K ⊆ K yield

Then (3.22) and (3.23) give

Hence,

Plugging (3.18) for the s = n case and (3.24) for the s ∈ {1, . . . , n − 1} case, into the error decomposition equation (3.17), we have

which can be simplified, using the sum of geometric series due to the fact that e −β < 1 from (3.21), as follows:

Since there is no random sampling in π n − π n k , we can replace it with ||| π n − π n ||| k in the above inequality and then obtain

Step 2. Bounding ||| π n − π n ||| k . By (3.11), we have that

Given that π 0 = π 0 , we can bound ||| π n − π n ||| k by means of error decomposition

We display our discussions as follows:

• When s = n in (3.26), by the definition of ||| · ||| k in (2.10), we can see that for k ⊆ K,

Recalling that P n is the prediction operator defined in (3.3), C K n is the correction operator defined in (3.9), B K is the blocking operator defined in (3.7), and S J is the sampling operator defined in (3.13), by the expressions of F n given in (3.6) and F n given in (3.12), we have

Furthermore, by Theorem B.3, we have F n π n−1 − F n π n−1 K ≤2 −2card(K) y P n π n−1 − S J P n π n−1 .

Noting that by Lemma 4.7 in Van Handel (2008) for ρ being any probability measure

and then we have

(3.27)

• When s = n − 1 in (3.26), by the definition of · k given in (2.12) and the definition of ||| · ||| k given in (2.10), for k ⊆ K we have

where the last inequality follows by Proposition E.1.

• When s ∈ {1, . . . , n − 2} in (3.26), by the definitions of · k and ||| · ||| k , we have

Then firstly by Proposition E.3 we have

, and then by Proposition E.1 we have

(3.29)

Plugging (3.27) for the s = n case, (3.28) for the s = n − 1 case, and (3.29) for the s ∈ {1, . . . , n − 2} case, into the error decomposition equation (3.26), we have

Since card(K) ≤ |K| ∞ for any K ∈ K, using the sum of geometric series due to the fact that e −β < 1 from (3.21), the above expression can be simplified as follows:

(3.30)

Step 3. Summing up. Now plugging (3.25) and (3.30) into (3.16), we have

Since the above bound is uniform on all k, we complete the proof.

Since the same set of data Y 1 , . . . , Y N is used in any one of the M iterations and the upper bound in Theorem 3.3 is uniform on data, given the result of the m-th iteration for m ∈ [M ] is simply the initial value of the (m + 1)-th iteration according to the design of IBPF in Algorithm 1, we can see that all M iterations together can be represented as a filtering problem for this extended graphical POMP model on M replications of the data as follows:

Hence, the convergence and the error bound as M goes to infinity follow.

Recall that our IBPF algorithm is the iterated version of the BPF algorithm, the IF2 algorithm in Ionides et al. (2015) is the iterated version of the PF algorithm, and the BPF algorithm is the block version of the PF algorithm. We can see that our IBPF algorithm is the block version of the IF2 algorithm. As indicated in Remark 3.4, the algorithmic error is exponential in the dimension of the model using the global metric ||| · ||| with the standard PF algorithm where all observations are used to update the filtering distribution, while using the local metric ||| · ||| k with the IBPF algorithm where only observations in a block are used to update the filtering distribution in that block, the algorithmic error is merely exponential in the dimension of the block. Hence, the MLE analysis in Ionides et al. (2015) is expected to guarantee the block MLE of our IBPF algorithm as σ goes to zero, using the local metric ||| · ||| k . Here we note that our Assumption 3.1 (3) violates the (B5) condition on page 3 of Ionides et al. (2015) , but in fact only a weaker version of their (B5) condition is needed in the asymptotic analysis as σ tends to zero in Ionides et al. (2015) . Specifically, the compact support assumption in (B5) can be relaxed by replacing 0 with o(σ). With this modification, and replacing the filtering error bound generated from the PF algorithm in the MLE analysis in Ionides et al. (2015) with our Theorem 3.3, the MLE analysis therein can be used to guarantee the block MLE of our IBPF algorithm as σ goes to zero, using the local metric ||| · ||| k .

In this section, we illustrate how IBPF can be used and compare its performances with those of IF2 and IEnKF, using the spatiotemporal model covered in Section 4.1, over the dataset covered in Section 4.2. We gradually increase the parameter learning difficulties in 4 cases, with consistent fairness in all experiments, through measurements of all three algorithms' specific likelihood metrics, in Section 4.3. We implement IF2 and IEnKF through the spatPomp package ).

Measles is a highly contagious infectious disease caused by the measles virus; it spreads easily from one person to the next through coughs and sneezes of infected people. In this section, we consider a generalized spatiotemporal model for disease transmission dynamics of measles within and between multiple cities.

A compartment modeling framework for spatiotemporal population dynamics divides the population at each spatial location into compartments. Specifically, measles transmission at each location is modeled according to the SEIR model with 6 compartments: (S) represents susceptible individuals who have not been infected yet but may experience infection later, (E) represents individuals exposed and carrying a latent infection, (I) represents infectious individuals that have been infected and are infectious to others, (R) represents recovered individuals that are no longer infectious and are immune, (B) represents the birth of individuals, and (D) represents the death of individuals. Park and Ionides (2020) generalized the compartment model presented by He et al. (2010) to the spatiotemporal modeling setting, which is analyzed in other literature such as Ionides et al. (2021) . We further generalize that spatiotemporal compartment model by allowing the dynamics in each spatial location to have their own specific parameters, including different disease transmission speeds across locations. By demonstrating methodology that scales to vertex-specific parameters, we open up new possibilities for spatiotemporal inference, though we focus here on testing statistical tools and so we do not engage directly in the scientific debates.

The number of individuals in compartments S, E, I, and R of city v at time t are denoted by integer-valued random variables S v (t), E v (t), I v (t), and R v (t) respectively. Denote N v ij (t) as the counting process enumerating cumulative transitions from compartment i to compartment j where i, j ∈ {B, S, E, I, R, D} and i = j, in city v up to time t. We model the 40 largest cities in the UK, ordered in decreasing size with v = 1 corresponding to London. Our model is described by the following system of stochastic differential equations, for v ∈ {1, . . . , 40},

(4.1)

The total population

imsart-generic ver. 2014/10/16 file: ms_upload_version.tex date: October 22, 2021 is calculated by smoothing census data and is treated as known. Hence, by (4.1), the number of recovered individuals R v (t) in city v is defined implicitly. The birth process N v BS (t) is a time-inhomogeneous Poisson processes with rate µ v BS (t) given by interpolated census data. The transition processes N v EI (t), N v IR (t) and N v •D (t) are modeled as conditional Poisson processes with per-capita rates µ EI , µ IR and µ •D respectively. The transition process N v SE (t) is modeled as a negative binomial death process according to Bretó et al. (2009) and Bretó and Ionides (2011) with over-dispersion parameter σ v SE and rate given by

Here, β v (t) models seasonality driven by high contact rates between children at school, described by

where p is the proportion of the year taken up by the school terms, R v 0 is the annual average basic reproductive ratio, and θ a measures the reduction of transmission during school holidays. In (4.2), α v is a mixing exponent modeling inhomogeneous contact rates within the city v, and ι models immigration of infected individuals which is appropriate when analyzing a subset of cities that cannot be treated as a closed system. In (4.2), the number of travelers from city v to v is denoted by θ v v , constructed as fixed through time and symmetric between any two arbitrary cities, using the gravity model of Xia et al. (2004) ,

where d(v, v) denotes the distance between city v and city v, P v is the average population for city v across time,P is the average population across cities, and d is the average distance between a randomly chosen pair of cities. To complete the model specification, the measurement process for modeling the partial observability is defined as following: for

being the number of removed infected individuals in the nth reporting interval, suppose that cases are quarantined once they are identified, so that reported cases comprise a fraction ρ of these removal events; the case report y v n is modeled as a realization of a discretized conditionally Gaussian random variable

where Φ( · ; µ, σ 2 ) is the cumulative distribution function of Normal(µ, σ 2 ) and ψ models overdispersion relative to the binomial distribution. Figure 1 shows a simulation (Plot A) from our model covered in Section 4.1 and the real measles data (Plot B). We note that the spatiotemporal model considered in Park and Ionides (2020) and Ionides et al. (2021) is a special case of ours, by taking location specific parameters α v = 1 (in equation (4.2)), R v 0 = 30 (in equation (4.3)) and σ v SE = 0.15 year 1/2 the same for all locations, and take all the initializations (0) the same for all locations. In our simulation, for each location v, we draw the corresponding variables according to uniform distributions of the [0.99, 1.0355]-scaled range of those in Ionides et al. (2021) .

We take the other parameters as fixed values as those of Ionides et al. (2021) : Figure 1 , we can see that the simulation shares the biennial pattern with most cities locked in phase most of the time. It has been hard to understand that disease transmission dynamics between locations which have important levels of interaction yet are not locked in synchrony (Becker et al. (2020) ), which is reflected in some systematic features captured by our simulation trajectories. 

In this section, we conduct the performance analysis of IEnKF, IF2 and our IBPF on learning initial values of measles transmission dynamics and nonlinear parameters for each location, over the dataset covered in Section 4.2 using the spatiotemperal model covered in Section 4.1. We gradually increase the parameter learning difficulties in 4 cases: in the first case, the goal is to learn initial values for measles transmission dynamics for each location: S v (0), E v (0), I v (0), and R v (0); in the second case, in addition to these 4 initial values the goal is to also learn the nonlinear parameter R v 0 for each location; in the third case, in addition to these 4 initial values the goal is to also learn the highly nonlinear parameter α v for each location; in the fourth case, the goal is to learn all the location specific parameters, namely,

Fairness is obtained over all experiments on all these three algorithms, as follows:

• Each algorithm uses the same number of iterations M = 100 (see Algorithm 1 for notations) and the same number of particles N = 80000; • We conduct 10 replicates of all the parameter learning performance comparisons, in the way that in each replicate all algorithms start with the same initial search values draw uniformly as follows: 25, 35] , where the initial values are further normalized to satisfy the SEIR model requirement

• Given that each algorithm has its own metric of log-likelihood: IEnKF uses the metric of EnKF on log-likelihood estimation, denoted as l e ; IF2 uses the metric of PF on log-likelihood estimation, denoted as l p ; Our IBPF uses the metric of BPF on log-likelihood estimation, denoted as l b . We evaluate the best parameters learned using each algorithm with the metrics of the other two algorithms, in all the experiments. For example, we evaluate the best parameters learned using the IEnKF algorithm through its l e metric, with the other two metrics ( l p and l b ). • One additional setup is needed with our IBPF algorithm: to reduce human intervene, we simply only allow each block to have exactly 2 cities in all the experiments since 2 cities are the minimum (for the 1 city case, our IBPF is IF2). That is, the first block is city 1 and city 2, the second block is city 3 and city 4, and so on. Hence, the number of blocks card(K) = Number of cities/2.

We note that the last item above can be seen as unfair to our IBPF algorithm: users may choose any appropriate setup on the numbers of cities per block instead of using the simplest setup (each block has 2 cities) in all experiments. The performance of our IBPF algorithm can be on or better than those in these 4 cases whereas those are already good enough to demonstrate its strength clearly. best parameter learning results from 10 replicates of experiments of all three algorithms are as good as the true parameter using the EnKF metric and the IBPF metric, while the best parameter learning results outperform the true parameter using the PF metric consistently. We can see that for nonlinear and non-Gaussian problems but not highly nonlinear (both case 1 and case 2), the IEnKF algorithm performs very well; IF2 scales well at least up to 20 cities for simpler initial values learning problem in case 1, while its performance starts to drop from 4 cities with all metrics in case 2 which confirms the phenomenon that PF may not scale well with dimensions (Bengtsson et al. (2008); Snyder et al. (2008) ); our IBPF is as good as the true parameter consistently with the EnKF metric and the BPF metric, and much better than the true parameter with the PF metric consistently, in both case 1 and case 2. Figure 3 reports the log-likelihood estimates per city per time step of various dimensions for cases 3 and 4, with the same transformation done as that of Figure 2 upon the corresponding original parameter learning results that are reported in Tables 3 and 4 in Appendix F. Recall that the goal of case 3 is to learn the initial values (S v (0), E v (0), I v (0), R v (0)) and the highly non-linear parameter α v . In equation (4.2), we can see that through α v , the dynamic of one city has direct interactions with other cities, because of which the model under investigation is called interacting particle systems in mathematics terminology. Case 4 is the hardest in all these 4 cases. Its goal is to learn all the location specific parameters (S v Figure  3 , we can see that, even for 2 cities using all these 3 metrics, IEnKF performs very badly in case 3 and fails completely in case 4, all of which confirm the phenomenon that EnKF may not perform well for highly nonlinear and non-Gaussian problems (Ades and Van Leeuwen (2015) ; Lei et al. (2010); Miller et al. (1999) ); the performance of IF2 drops in both cases using all three metrics; the performance of our IBPF in both cases, is as good as that of the true parameter consistently with the EnKF metric and the BPF metric, and much better than that of the true parameter with the PF metric consistently.

In Algorithm 2, we provide the IF2 pseudocode in Ionides et al. (2015) for comparison.

The following Dobrushin comparison theorem can be seen in Theorem 3.1 in Rebeschini and Van Handel (2015) and Theorem 8.20 in Georgii (2011) .

Theorem B.1 (Dobrushin comparison theorem). Let I be a finite set. Let S = i∈I S i where S i is a Polish space for each i ∈ I. Define the coordinate projections X i : x → x i for x ∈ S and i ∈ I. For any probability measure ρ and ρ on S, define

If the Dobrushin condition max i∈I j∈I Algorithm 2 (The IF2 algorithm)

Input:

Simulator for f X 0 (x 0 ; θ) Simulator for f Xn|X n−1 (xn | x n−1 ; θ), n ∈ 1:N Evaluator for f Yn|Xn (yn | xn ; θ), n ∈ 1:N Data, y 1:N Number of iterations, M Number of particles, N Initial parameter swarm, {Θ 0 j , j ∈ 1:J} Perturbation density, f Θn|Θ n−1 (θ | ϑ ; σ), n ∈ 1:N Perturbation sequence, σ 1:M

n,j ∼ f Xn|X n−1 (xn | X F,m n−1,j ; Θ P,m n,j ) for j ∈ 1:J Compute w m n,j = f Yn|Xn (yn | X P,m n,j ; Θ P,m n,j ) for j ∈ 1:J Draw s 1:J with Prob(s j = i) = w m n,i n u=1 w m n,u Set X F,m n,j = X P,m n,s j for j ∈ 1:J Set Θ F,m n,j = Θ P,m n,s j for j ∈ 1:J End For Set Θ m j = (Θ F,m n,j ) for j ∈ 1:J End For holds, then for every J ⊆ I,

Theorem B.2 (Lemma 4.1 of Rebeschini and Van Handel (2015) ). Let probability measures ν, ν , F, F and > 0 be such that ν(A) ≥ F(A) and ν (A) ≥ F (A) for every measurable set A. Then (2015)). Let ρ and ρ be probability measures and let Λ be a bounded and strictly positive measurable function. Define

Then

Theorem B.4 (Lemma 4.3 of Rebeschini and Van Handel (2015)). Let I be a finite set and let m be a pseudometric on I. Let C = (C ij ) i,j∈I be a matrix with nonnegative entries. Suppose that max i∈I j∈I e m(i,j) C ij ≤ c < 1.

Then the matrix D = n≥0 C n satisfies max i∈I j∈I e m(i,j) D ij ≤ 1 1 − c . (2015)). Let µ = µ 1 ⊗ · · · ⊗ µ d and ν = ν 1 ⊗ · · · ⊗ ν d be product probability measures on S = S 1 × · · · × S d and let Λ : S → R be a bounded and strictly positive measurable function. Let µ Λ and ν Λ be probability measures

Suppose that there exists a constant > 0 such that the following holds: for every i = 1, · · · , d, there is a measurable function Λ i : S → R such that

for all x ∈ S such that Λ i (x) = Λ i (x) whenever x {1,...,d}\{i} =x {1,...,d}\{i} . Then

Theorem B.6 (Corollary 4.21 of Rebeschini and Van Handel (2015)). For any subset of blocks L ⊆ K, we have

for every probability measure µ on X .

imsart-generic ver. 2014/10/16 file: ms_upload_version.tex date: October 22, 2021

For any probability measure µ s−1 on X × Θ at time s − 1 for any integer s ≥ 1, any vertex v ∈ V , any block K ∈ K, and any set

we define the following quantities which will be used frequently in the proofs:

imsart-generic ver. 2014/10/16 file: ms_upload_version.tex date: October 22, 2021

. Based on the above quantities, with v ∈ V we define

and with β being a finite positive constant we further define

Then, for F s defined in (3.1) and F s defined in (3.5), any probability measure ν s−1 on X × Θ at time s − 1 for any integer s ≥ 1, and any set A ⊆ X × Θ, we have According to the definition of µ v s given in (C.1), for any K ∈ K and v ∈ K, we have

.

By (C.12) and (C.13), according to the definition of µ v s|s+1 given in (C.2), for any K ∈ K and v ∈ K, we have

. By (C.12) and (C.13), according to the definition of µ v s|s+1 given in (C.3), for any K ∈ K and v ∈ K, we have

. By (C.12) and (C.13), according to the definition of µ v,K s|s+1 given in (C.4), for any K, K ∈ K and v ∈ K, we have

. By (C.12) and (C.13), according to the definition of µ v,K s|s+1 given in (C.5), for any K, K ∈ K and v ∈ K, we have

.

Lemma D.1. Under condition (3.14), for β defined in (3.15), the following holds:

Proof. By the definition of β, we have

That is,

.

Since r, ∆, ∆ K ≥ 1 and 0 ≤ x , θ (σ) ≤ 1, we have

Lemma D.2. Under Assumption 3.1, for any n ≥ 0, we have

where Corr( · , β) is defined in (C.9) and β is given in (3.15).

Proof. Since π 0 = π 0 = δ x δ θ which is non-random, by the definition of Corr( · , β) given in (C.9), we have Corr( π 0 , β) = 0. In the following, we prove by the method of induction and assume that for n ≥ 1

Corr( π n−1 , β) < 1 9 .

.

Then according to the definitions of ( F s ν s−1 ) v,K s|s+1 in (C.19) and ( F s ν s−1 ) v,K s|s+1 in (C.21) for s ≥ 1, we have n,v) .

In order to use Theorem B.1 (Dobrushin comparison theorem) to bound ρ − ρ (n,v) , we need to bound C ij and b i with i = (k, t) and j = (k , t ). Set

, whose definitions are given in Theorem B.1. Note that

. We display our discussions as follows:

• When k = n − 1, we have

.

We can see that ρ i = π t,K n−1|n , by the definition of µ v,K s−1|s given in (C.4). Therefore, if k = n − 1, by the definition of C πn−1 tt given in (C.7), we know that C ij ≤ C πn−1 tt . Note that

n |Θn−1 (θ ω n | θ n−1 ; σ) π t n−1 (dx t n−1 , dθ t n−1 )

, and then we have C ij ≤ 1 − 2 x [ θ (σ)] 2 if k = n and v ∈ N (t) by Theorem B.2, and C ij = 0 otherwise. Recalling that at the beginning of this proof

, and ρ i (A)

imsart-generic ver. 2014/10/16 file: ms_upload_version.tex date: October 22, 2021

.

• When k = n, we have

.

Then we have C ij ≤ 1 − 2 x [ θ (σ)] 2 if k = n − 1 and t ∈ N (v) by Theorem B.2, and C ij = 0 otherwise. Note that

.

otherwise. Define the matrix (C ij (v)) i,j∈I whose entries are given below:

In sum, we have that

where the third inequality is by the assumption of the induction method and Lemma D.1. Next, applying Theorem B.1 (Dobrushin comparison theorem) we obtain that

By the definition of C µs−1 vv in equation (C.7), we have

By the definition of Corr( · , β) in equation (C.9), we have

where the second inequality holds by (D.1) and Theorem B.4, and the last inequality holds by Lemma D.1.

Lemma D.3. Under Assumption 3.1, when condition (3.14) holds, for any n ≥ 0, we have

where Corr( · , β) is defined in (C.8) and β is given in (3.15) .

Proof. For µ v s−1|s defined in (C.2) and µ v,K s−1|s defined in (C.4), we have

.

, and θ n , θ n ∈ Θ be such that

defined in (C.5), by Theorem B.3, we have

Note that

and then by Theorem

One can establish the inequality Corr( π n , β)

Under condition (3.14) that 

Proposition D.4. Under Assumption 3.1, when condition (3.14) holds, for every n ≥ 1, K ∈ K and k ⊆ K, we have that

. Then for any k ⊆ K and K ⊆ V , we have F n π n−1 − F n π n−1 k = ρ − ρ {n}×k .

In order to use Theorem B.1 (Dobrushin comparison theorem) to bound ρ − ρ {n}×k , we need to bound C ij and b i with i = (k, v) and j = (k , v ). Set xn,θn) and ρ i = ρ i (xn−1,θn−1,xn,θn) , whose definitions are given in Theorem B.1. We display our discussions as follows:

• When k = n − 1, we have

.

We can see that ρ i = π v n−1|n , according to the definition of µ v s−1|s defined in (C.2). Therefore, if k = n − 1, by the definition of C µs−1 vv in equation (C.6), we know that C ij ≤ C πn−1 vv . If k = n, since

by Theorem B.2, and C ij = 0 otherwise. Hence, by the definition of Corr(µ s−1 , β) in (C.8), we have

(D.2) Next we take care of b i . When k = n − 1, we have

.

imsart-generic ver. 2014/10/16 file: ms_upload_version.tex date: October 22, 2021

• When k = n, we have

.

Hence,

.

Summing up (D.2) for k = n − 1 and (D.3) for k = n, we have that

2 )e β(r+1) ∆. Furthermore, by Lemma D.3 we have that Corr( π n , β) < 1 3 , and by Lemma D.1 we have that

Next, applying Theorem B.1 (Dobrushin comparison theorem) and Theorem B.4, we obtain that

Proposition D.5. Under Assumption 3.1, when condition (3.14) holds, we have that for every k ⊆ K, K ∈ K and s ∈ {1, . . . , n − 1},

where β is given in (3.15).

Proof. Define I = {s, . . . , n} × V and S = (X × Θ) n−s+1 . Define ρ = P Fs πs−1 [X s , X s+1 , . . . , X n ∈ · , Θ s , Θ s+1 , . . . , Θ n ∈ · | Y s+1 , . . . , Y n ], ρ = P Fs πs−1 [X s , X s+1 , · · · , X n ∈ · , Θ s , Θ s+1 , · · · , Θ n ∈ · | Y s+1 , . . . , Y n ].

Then we have

In order to use Theorem B.1 (Dobrushin comparison theorem) to bound ρ − ρ {n}×k , we need to bound C ij and b i with i = (k, v) and j = (k , v ). Set xs,...,xn,θs,...,θn) and ρ i = ρ i (xs,...,xn,θs,...,θn) , whose definitions are given in Theorem B.1. We display our discussions as follows:

• When k = s, we have

according to the definition of µ v s−1|s given in (C.2). Therefore, when k = s, by the definition of C πs vv in equation (C.6), we know that C ij ≤ C πs vv . When

by Theorem B.2, and C ij = 0 otherwise. Hence, by the definition of Corr(µ s−1 , β) in (C.8),

imsart-generic ver. 2014/10/16 file: ms_upload_version.tex date: October 22, 2021

• When k ∈ {s + 1, . . . , n − 1}, we have

.

Note that

.

.

• When k = n, note that x v n only depends on x v n−1 where v ∈ N (v), and 

Corr( π s , β) < 1 3 , and by Lemma D.1 we have that

imsart-generic ver. 

where we used (D.7) in the last inequality.

Proposition D.6. Under Assumption 3.1, when condition (3.14) holds, we have that for every s ≥ 1, K ∈ K, and v ∈ K ,

where β is given in (3.15).

Proof. According to the expressions of (F s ν s−1 ) v s|s+1 given in (C.14) and ( F s ν s−1 ) v s|s+1

given in (C.15), we have

. probability measures on S as follows:

.

Then we have s,v ) . In order to use Theorem B.1 (Dobrushin comparison theorem) to bound ρ − ρ (s,v ) , we need to bound C ij and b i with i = (k, v) and j = (k, v). Set

and ρ i = ρ i (xs−1,θs−1,x v s ,θ v s ) , whose definitions are given in Theorem B.1. We display our discussions as follows:

• When k = s − 1, we have where π v s−1 is defined according to the definition of µ v s−1 given in (C.1). Therefore, by the definition of µ v s−1|s given in (C.2), we have ρ i = π v s−1|s . Furthermore, by the definition of C µs−1 vv in equation (C.6), we know that

,

by Theorem B.2, and C ij = 0 otherwise. Hence, by the definition of Corr(µ s−1 , β) in (C.8),

• When k = s, we have 

By Theorem B.1 (Dobrushin comparison theorem) and Theorem B.4, we have

which is uniform for all x s , x s+1 ∈ X and all θ s , θ s+1 ∈ Θ. 

where |K| ∞ is the maximal size of a block in K defined in (2.6).

Proof. For any K ∈ K, by Theorem B.3, we have

For ψ K (dx K · ) and λ K (dθ K · ) defined in (2.2), we have

where N (K) is defined in (2.4) as the collection of blocks that interact with the block K. By equation (8.1) in Georgii (2011), we have

Therefore, by Minkowski's integral inequality,

By Assumption 3.1 and the fact that

is a transition density, we have that

Similarly, by Assumption 3.1 and the fact that f Θ v s+1 |Θs (θ v s+1 | θ s ; σ) is a transition density, we have that

Furthermore, by Assumption 3.1, we have

Hence, by Theorem B.3 and Assumption 3.1,

Set τ = P s π s−1 andτ = S J P s π s−1 . Suppose N (K) = {K 1 , . . . , K d }, then for any bounded function f : (X × Θ) d → R, we have

Therefore, we have

and then plugging in equation (E.2) we have

where the last inequality is by Theorem B.6. Plugging in equation (E.1), the proof is complete.

Lemma E.2. Under Assumption 3.1, when condition (3.14) holds, one has that for every K ∈ K, k ⊆ K and s ∈ {1, . . . , n}, F n · · · F s δ x δ θ − F n · · · F s δ x δ θ k ≤ 12 5 e −β(n−s+1) card(k),

.

Proof. We first tackle the s = 1 case that F n · · · F 1 δ x δ θ − F n · · · F 1 δ x δ θ k and then generalize to all s ∈ {1, . . . , n}.

Recalling that N (K) is defined in (2.4) as the collection of blocks that interact with the block K in one time step, we define a "block" tree T as

where K n = K. That is, [K u · · · K n ] is the block K u at time u that has interactions with the block K at time n after n − u time steps; note that this type of representation describes the block-wised interaction trace from time u up to time n. By the effect of block operator B K on F n ρ for any n ∈ N and any measure ρ on X × Θ, given in (3.10), we can write 

where P K n is defined in (3.8) and C K n is defined in (3.9). The vertex set of the tree are defined as

Clearly, the following equivalence holds:

Define the index set of leaves of the tree T as

and then define the set of non-leaf vertices of the tree as

and for any [K u · · · K n ]v ∈ I and any measure ρ define

imsart-generic ver. 2014/10/16 file: ms_upload_version.tex date: October 22, 2021

We define two probability measures on S as

and

, where δ(x, θ) is defined in (3.1). Therefore,

In the following, we are going to use Theorem B.1 (Dobrushin comparison theorem) to bound ρ − ρ [Kn]k . We will bound C ij and b i with i = [K u · · · K n ]v and j = [K u · · · K n ]v where K n = K n = K and 0 ≤ u, u ≤ n. Set

whose definitions are given in Theorem B.1. We display our discussions as follows:

• When u = 0, we have

imsart-generic ver. 2014/10/16 file: ms_upload_version.tex date: October 22, 2021

• When u ∈ {1, . . . , n − 1}, we have

.

We can see that ρ i (A) = ρ i (A) and then b i = 0. Next we take care of C ij . Note that when j ∈ c(i) we have

, when i ∈ c(j) we have

, and when j ∈ ∪ l∈I+:i∈c(l) c(l) we have

if j ∈ ∪ l∈I+:i∈c(l) c(l), and C ij = 0 otherwise. Hence,

• When u = n, we have

.

We can see that ρ i (A) = ρ i (A) and then b i = 0. When j ∈ c(i) we have

.

By Theorem B.2, we have C ij ≤ 1 − 2 x [ θ (σ)] 2 if j ∈ c(i), and C ij = 0 otherwise. Hence, 

Under Assumption 3.1, when condition (3.14) holds, since ∆ K ≥ 1 and ∆ ≥ 1, we have 1 By Theorem B.1 (Dobrushin comparison theorem) and Theorem B.4, we obtain F n · · · F 1 δ x δ θ − F n · · · F 1 δ x δ θ k = ρ − ρ [Kn]k ≤ 2 × 1 1 − 1 6 e −βn card(k) = 12 5 e −βn card(k), which is a special case of F n · · · F s δ x δ θ − F n · · · F s δ x δ θ k for s = 1. Note that the above bound holds uniformly in the sequence of Y , we can generalize to all s ∈ {1, . . . , n}, F n · · · F s δ x δ θ − F n · · · F s δ x δ θ k ≤ 12 5 e −β(n−s+1) card(k).

Proposition E.3. Under Assumption 3.1, when condition (3.14) holds, for any two product measures µ = K∈K µ K and ν = K∈K ν K , one has that for every s ∈ {1, . . . , n − 2}, k ⊆ K, and K ∈ K, E F n · · · F s+2 µ − F n · · · F s+2 ν 2 k 1/2

where β is given in (3.15).

imsart-generic ver. 2014/10/16 file: ms_upload_version.tex date: October 22, 2021 and then we can write

.

Further define the measure

, and then we can write

Analogously define the measure

and then we can write

Recall that the local total variation distance is given by Since osc(f ) = 2 for |f | ≤ 1, by (E.7) and (E.8), we have

Since A is the filter obtained when the initial condition is a point mass on the leaves of the computation tree, by Lemma E. i.e., osc A ≤ 6 5 e −βn card(k).

Also by the local total variation distance definition of this paper and by equation (8.1) in Georgii (2011), we have

Plugging in equation (E.9), we have F n · · · F 1 µ − F n · · · F 1 ν k ≤ 6 5 e −βn card(k) ζ − ς .

(E.10)

Note that (x T0 0 , θ T0 0 ) depends on (x t 0 , θ t 0 ) for t ∈ T 0 through the terms

when c(i) ∩ t = ∅. Write t = [K 0 · · · K n ], and then c(i) ∩ t = ∅ requires i ∈ [K 1 · · · K n ], therefore card i ∈ I + : c(i) ∩ t = ∅ ≤ card(K 1 ) ≤ |K| ∞ .

Define t (x T0 0 , θ T0 0 ) = i∈I+:c(i)∩t=∅

and then we have

By Theorem B.5, we have

Plugging into equation (E.10), we have F n · · · F 1 µ − F n · · · F 1 ν k ≤ 12 5 e −βn card(k) −2|K|∞

Since the branching factor of T 0 is at most ∆ K for each layer of n layers, we have E F n · · · F 1 µ − F n · · · F 1 ν 2 Table 1 Parameter learning results in terms of log-likelihood for case 1. Three log-likelihood metrics:

le representing the EnKF metric, lp representing the PF metric, and l b representing the BPF metric, were applied to the best parameters learned from IEnKF, IF2, and IBPF as well as the true parameter θ.

Bengtsson, T., Bickel, P., and Li, B. (2008) . Curse-of-dimensionality revisited: Table 2 Parameter learning results in terms of log-likelihood for case 2. Three log-likelihood metrics:

le representing the EnKF metric, lp representing the PF metric, and l b representing the BPF metric, were applied to the best parameters learned from IEnKF, IF2, and IBPF as well as the true parameter θ. Table 3 Parameter learning results in terms of log-likelihood for case 3. Three log-likelihood metrics:

le representing the EnKF metric, lp representing the PF metric, and l b representing the BPF metric, were applied to the best parameters learned from IEnKF, IF2, and IBPF as well as the true parameter θ. Table 4 Parameter learning results in terms of log-likelihood for case 4. Three log-likelihood metrics:

le representing the EnKF metric, lp representing the PF metric, and l b representing the BPF metric, were applied to the best parameters learned from IEnKF, IF2, and IBPF as well as the true parameter θ.

The equivalent-weights particle filter in a high-dimensional system

spatPomp: R package for statistical inference for spatiotemporal partially observed Markov processes

Coexisting attractors in the context of cross-scale population dynamics: Measles in London as a case study

An overview of existing methods and recent advances in sequential Monte Carlo

Monte Carlo methods of inference for implicit statistical models

Mathematical models for emerging disease

Efficient block sampling strategies for sequential Monte Carlo methods

An introduction to sequential Monte Carlo methods

A tutorial on particle filtering and smoothing: Fifteen years later. Handbook of nonlinear filtering

Sequential data assimilation with a nonlinear quasigeostrophic model using Monte Carlo methods to forecast error statistics

Gibbs measures and phase transitions

Spatiotemporal blocking of the bouncy particle sampler for efficient inference in state space models

The iterated auxiliary particle filter

Plug-and-play inference for disease dynamics: measles in large and small populations as a case study

A sequential ensemble Kalman filter for atmospheric data assimilation

Bagged filters for partially observed spatiotemporal systems

Iterated filtering

Inference for nonlinear dynamical systems

Inference for dynamic and latent variable models via iterated, perturbed Bayes maps

On blocks, tempering and particle MCMC for systems identification

Ensemble Kalman methods for high-dimensional hierarchical dynamic space-time models

Equation-free: The computer-aided analysis of complex multiscale systems

Inapparent infections and cholera dynamics

Comparison of ensemble Kalman filters under non-Gaussianity

Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2)

The tightness of the Kesten-Stigum reconstruction bound of symmetric model with multiple mutations

Information reconstruction on an infinite tree for a 4 × 4-state asymmetric model with community effects

Large degree asymptotics and the reconstruction threshold of the asymmetric binary channels

Phase transition of the reconstructability of a general model with different in-community and out-community mutations on an infinite tree

Markov chain Monte Carlo without likelihoods

Data assimilation into nonlinear stochastic models

Inference on high-dimensional implicit dynamic models using a guided intermediate resampling filter

Can local particle filters beat the curse of dimensionality?

Blocking strategies and stability of particle Gibbs samplers

Sequential Monte Carlo without likelihoods

Obstacles to high-dimensional particle filtering

Lecture notes: Hidden Markov models

Twisted particle filters. The Annals of Statistics

Measles metapopulation dynamics: A gravity model for epidemiological coupling and dynamics

An equation-free, multiscale approach to uncertainty quantification

This research project was supported by NSF grant DMS-1761603. Code and data reproducing our results are available upon request from the authors and will be publicly accessible upon acceptance for publication.

In Tables 1-4, we provide the original parameter learning results in terms of log-likelihood for cases 1-4 respectively.